Zum Inhalt springen
DE | EN
Zurück

AI Without the Cloud — What Actually Works, and What Doesn't

AI Without the Cloud — What Actually Works, and What Doesn't

I've been running local AI for a while now — on a Raspberry Pi 4, a Pi 5, and my MacBook Air M2. With models ranging from gemma2:2b up to considerably larger ones like llama3:70b on the MBA. And I can tell you: expectation and reality often diverge quite a bit. In both directions.

Not because local AI is bad. But because most articles on the topic are either completely enthusiastic ("it runs on a Pi!") or completely dismissive ("too slow, forget it"). The truth is — as usual — somewhere in the middle. And the details, and above all the specific use case, make all the difference.

This article isn't a setup guide. I've already written that one. This is the field test: what have I actually done with it, what worked, what didn't, and where does the sensible boundary lie.


The Honest Starting Point

Before we dive in, let's establish the baseline. Anyone starting out with local AI on a Pi or a small mini-PC usually arrives with one of two expectations:

Expectation A: "I want to be completely independent from cloud services."
Expectation B: "I want to handle simple tasks without always needing an internet connection."

Expectation B is realistic and will be met. Expectation A is a mindset — one I respect and share — but it comes with trade-offs. Accept those trade-offs and you'll be happy. Expect a 2B model on a Pi to match Claude and you'll be disappointed.

graph TD
    A[Expectation A\nComplete cloud independence] --> B{Realistic?}
    C[Expectation B\nLocal for simple tasks] --> B
    B -->|Yes, with trade-offs| D[✅ For specific\nuse cases]
    B -->|No, for complex tasks| E[⚠️ Cloud remains\nnecessary]

With that baseline established, let's get into the field test.


What Works Well

Summarising and Structuring Text

This is the use case that surprised me most — in a good way. Local models are surprisingly capable at summarisation. Not perfect, but good enough for everyday use.

A concrete example from my own practice: I use local models to summarise longer log files or configuration documents. The quality is perfectly adequate for this task — and crucially: the data never leaves my network.

flowchart TD
    classDef success fill:#14532d,color:#86efac,stroke:#86efac,stroke-width:1.5px
    A[📄 Local document\nLog file / notes] --> B[Ollama\ngemma2:2b]
    B --> C[📝 Summary\ngenerated locally]
    C --> D[✅ Data stays\nin the network]
    class D success

Why it works: Summarisation is a relatively simple pattern — the model doesn't need to generate new ideas, just condense existing information. Even small models handle this well.

Where it breaks down: Very long documents that exceed the context window. You have to split them up — which is doable, but complicates the workflow. More importantly, you have to be careful that context from other sections doesn't get lost in the process. A summary that only knows part of a document can lead to wrong conclusions — and that's a mistake that doesn't always surface immediately.


Classification and Categorisation

Another pleasant surprise. Categorising text, assigning priorities, making simple yes/no decisions — this works very reliably locally.

In my own practice I use this for:
- Classifying log entries by severity
- Pre-sorting emails by subject line (locally, before anything gets uploaded anywhere)
- Simple rule checks on configuration files

Task Model Quality Speed
Log classification gemma2:2b ⭐⭐⭐⭐ ⭐⭐⭐
Text categorisation gemma2:2b ⭐⭐⭐⭐ ⭐⭐⭐
Yes/No decisions phi3:mini ⭐⭐⭐⭐⭐ ⭐⭐
Priority assignment gemma2:2b ⭐⭐⭐ ⭐⭐⭐

Code Snippets for Standard Tasks

Bash scripts, simple Python functions, regex patterns — for well-defined, contained tasks, phi3:mini delivers usable results. Not always correct on the first try, but a solid starting point.

The key point: I review every generated piece of code. I do the same with Claude — as I wrote in Article 1. The origin of the code doesn't change that obligation.

Where it stops working: As soon as the task requires context across multiple files, or the model needs framework-specific knowledge beyond its training. That's where the quality gap compared to Claude becomes significant.


Local Document Search with RAG

This is the most involved use case — but once it's running, one of the most useful. RAG stands for Retrieval-Augmented Generation: your own documents are indexed, and the LLM can search them and answer questions based on what it finds.

flowchart TD
    classDef success fill:#14532d,color:#86efac,stroke:#86efac,stroke-width:1.5px
    A[📚 Local documents\nMarkdown, PDF, text] --> B[Embedding model\nlocal]
    B --> C[(Vector database\nlocal)]
    D[❓ Question] --> E[Retrieve relevant\nchunks]
    C --> E
    E --> F[Ollama LLM\nGenerate answer]
    F --> G[✅ Answer\nbased on your\nown documents]
    class G success

I use this for my own documentation — homelab notes, configuration references, article drafts. Quality depends heavily on the embedding model and chunk size, but the approach works.

Honest effort required: The initial setup isn't trivial. It's worth setting aside some time for it — and this is actually an area where Claude Code can be genuinely helpful.


Where Local Models Hit Their Limits

Complex Reasoning and Multi-Step Tasks

This is the expected weakness of small models — and it shows up clearly in practice. When a task requires multiple reasoning steps — "analyse this problem, derive a solution, and explain why" — small models start to struggle quickly.

The pattern I see repeatedly: the model starts promisingly, then loses the thread, and ends up with an answer that sounds fluent but is factually off. This is the "confident wrongness" I described in Article 1 — and in small models it's even more pronounced.

graph TD
    classDef danger  fill:#7f1d1d,color:#fca5a5,stroke:#fca5a5,stroke-width:1.5px
    A[Complex task\nmultiple steps] --> B{Model size}
    B -->|Small 1B-4B| C[⚠️ Loses thread\nafter 2-3 steps]
    B -->|Large 70B+| D[✅ Maintains context\nacross many steps]
    C --> E[Plausible-sounding\nbut incorrect answer]
    D --> F[Reliable\nmulti-step answer]
    class E danger

My conclusion: Complex analyses, architecture decisions, code reviews across multiple files — that stays with Claude. Not because I haven't tried it locally, but because I've compared the results and the quality gap is too significant.


Creative Writing and Style

Structuring initial ideas, suggesting a rough outline, rephrasing a paragraph — I've tested local models for these kinds of support tasks. The result: usable for rough structural suggestions, but as soon as style, tone, or nuance come into play, the limitations become clear. The output sounds generic and the style is hard to influence.

This comes down to model size. Stylistic sensitivity and linguistic nuance require a large parameter space — one that 2-4B models simply don't have. For anything beyond basic structural help, Claude remains the better choice.


Multilingual Tasks

German works noticeably worse than English with phi3:mini and gemma2:2b. That's no surprise — most of these models were trained primarily on English-language data. For German text, the quality difference is noticeable.

Language phi3:mini gemma2:2b
English ⭐⭐⭐⭐ ⭐⭐⭐⭐
German ⭐⭐⭐ ⭐⭐⭐
Other EU languages ⭐⭐ ⭐⭐

For this blog — which is bilingual — that means: German text is written with Claude, not locally.


The Grey Zone

Some tasks work locally — but with trade-offs worth knowing about.

Explaining Code

Having existing code explained works surprisingly well, as long as the code is manageable in scope. With more complex systems or unfamiliar frameworks, the explanation becomes imprecise. Good as a first orientation, not as a reliable source.

Translations

Simple translations between common languages work. For nuance, technical terminology, or stylistically demanding text — better to use Claude.

Structuring Data

Extracting JSON from unstructured text, cleaning up tables, renaming fields — this works well for simple cases. Complex transformations sometimes produce subtle errors that are easy to miss.


Hybrid Workflows: The Best of Both Worlds

After several weeks of hands-on experience, I've settled into a workflow that deliberately combines local and cloud — not as a compromise, but as a conscious architectural decision.

flowchart TD
    classDef cloud   fill:#1e3a5f,color:#93c5fd,stroke:#93c5fd,stroke-width:1.5px
    classDef success fill:#14532d,color:#86efac,stroke:#86efac,stroke-width:1.5px
    classDef warning fill:#d97706,color:#ffffff,stroke:#92400e,stroke-width:1.5px
    A[Task] --> B{Data sensitivity?}
    B -->|High\nprivate/internal data| C{Task complexity?}
    B -->|Low\npublic data| D{Quality requirements?}

    C -->|Simple| E[✅ Local\nOllama]
    C -->|Complex| F[⚠️ Weigh up:\nLocal with review\nvs. Cloud with ZDR]

    D -->|Standard| G[✅ Local\nOllama]
    D -->|High| H[✅ Cloud\nClaude]

    class E success
    class G success
    class H cloud
    class F warning

The decision logic in practice:

Sensitive data, simple task → local, done.

Sensitive data, complex task → this is the hard case. I weigh it up: can I simplify the task enough that local is sufficient? Or is there a ZDR-compliant cloud option? There's no blanket answer — it's a case-by-case call.

Non-sensitive data → cloud, if the quality justifies it. No reason for local constraints when the data would be publicly accessible anyway.


What I'd Do Differently

Three things I wish I'd known earlier:

1. Define the use case first — then make the architecture decision.
What sounds like a basic principle in theory has proven itself in practice. I went from MBA M2 with 24GB, to Pi 5 with 16GB, to Pi 4 with 4GB — each stage has different strengths and limits, and the right choice depends directly on the use case. Whoever reverses that order and buys the hardware first often finds it's either over-engineered or too limited for the actual task. That sounds obvious — but in practice it often isn't.

2. Small models have their own character — and you need to learn it.
phi3:mini is stronger at reasoning, gemma2:2b is faster and more stable. You only really feel this after several hours of real use. Benchmarks help, but they don't replace hands-on experience.

3. Hybrid isn't a defeat.
For a while I tried to keep as much local as possible — out of conviction and out of curiosity. At some point I accepted that a sensible hybrid approach is more pragmatic than dogmatic cloud avoidance. Control through architecture where it matters, and cloud where it makes sense.


Conclusion

Local AI without the cloud works. For the right tasks, with the right expectations.

What surprised me most: how well it handles summarisation and classification. And how clearly the limits show up in complex reasoning.

What I've taken away: the hybrid approach isn't the path of least resistance — it's the pragmatically right path. Local where data sovereignty matters and the task allows it. Cloud where quality, complexity, regulatory requirements, or legal obligations justify it.

And anyone who has actually tried it — got a model running locally, watched the traffic monitor, tested the limits — understands the technology in a way that no article can convey.


All figures are based on personal tests with Ollama on MacBook Air M2 (24GB RAM), Raspberry Pi 5 (16GB RAM), and Raspberry Pi 4 (4GB RAM), using various models from gemma2:2b to llama3:70b. Current as of June 2026.

This article reflects my personal views exclusively and has no connection to any professional affiliation.


Artikel teilen:

Vorheriger Artikel
Meine Heiminfrastruktur mit Claude Code aufgebaut — ein ehrlicher Erfahrungsbericht
Nächster Artikel
AI ohne Cloud — was wirklich geht, und was nicht