Zum Inhalt springen
DE | EN
Zurück

Why I Run My AI Locally — And What That Really Means

Why I Run My AI Locally — And What That Really Means

There are two honest reasons why I run a dedicated Raspberry Pi 4 in my home network with a local language model running around the clock. The first is privacy and digital sovereignty — a conviction, not paranoia. The second is pure curiosity: there's something almost magical about a small computer running a language model, and I've always been someone who wants to understand how things work — from the inside. Theory matters, but nothing replaces hands-on experience.

The road to get here was a proper experimental journey: starting on the MacBook Air M2 where Apple Silicon shows what local inference can really do, moving to a Raspberry Pi 5 (16GB) as the first dedicated ARM server, then to a Pi 4 with 4GB RAM as a pragmatic interim step — and finally to the current setup: a Pi 4 with 8GB RAM, reserved exclusively for Ollama. Every stage taught something — and you only really understand why once you've been through it yourself.

This article isn't a tutorial for the perfect local AI stack. It's an honest field report with everything that comes with it: motivation, setup, capabilities, limits, and a clear assessment of when cloud AI — like Claude — is still the better choice.


Motivation: Privacy Meets Curiosity

The Privacy Angle

When I use Claude, ChatGPT, or any other cloud-based AI tool, my prompts leave my network. That's not speculation — it's the technical reality of every API call. What happens to that data depends on the terms of service, the pricing tier, and how much you trust the provider.

That's not an argument against cloud AI. It's an argument for using it deliberately. Not every prompt is equally sensitive. Explaining a general technical concept is a very different thing from processing internal project notes, analysing proprietary code, or summarising personal notes.

The question I ask myself isn't: "Is this provider trustworthy?" It's: "Which data do I want leaving my network at all — and which not?"

For the second category, local AI is the answer. Not because cloud providers are malicious, but because data sovereignty means making that decision yourself.

The regulatory background — GDPR, EU AI Act, Cyber Resilience Act — and their implications for AI usage are covered in a dedicated article. Coming soon.

The Curiosity Angle

Getting a local language model running is technically interesting. It's about understanding how quantisation works, why RAM is the critical resource on CPU-only hardware, which model architectures are efficient and which aren't. It's the same feeling that led me to set up a Raspberry Pi as a Gitea server, a second one for Ghost, and my own CI/CD pipeline with Gitea Actions.

Self-hosting isn't a means to an end for me. It's a mindset.


The Setup: Raspberry Pi 4 as a Local AI Hub

Hardware

My AI server is a Raspberry Pi 4 with 8GB RAM, running headless in the home network — dedicated to Ollama, nothing else. No screen, no keyboard — just SSH access over the local network.

The role split across the homelab has become clear over time: the Pi 5 (16GB) runs as NAS and Gitea server, the Pi 4-1 (4GB) handles Ghost, Nginx, and OpenWebUI as the web service layer, and the Pi 4-2 (8GB) is reserved exclusively for Ollama. Clean separation of responsibilities — no mixing of web services and AI inference on the same device.

Worth noting: I originally started this setup on a Raspberry Pi 5 with 16GB RAM — and the difference in performance and model choice is significant. Before the Pi 5, I also experimented on a MacBook Air M2 with 24GB RAM, which is a different league entirely: Apple Silicon with unified memory is excellent for local LLM inference.

For anyone starting fresh: a Raspberry Pi 5 with 8GB or 16GB RAM is the clear recommendation for a dedicated server — or Apple Silicon if you already have a Mac.

Performance benchmarks for local LLM inference with Ollama:

Hardware RAM Recommended model size Realistic tokens/s
Raspberry Pi 4 4 GB 1B–2B 0.5–1.5
Raspberry Pi 4 8 GB 1B–4B 1–3
Raspberry Pi 5 8 GB 1B–7B 3–8
Raspberry Pi 5 16 GB 3B–13B 4–10
MacBook Air M2 16 GB 3B–7B 15–25
MacBook Air M2 24 GB 7B–13B 20–35
Mini-PC (e.g. Intel N100) 16 GB 3B–7B 8–20
Mini-PC with dedicated GPU 8+ GB VRAM 7B–13B 40–100+

Sources: serverman.co.uk, toolhalla.ai, mljourney.com — CPU-only inference, Q4 quantisation. Results vary by model and prompt.

Software: Ollama

Ollama is the tool of choice for local LLM inference on ARM hardware. Installation is a single line — full documentation at ollama.com:

curl -fsSL https://ollama.com/install.sh | sh

Ollama handles model download, quantisation, and serving via a local API (default: port 11434). It runs as a systemd service and starts automatically on boot.

Model Selection for the Pi 4 (8GB)

With 8GB of RAM, model selection is noticeably more comfortable than on a 4GB board — more headroom, no swap pressure. Here's an honest rundown of the models I've tested:

Model Size RAM required Tokens/s (Pi 4, 8GB) Assessment
phi3:mini 3.8B ~3.5 GB 1–2 Best quality, runs comfortably
gemma2:2b 2B ~2.0 GB 2–3 Fast and stable
llama3.2:3b 3B ~3.2 GB 1–2 Solid, no RAM pressure
qwen2.5:1.5b 1.5B ~1.5 GB 3–4 Very fast, good for simple tasks

Sources: serverman.co.uk, toolhalla.ai, own tests. Q4_K_M quantisation, CPU-only.

phi3:mini from Microsoft benefits noticeably on the 8GB board: no swap, stable operation, and the quality comes through properly. On 4GB it was a constant balancing act.

gemma2:2b from Google remains the pragmatic choice when speed matters more than maximum quality — ideal for batch tasks and simple classification.

My honest take: A Pi 4 with 8GB is a solid dedicated AI server for the home — as long as it's genuinely reserved for Ollama only. The key is role separation: no mixing with other services.


The Network: Full Isolation Is Possible

One of the most compelling aspects of running AI locally is control over network access. My Pi 4 runs in a dedicated VLAN in the home network, managed via a UniFi UCG Max.

How the Isolation Works

The Pi can be fully isolated from the internet without losing its ability to serve local requests. The UniFi Policy Engine makes this straightforward:

graph TD
    A[MacBook / local clients] -->|Port 11434| B[Raspberry Pi 4\nOllama]
    B -->|No outbound traffic| C[🚫 Internet]
    D[UniFi UCG Max] -->|Firewall rule: Block WAN| B
    D -->|Traffic monitoring| E[Dashboard]
    B -->|Local VLAN| A

The concrete configuration in the UniFi Controller:

1. Dedicated VLAN for the Pi (e.g. VLAN 20 "AI-Local")

2. Firewall rule in the Policy Engine:
- Source: Pi's IP (static IP recommended)
- Destination: WAN / internet
- Action: Block
- Direction: Outbound

3. Traffic monitoring via the UniFi dashboard gives full visibility: every outbound connection attempt is logged and — thanks to the firewall rule — blocked.

The result: the Pi responds to local API requests from my MacBook, my homelab, or other local services — but not a single packet leaves the home network. Whatever gets processed on the Pi stays on the Pi.

Sources: Ubiquiti Help Center, help.ui.com — Traffic & Policy Management in UniFi

Why This Matters

With this setup, it's technically impossible for a local model to phone home. That's a fundamentally different security model from "we trust the provider not to share the data."

It's not about distrusting specific providers. It's about the principle: control through architecture, not through trust.

An interesting side effect: traffic monitoring quickly reveals how differently local models behave. Some try to pull model updates or telemetry data on startup — and you see it immediately in the dashboard. That sharpens your understanding of what's actually happening in the background.


What Actually Works: Local LLM Use Cases

Local models on a Pi 4 aren't a replacement for Claude or GPT-4. But for specific use cases, they're exactly the right tool:

Processing sensitive documents
Internal project notes, personal journals, proprietary code — anything that shouldn't leave the house but still benefits from language model processing. The model sees the data, but only locally.

Local automation and scripts
Bash scripts that process text, simple classification tasks, summarising local log files — none of this needs an internet connection or a large model.

Home automation and IoT
Processing sensor data locally with an LLM, anomaly detection, natural-language control of smart home devices — all without cloud dependency.

Experimentation and learning
Understanding how quantisation works, how different model architectures behave, how to build a local RAG pipeline — the Pi is an excellent testbed.

Offline assistant for simple tasks
Rephrasing text, short summaries, answering simple questions — works without an internet connection.


What Doesn't Work: The Honest Limits

A Pi 4 with 2B–4B models is not a Claude replacement — and it's not trying to be.

Performance
A Raspberry Pi 4 with 8GB RAM reaches around 1–2 tokens per second with 3B models — noticeably better than 4GB, where swap usage drags performance down further. A short 100-word response still takes 30–60 seconds. For interactive chat that's frustrating. For background batch tasks it's fine.

Reasoning and complexity
Small models (1B–4B parameters) hit their limits quickly on complex multi-step tasks. Code debugging, complex analysis, creative writing — the quality gap compared to Claude or GPT-4 is significant and noticeable.

Context window
Small models often have limited context windows. Long documents have to be split up, which complicates the workflow.

Model size
On a Pi 4 with 8GB, models above 7B parameters aren't realistically usable. The strongest current models (Claude, GPT-4o, Gemini Ultra) have tens to hundreds of billions of parameters — that's simply not possible on this hardware.

Multimodality
Image processing, audio transcription, and other multimodal tasks aren't readily available on the Pi with Ollama.


Local vs. Cloud: What Goes Where?

This isn't an either/or question. I use both — deliberately and depending on the task:

Task Local (Pi + Ollama) Cloud (e.g. Claude)
Sensitive / private data ✅ First choice ⚠️ With caution
Simple automation ✅ Sufficient Overkill
Complex code tasks ⚠️ Limited ✅ First choice
Creative writing ⚠️ Limited ✅ First choice
Offline / no internet ✅ Only option ❌ Not possible
Fast interactive responses ⚠️ Slow ✅ First choice
Regulatory requirements ✅ Clearly advantageous ⚠️ Needs checking
Experimentation / learning ✅ Ideal Expensive
Long documents / large context ⚠️ Limited ✅ First choice

My personal decision rule:

Local, when data shouldn't leave the house, when the task is simple enough for a small model, or when I want to experiment.

Cloud (Claude), when I need maximum quality, when the task is complex, when speed matters, or when the data isn't sensitive.

That's not a weakness of the local approach. It's a deliberate hybrid — the best of both worlds, used with intent.


Regulation as a Tailwind

The regulatory direction in the EU reinforces the case for local AI in certain contexts — particularly for companies and developers working with sensitive data.

GDPR sets clear requirements for processing personal data. The EU AI Act defines risk categories and transparency obligations. The Cyber Resilience Act mandates security by design. Local AI infrastructure simplifies compliance in many of these areas significantly — because the data simply never leaves your own systems.

The regulatory details — timelines, fines, specific requirements — are covered in a dedicated article. Coming soon.

Conclusion: Sovereignty Isn't About Paranoia

Running AI locally is a deliberate choice. Not out of distrust, not out of paranoia — but because data sovereignty means deciding for yourself which data touches which systems.

A dedicated Raspberry Pi 4 with 8GB RAM and Ollama is not a perfect system. It's slow, limited in model choice, and no substitute for the quality of large cloud models. But it's mine. Fully under my control, fully isolatable from the internet, fully transparent about what it's doing.

And sometimes that's exactly the right thing — not the fastest, not the most powerful, but the controllable.

There's also the curiosity factor. Understanding how a language model generates text on an ARM processor drawing 4 watts is fascinating. And anyone who has actually got a model running locally understands the technology in a way no API call can ever convey.


All performance figures are based on publicly available benchmarks and personal tests. Current as of June 2026. Hardware prices and model availability may change.

This article reflects my personal views exclusively and has no connection to any professional affiliation.


Artikel teilen:

Vorheriger Artikel
AI ohne Cloud — was wirklich geht, und was nicht
Nächster Artikel
Warum ich meine KI lokal betreibe — und was das wirklich bedeutet