100% local: Ollama + Aider + llama.cpp
Full offline dev workflow on a single workstation. No tokens leave the room. Qwen2.5-Coder for daily work, DeepSeek-Coder for the hard parts, Llama 3.3 for chat.
Who this is for
You work on code you cannot upload. Regulated industries, client NDAs, internal R&D, a side project you just do not want training on. Or you travel to places with unreliable internet and want your tools to work in an airport lounge. Or the Anthropic bill last month made you do a spit take.
You need at least 24GB of GPU VRAM (RTX 4090, RTX 6000 Ada, RTX A6000) or a Mac with 48GB+ unified memory. 16GB works for smaller models but you will feel the ceiling within a week.
The three-tool stack
- Ollama — model runner, one-line install, REST API on
:11434 - Aider — terminal coding agent, reads your repo, edits files, commits
- llama.cpp — raw inference engine, what Ollama wraps; keep it installed for GGUF files Ollama does not have in its registry yet
Model picks that actually earn their slot
Be picky. A local model you use is infinitely more useful than three you downloaded and never opened.
- Qwen2.5-Coder-32B-Instruct — the daily driver. Shockingly good at multi-file reasoning.
Q5_K_Mfits in 24GB with room for 16k context. - DeepSeek-Coder-V3 (or V2.5 Lite for 16GB cards) — reaches for the math-heavy and algorithmic work. Slower but sharper on edge cases.
- Llama 3.3 70B Instruct — the chat brain. Use it for architecture conversations, not code edits. Needs
Q4_K_Mquant + 48GB to breathe. - Qwen2.5-Coder-7B — the autocomplete workhorse. Pin this as the
tabmodel, it answers in under 200ms on a laptop GPU.
Setup in five steps
01. Install Ollama and pull the core set
# macOS
brew install ollama
brew services start ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# pull the four keepers
ollama pull qwen2.5-coder:32b-instruct-q5_K_M
ollama pull qwen2.5-coder:7b-instruct-q5_K_M
ollama pull deepseek-coder-v3:16b-q5_K_M
ollama pull llama3.3:70b-instruct-q4_K_M Disk cost: ~90GB. Worth every block.
02. Wire Aider to Ollama
Aider speaks OpenAI-compatible. Ollama serves OpenAI-compatible. You point one at the other.
pip install aider-chat
# ~/.aider.conf.yml
cat > ~/.aider.conf.yml <<EOF
openai-api-base: http://localhost:11434/v1
openai-api-key: ollama
model: openai/qwen2.5-coder:32b-instruct-q5_K_M
weak-model: openai/qwen2.5-coder:7b-instruct-q5_K_M
edit-format: diff
auto-commits: true
dirty-commits: false
EOF edit-format: diff matters. With whole, Aider will re-emit your whole file and the local model will occasionally lose punctuation. Diff mode is surgical.
03. Use llama.cpp for the models Ollama is slow to ship
Sometimes HuggingFace has a quant a week before Ollama’s registry does. Keep llama.cpp for those.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1 -j
# serve a GGUF directly
./llama-server \
-m models/Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-c 16384 -ngl 99 It also exposes /v1/chat/completions, so Aider can flip over with a one-line config change.
04. Pin a keyboard-level routing strategy
Two aliases in your shell: cheap uses the 7B, smart uses the 32B. Everything else is branch names.
# ~/.config/fish/config.fish (or bashrc)
alias cheap='aider --model openai/qwen2.5-coder:7b-instruct-q5_K_M'
alias smart='aider --model openai/qwen2.5-coder:32b-instruct-q5_K_M'
alias chat='aider --model openai/llama3.3:70b-instruct-q4_K_M --edit-format ask' 05. Set up a fallback to the Gateway for the nights you travel
Local-first does not mean local-always. When the GPU is 9,000km away, keep a fallback config that points at the OVTH Gateway. Swap with one environment variable.
Cost, privacy, performance
Related flash tutorials
- Ollama self-host in sixty seconds — the install-and-pull version
- Aider + git workflow — branches, commits, rollbacks
- OpenCode multi-provider — alternative agent with local backends
The local stack is not a purity test. It is the version of the workflow that keeps running when the internet does not, when the client’s security policy says no, and when you just want to hear your GPU fans earn their electricity.