Who this is for

You work on code you cannot upload. Regulated industries, client NDAs, internal R&D, a side project you just do not want training on. Or you travel to places with unreliable internet and want your tools to work in an airport lounge. Or the Anthropic bill last month made you do a spit take.

You need at least 24GB of GPU VRAM (RTX 4090, RTX 6000 Ada, RTX A6000) or a Mac with 48GB+ unified memory. 16GB works for smaller models but you will feel the ceiling within a week.

The three-tool stack

Ollama — model runner, one-line install, REST API on :11434
Aider — terminal coding agent, reads your repo, edits files, commits
llama.cpp — raw inference engine, what Ollama wraps; keep it installed for GGUF files Ollama does not have in its registry yet

Model picks that actually earn their slot

Be picky. A local model you use is infinitely more useful than three you downloaded and never opened.

Qwen2.5-Coder-32B-Instruct — the daily driver. Shockingly good at multi-file reasoning. Q5_K_M fits in 24GB with room for 16k context.
DeepSeek-Coder-V3 (or V2.5 Lite for 16GB cards) — reaches for the math-heavy and algorithmic work. Slower but sharper on edge cases.
Llama 3.3 70B Instruct — the chat brain. Use it for architecture conversations, not code edits. Needs Q4_K_M quant + 48GB to breathe.
Qwen2.5-Coder-7B — the autocomplete workhorse. Pin this as the tab model, it answers in under 200ms on a laptop GPU.

Setup in five steps

01. Install Ollama and pull the core set

bash

# macOS
brew install ollama
brew services start ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# pull the four keepers
ollama pull qwen2.5-coder:32b-instruct-q5_K_M
ollama pull qwen2.5-coder:7b-instruct-q5_K_M
ollama pull deepseek-coder-v3:16b-q5_K_M
ollama pull llama3.3:70b-instruct-q4_K_M

Disk cost: ~90GB. Worth every block.

02. Wire Aider to Ollama

Aider speaks OpenAI-compatible. Ollama serves OpenAI-compatible. You point one at the other.

bash

pip install aider-chat

# ~/.aider.conf.yml
cat > ~/.aider.conf.yml <<EOF
openai-api-base: http://localhost:11434/v1
openai-api-key: ollama
model: openai/qwen2.5-coder:32b-instruct-q5_K_M
weak-model: openai/qwen2.5-coder:7b-instruct-q5_K_M
edit-format: diff
auto-commits: true
dirty-commits: false
EOF

edit-format: diff matters. With whole, Aider will re-emit your whole file and the local model will occasionally lose punctuation. Diff mode is surgical.

03. Use llama.cpp for the models Ollama is slow to ship

Sometimes HuggingFace has a quant a week before Ollama’s registry does. Keep llama.cpp for those.

bash

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1 -j

# serve a GGUF directly
./llama-server \
-m models/Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-c 16384 -ngl 99

It also exposes /v1/chat/completions, so Aider can flip over with a one-line config change.

04. Pin a keyboard-level routing strategy

Two aliases in your shell: cheap uses the 7B, smart uses the 32B. Everything else is branch names.

bash

# ~/.config/fish/config.fish (or bashrc)
alias cheap='aider --model openai/qwen2.5-coder:7b-instruct-q5_K_M'
alias smart='aider --model openai/qwen2.5-coder:32b-instruct-q5_K_M'
alias chat='aider --model openai/llama3.3:70b-instruct-q4_K_M --edit-format ask'

05. Set up a fallback to the Gateway for the nights you travel

Local-first does not mean local-always. When the GPU is 9,000km away, keep a fallback config that points at the OVTH Gateway. Swap with one environment variable.

Quantization is compression for model weights. Lower bits = smaller file = faster inference, at the cost of quality. Here is how we actually choose:

Q4_K_M — the sweet spot for 70B models on 48GB rigs. Measurable quality loss but still usable.
Q5_K_M — the sweet spot for 32B and below. The quality gap to the full model is within noise on real coding tasks. Use this by default.
Q6_K — only if you have VRAM to burn. Diminishing returns vs Q5_K_M.
Q8_0 — for benchmarks, not for work. It is basically the full weights.
Q4_0 / Q3_K — avoid. Coherence drops on long prompts, hallucinations rise on structured output (JSON, code).

The _K_M suffix means k-quants with mixed precision for attention layers. They outperform the flat Q4_0 style at every size we tested.

Rule of thumb: pick the largest model that fits in VRAM at Q5_K_M with 16k context, not the largest model period. A Q4_0 70B loses to a Q5_K_M 32B on real multi-step coding work.

Cost, privacy, performance

Ollama self-host in sixty seconds — the install-and-pull version
Aider + git workflow — branches, commits, rollbacks
OpenCode multi-provider — alternative agent with local backends

The local stack is not a purity test. It is the version of the workflow that keeps running when the internet does not, when the client’s security policy says no, and when you just want to hear your GPU fans earn their electricity.

Who this is for

The three-tool stack

Model picks that actually earn their slot

Setup in five steps

01. Install Ollama and pull the core set

02. Wire Aider to Ollama

03. Use llama.cpp for the models Ollama is slow to ship

04. Pin a keyboard-level routing strategy

05. Set up a fallback to the Gateway for the nights you travel

Cost, privacy, performance

Related flash tutorials