OVTH / 2026 LIVE · V0.4.0
Ø Overthinking Gateway ↗

100% local: Ollama + Aider + llama.cpp

Full offline dev workflow on a single workstation. No tokens leave the room. Qwen2.5-Coder for daily work, DeepSeek-Coder for the hard parts, Llama 3.3 for chat.

Updated May 08, 2026 by xlrd · Fig. S02

Who this is for

You work on code you cannot upload. Regulated industries, client NDAs, internal R&D, a side project you just do not want training on. Or you travel to places with unreliable internet and want your tools to work in an airport lounge. Or the Anthropic bill last month made you do a spit take.

You need at least 24GB of GPU VRAM (RTX 4090, RTX 6000 Ada, RTX A6000) or a Mac with 48GB+ unified memory. 16GB works for smaller models but you will feel the ceiling within a week.

The three-tool stack

  • Ollama — model runner, one-line install, REST API on :11434
  • Aider — terminal coding agent, reads your repo, edits files, commits
  • llama.cpp — raw inference engine, what Ollama wraps; keep it installed for GGUF files Ollama does not have in its registry yet

Model picks that actually earn their slot

Be picky. A local model you use is infinitely more useful than three you downloaded and never opened.

  • Qwen2.5-Coder-32B-Instruct — the daily driver. Shockingly good at multi-file reasoning. Q5_K_M fits in 24GB with room for 16k context.
  • DeepSeek-Coder-V3 (or V2.5 Lite for 16GB cards) — reaches for the math-heavy and algorithmic work. Slower but sharper on edge cases.
  • Llama 3.3 70B Instruct — the chat brain. Use it for architecture conversations, not code edits. Needs Q4_K_M quant + 48GB to breathe.
  • Qwen2.5-Coder-7B — the autocomplete workhorse. Pin this as the tab model, it answers in under 200ms on a laptop GPU.

Setup in five steps

01. Install Ollama and pull the core set

bash
# macOS
brew install ollama
brew services start ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# pull the four keepers
ollama pull qwen2.5-coder:32b-instruct-q5_K_M
ollama pull qwen2.5-coder:7b-instruct-q5_K_M
ollama pull deepseek-coder-v3:16b-q5_K_M
ollama pull llama3.3:70b-instruct-q4_K_M

Disk cost: ~90GB. Worth every block.

02. Wire Aider to Ollama

Aider speaks OpenAI-compatible. Ollama serves OpenAI-compatible. You point one at the other.

bash
pip install aider-chat

# ~/.aider.conf.yml
cat > ~/.aider.conf.yml <<EOF
openai-api-base: http://localhost:11434/v1
openai-api-key: ollama
model: openai/qwen2.5-coder:32b-instruct-q5_K_M
weak-model: openai/qwen2.5-coder:7b-instruct-q5_K_M
edit-format: diff
auto-commits: true
dirty-commits: false
EOF

edit-format: diff matters. With whole, Aider will re-emit your whole file and the local model will occasionally lose punctuation. Diff mode is surgical.

03. Use llama.cpp for the models Ollama is slow to ship

Sometimes HuggingFace has a quant a week before Ollama’s registry does. Keep llama.cpp for those.

bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1 -j

# serve a GGUF directly
./llama-server \
-m models/Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-c 16384 -ngl 99

It also exposes /v1/chat/completions, so Aider can flip over with a one-line config change.

04. Pin a keyboard-level routing strategy

Two aliases in your shell: cheap uses the 7B, smart uses the 32B. Everything else is branch names.

bash
# ~/.config/fish/config.fish (or bashrc)
alias cheap='aider --model openai/qwen2.5-coder:7b-instruct-q5_K_M'
alias smart='aider --model openai/qwen2.5-coder:32b-instruct-q5_K_M'
alias chat='aider --model openai/llama3.3:70b-instruct-q4_K_M --edit-format ask'

05. Set up a fallback to the Gateway for the nights you travel

Local-first does not mean local-always. When the GPU is 9,000km away, keep a fallback config that points at the OVTH Gateway. Swap with one environment variable.

Cost, privacy, performance

The local stack is not a purity test. It is the version of the workflow that keeps running when the internet does not, when the client’s security policy says no, and when you just want to hear your GPU fans earn their electricity.