Proxy everything: why aggregation wins
The vendor with the best model loses to the proxy with the best routing. Here's why, with code.
I used to have six SDKs in my package.json. @anthropic-ai/sdk, openai, @google/generative-ai, mistral, groq-sdk, and a hand-rolled adapter for a local Ollama. Each with its own retry logic, its own streaming quirks, its own auth dance. Each wrapped in a try/catch. Each with a config block in production that nobody wanted to touch.
Today I have one. openai. Pointed at https://gateway.ovth.dev/v1. It talks to a public model catalog through one API surface. The code doesn’t know how requests are routed. The code doesn’t need to know.
This is the aggregation pattern, and I want to argue it’s the single most underrated architectural decision in modern AI tooling.
The case, compressed
A gateway in front of your model layer gives you five things that no single vendor can:
- Model independence. Your code targets the OpenAI wire protocol. Your gateway routes to whoever’s actually serving. Swap Claude for GPT for Gemini by editing a JSON config, not by refactoring client code.
- Failover. Anthropic has a bad hour, the gateway retries via Google. Your product keeps working. Your users never see the outage page.
- Cost control. Route cheap traffic to cheap models. Route premium traffic to premium models. Enforce per-team budgets centrally. Alert when someone’s prompt-chain is hemorrhaging tokens.
- One auth surface. One key, one rotation policy, one audit log. Not six.
- Observability. Request traces, token counts, latency histograms — in one place, for every model, in the same schema.
None of these are hypothetical. All of them are necessary the moment you have more than one model in production.
The code looks like nothing
That’s the point.
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.ovth.dev/v1",
api_key=os.environ["OVTH_API_KEY"],
)
response = client.chat.completions.create(
model="claude-opus-4.7",
messages=[{"role": "user", "content": "hello"}],
) That request is routed, retried, schema-normalized, and shaped into the response format your client expects — all of it invisible. You get an OpenAI-shaped response object, and you move on.
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://gateway.ovth.dev/v1",
apiKey: process.env.OVTH_API_KEY,
});
const stream = await client.chat.completions.create({
model: "kimi-k2.6",
messages: [{ role: "user", content: "refactor this" }],
stream: true,
});
for await (const chunk of stream) process.stdout.write(chunk.choices[0]?.delta?.content ?? ""); Claude today. Open-source tomorrow. Gemini next week. Same code.
Why this wins structurally
Software has a long history of this pattern. TCP/IP won because it didn’t care what was underneath. Kubernetes won because it didn’t care what cloud you were on. Stripe won because it didn’t care which card network processed the charge.
AI’s layer is finally getting its own abstraction. The OpenAI chat-completions schema turned into the de facto wire format — not because it’s beautiful, but because it’s good enough and everywhere. Anthropic speaks it via compatibility mode. Google speaks it. Mistral speaks it. Even Ollama speaks it locally on port 11434.
Once a wire format becomes universal, the switching cost collapses. Once switching cost collapses, the power moves up the stack — to whoever owns routing, observability, and cost management. That’s the gateway.
The OVTH Gateway pattern, specifically
I’ll use OVTH as the example because I know it best and because it’s the reference implementation I reach for.
# gateway.yaml — a real fragment
providers:
anthropic:
priority: 1
models: [claude-opus-4.7, claude-sonnet-4.6]
google:
priority: 2
models: [gemini-3.1-pro, gemini-3-flash]
openai:
priority: 3
models: [gpt-5.5, gpt-5.4-mini]
routing:
- match: { model: "claude-opus-4.7" }
try: [anthropic, openai]
- match: { model: "gemini-3-flash" }
try: [google, anthropic]
- match: { prompt_length: "<2000" }
prefer: cheapest # keep the cheap lane warm The gateway:
- exposes one stable OpenAI-compatible endpoint and one Anthropic-native endpoint.
- lets clients use public model IDs instead of provider-specific configuration.
- normalizes common schema and message-shape differences before a request leaves the public API boundary.
- supports reasoning controls, streaming, tool calls, and image inputs through the same client-facing contract.
- keeps routing, retry, and backend details out of the calling code.
None of this is visible to the client. The public contract is the API shape, model IDs, limits, and capabilities — not the internal routing machinery.
Why aggregation matters
A serious production stack needs three tiers, and a gateway that speaks all three is the difference between “works in demo” and “works at 3AM on a Tuesday”:
- Frontier — Claude Opus, GPT-5.5, Gemini 3.1 Pro. The models you reach for when correctness matters.
- Fast — Haiku, GPT-5.4-mini, Gemini Flash, DeepSeek Flash. The models you reach for when latency matters.
- Cheap — whatever lane gives you the most tokens per dollar for triage and pre-processing.
A gateway that speaks to all three means your code runs the same whether the query lands on a frontier cluster or a cost-optimized backup. That is what optionality looks like.
The objection I keep hearing
“But a gateway is another single point of failure.”
True in principle. Less true in practice. A well-designed gateway has more failover paths than any single vendor, because it aggregates them. If Anthropic goes down, direct clients break. Gateway clients route to Google. The gateway only fails when everyone fails, at which point you’ve got bigger problems than the gateway.
Self-hosting answers the rest. OVTH Gateway is a single binary. Run two replicas behind a load balancer. Now your SPOF is your own ops, which you already trust.
The real bet
Every time you add a provider-specific SDK, you’re betting that provider is the right long-term choice. Every time you put a gateway in front, you’re betting the abstraction outlasts any provider.
Given the pace at which frontier models rotate leadership, I know which bet I’d rather hold.
Proxy everything. Let your code be boring. Keep your optionality. The vendor with the best model this quarter isn’t the vendor with the best model next quarter. The proxy in front of all of them always wins.