The Overthink Economy

Somewhere between OpenAI’s o1 in late 2024 and Claude’s extended thinking rollout in 2025, thinking stopped being a side effect and became a product feature. By 2026 it’s a checkbox in every serious API: reasoning: true, thinking: { budget_tokens: 16000 }, show_work: verbose. The entire industry pivoted from “generate a response” to “generate a visible process and then a response,” and the whole shape of the economy around LLMs shifted with it.

Let’s call it what it is. The overthink economy.

From blast to deliberate

The old model was simple. You sent a prompt, the model produced tokens left-to-right, you got an answer. Fast, cheap, confident, and often wrong in ways that looked suspiciously like confidence.

The new model is inverted. The model thinks, visibly, sometimes for thirty seconds, sometimes for ten minutes. You see scratch pads. Chains of reasoning. Self-correction mid-thought. Rejected approaches. Then a final answer that’s often genuinely different from what a blast-response would have given.

This isn’t a gimmick. On hard problems — competitive math, multi-hop logic, complex refactors, anything requiring actual planning — reasoning models outperform their blast equivalents by entire league tables. The cost is latency. And latency is now a UX axis we have to deliberately design around.

The pricing that pricing wasn’t ready for

Here’s what surprised me. Reasoning tokens get billed. They count against your context window. Some providers bill them at a premium. A single o4 call can spend 40,000 tokens thinking before producing 200 tokens of output. Your invoice doesn’t care that 99% of the tokens were invisible to you.

The pricing page had to grow a new column. Input tokens. Output tokens. Reasoning tokens. Three SKUs where there used to be two. Procurement teams blinked and suddenly their spend on a single model tripled, not because they sent more queries, but because each query now thought longer.

This has a second-order effect nobody fully priced in: it made prompts matter differently. A tight prompt with good constraints means less wandering. A vague prompt means the model thinks itself in circles for a minute and bills you for the privilege. Prompt engineering used to mean getting a good answer. In 2026 it also means not getting a $4 answer.

Latency as UX

Thirty seconds is an eternity in chat. Ten minutes is a commercial break. But here’s the thing: users accepted it. Once they saw the scratch pad — the visible thinking — they started treating those seconds as progress, not delay.

This is a huge UX finding and it’s worth sitting with. We didn’t make models faster; we made waiting legible. The progress bar is the reasoning. The user sees “the model considered approach A, rejected it, tried B, tested it against the constraint, moved on,” and they feel informed, not stalled.

Streaming made slow acceptable in 2023. Visible reasoning made very slow acceptable in 2026. I don’t think anyone predicted that.

It also rewired our expectations of what a model should do. A model that just blasts an answer now feels… careless. Even when it’s right. The aesthetic shifted. We want to see the work.

Tool calls got slower too

Agentic workflows stacked on top of reasoning models are even slower. A single task might be: think → call tool → read output → think → call another tool → think → answer. Each “think” is 5-30 seconds. A ten-step agent run can sit at 5-8 minutes end-to-end.

Nobody wants to watch that in a chat window. The interfaces that won are the ones that moved agentic work out of the synchronous chat loop entirely. Background tasks. Notifications when done. Parallel sub-agent runs where the user does other work while the system chews.

If your product is still trying to do agentic reasoning in a synchronous typing-dots interface, you’re fighting physics. The economy already moved to fire-and-forget.

What the pricing implies about the future

If reasoning tokens scale quadratically with problem complexity, and providers keep billing them at premium rates, there’s a plateau coming where the marginal cost of more reasoning exceeds the value of the marginal correctness. We’re not there yet. But a $2 answer that’s 94% correct vs a $12 answer that’s 97% correct is the frontier every team now navigates.

Aggregators help. Gateways can decide: easy task → blast model, hard task → reasoning model, really hard task → reasoning model with big budget. The routing becomes economic, not just technical.

The deepest shift is cultural. We’re paying for models to overthink, and that pricing signal will eventually land back on human work too. If customers tolerate a model that thinks for ten minutes, they might tolerate a human reviewer who thinks for an hour. If “fast but sloppy” gets penalized in AI pricing, it might get penalized everywhere else.

The tax we agreed to

We bought overthinking with dollars and seconds. The overthink economy is the trade of immediacy for reliability, of flat pricing for tiered pricing, of chat-ui for task-ui.

I think it’s a good trade most of the time. Hard problems deserve thought. Software is mostly hard problems. But I also think we should notice what we gave up: the cheap, instant, blast-a-response interaction that made LLMs feel magical in the first place. That product is still out there — Haiku, Flash, Mistral Small — and it’s still the right call for 70% of what we actually ask.

The mature move is to pick the right brain for the right task. Blast when you can. Overthink when you must. Pay attention to the bill.