Why the model that writes more actually does more

Benchmarks don't tell the full story. After running Claude Code and Codex CLI side by side on real production work, the difference that actually matters in agentic workflows isn't capability — it's output style.

I've been running Claude Code and Codex CLI side by side for several months across real production work. Not synthetic benchmarks, not toy projects. Actual platform architecture, multi-tenant routing, MCP integration. The kind of work where a wrong turn costs days.

The capability benchmarks are genuinely close. SWE-bench scores sit within a few percentage points of each other, and on any given task you could make a case for either. But there's a difference that doesn't show up in any benchmark I've seen, and it's been consistently decisive in my workflows.

Claude Code, running Opus, writes research output the way a senior engineer thinks through a problem. Verbose, explanatory, structured reasoning. You get the conclusion and the chain of thought that led there. Codex tends toward concision. Bullet points, short paragraphs, clean summaries. On the surface, that looks efficient.

The problem is that in agentic workflows, research doesn't end at the research step. It feeds execution. The output of one agent becomes the context for the next. And when that context is a bulleted summary, you've already stripped out the connective tissue: the qualifications, the trade-off reasoning, the "this approach works unless you're doing X" caveats that only appear in full prose. By the time a coding agent picks up a compressed summary and starts generating, it's working from an impoverished foundation.

I noticed this pattern when debugging a particularly awkward data routing problem. Codex gave me a clean summary of the options with a recommended approach. Claude gave me the same recommendation, plus a three-paragraph explanation of why one alternative looked attractive but would fail under specific cache invalidation conditions I hadn't explicitly mentioned. The Codex path required two correction cycles. The Claude path didn't require any.

This matters more as agentic task chains get longer. A single-step task doesn't need rich context. But when you're orchestrating agents across planning, research, implementation, and review, each handoff is a potential compression point where reasoning degrades. The model that writes like it's thinking out loud, rather than summarising for a slide deck, keeps that reasoning intact longer.

I'm not arguing that verbosity is inherently better. A model that rambles without structure is just as useless. The distinction is between explanation and padding. Opus writes long because it's reasoning, not because it's hedging. That's a meaningful difference.

The benchmark conversation in AI tends to flatten this. SWE-bench, Terminal-Bench, OSWorld. All useful, all measuring something real. But they're single-task evaluations. They don't capture what happens when you chain ten tasks together and the quality of step three depends on how thoroughly step one was explained. That's a workflow property, not a model property, and it's only visible in production use.

For anyone building serious agentic pipelines: test your models on context retention, not just task completion. Run a research step, then pass the output to an execution step without modification, and see how much hand-holding the second step needs. That gap is where the real performance difference lives.

The model that writes more, in my experience, consistently does more with less correction.

Tags:AIClaude CodeAgentic WorkflowsDeveloper ToolsCodex

Want to discuss this article?

Get in touch with our team.