Three Engineers, a Million Lines of Code, and the Part They Forgot to Mention
Source: OpenAI

OpenAI published a post this week about how a team of three engineers built a million lines of code in five months using their Codex agent. No human wrote a single line directly. Average throughput: 3.5 pull requests per engineer per day. The numbers are impressive. The post is well-written. And it is, in a very specific way, a piece of marketing dressed as an engineering retrospective.
That's not a dismissal. There's a real idea in here. It just requires some excavation.
The real idea is this: when an AI agent can't access something in context, it doesn't exist. The Slack thread where your team aligned on an architectural decision. The Google Doc with the original design rationale. The thing Dave just knows. From an agent's point of view, none of it is real. So the team rebuilt their entire approach to documentation — not as a courtesy to future colleagues, but as functional infrastructure. They called it "harness engineering": designing environments, versioning decisions into the repository, and specifying intent precisely enough for an agent to act on it reliably.
It reframes documentation from bureaucratic overhead into something closer to the actual product. If the knowledge doesn't live in the repo, the system can't use it. That principle will still be true in five years, regardless of which tools are fashionable.
But here's what the post doesn't say.
This was a greenfield project. Empty repository, clean slate, no legacy constraints, no inherited technical debt, no team members who've been doing it their way since 2017. The team was at OpenAI, running OpenAI's own frontier coding model, with the kind of feedback loop and scaffolding investment that requires real engineering sophistication to build and maintain. When the agent hit a wall, the response wasn't to prompt harder. It was to step back, identify the missing capability, and engineer around it. That's a high bar. Most teams encountering AI coding tools are doing so inside messy fifteen-year-old codebases, under delivery pressure that doesn't accommodate that kind of reflective investment.
There's also a metric problem. A million lines of agent-generated code, reviewed by three people, reviewed quickly enough to sustain 3.5 PRs per day — the quality question is simply not asked. The post mentions "hundreds of internal users, including daily power users." That's early-stage internal validation. It's not nothing, but it's a long way from the production stress-testing that reveals what a million lines of fast-generated code actually contains. Perhaps it's excellent. Perhaps it's fine. We're not told, because throughput is the story being told.
None of this makes the piece wrong. "Context architecture is now a primary engineering discipline" is a real and important observation. "The best use of human time is asking what capability is missing, not reaching for the keyboard" — also real.
What the piece is not is a replicable playbook. It's a proof of concept run under ideal conditions by people who built the tool, to demonstrate the potential of the tool. That's a legitimate and interesting thing to publish. But the gap between this and "my team should do this" is enormous, and the post is carefully uninterested in that gap.
The insight is genuine. The conditions required to reproduce it are conspicuously absent.
Read it for the documentation principle. Be skeptical of the throughput numbers. And notice who's telling you the story and why.
Stay current weekly
Get new commentary and weekly AI updates in the AI Primer Briefing.