GPT-5.4: What's real, what's packaging, and what got left out

OpenAI released GPT-5.4 today, positioned as their most capable and efficient frontier model for professional work, rolling out across ChatGPT, the API, and Codex. The headline numbers are strong: on GDPval, the benchmark testing agent performance across 44 occupations, GPT-5.4 matched or exceeded industry professionals in 83% of comparisons, up from 70.9% for GPT-5.2. Hallucinations dropped 33% at the claim level compared to GPT-5.2, with overall responses 18% less likely to contain errors.

Credit where it's due: the practical improvements matter more than the benchmarks. GPT-5.4 merges the coding strengths of GPT-5.3-Codex into a general-purpose model, improves tool use across software environments, and is the first mainline model with built-in computer-use capabilities. The API now supports context windows up to 1 million tokens alongside a new "Tool Search" system for managing tool calling. For developers building agents, that's useful. Less duct tape between separate models, more coherent workflows in a single system.

Now, the part most coverage will skip.

The framing around "professional work" is doing heavy lifting here. GDPval measures performance on well-specified knowledge work tasks — meaning tasks with clear inputs, defined outputs, and limited ambiguity. That's an important subset of professional work, but it's not most of professional work. The hardest parts of any knowledge worker's day involve unclear requirements, competing priorities, political context, and judgement calls that can't be captured in a benchmark prompt. Saying a model "matches or exceeds professionals" on well-specified tasks is a bit like saying a calculator exceeds accountants at arithmetic. True, but it leaves out the interesting part.

The release cadence itself tells a story worth reading. OpenAI teased "5.4 sooner than you think" barely an hour after releasing GPT-5.3 Instant on 3 March. GPT-5.2 launched in December as a direct response to Gemini competition. Three major model iterations in three months is less "steady progress" and more "arms race pacing." For professionals trying to build workflows around these tools, the instability is the feature nobody's selling you on. Any process you build today around GPT-5.4's specific capabilities may need revisiting by summer — from OpenAI or their competitors.

GPT-5.4 is a real step forward in making these models more useful as working tools rather than party tricks. The coding, tool-use, and error-reduction improvements are real and worth evaluating. But "matches professionals 83% of the time on well-defined tasks" is not the same as "replaces professionals" — and the gap between those two statements is where most of the important decisions about your team and your career actually live.

More from the blog

Stay current weekly