Devin, but Good

Scott Wu, CEO of Cognition, announcing a major Devin update:

It turns out that when you build for the future... eventually the future comes.

There's a version of this post that's pure victory lap. Wu mostly avoids it. The admission that Devin was "frankly too early" at launch — that it took months to drive real PRs even in their own codebase — is more candour than you typically get from a founder sitting on top of a hype cycle. And describing the update as "just Devin, but good" is the kind of phrasing that only works if you're willing to concede the original product wasn't.

The substance of the update is telling. Faster startup. Smoother Slack and Linear integrations. Hundreds of UX fixes. No new paradigm, no architectural leap — just the slow, tedious work of making something actually usable. That's the part I believe. That's what separates products from demos.

What I'm less sold on is the metrics. Internal PRs went from 154 to 659 per week. Enterprise sessions are "doubling every six weeks." Usage grew "65x in thirteen months." These numbers sound enormous until you think about what they're actually measuring. A pull request is a unit of activity, not a unit of value. Going from near-zero to 65x near-zero is still a number that fits comfortably on a napkin. And none of this tells you the thing you'd actually want to know: what percentage of Devin's output ships to production without a human effectively rewriting it?

Wu credits the improvement to a 100x leap in model capabilities, citing METR benchmarks showing frontier models handling tasks that would take humans six-plus hours. Fine. But if the gains are mostly downstream of better foundation models, then every competitor building on the same models gets the same tailwind for free. The interesting question isn't whether Devin is better — everything built on top of Opus 4.6 and Codex 5.3 is better — it's whether Cognition's tooling layer is a durable product or a temporary head start.

To be fair, "we built the infrastructure early and it paid off when the models caught up" is a legitimate strategy, and it may well be the right one. The point stands that engineering leaders who dismissed Devin a year ago should probably look again. AI tooling has a shorter shelf life on first impressions than almost any other product category.

But I keep coming back to those 659 PRs. It's the number Wu chose to lead with, which means it's the number that makes Devin look best. The number he didn't share — the one measuring how much of that output actually stands on its own — would tell you whether the future has really arrived, or just sent a very promising postcard.

More from the blog

Stay current weekly