№ 013 · April 22, 2026 · 18 min read

Benchmarking 7 coding agents on a real refactor

Same 12k-line TypeScript codebase, same task: extract a domain layer. I ran every agent twice and graded the diffs.

I gave seven coding agents the same task: extract the domain layer from a 12,000-line TypeScript codebase. Same prompt, two attempts each, identical environment. Then I read every diff.

§ 01 The setup

The codebase was a real one — a side project, not a benchmark — with the usual tangle of business logic stuffed into route handlers. The brief was deliberately under-specified, the way real refactor tickets are.

§ 01 The setup

Related entries