A field guide to writing your own eval harness
Why "vibes-based" testing collapses past 50 prompts, and the smallest harness that scales without becoming a second product.
Why "vibes-based" testing collapses past 50 prompts, and the smallest harness that scales without becoming a second product.
"The taxonomy I keep coming back to. Workflows vs. agents, with worked patterns."
via Anthropic"A pragmatic field report. The section on guardrail budgets is worth the read alone."
via MediumSame 12k-line TypeScript codebase, same task: extract a domain layer. I ran every agent twice and graded the diffs.
"Worth reading the verified subset methodology before quoting any number from the headline board."
via MarkTechPost"Skim the intro, read the middle. Their framing of "negotiated autonomy" is sticky."
via The New Stack"Annual roundup. Section on tool-calling reliability is gold."
via simonwillison.netMost MCP servers I see are CRUD wrappers. Here is what changes when you design tools as if a model were a junior engineer with no memory.
"Forking a verifier you never await is the cheapest CI you will ever ship."
via Anthropic DocsA 90th-percentile breakdown of where the seconds actually go in a tool-using chat. Spoiler: it is not the model.
"The PageRank-on-symbols trick is more interesting than the headline feature."
via aider.chatWhen the user is not the only one reading your UI. A short manifesto on machine-legible interfaces.