WEEKLY · AI CODING BENCHMARKS

AI coding with agents,
benchmarked honestly.

Real bugs. Real code. Real results — in 5 minutes, every week.

No spam. Unsubscribe anytime. Free forever.

12 parallel runs

16.6% root-cause fix rate

4 models, zero bias

From issue #3:

Can an AI agent fix what it can't see?

Ran based on GPT-5.3 Codex: same model, same bug, same prompt. The only variable: browser access.
Simple Mode without browser: 0 / 3 root cause fixes (0%)
Simple Mode with minimal visuals (GitHub screenshots): 1 / 3 (33%)
Plan Mode with browser (Playwright + localhost): 0 / 3 (0% root cause — 2 partials)
Browser access bumped GPT-5.3 from 0 to 1 in simple mode.
The browser helps. But it's not the whole story.

Read this issue

From issue #2:

How does strategy affect the outcome?

Same model, same bug, same repo — only the execution strategy changed.
Plan Mode: 2/3 root cause fixes (67%). Feature-Dev Mode: 1/3 (33%).
Plan Mode used ~1/3 the tokens of Feature-Dev.
More computational spend does NOT guarantee more consistent correctness.

Read this issue

From issue #1:

Temperature is not a bug, it's a feature

Ran a head to head of Codex, Claude Code (Opus and Sonnet) and Antigravity (Gemini 3 Pro) in plan mode.
Each model was given 3 parallel runs on worktrees - 12 runs over 4 products and models.
2(!!!!) out of 12 runs fixed this very simple bug - 16.6% success rate
No major difference found between Opus and Sonnet, aside for 2X token usage.
While everyone claims to be living in the year 3000 where coders are no longer needed, this data begs to differ.

Read this issue

What to expect

Every week: one real benchmark, one honest result
5 minutes to read, zero marketing fluff
Delivered weekly — unsubscribe in one click