WEEKLY · AI CODING BENCHMARKS
AI coding with agents,
benchmarked honestly.
Real bugs. Real code. Real results — in 5 minutes, every week.
No spam. Unsubscribe anytime. Free forever.
12
parallel runs
16.6%
root-cause fix rate
4
models, zero bias
From issue #3:
Can an AI agent fix what it can't see?
- Ran based on GPT-5.3 Codex: same model, same bug, same prompt. The only variable: browser access.
- Simple Mode without browser: 0 / 3 root cause fixes (0%)
- Simple Mode with minimal visuals (GitHub screenshots): 1 / 3 (33%)
- Plan Mode with browser (Playwright + localhost): 0 / 3 (0% root cause — 2 partials)
- Browser access bumped GPT-5.3 from 0 to 1 in simple mode.
- The browser helps. But it's not the whole story.
From issue #2:
How does strategy affect the outcome?
- Same model, same bug, same repo — only the execution strategy changed.
- Plan Mode: 2/3 root cause fixes (67%). Feature-Dev Mode: 1/3 (33%).
- Plan Mode used ~1/3 the tokens of Feature-Dev.
- More computational spend does NOT guarantee more consistent correctness.
From issue #1:
Temperature is not a bug, it's a feature
- Ran a head to head of Codex, Claude Code (Opus and Sonnet) and Antigravity (Gemini 3 Pro) in plan mode.
- Each model was given 3 parallel runs on worktrees - 12 runs over 4 products and models.
- 2(!!!!) out of 12 runs fixed this very simple bug - 16.6% success rate
- No major difference found between Opus and Sonnet, aside for 2X token usage.
- While everyone claims to be living in the year 3000 where coders are no longer needed, this data begs to differ.
What to expect
- Every week: one real benchmark, one honest result
- 5 minutes to read, zero marketing fluff
- Delivered weekly — unsubscribe in one click