WEEKLY · AI CODING BENCHMARKS

AI coding with agents,
benchmarked honestly.

Real bugs. Real code. Real results — in 5 minutes, every week.

No spam. Unsubscribe anytime. Free forever.

12 parallel runs
16.6% root-cause fix rate
4 models, zero bias

From issue #3:

Can an AI agent fix what it can't see?

  • Ran based on GPT-5.3 Codex: same model, same bug, same prompt. The only variable: browser access.
  • Simple Mode without browser: 0 / 3 root cause fixes (0%)
  • Simple Mode with minimal visuals (GitHub screenshots): 1 / 3 (33%)
  • Plan Mode with browser (Playwright + localhost): 0 / 3 (0% root cause — 2 partials)
  • Browser access bumped GPT-5.3 from 0 to 1 in simple mode.
  • The browser helps. But it's not the whole story.
Read this issue

From issue #2:

How does strategy affect the outcome?

  • Same model, same bug, same repo — only the execution strategy changed.
  • Plan Mode: 2/3 root cause fixes (67%). Feature-Dev Mode: 1/3 (33%).
  • Plan Mode used ~1/3 the tokens of Feature-Dev.
  • More computational spend does NOT guarantee more consistent correctness.
Read this issue

From issue #1:

Temperature is not a bug, it's a feature

  • Ran a head to head of Codex, Claude Code (Opus and Sonnet) and Antigravity (Gemini 3 Pro) in plan mode.
  • Each model was given 3 parallel runs on worktrees - 12 runs over 4 products and models.
  • 2(!!!!) out of 12 runs fixed this very simple bug - 16.6% success rate
  • No major difference found between Opus and Sonnet, aside for 2X token usage.
  • While everyone claims to be living in the year 3000 where coders are no longer needed, this data begs to differ.
Read this issue
  • Every week: one real benchmark, one honest result
  • 5 minutes to read, zero marketing fluff
  • Delivered weekly — unsubscribe in one click