TL;DR
- Ran a head to head of Codex, Claude Code (Opus and Sonnet) and Antigravity (Gemini 3 Pro) in plan mode.
- Each model was given 3 parallel runs on worktrees - 12 runs over 4 products and models.
- 2(!!!!) out of 12 runs fixed this very simple bug - 16.6% success rate
- No major difference found between Opus and Sonnet, aside for 2X token usage.
- While everyone claims to be living in the year 3000 where coders are no longer needed, this data begs to differ.
The Thesis
I took a real life issue (currently unsolved) in a popular open source project - Excalidraw.
While this issue demands a deep understanding of CSS to resolve, its complete isolation and peripheral location in the project make it a perfect testbed - it is the ultimate "Good first issue" for a newly onboarded team mate.

When selecting multiple squares - more options appear in the panel. The scrollbar pops up and squishes everything. Yes, that's right, I used the highly technical term squishes. Ok, crowded, is that better?
This makes it a golden stress test for AI coding agents. The bug is real (pulled from open GitHub issues), the fix is verifiable with a browser, and the reasoning chain from symptom to root cause is where agents visibly diverge.
This week I ran 12 parallel agents (3 per suite), varying the model (Opus 4.6 vs. Sonnet 4.5 vs. Gemini 3 pro vs. GPT 5.3) in plan mode.
Ready? Steady? Go!
The agents that solve the problem are the ones that look at behavior and not at the code. Moreover, many of the runs (4/12 = 33 %) fixed the bug only for the component reported in the issue, but the fix did not solve the bug that happens in a component directly below in the same pane. The color swatches remain stationary but the opacity bar and the default color squish in a glitchy way.

The operation was successful, but the patient died.
The final nail in the coffin of this experiment - many of the agents used a browser and solemnly declared, on the life or their training set, that the issue was completely fixed.
Scrollbar Layout Shift - Opus variance
Here's an example of how different the results were in one of the suites.
Repo: excalidraw/excalidraw
Issue: #10688 — Properties panel buttons "squash" when scrollbar appears in stroke/background selectors
Parameter varied: Concurrency (3 worktrees), all Opus 4.6 in Plan mode
Results
| Run | Time | Files Changed | Strategy | Outcome |
|---|---|---|---|---|
| 1 | ~30m | 1 (ColorPicker.scss) | Replaced space-between with gap: 0.25rem | ⚠️ Partial — spacing fixed, shift remains |
| 2 | ~20m | 1 (styles.scss) | Applied scrollbar-gutter: stable to container | ✅ Root cause fixed |
| 3 | ~20m | 1 (ColorPicker.scss) | Added min-width: 0 + gap: 0.125rem | ❌ Did not fix |
Divergence Analysis
Two of the agents (run 1 and 3) both diagnosed the problem at the component level — they saw squashed buttons in .color-picker__top-picks and tried to force better spacing within that flex container. Run 1's gap replacement did improve the visual spacing, but only for the color picker. The opacity slider, on the other hand, not so much. It was visibly bounced around every time the scrollbar appeared.
Run 2 diagnosed the problem at the container level. It recognized that the scrollbar on .App-menu__left was consuming ~17px of width, which compressed all child content. The fix — scrollbar-gutter: stable — reserves that 17px permanently, eliminating the layout shift globally.
The pattern: two agents treated the symptom, one treated the cause. The distinguishing factor was whether the agent's investigation moved up the DOM tree from the visually broken element to the layout constraint.
One property, one line, applied to the container — not the component. Spoiler alert for future editions - Context usage for this run was ~50% higher than the Opus 4.5 baseline, worth noting for cost-conscious teams.
Scorecard — Cumulative Results
Each week we'll keep a running score of all the benchmarks in the following table:
| Configuration | Root Cause ✅ | Partial ⚠️ | Failed ❌ | Avg Time | Root Cause Rate |
|---|---|---|---|---|---|
| Opus 4.6 (Plan) | 1 | 1 | 1 | 23 min | 33% |
| Sonnet 4.5 (Plan) | 1 | 0 | 2 | 6 min | 33% |
| GPT 5.3 (Plan) | 0 | 2 | 1 | TODO | 0% |
| Gemini 3 Pro | 0 | 2 | 1 | TODO | 0% |
Closing Signal
Next week's theme: "Does Strategy Beat Intelligence?" — Opus 4.5 across all three modes - Simple Mode, Plan Mode and Feature Dev. Surprising results are coming!