Temperature is not a bug, it's a feature

but the outcome of testing this feature, the magnitude of the variance, caught me by surprise!

TL;DR

Ran a head to head of Codex, Claude Code (Opus and Sonnet) and Antigravity (Gemini 3 Pro) in plan mode.
Each model was given 3 parallel runs on worktrees - 12 runs over 4 products and models.
2(!!!!) out of 12 runs fixed this very simple bug - 16.6% success rate
No major difference found between Opus and Sonnet, aside for 2X token usage.
While everyone claims to be living in the year 3000 where coders are no longer needed, this data begs to differ.

The Thesis

I took a real life issue (currently unsolved) in a popular open source project - Excalidraw.

While this issue demands a deep understanding of CSS to resolve, its complete isolation and peripheral location in the project make it a perfect testbed - it is the ultimate "Good first issue" for a newly onboarded team mate.

The bug: scrollbar squishes the panel when multiple shapes are selected

When selecting multiple squares - more options appear in the panel. The scrollbar pops up and squishes everything. Yes, that's right, I used the highly technical term squishes. Ok, crowded, is that better?

This makes it a golden stress test for AI coding agents. The bug is real (pulled from open GitHub issues), the fix is verifiable with a browser, and the reasoning chain from symptom to root cause is where agents visibly diverge.

This week I ran 12 parallel agents (3 per suite), varying the model (Opus 4.6 vs. Sonnet 4.5 vs. Gemini 3 pro vs. GPT 5.3) in plan mode.

Ready? Steady? Go!

The agents that solve the problem are the ones that look at behavior and not at the code. Moreover, many of the runs (4/12 = 33 %) fixed the bug only for the component reported in the issue, but the fix did not solve the bug that happens in a component directly below in the same pane. The color swatches remain stationary but the opacity bar and the default color squish in a glitchy way.

The partial fix: color swatches stable but opacity bar still squishes

The operation was successful, but the patient died.

The final nail in the coffin of this experiment - many of the agents used a browser and solemnly declared, on the life or their training set, that the issue was completely fixed.

Scrollbar Layout Shift - Opus variance

Here's an example of how different the results were in one of the suites.

Repo: excalidraw/excalidraw

Issue: #10688 — Properties panel buttons "squash" when scrollbar appears in stroke/background selectors

Parameter varied: Concurrency (3 worktrees), all Opus 4.6 in Plan mode

Results

Run	Time	Files Changed	Strategy	Outcome
1	~30m	1 (ColorPicker.scss)	Replaced `space-between` with `gap: 0.25rem`	⚠️ Partial — spacing fixed, shift remains
2	~20m	1 (styles.scss)	Applied `scrollbar-gutter: stable` to container	✅ Root cause fixed
3	~20m	1 (ColorPicker.scss)	Added `min-width: 0` + `gap: 0.125rem`	❌ Did not fix

Divergence Analysis

Two of the agents (run 1 and 3) both diagnosed the problem at the component level — they saw squashed buttons in .color-picker__top-picks and tried to force better spacing within that flex container. Run 1's gap replacement did improve the visual spacing, but only for the color picker. The opacity slider, on the other hand, not so much. It was visibly bounced around every time the scrollbar appeared.

Run 2 diagnosed the problem at the container level. It recognized that the scrollbar on .App-menu__left was consuming ~17px of width, which compressed all child content. The fix — scrollbar-gutter: stable — reserves that 17px permanently, eliminating the layout shift globally.

The pattern: two agents treated the symptom, one treated the cause. The distinguishing factor was whether the agent's investigation moved up the DOM tree from the visually broken element to the layout constraint.

One property, one line, applied to the container — not the component. Spoiler alert for future editions - Context usage for this run was ~50% higher than the Opus 4.5 baseline, worth noting for cost-conscious teams.

Scorecard — Cumulative Results

Each week we'll keep a running score of all the benchmarks in the following table:

Configuration	Root Cause ✅	Partial ⚠️	Failed ❌	Avg Time	Root Cause Rate
Opus 4.6 (Plan)	1	1	1	23 min	33%
Sonnet 4.5 (Plan)	1	0	2	6 min	33%
GPT 5.3 (Plan)	0	2	1	TODO	0%
Gemini 3 Pro	0	2	1	TODO	0%

Closing Signal

Next week's theme: "Does Strategy Beat Intelligence?" — Opus 4.5 across all three modes - Simple Mode, Plan Mode and Feature Dev. Surprising results are coming!