Last week we explored variance (or lack thereof) across models.
This week we remove intelligence from the equation. Same model. Same bug. Same repository.
Only one thing changes. The Execution strategy — Simple, Plan and Feature-Dev Mode. All powered by Opus 4.5.
What if workflow matters more than horsepower?
TL;DR
Using Opus 4.5 across three execution modes on the same Excalidraw bug:
- Plan Mode: 2 / 3 root cause fixes (67%)
- Simple Mode: 1 / 3 (33%)
- Feature-Dev Mode: 1 / 3 (33%)
- Plan Mode used ~1/3 the tokens of Feature-Dev
At face value, Plan Mode “wins”.
But the real story isn't that plan mode won (and to be honest, the win isn't statistically significant).
The real story is that a more expensive, thorough and autonomous plan didn't win. It even failed in a pompous and embarrassing way!
The Thesis
This week we wanted to test the hypothesis that there are better workflows. Everyone is shouting through a megaphone that this (insert random workflow here) changes everything and they just became a 10x engineer without ever writing code. To test that, we ran a head to head between 3 strategies supplied by Anthropic.
We used the same issue as Edition #1. When selecting multiple shapes in Excalidraw, the properties panel grows vertically. A scrollbar appears. The layout “squishes”.
While it looks like a component spacing problem, it's actually a container constraint problem.
In last week's edition we saw that the obvious (and incorrect) fix is to apply constraints on the color picker, but the actual solution is to fix the container itself.
The Three Modes
All runs used:
- Claude Code
- Opus 4.5
- 3 parallel worktrees
- Same base commit
Only the execution protocol changed.
The results? Surprising to say the least!
In simple mode — just a direct instruction — “Fix the bug in this issue: https://github.com/excalidraw/excalidraw/issues/10688”
As expected — 1 out of 3 success rate (33.3%), Moderate token usage, Fast runtime.
In Feature Dev mode — add the workflow to the simple instruction — “/feature-dev:feature-dev Fix the bug in this issue: https://github.com/excalidraw/excalidraw/issues/10688”
Completely identical success rate!
1 out of 3 success rate (33.3%), crazy high token usage (3X tokens than plan mode!). All this coated with an extremely long run time.
In one failed run, the agent (arrogantly?) abandoned live reproduction: “Let me proceed with the code analysis approach. I have enough understanding of the bug from reading the source code.”
It was overconfident!
Next time, serve slightly chilled please!
And finally, in plan mode — Though not statistically significant, drumroll please...
- 2 out of 3 success rate (66.7%)
- 1/3 token usage of Feature Dev (and even less than Simple mode!!!!)
- Fastest runtime!
- Verified visually.
The Scorecard
Here are all the benchmarks we ran from the first edition of the newsletter in one handy table.
| Week | Configuration / Mode | Model | Root Cause ✅ | Partial ⚠️ | Failed ❌ | Success Rate |
|---|---|---|---|---|---|---|
| March 3rd | Plan Mode | Opus 4.6 | 1 | 1 | 1 | 33% |
| March 3rd | Plan Mode | Sonnet 4.5 | 1 | 0 | 2 | 33% |
| March 3rd | Plan Mode | GPT 5.3 | 0 | 2 | 1 | 0% |
| March 3rd | Plan Mode | Gemini 3 Pro | 0 | 2 | 1 | 0% |
| March 10th | Simple Mode | Opus 4.5 | 1 | 0 | 2 | 33% |
| March 10th | Feature-Dev Mode | Opus 4.5 | 1 | 0 | 2 | 33% |
| March 10th | Plan Mode | Opus 4.5 | 2 | 0 | 1 | 67% |
At first glance, Plan Mode “wins”. So why is it not statistically significant? Because with n=3 per condition, this could easily be variance.
So... this experiment does not prove Plan Mode is superior.
But ...every mode, including the most expensive and autonomous one, failed at least once. None achieved 3 / 3. And an expensive fail at that!
That's not just variance. That's an economic signal. If more autonomy and more token burn consistently produced perfect reliability, you could argue cost buys certainty. But it sure doesn't.
More computational spend does NOT guarantee more consistent correctness.
Why This Matters
If you're building AI-assisted development workflows:
- Don't assume the most autonomous mode is the safest.
- Don't assume more tokens equals more accuracy.
- Don't assume cost buys determinism.
Reliability is still probabilistic.
Variance is still real.
And process design still matters.
Closing Signal
Next week's theme: “The Browser Paradox” — GPT-5.3 Codex with and without browser access. The cleanest controlled experiment yet. Does vision beat reasoning? The answer is awkward.