TL;DR
- Ran based on GPT-5.3 Codex: same model, same bug, same prompt. The only variable: browser access.
- Simple Mode without browser: 0 / 3 root cause fixes (0%)
- Simple Mode with minimal visuals (GitHub screenshots): 1 / 3 (33%)
- Plan Mode with browser (Playwright + localhost): 0 / 3 (0% root cause — 2 partials)
- Browser access bumped GPT-5.3 from 0 to 1 in simple mode.
- The browser helps. But it's not the whole story.
The Thesis
The last two editions tested model variance and workflow variance. This week, we test perception and agency. Does the agent perform better if an agent can play with the front-end itself?
Everyone building AI coding agents right now is racing to plug in browsers, screenshots, Playwright, MCP tools - the assumption being that if the agent can see the bug, it can fix the bug. It's intuitive. It's compelling. And it's... partially right?
We ran the cleanest controlled experiment we've had so far. Same model (GPT-5.3 Codex). Same bug (Excalidraw #10688 - our nemesis, the scrollbar layout shift). Same prompt. Three parallel worktrees per condition. The only variable: whether the agent had browser access.
Three configurations:
- Simple Mode, No Browser - code-only, no visual feedback
- Simple Mode, Browser (GitHub) - could view the issue page and screenshots, but not run the app locally.
- Plan Mode, Browser (Playwright MCP + localhost deployment) - full interactive access, could launch the app, draw shapes, inspect the DOM, etc...
If more browser access is the missing ingredient, we should see a clean gradient: more visual access equals more fixes. Right?
Not exactly.
The Results
Simple Mode — No Browser: 0 / 3
Without any visual grounding, GPT-5.3 went 0 for 3. Every run "fixed" something, every run passed its own tests, and not a single one actually solved the bug.
- Run-1 replaced
justify-content: space-betweenwith a hardcodedgap. The swatches stopped squishing, but elements got clipped when the scrollbar appeared. A new bug for an old one. ❌ - Run-2 added
min-width: 12.5remto the container — a theoretical fix that addressed a problem that didn't exist. ❌ - Run-3 restructured the JSX to use a
.panelRowwrapper. Code hygiene, not a bug fix. And it introduced TypeScript errors in the process. ❌
Average completion time: under 4 minutes. All agents confidently reported success. All hallucination, no cause for celebration.
Simple Mode - Browser (GitHub Issue): 1 / 3
Now we gave the same agent access to the GitHub issue page — including the screenshots attached to the bug report. One agent looked at the pictures, understood the problem, and nailed it.
- Run-1 opened the browser, studied the screenshots, and diagnosed the scrollbar as consuming horizontal space from child elements. It applied
scrollbar-gutter: stable— one CSS property, on the container. Root cause fixed. ✅ - Run-2 never opened the browser. It refactored
canChangeStrokeColorlogic — a code-level cleanup that had nothing to do with the visual regression. ❌ - Run-3 opened the browser, looked at the screenshots, and still went after a structural "code smell" — making a wrapper div conditional. Seeing the bug wasn't enough to override its instinct to refactor. ❌
So browser access took GPT-5.3 from 0/3 to 1/3. But two of three agents with browser access still failed - one because it didn't bother to look, and one because it looked but drew the wrong conclusion.
Plan Mode - Browser (Playwright + Localhost): 0 / 3 Root Cause
Here's where it gets weird. We gave GPT-5.3 the full toolkit: Plan Mode with Playwright MCP, a running dev server, the ability to draw shapes on the canvas, measure DOM elements and take screenshots. The most instrumented, most autonomous configuration we tested. Robocop 3000.
The Result? Zero root cause fixes. Two partials, one outright failure.
- Run-1 launched the app, reproduced the bug with Playwright, measured DOM metrics, asked clarifying questions — and then applied a
grid-template-columnsoverride that fixed the swatches but introduced clipping elsewhere. ⚠️ - Run-2 never opened the browser despite having it available. It hypothesized a
<h3>tag semantic conflict — a hallucinated root cause. ❌ - Run-3 asked a clarifying question, applied a scoped CSS fix, and got a partial — similar clipping issue as Run-1. ⚠️
More tools. More planning. More tokens. Worse results than the simple agent that just looked at a screenshot. And I had such high hopes...
The Browser Paradox
Here's the paradox: browser access helps, but more browser access doesn't help more.
A lightweight visual check (looking at screenshots on GitHub) was enough for one agent to nail the root cause. But the fully instrumented agent - with localhost access, DOM measurement, and interactive reproduction - performed worse.
Why? I have a few theories...
Statistical Significance. First off, let's be honest, we didn't have enough runs to statistically say that full browser access is significantly worse, that being said, it definitely isn't better, right?
Context overload. The Plan Mode + Playwright agents generated enormous amounts of DOM metrics, accessibility trees, and layout measurements. All of that context competes for attention with the actual signal: "the scrollbar is eating horizontal space." More data can mean more noise.
Tool-use theater. When agents have fancy tools, they use them - even when simple observation would suffice. Run-1 in Plan Mode spent time measuring pixel offsets between elements when the fix was a single CSS property on a parent container. The investigation was thorough and irrelevant.
The "code smell" trap. Across all configurations, agents were drawn to structural refactoring. They'd see a wrapper div and want to fix that (I love the smell of Divs in the morning!), even when the visual evidence pointed to a CSS layout issue. Browser access only helps if the agent knows how to weigh what it sees against what it reads in code.
The Scorecard
Cumulative results across all three editions:
| Week | Configuration / Mode | Model | Browser | Root Cause | Partial | Failed | Success Rate |
|---|---|---|---|---|---|---|---|
| March 3rd | Plan Mode | Opus 4.6 | No | 1 | 1 | 1 | 33% |
| March 3rd | Plan Mode | Sonnet 4.5 | No | 1 | 0 | 2 | 33% |
| March 3rd | Plan Mode | GPT 5.3 | No | 0 | 2 | 1 | 0% |
| March 3rd | Plan Mode | Gemini 3 Pro | No | 0 | 2 | 1 | 0% |
| March 10th | Simple Mode | Opus 4.5 | No | 1 | 0 | 2 | 33% |
| March 10th | Feature-Dev Mode | Opus 4.5 | No | 1 | 0 | 2 | 33% |
| March 10th | Plan Mode | Opus 4.5 | No | 2 | 0 | 1 | 67% |
| March 17th | Simple Mode | GPT 5.3 | No | 0 | 1 | 2 | 0% |
| March 17th | Simple Mode | GPT 5.3 | GitHub | 1 | 0 | 2 | 33% |
| March 17th | Plan Mode | GPT 5.3 | Playwright | 0 | 2 | 1 | 0% |
Why This Matters
If you're building AI-assisted development workflows:
- Don't assume browser access solves visual bugs. It helps, but agents still need the reasoning to interpret what they see.
- Don't assume more tools equals better outcomes. The simplest condition (just reading screenshots) outperformed the most instrumented one (Playwright + localhost + Plan Mode).
- Passing tests means nothing for visual bugs. Every single run, all 9 of them, passed its own tests. Not one of those test suites caught the actual visual regression.
- Model reasoning still dominates. The bottleneck is the model's ability to reason about CSS layout, not its ability to screenshot a browser.
Browser access is a useful crutch. But a crutch doesn't teach you to walk.
And the agent that can reason about what it can't see is still more reliable than the agent that can see but can't reason.
Closing Signal
Is there really a big difference between the generations of the models? Does Opus 4.6 outperform Opus 4.5? Does GPT 5.4 outperform GPT 5.3? Find out more in the next newsletter!