Ran 35 agent trials across 4 browser-snapshot formats - pass rate was identical, token cost wasn't

Reddit r/AI_Agents Tools

Summary

Opera's team tested browser snapshot formats for AI agents, finding identical pass rates but significantly lower token cost with their compressed format (opera-compact), which eliminates redundant ARIA attributes and repeated URLs. They released it as an open-source MCP server.

Not going to pretend, I'm on the team at Opera that builds browser tooling for agents, so take the numbers with whatever grain of salt that deserves. Question we kept running into: when an agent drives a browser, how much of its context actually goes to the task vs. re-reading the same page structure every step? Wanted a real number instead of a guess. Setup: 7 browser tasks (adapted from AXI's bench-browser suite), gpt-5.5 medium reasoning, 5 runs per condition. | Format | Pass | Avg input tokens | Tool calls | |---------------------------------------|------|------------------|------------| | unprocessed MCP (chrome-devtools-mcp) | 100% | 179.2k | 2.1 | | AXI reference CLI | 100% | 102.2k | 1.5 | | our raw output, compression off | 100% | 107.5k | 1.6 | | our compressed format (opera-compact) | 100% | 36.3k | 1.4 | Same pass rate everywhere - the difference is entirely in what the agent pays to get there. The compression comes from cutting stuff that's genuinely redundant: ARIA attributes implicit for a given role anyway, echoed text nodes, repeated URLs collapsed into a lookup table instead of inlined every time. Ships as opera-browser-cli / opera-devtools-mcp, Apache-2.0, plugs in as an MCP server: npm install -g opera-browser-cli && opera-browser-cli setup If you're running multi-agent pipelines where a sub-agent re-snapshots constantly, curious whether this gap widens or shrinks at that scale - we only tested single-agent loops. (links in comment)
Original Article

Similar Articles

Agent Execution Tax: new procurement metric for browser agent benchmarks?

Reddit r/LocalLLaMA

Fireworks AI and Notte introduce the 'Agent Execution Tax' metric after running 720 browser agent tasks across four LLMs, finding that execution reliability—not intelligence—is the primary bottleneck in agentic AI, with one model wasting 22.9% of inference calls on malformed JSON.