40% of my browser agent's sessions were silently failing and the LLM wasn't the problem

Reddit r/AI_Agents Tools

Summary

A developer discovered that 40% of browser agent sessions silently failed due to browser fingerprinting and automation detection, not LLM reasoning. An open-source tool called Leakish identified the issues.

I built a Puppeteer agent that passed every reasoning eval. In production, 40% of sessions returned degraded results with zero errors. The LLM was reasoning correctly over poisoned input. The browser was the blind spot. I verified this with an open source scanner whose full codebase is on GitHub and whose fingerprint checks execute locally, so I trusted the output before pointing it at my agent's sessions. The tool is called Leakish. My sessions were flagged on Canvas rendering, WebRTC, and automation detection surfaces I never thought to monitor. I still don't have a clean fix for making the browser layer invisible to these detection systems.
Original Article

Similar Articles

Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces

Hugging Face Daily Papers

This paper demonstrates that websites can identify which large language model powers a browsing agent by analyzing its behavioral patterns and timing data, achieving up to 96% F1 score across 14 frontier LLMs. It formalizes this attack surface and shows that random timing delays are insufficient to prevent identification.

Agent Execution Tax: new procurement metric for browser agent benchmarks?

Reddit r/LocalLLaMA

Fireworks AI and Notte introduce the 'Agent Execution Tax' metric after running 720 browser agent tasks across four LLMs, finding that execution reliability—not intelligence—is the primary bottleneck in agentic AI, with one model wasting 22.9% of inference calls on malformed JSON.

Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Hacker News Top

This paper identifies a new class of injection attacks where payloads mimic the domain language to evade LLM injection detectors, showing detection rates drop dramatically (e.g., from 93.8% to 9.7% on Llama 3.1 8B). The vulnerability is systematic and extends to dedicated safety classifiers like Llama Guard 3, which detected zero camouflage payloads.