Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism

Hugging Face Daily Papers Papers

Summary

This paper studies observation masking in long-horizon search agents, finding that accuracy gains follow an asymmetric inverted-U shape depending on the interplay between retriever capability and model capacity, with a collapse when the model is saturated. It provides a mechanistic analysis and a regime map for context management.

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live-web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted-U shape when plotted against the model's accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid-capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model's implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token-for-turn trade-off: it removes observations the model has largely stopped attending to and pages the agent rarely re-opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime-dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i-DeepSearch/observation-masking) to support future research.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:23 AM

Paper page - Masking Stale Observations Helps Search Agents – Until It Doesn’t: A Regime Map and Its Mechanism

Source: https://huggingface.co/papers/2606.00408

Abstract

Observation masking in long-horizon search agents shows variable accuracy gains depending on the interaction between retriever capability and model capacity, following an asymmetric inverted-U pattern.

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form ofcontext managementhelps and why. We studyobservation maskingthrough a systematic sweep over variousagent backbones(4B to 284B parameters) and threeretrieverson offline and live-webagentic searchbenchmarks. We find that the accuracy gain from masking follows an asymmetric inverted-U shape when plotted against the model’s accuracy withoutcontext management: a plateau under weakretrievers, a peak when a strong retriever meets a mid-capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model’simplicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements atoken-for-turn trade-off: it removes observations the model has largely stopped attending to and pages the agent rarely re-opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframecontext managementas a regime-dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i-DeepSearch/observation-masking) to support future research.

View arXiv pageView PDFGitHub0Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.00408 in a model README.md to link it from this page.

Datasets citing this paper1

#### i-DeepSearch/observation-masking-eval-logs Preview• Updated37 minutes ago • 548 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.00408 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Search Discipline for Long-Horizon Research Agents

arXiv cs.AI

This paper identifies a failure mode in long-horizon research agents where optimizing an aggregate metric can select candidates that improve the headline number but break critical subgroups (inversion). It proposes a search-discipline protocol with an external control loop that audits candidates based on disaggregated behavior rather than the score.

@omarsar0: // The Memory Curse in LLM Agents // (bookmark it) Long histories apparently degrades agents as they become increasingl…

X AI KOLs Following

This research paper identifies the 'memory curse' in LLM agents, demonstrating that expanded context windows systematically degrade cooperative behavior in multi-agent social dilemmas by eroding forward-looking intent. The authors show that targeted fine-tuning, synthetic memory sanitization, and reducing explicit Chain-of-Thought reasoning can effectively mitigate this behavioral decay.

Constraint-Enhanced Physical Search through Correlation Matching

arXiv cs.AI

This paper proposes a principle of 'constraint-enhanced physical search' where temporal correlations in exploration are matched to constraint-induced spatial correlations in update dynamics, demonstrated via a tug-of-war bandit model. The authors show that efficient search emerges not from maximal randomness but from matching temporal correlation to the physical update scale that converts feedback into evidence.