Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
Summary
Academic study shows LLM agents frequently discover complete solutions in their environments but almost never use them, revealing a missing "environmental curiosity" capability critical for open-ended tasks.
View Cached Full Text
Cached at: 04/21/26, 03:38 PM
Paper page - Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
Source: https://huggingface.co/papers/2604.17609
https://huggingface.co/papers/2604.17609#agents-explore-but-agents-ignore-llms-lack-environmental-curiosityAgents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
LLM agents are assumed to integrate environmental observations into their reasoning. It turns out they don’t.
We inject complete solutions into agent environments as a file or API endpoint. Agents discover them in almost every run and ignore them almost always. Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as “returns the complete solution to this task” in 97.54% of runs and calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld.
We call this missing capabilityenvironmental curiosity: the ability to recognize and investigate unexpected but relevant observations. It matters because agents operating in novel environments need to catch subtle, unexpected, but highly relevant information to succeed, not just execute memorized patterns. And we find that configurations that maximize environmental curiosity also achieve the best performance on the unmodified benchmarks.

https://huggingface.co/papers/2604.17609#agents-lack-environmental-curiosityAgents Lack Environmental Curiosity
We propose two metrics to measure environmental curiosity: discovery@k (whether the agent surfaces relevant information) and interaction@k (whether the agent acts on it). The gap between the two is consistent across models and benchmarks.

https://huggingface.co/papers/2604.17609#three-test-time-factors-shape-environmental-curiosityThree test-time factors shape environmental curiosity
**Tool availability.**Adding str_replace_editor (the default SWE-agent tool along bash) on top of bash increases pass@1 but consistently reduces interaction with discovered solutions. Agents default to learned tool-specific patterns rather than examining their environment.

**Reasoning budget.**Increasing gpt-oss-120b from low to high reasoning triples interaction@1. And this is not an artifact of better discovery as discovery is consistently high: The probability of interaction given discovery rises from 17.65% (low) to 45.69% (high).

**Prompting.**Explicit instructions to explore the environment improve both interaction and pass@1. The prompt that maximizes interaction is also the best-performing prompt on the unmodified benchmark.
https://huggingface.co/papers/2604.17609#narrow-fine-tuning-suppresses-curiosityNarrow fine-tuning suppresses curiosity
We fine-tune the same base model on three task distributions and compare. Narrow in-distribution training reduces curiosity: on AppWorld w/ solution, AppWorld-SFT achieves higher pass@1 than the broader T-Bench-SFT (44.2 vs 34.5) but lower interaction@10 (26.9 vs 41.5). Narrow training compresses the solution space the agent explores. And curiosity does not transfer across domains: on each solution-injected benchmark, the respective in-domain model achieves higher interaction rates and better pass@10 scaling than the out-of-domain one. The same pattern appears on the original, unmodified benchmarks: narrow wins at pass@1, broader wins at pass@k.

https://huggingface.co/papers/2604.17609#discussionDiscussion
Current agents run the ReACT loop:
Action → Observation → Reasoning → Next Action
Environmental curiosity requires reflecting on whether observations fit the agent’s current model of the environment:
Action → Observation →Reasoning and reflecting on observations→ Next Action
Even with all test-time factors jointly optimized, agents ignore discovered solutions in the majority of trials. The gap is not only inference-time configuration, it’s inherent to how LLMs are trained. We find 3 main open questions:
- Does post-training suppress environmental curiosity that pre-training may produce, or does it never emerge? Measuring this in base models is hard because curiosity can only be observed through agentic behavior.
- We tried three SFT setups to teach the reflective loop (curious first turns via rejection sampling, mid-trajectory file removal, masked adversarial turns). None worked. Training for environmental curiosity is an open problem.
- Outcome-driven metrics like pass@k reward rigid plan execution the same as adaptive reasoning. Process-oriented metrics that assess whether agents ground reasoning in observations are a necessary complement.
📜https://arxiv.org/abs/2604.17609
Work by Cohere ❤️
Similar Articles
AI scientists produce results without reasoning scientifically
Large-scale study finds LLM-based scientific agents ignore evidence 68% of the time and rarely revise beliefs, showing they execute workflows but lack genuine scientific reasoning.
@rohanpaul_ai: Columbia CS Prof Vishal Misra explains why LLMs can’t generate new science ideas. Bcz LLMs learn a structured map, Baye…
Columbia CS Prof Vishal Misra argues LLMs can’t generate truly novel science because they only interpolate within learned Bayesian manifolds rather than create new conceptual maps.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
This paper proposes a method to train LLM agents with intrinsic meta-evolution capabilities, enabling spontaneous self-improvement without external rewards at inference time. Applied to Qwen3-30B and Seed-OSS-36B, the approach yields a 20% performance boost on web navigation benchmarks, with a 14B model outperforming Gemini-2.5-Flash.
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
Large-scale study of 15 LLMs across 8 tasks reveals that optimization success hinges on maintaining localized search trajectories rather than initial problem-solving ability or solution novelty.
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.