Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Hugging Face Daily Papers 04/19/26, 12:00 AM Papers

Summary

Academic study shows LLM agents frequently discover complete solutions in their environments but almost never use them, revealing a missing "environmental curiosity" capability critical for open-ended tasks.

LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task's solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command "returns the complete solution to this task" in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/21/26, 03:38 PM

Paper page - Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Source: https://huggingface.co/papers/2604.17609

https://huggingface.co/papers/2604.17609#agents-explore-but-agents-ignore-llms-lack-environmental-curiosityAgents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

LLM agents are assumed to integrate environmental observations into their reasoning. It turns out they don’t.

We inject complete solutions into agent environments as a file or API endpoint. Agents discover them in almost every run and ignore them almost always. Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as “returns the complete solution to this task” in 97.54% of runs and calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld.

We call this missing capabilityenvironmental curiosity: the ability to recognize and investigate unexpected but relevant observations. It matters because agents operating in novel environments need to catch subtle, unexpected, but highly relevant information to succeed, not just execute memorized patterns. And we find that configurations that maximize environmental curiosity also achieve the best performance on the unmodified benchmarks.

Agent trajectory on AppWorld. Agent runs cli –help, sees a solution command documented as displaying the solution, then ignores it and explores cli simple_note –help instead. 97.54% of runs discover the solution API; 0.53% call it.

https://huggingface.co/papers/2604.17609#agents-lack-environmental-curiosityAgents Lack Environmental Curiosity

We propose two metrics to measure environmental curiosity: discovery@k (whether the agent surfaces relevant information) and interaction@k (whether the agent acts on it). The gap between the two is consistent across models and benchmarks.

Bar chart comparing discovery@1 and interaction@1 for gpt-oss-120b, GLM-4.7, and Command A across Terminal-Bench, AppWorld, and SWE-Bench. Discovery bars are consistently high; interaction bars are much lower, with AppWorld showing the largest gap.

https://huggingface.co/papers/2604.17609#three-test-time-factors-shape-environmental-curiosityThree test-time factors shape environmental curiosity

**Tool availability.**Adding str_replace_editor (the default SWE-agent tool along bash) on top of bash increases pass@1 but consistently reduces interaction with discovered solutions. Agents default to learned tool-specific patterns rather than examining their environment.

Two line charts on SWE-Bench. Left: pass@1 increases when adding str_replace_editor to bash. Right: probability of interaction given discovery decreases when the editor is added, across all models and scaffolds.

**Reasoning budget.**Increasing gpt-oss-120b from low to high reasoning triples interaction@1. And this is not an artifact of better discovery as discovery is consistently high: The probability of interaction given discovery rises from 17.65% (low) to 45.69% (high).

Line chart of interaction@n on Terminal-Bench for gpt-oss-120b at low, medium, and high reasoning. Higher reasoning budgets yield substantially higher interaction rates.

**Prompting.**Explicit instructions to explore the environment improve both interaction and pass@1. The prompt that maximizes interaction is also the best-performing prompt on the unmodified benchmark.

https://huggingface.co/papers/2604.17609#narrow-fine-tuning-suppresses-curiosityNarrow fine-tuning suppresses curiosity

We fine-tune the same base model on three task distributions and compare. Narrow in-distribution training reduces curiosity: on AppWorld w/ solution, AppWorld-SFT achieves higher pass@1 than the broader T-Bench-SFT (44.2 vs 34.5) but lower interaction@10 (26.9 vs 41.5). Narrow training compresses the solution space the agent explores. And curiosity does not transfer across domains: on each solution-injected benchmark, the respective in-domain model achieves higher interaction rates and better pass@10 scaling than the out-of-domain one. The same pattern appears on the original, unmodified benchmarks: narrow wins at pass@1, broader wins at pass@k.

Two pass@n curves on the unmodified benchmarks. Left (AppWorld): narrow-trained AppWorld-SFT wins at pass@1 but is overtaken by broader T-Bench-SFT at higher k. Right (Terminal-Bench): T-Bench-SFT outperforms AppWorld-SFT across all k.

https://huggingface.co/papers/2604.17609#discussionDiscussion

Current agents run the ReACT loop:

Action → Observation → Reasoning → Next Action

Environmental curiosity requires reflecting on whether observations fit the agent’s current model of the environment:

Action → Observation →Reasoning and reflecting on observations→ Next Action

Even with all test-time factors jointly optimized, agents ignore discovered solutions in the majority of trials. The gap is not only inference-time configuration, it’s inherent to how LLMs are trained. We find 3 main open questions:

Does post-training suppress environmental curiosity that pre-training may produce, or does it never emerge? Measuring this in base models is hard because curiosity can only be observed through agentic behavior.
We tried three SFT setups to teach the reflective loop (curious first turns via rejection sampling, mid-trajectory file removal, masked adversarial turns). None worked. Training for environmental curiosity is an open problem.
Outcome-driven metrics like pass@k reward rigid plan execution the same as adaptive reasoning. Process-oriented metrics that assess whether agents ground reasoning in observations are a necessary complement.

📜https://arxiv.org/abs/2604.17609

Work by Cohere ❤️

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Paper page - Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

https://huggingface.co/papers/2604.17609#agents-explore-but-agents-ignore-llms-lack-environmental-curiosityAgents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

https://huggingface.co/papers/2604.17609#agents-lack-environmental-curiosityAgents Lack Environmental Curiosity

https://huggingface.co/papers/2604.17609#three-test-time-factors-shape-environmental-curiosityThree test-time factors shape environmental curiosity

https://huggingface.co/papers/2604.17609#narrow-fine-tuning-suppresses-curiosityNarrow fine-tuning suppresses curiosity

https://huggingface.co/papers/2604.17609#discussionDiscussion

Similar Articles

AI scientists produce results without reasoning scientifically

@rohanpaul_ai: Columbia CS Prof Vishal Misra explains why LLMs can’t generate new science ideas. Bcz LLMs learn a structured map, Baye…

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Submit Feedback

Similar Articles

AI scientists produce results without reasoning scientifically

@rohanpaul_ai: Columbia CS Prof Vishal Misra explains why LLMs can’t generate new science ideas. Bcz LLMs learn a structured map, Baye…

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks