FORTIS: Benchmarking Over-Privilege in Agent Skills
Summary
FORTIS benchmarks how LLM agents frequently exceed necessary privileges when selecting skills, showing over-privilege is the norm across ten frontier models and failing under realistic user interactions.
View Cached Full Text
Cached at: 05/12/26, 10:53 AM
Paper page - FORTIS: Benchmarking Over-Privilege in Agent Skills
Source: https://huggingface.co/papers/2605.09163 Authors:
,
,
,
,
,
,
,
,
,
Abstract
Large language model agents frequently exceed necessary privileges when selecting and executing skills, with performance declining under realistic user interaction conditions.
Large language model agentsincreasingly operate through an intermediateskill layerthat mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also aprivilege boundarythat current models routinely exceed. We present FORTIS, a benchmark that evaluatesover-privilegeinagent skillsacross two stages: whether a model selects theminimally sufficient skillfrom a large overlapping library, and whether it executes that skill without expanding into broader tools or actions than the skill permits. Across ten frontier models and three domains, we find thatover-privileged behavior is the norm rather than the exception. Models consistently reach for higher-privilege skills and tools than the task requires, failing at both stages at rates that remain high even for the strongest available models. Failure is especially severe under the ordinary conditions of real user interaction: incomplete specification, convenience framing, and proximity to skill boundaries. None of these requires adversarial construction. The results indicate that theskill layer, far from containing agent behavior, is itself a primary source ofprivilege escalationin current systems.
View arXiv pageView PDFProject pageGitHub1Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09163 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.09163 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09163 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents
This paper investigates over-privileged tool selection in LLM agents, introducing ToolPrivBench to evaluate and mitigate unnecessary use of high-privilege tools. It finds that safety alignment does not ensure least-privilege choices, and proposes a post-training defense that reduces excessive privilege use without sacrificing performance.
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
The Capability Frontier: Benchmarks Miss 82% of Model Performance
The paper introduces the Capability Frontier, a Pareto frontier over models that corrects for biases in single-model and single-run evaluations, showing that standard benchmarks miss up to 82% of model performance and that collective LLM capabilities are substantially underestimated.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.
Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
This paper introduces AARR (Act As a Real Researcher), a suite of benchmarks to evaluate frontier LLMs and agentic systems on granular research scenarios. The first benchmark, AARRI-Bench, reveals that even top-performing agents achieve only 68.3% success, highlighting gaps in field sensitivity and nuanced reasoning.