FORTIS: Benchmarking Over-Privilege in Agent Skills

Hugging Face Daily Papers Papers

Summary

FORTIS benchmarks how LLM agents frequently exceed necessary privileges when selecting skills, showing over-privilege is the norm across ten frontier models and failing under realistic user interactions.

Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also a privilege boundary that current models routinely exceed. We present FORTIS, a benchmark that evaluates over-privilege in agent skills across two stages: whether a model selects the minimally sufficient skill from a large overlapping library, and whether it executes that skill without expanding into broader tools or actions than the skill permits. Across ten frontier models and three domains, we find that over-privileged behavior is the norm rather than the exception. Models consistently reach for higher-privilege skills and tools than the task requires, failing at both stages at rates that remain high even for the strongest available models. Failure is especially severe under the ordinary conditions of real user interaction: incomplete specification, convenience framing, and proximity to skill boundaries. None of these requires adversarial construction. The results indicate that the skill layer, far from containing agent behavior, is itself a primary source of privilege escalation in current systems.
Original Article
View Cached Full Text

Cached at: 05/12/26, 10:53 AM

Paper page - FORTIS: Benchmarking Over-Privilege in Agent Skills

Source: https://huggingface.co/papers/2605.09163 Authors:

,

,

,

,

,

,

,

,

,

Abstract

Large language model agents frequently exceed necessary privileges when selecting and executing skills, with performance declining under realistic user interaction conditions.

Large language model agentsincreasingly operate through an intermediateskill layerthat mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also aprivilege boundarythat current models routinely exceed. We present FORTIS, a benchmark that evaluatesover-privilegeinagent skillsacross two stages: whether a model selects theminimally sufficient skillfrom a large overlapping library, and whether it executes that skill without expanding into broader tools or actions than the skill permits. Across ten frontier models and three domains, we find thatover-privileged behavior is the norm rather than the exception. Models consistently reach for higher-privilege skills and tools than the task requires, failing at both stages at rates that remain high even for the strongest available models. Failure is especially severe under the ordinary conditions of real user interaction: incomplete specification, convenience framing, and proximity to skill boundaries. None of these requires adversarial construction. The results indicate that theskill layer, far from containing agent behavior, is itself a primary source ofprivilege escalationin current systems.

View arXiv pageView PDFProject pageGitHub1Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09163 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09163 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09163 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Hugging Face Daily Papers

This paper investigates over-privileged tool selection in LLM agents, introducing ToolPrivBench to evaluate and mitigate unnecessary use of high-privilege tools. It finds that safety alignment does not ensure least-privilege choices, and proposes a post-training defense that reduces excessive privilege use without sacrificing performance.

The Capability Frontier: Benchmarks Miss 82% of Model Performance

arXiv cs.AI

The paper introduces the Capability Frontier, a Pareto frontier over models that corrects for biases in single-model and single-run evaluations, showing that standard benchmarks miss up to 82% of model performance and that collective LLM capabilities are substantially underestimated.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Hugging Face Daily Papers

SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.