Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
Summary
The paper introduces PhoneSafety, a benchmark of 700 safety-critical moments across 130+ apps to evaluate phone-use agents. Results show that avoiding harmful outcomes does not necessarily indicate safety, as models may fail to act or make unsafe choices, requiring a distinction between capability and safety signals.
View Cached Full Text
Cached at: 05/12/26, 02:52 PM
Paper page - Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
Source: https://huggingface.co/papers/2605.07630 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
PhoneSafety benchmark reveals that avoiding harmful outcomes doesn’t necessarily indicate safety, as models may fail to act or make unsafe choices, requiring distinction between capability and safety signals.
Whenaphone-useagentavoidsharm,doesthatshowsafety,orsimplyinabilitytoact?Existingevaluationsoftencannottell.Aharmfuloutcomemaybeavoidedbecausetheagentrecognizedtheriskandchosethesafeaction,orbecauseitfailedtounderstandthescreenorexecuteanyrelevantactionatall.Thesecaseshavedifferentcausesandcallfordifferentfixes,yetcurrentbenchmarksoftenmergethemundertasksuccess,refusal,orfinalharmfuloutcome.WeaddressthisproblemwithPhoneSafety,abenchmarkof700safety-criticalmomentsdrawnfromrealphoneinteractionsacrossmorethan130apps.Eachinstanceisolatesthenextdecisionatariskymomentandasksasimplequestion:doesthemodeltakethesafeaction,taketheunsafeaction,orfailtodoanythinguseful?Weevaluateeightrepresentativephone-useagentsunderthisframework.Ourresultsrevealtwomainpatterns.First,strongergeneralphone-useabilitydoesnotreliablyimplysaferchoicesatriskymoments.Modelsthatperformbetteronordinaryapptasksarenotalwaystheonesthatbehavemoresafelywhenthenextactionmatters.Second,failurestodoanythingusefulbehavelikeacapabilitysignalratherthanasafetysignal:theyareconcentratedinmorevisuallyandoperationallydemandingsettingsandremainstablewhentheevaluationprotocolchanges.Acrossmodels,failuressplitintotworecurringpatterns:unsafechoicesinsettingswherethemodelcanactbutchooseswrongly,andinabilitytoactinmorevisuallyandoperationallydemandingscreens.Overall,aharmlessoutcomeisnotenoughtocountasevidenceofsafety.Evaluatingphone-useagentsrequiresseparatingunsafejudgmentfrominabilitytoact.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.07630
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.07630 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.07630 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.07630 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
On Safety Risks in Experience-Driven Self-Evolving Agents
Researchers from Harbin Institute of Technology and Singapore Management University investigate safety risks in experience-driven self-evolving LLM agents, finding that even benign task experience can compromise safety in high-risk scenarios due to agents' execution-oriented tendencies, and revealing a fundamental safety–utility trade-off.
I built an AI support agent where the main metric is unsafe auto-action rate, not just accuracy
A technical walkthrough of building a telecom customer support agent that prioritizes safety metrics over classifier accuracy, using a deterministic access gate, scoped tool execution, and route-level evaluation.
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
This paper argues that current benchmarks for autonomous agents fail to evaluate whether an agent should have proceeded at all, introducing a 'compliance bias'. The authors propose a taxonomy of abstention-warranted scenarios and new evaluation protocols (Safety Rate, Usability Rate, Informed Refusal Rate) with preliminary results showing tunable safety–usability tradeoffs across model families.
SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces
SABER introduces a benchmark for evaluating the operational safety of LLM coding agents in realistic stateful project workspaces, showing that even the best model has over a 54% harmful safety-violation rate, indicating insufficient alignment for real-world environments.
AI safety is arguing about the wrong boundary
This article argues that the AI safety debate is misdirected, focusing on model alignment and internal controls instead of the critical boundary: external admission authority over agent execution. It warns that systems capable of self-authorizing high-impact actions (e.g., deploying code, moving money) pose a fundamental risk that logging and monitoring cannot mitigate.