Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

safety evaluation phone-use agents benchmark ai-safety capability-vs-safety

Summary

The paper introduces PhoneSafety, a benchmark of 700 safety-critical moments across 130+ apps to evaluate phone-use agents. Results show that avoiding harmful outcomes does not necessarily indicate safety, as models may fail to act or make unsafe choices, requiring a distinction between capability and safety signals.

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.

Original Article

View Cached Full Text

Cached at: 05/12/26, 02:52 PM

Paper page - Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Source: https://huggingface.co/papers/2605.07630 Authors:

Abstract

PhoneSafety benchmark reveals that avoiding harmful outcomes doesn’t necessarily indicate safety, as models may fail to act or make unsafe choices, requiring distinction between capability and safety signals.

Whenaphone-useagentavoidsharm,doesthatshowsafety,orsimplyinabilitytoact?Existingevaluationsoftencannottell.Aharmfuloutcomemaybeavoidedbecausetheagentrecognizedtheriskandchosethesafeaction,orbecauseitfailedtounderstandthescreenorexecuteanyrelevantactionatall.Thesecaseshavedifferentcausesandcallfordifferentfixes,yetcurrentbenchmarksoftenmergethemundertasksuccess,refusal,orfinalharmfuloutcome.WeaddressthisproblemwithPhoneSafety,abenchmarkof700safety-criticalmomentsdrawnfromrealphoneinteractionsacrossmorethan130apps.Eachinstanceisolatesthenextdecisionatariskymomentandasksasimplequestion:doesthemodeltakethesafeaction,taketheunsafeaction,orfailtodoanythinguseful?Weevaluateeightrepresentativephone-useagentsunderthisframework.Ourresultsrevealtwomainpatterns.First,strongergeneralphone-useabilitydoesnotreliablyimplysaferchoicesatriskymoments.Modelsthatperformbetteronordinaryapptasksarenotalwaystheonesthatbehavemoresafelywhenthenextactionmatters.Second,failurestodoanythingusefulbehavelikeacapabilitysignalratherthanasafetysignal:theyareconcentratedinmorevisuallyandoperationallydemandingsettingsandremainstablewhentheevaluationprotocolchanges.Acrossmodels,failuressplitintotworecurringpatterns:unsafechoicesinsettingswherethemodelcanactbutchooseswrongly,andinabilitytoactinmorevisuallyandoperationallydemandingscreens.Overall,aharmlessoutcomeisnotenoughtocountasevidenceofsafety.Evaluatingphone-useagentsrequiresseparatingunsafejudgmentfrominabilitytoact.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.07630

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.07630 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.07630 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.07630 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Paper page - Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

On Safety Risks in Experience-Driven Self-Evolving Agents

I built an AI support agent where the main metric is unsafe auto-action rate, not just accuracy

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

AI safety is arguing about the wrong boundary

Submit Feedback

Similar Articles

On Safety Risks in Experience-Driven Self-Evolving Agents

I built an AI support agent where the main metric is unsafe auto-action rate, not just accuracy

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

AI safety is arguing about the wrong boundary