Tag
The paper introduces PhoneSafety, a benchmark of 700 safety-critical moments across 130+ apps to evaluate phone-use agents. Results show that avoiding harmful outcomes does not necessarily indicate safety, as models may fail to act or make unsafe choices, requiring a distinction between capability and safety signals.