PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions
Summary
PhoneHarness is a mixed-action benchmark and execution framework that evaluates phone-use agents on verifiable mobile workflows, achieving a 75% pass rate and outperforming existing approaches by 12.9 percentage points through deterministic action routing and auditable execution traces.
View Cached Full Text
Cached at: 06/16/26, 11:33 AM
Paper page - PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions
Source: https://huggingface.co/papers/2606.14832 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
PhoneHarness presents a mixed-action benchmark and execution framework for evaluating phone-use agents on verifiable mobile workflows, demonstrating superior performance over existing approaches through deterministic action routing and auditable execution traces.
Phone agents are increasingly expected to complete realmobile workflowsrather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily asGUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, amixed-action benchmarkandexecution harnessfor studyingphone-use agentson verifiablemobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combiningdeterministic action routingwithbounded GUI delegationandauditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends onaction-surface routingandverifiable execution, not only visual GUI control.
View arXiv pageView PDFProject pageGitHub20Add to collection
Get this paper in your agent:
hf papers read 2606\.14832
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.14832 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.14832 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.14832 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
PhoneWorld: Scaling Phone-Use Agent Environments
PhoneWorld is a pipeline that transforms real GUI trajectories into controllable mobile environments, enabling scalable creation of phone-use benchmarks. It covers 34 apps across 16 domains and shows that using its supervision improves performance on multiple evaluation benchmarks.
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
HarnessX is a foundry for composable, adaptive, and evolvable AI agent harnesses that uses compositional primitives and trace-driven evolution to improve agent performance. Across five benchmarks, it achieves an average gain of +14.5% (up to +44.0%), demonstrating that runtime interface evolution is a complementary lever to model scaling.
SkillHarness: Harnessing Safe Skills for Computer-Use Agents
SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms, reducing unsafe rates by 57.1%.
Harness design for long-running application development
Anthropic engineers detail a multi-agent harness design using generator and evaluator agents to improve Claude's ability to build complete, high-quality frontend applications autonomously over long durations.
Auditing Agent Harness Safety
This paper proposes HarnessAudit, a framework for auditing LLM agent execution trajectories beyond final outputs, focusing on boundary compliance, execution fidelity, and system stability. It introduces HarnessAudit-Bench with 210 tasks across eight domains and evaluates ten harness configurations, finding that task completion misaligns with safe execution and violations accumulate with trajectory length.