Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Hugging Face Daily Papers Papers

Summary

This paper introduces AgentCIBench, a benchmark to evaluate privacy risks in computer-use agents, finding that 11 of 15 frontier agents leak information in over 50% of scenarios.

Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also creates a privacy risk that has been largely overlooked: when an agent works in one context, it can pull in information from another that is inappropriate in that context. Hence, we introduce AgentCIBench, an evaluation harness that turns this risk into executable, deterministically scored scenarios. We target three common failure modes in CUAs: visual co-location, where the agent pulls in prohibited items that sit next to the task target in the UI; task-ambiguity overshare, where the agent dumps dense personal state in response to an under-specified prompt; and recipient misalignment, where the agent sends content to an addressee for whom it is inappropriate. We evaluate 15 frontier agents and find a surprisingly high failure rate: 11 of 15 leak on more than 50% of scenarios, with an average leakage of 67.9%, and the same failures persist when agents act end-to-end in the environment to complete the task. We release AgentCIBench to encourage the development of safer computer-use agents and position contextual disclosure testing as a pre-deployment safety check.
Original Article
View Cached Full Text

Cached at: 06/23/26, 05:43 PM

Paper page - Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Source: https://huggingface.co/papers/2606.23189

Abstract

Computer-use agents frequently expose inappropriate information across applications, prompting the creation of AgentCIBench to evaluate and mitigate privacy risks in cross-application contexts.

Computer-use agents(CUAs) now act on a user’s behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also creates aprivacy riskthat has been largely overlooked: when an agent works in one context, it can pull in information from another that is inappropriate in that context. Hence, we introduceAgentCIBench, an evaluation harness that turns this risk into executable, deterministically scored scenarios. We target three common failure modes in CUAs:visual co-location, where the agent pulls in prohibited items that sit next to the task target in the UI;task-ambiguity overshare, where the agent dumps dense personal state in response to an under-specified prompt; andrecipient misalignment, where the agent sends content to an addressee for whom it is inappropriate. We evaluate 15 frontier agents and find a surprisingly high failure rate: 11 of 15 leak on more than 50% of scenarios, with an average leakage of 67.9%, and the same failures persist when agents act end-to-end in the environment to complete the task. We releaseAgentCIBenchto encourage the development of safercomputer-use agentsand positioncontextual disclosuretesting as a pre-deployment safety check.

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2606\.23189

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.23189 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.23189 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.23189 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

arXiv cs.CL

This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.

On the Reliability of Computer Use Agents

Hugging Face Daily Papers

A preprint analyzing why computer-use agents succeed once but fail on repeated executions, attributing unreliability to execution stochasticity, task ambiguity, and behavioral variability, and advocating repeated evaluation and stable strategies.

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

arXiv cs.LG

This paper introduces ROGUE, a benchmark to evaluate corrigibility failures in AI agents, finding that frontier models often bypass user interruptions or restrictions even in benign settings, and that better performance correlates with greater misalignment.

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

arXiv cs.CL

This paper introduces WebDecept, a framework for injecting deceptive interface patterns into web environments to evaluate the safety of autonomous web agents. Experiments show current agents are highly susceptible to such manipulations, highlighting safety challenges for real-world deployment.