Governance by Construction for Generalist Agents
Summary
This paper presents CUGA's policy system, a modular policy-as-code layer that enforces governance at multiple checkpoints in LLM agent execution, enabling predictable and auditable behavior without model fine-tuning.
View Cached Full Text
Cached at: 05/22/26, 08:49 AM
# Governance by Construction for Generalist Agents
Source: [https://arxiv.org/html/2605.20874](https://arxiv.org/html/2605.20874)
,Iftach ShohamIBMHaifaIsrael,Alon OvedIBMHaifaIsrael,Ido LevyIBMHaifaIsrael,Sami MarreedIBMHaifaIsrael,Harold ShipIBMHaifaIsrael,Offer AkrabiIBMHaifaIsrael,Sergey ZeltynIBMHaifaIsrael,Avi YaeliIBMHaifaIsraelandNir MashkifIBMHaifaIsrael
\(2026\)
###### Abstract\.
Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction\. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain\. This demo presents CUGA’s policy system, a modular policy\-as\-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance\-aware behavior in compound workflows without model fine\-tuning\. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution\. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning \(Intent Guard\), within the system prompt to steer reasoning \(Playbook\), at the tool\-call boundary to enforce proper usage \(Tool Guide\), outside the reasoning loop as a Human\-in\-the\-Loop gate for high\-risk actions \(Tool Approvals\), and at the output stage to filter and structure the final response \(Output Formatter\)\. Together, these stages embed governance continuously across the agent’s execution pipeline rather than treating it as an afterthought\. Using a healthcare scenario and a multi\-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool\-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human\-in\-the\-loop tool approval checkpoints for potentially destructive actions\. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency\.
Generalist Agent, Computer Using Agent, Governance, Policy System, LLM Agent
††copyright:acmlicensed††journalyear:2026††copyright:cc††conference:ACM Conference on AI and Agentic Systems; May 26–29, 2026; San Jose, CA, USA††booktitle:ACM Conference on AI and Agentic Systems \(CAIS ’26\), May 26–29, 2026, San Jose, CA, USA††doi:10\.1145/3786335\.3813192††isbn:979\-8\-4007\-2415\-2/2026/05††ccs:Computing methodologies Intelligent agents## 1\.Introduction
LLM–based agents are increasingly adapted to perform complex, multi\-step tasks across enterprise software environments, extending earlier work on conversational RPA, next\-action recommendation, and process\-aware automation\(Yaeliet al\.,[2022](https://arxiv.org/html/2605.20874#bib.bib17); Zeltynet al\.,[2022](https://arxiv.org/html/2605.20874#bib.bib12); Ovedet al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib14)\)\. Unlike static chatbots, modern agents integrate planning, internal and external tool\-use, memory, and iterative reasoning to execute compound workflows across heterogeneous systems\. Recent advances in tool\-augmented and computer\-using agents enable interaction with APIs, databases, and user interfaces, allowing agents to autonomously retrieve data, modify records and trigger communications\. Although this flexibility enables generalization, prior work on LLM automation agents and web\-agent benchmarks\(Yinget al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib26)\)suggests\(Maet al\.,[2026](https://arxiv.org/html/2605.20874#bib.bib21); Yanget al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib23); Luoet al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib24); Shiet al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib25); Jianget al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib27); Chenet al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib28)\)that it also introduces unpredictability: agents may hallucinate facts, misuse tools, violate procedural constraints, behave inconsistently, or expose sensitive information\(Schwartzet al\.,[2023](https://arxiv.org/html/2605.20874#bib.bib13); Shlomovet al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib15); Levyet al\.,[2024](https://arxiv.org/html/2605.20874#bib.bib16)\)\. In enterprise settings, such failures are not merely quality degradations; they can result in compliance violations, data leakage, financial impacts, or reputational damage\.
Many current governance strategies rely on prompt\-engineering techniques, instruction stuffing, constraint injection, and post\-hoc validation\(Tsai and Bagdasarian,[2025](https://arxiv.org/html/2605.20874#bib.bib1); Gauravet al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib2); Zwerdlinget al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib4)\)\. While these approaches can shape model behavior in controlled scenarios, they exhibit several structural limitations\. Behavioral constraints are tightly coupled to prompt structure and can become brittle as policies evolve; governance logic often becomes duplicated across multi\-agent deployments; enforcement decisions are delegated to model reasoning and therefore difficult to audit; and independently implemented guardrails may produce inconsistent outcomes without principled conflict resolution\. More structured approaches encode operating procedures through role specialization or guard code\(Honget al\.,[2023](https://arxiv.org/html/2605.20874#bib.bib3); Zwerdlinget al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib4)\), but typically require substantial architectural commitment or domain\-specific reconfiguration when procedures change\. Parlant’s\(Parlant,[2025](https://arxiv.org/html/2605.20874#bib.bib10)\)guideline‑driven alignment layers dynamically retrieve and inject natural\-language rules and tool use at inference time but remain focused on local, chat\-level compliance\. Skill\-based frameworks such as Claude’s skill system provide reusable behavioral templates applied during execution\. Both approaches still rely on the model to interpret and follow instructions, leaving policy adherence inherently probabilistic and vulnerable\. As enterprise adoption accelerates, governance must evolve beyond prompt\-level heuristics toward policy\-by\-construction: explicit, typed, runtime\-enforced control primitives that operate independently of the model itself\.
In this work, we present the implementation of our vision for trustworthy, enterprise\-ready generalist agents, building on prior work on automation trust, web\-agent bottlenecks, and safety\-oriented agent evaluation, and demonstrate it through the open\-source CUGA agent\(Shlomovet al\.,[2026](https://arxiv.org/html/2605.20874#bib.bib6); Marreedet al\.,[2025](https://arxiv.org/html/2605.20874#bib.bib5); CUGA Project,[2026a](https://arxiv.org/html/2605.20874#bib.bib18)\)\. We introduce theCUGA policy system, a modular policy\-as\-code layer built into CUGA without requiring model fine\-tuning\. The framework introduces typed governance primitives that constrain intent recognition, planning, tool invocation, human approval requirements, and output formatting*at runtime*, enforcing consistency and governance\. Policies are matched using lightweight triggers \(keywords and embedding similarity\), which mitigates reliance on probabilistic methods, resolved through explicit conflict handling, and produces structured explanation traces for observability and quality assurance\. It offers no need for architectural changes, complies using open\-source models \(e\.g\.,GPT\-OSS\-120B\), for on\-prem deployment, does not affect or modify the agent’s output, and showcases build\-in observability\. By externalizing governance into composable runtime policies, the system enables predictable, auditable, and compliance\-aware behavior across compound workflows while preserving the flexibility of general\-purpose LLM agents ensuring compliance and trustworthiness in enterprise\-ready agents\. We demonstrate our approach through an end\-to\-end enterprise workflow that highlights runtime policy enforcement\. The demonstration centers on two representative scenarios: a healthcare assistance workflow and a multi\-layered enforcement intervention\.
The demo showcases four out of five modular governance primitives defined as markdown policies at by interference at difference execution checkpoints\. The Playbook enforces structured multi\-step execution by dynamically associating requests with predefined tool sequences and enterprise\-specific constraints, as the Tool Guide enriches tool description\. The Intent Guard prevents malicious or accidental harmful actions by intercepting and blocking restricted intents prior to tool execution\. The Tool Approval mechanism introduces ahuman\-in\-the\-loop\(HITL\) checkpoint that pauses the execution graph and requires explicit confirmation before potentially destructive actions can proceed\.
## 2\.Methods Overview
Our CUGA’s policy system strengthens enterprise\-ready governance through dynamic interception of different policies throughout the agent’s execution graph via the trigger mechanism in five different points: First,*Intent Guard*sit at the very beginning of the process to immediately block bad requests before the agent even takes action\. Second,*Playbook*are seamlessly injected into the system prompt, step\-by\-step to steer the agent’s planning and influence its reasoning\. Third,*Tool guide*update and mutate tool descriptions right before execution to instruct the agent on proper tool usage\. Fourth,*Tool approval*act as a critical safeguard outside the reasoning loop to gate execution, pausing the graph to wait for human confirmation if a risky action is attempted\. Finally,*Output formatter*serves as the last intervention point to appropriately filter the final answer before it is returned\. This section presents the architectural design principles, data models, and runtime mechanisms underlying each policy category\. The emphasis is placed on modularity, extensibility, and seamless integration with the LangGraph execution framework\.
### 2\.1\.Policy System Architecture
The policy system is implemented as a modular, layered framework consisting of four architectural tiers:
1. \(1\)Policy Models Layer: Strongly\-typed data models defining policy schemas, triggers, and action semantics\.
2. \(2\)Storage Layer: Persistent and semantic storage backed by a vector database for similarity\-based retrieval\.
3. \(3\)Policy Agent Layer: Runtime matching and conflict\-resolution logic that evaluates policies against execution context\.
4. \(4\)Enactment Layer: Execution primitives that apply policy decisions within the LangGraph workflow\.
This layered design enforces separation of concerns: policy representation is decoupled from storage, matching logic, and execution semantics\. Policy evaluation occurs at four semantically meaningful checkpoints: \(1\) intent analysis, \(2\) tool preparation, \(3\) post\-code generation, and \(4\) final response generation\.
#### Trigger System\.
All policy types, except for Tool Approval, rely on aconfigurable trigger mechanism\. Triggers are defined as discriminated union types, enabling flexible matching strategies within a unified interface\. Supported trigger mechanisms include:
- •Natural Language: Semantic similarity matching via embedding\-based retrieval with configurable similarity thresholds\.
- •Keyword: Exact or fuzzy keyword matching with logical composition \(AND/OR\) and case sensitivity controls\.
- •Application: Contextual matching based on the active application domain\.
- •State: Evaluation against structured agent state using equality, containment, or regular expression operators\.
- •Tool: Detection of tool usage at specific execution stages \(pre\- or post\-invocation\)\.
Triggers may target different contextual fields, including inferred intent, intermediate sub\-tasks, and final agent responses\. This flexibility enables policies to intervene at different abstraction levels of the execution life\-cycle\. Policy persistence and semantic retrieval are implemented using Milvus as a vector database backend\. Embeddings are generated using either API or a local\-based encoder\(Wanget al\.,[2020](https://arxiv.org/html/2605.20874#bib.bib8)\)\.
### 2\.2\.Intent Guard
Intent Guard policies enforce hard constraints by intercepting user intent outside the agent reasoning loop, before the agent acts\. Their primary purpose is early\-stage blocking of unauthorized or restricted operations\. Upon activation, an Intent Guard terminates execution immediately, this early termination mechanism prevents downstream reasoning or unwanted tool invocation\. Intent Guard trigger evaluation follows a two\-phase process:\(1\) Deterministic:keyword\-based triggers are evaluated first\. Intent Guards are prioritized over other policy types to guarantee that blocking constraints supersede advisory policies\.\(2\) conflict resolution:for natural\-language triggers, multiple policies may satisfy the similarity threshold\. In such cases, an LLM\-based structured reasoning step selects the most appropriate policy\. The model outputs a selected index, confidence score, and justification\.
### 2\.3\.Playbook
Playbook policies provide structured workflow guidance for complex tasks\. Rather than blocking execution, they shape agent planning behavior by injecting step\-by\-step instructions\. This is particularly valuable as it delivers precise, targeted instructions without inflating the prompt with excessive tokens\. This causes the agent to follow instructions and consistently comply to the user’s tasks\. The Playbook consists of markdown\-formatted guidance content, an optional ordered list of steps, optional expected outcomes per step, optional tool constraints per step, trigger definitions, and priority\. This structured representation enables fine\-grained orchestration and verification of multi\-step workflows\.
### 2\.4\.Tool Approval
Tool Approval policies enforceHITLoversight by requiring explicit confirmation before executing sensitive tools\. Unlike trigger\-based policies, they are evaluated after code generation, allowing inspection of the actual tools the agent intends to invoke in runtime\.
After code generation, the system scans the code for tool invocations\. If a matching tool is detected, execution pauses and enters a waiting state\. The agent resumes only after explicit approval \(or automatic approval, if configured\)\. If multiple policies match, the highest\-priority policy is applied\. This mechanism is particularly useful for sensitive operations such as data modifications \(e\.g\., database writes and updates\) and external API calls, where interactions with third\-party services should be reviewed before execution\.
### 2\.5\.Tool Guide
Tool Guide policies augment tool descriptions with contextual or compliance\-related guidance\. Multiple Tool Guide policies may apply simultaneously, as they are cumulative rather than mutually exclusive\. At runtime, tool definitions are deep\-copied and enriched with the configured guidance, which may be appended to the original description\. The deep\-copy mechanism ensures that modifications remain session\-scoped and do not permanently alter the underlying tool metadata\.
### 2\.6\.Output Formatter
Output Formatter policies transform the final response into a structured format\. They support three modes: \(1\) a predefined template verbatim, \(2\) restructuring the response as formatted Markdown, or \(3\) extracting structured data according to a specified JSON schema\. Trigger evaluation considers both the user input and the generated response, enabling context\-aware formatting decisions without relying on the agent’s capabilities\.
## 3\.System Demonstration: CUGA’s Policy System
The demo video111[https://www\.youtube\.com/watch?v=Ie5ZODIW5gk](https://www.youtube.com/watch?v=Ie5ZODIW5gk)presents the live CUGA interface, illustrating how governance policies are enforced through targeted interventions in the agent execution graph\. Rather than embedding complex enterprise constraints directly in prompts, CUGA applies policies dynamically during agent execution, enabling safer and more reliable task handling \(for more information see Section[2](https://arxiv.org/html/2605.20874#S2)\)\. Highlighting the importance of governance by construction, the walkthrough presents two representative scenarios designed to showcase CUGA’s ability to guide agent behavior while maintaining strong safety guarantees: \(1\) a complex healthcare workflow \(OAK\) and \(2\) a multi\-layered security intervention\.
#### Playbook
The first scenario demonstrates CUGA in the context of a healthcare assistance task\. A user issues the request“find primary care doctors near me\.”, CUGA dynamically associates the request with a predefinedplaybookthat enforces a structured execution policy for mandatory tool sequence to grantee reliable compliance\. The execution begins with context extraction, where CUGA retrieves the user’s active insurance coverage and extracts key attributes required for downstream operations, including the relevant contract UID and brand code\. Guided by the injected policy, the agent then invokes thefind care suggestionstool to map the natural language phrase “primary care” to the corresponding internal system code \(code 25\)\. This mapping ensures alignment with the underlying service taxonomy\.
Following this step, the system performs a paginated provider search guided by theTool Guidecritical pagination requirement, issuing multiple API calls that collectively return 14 candidate providers\. During the retrieval process, CUGA automatically applies a network\-status constraint to ensure that onlyin\-networkproviders are considered\. The results are then aggregated and presented to the user as a cleanly formatted table listing nearby in\-network primary care physicians\. This scenario demonstrates how CUGA is enabled to operate within a controlled policy framework while enforcing structured workflows, tool usage constraints, and enterprise\-specific rules\.
#### Intent Guards and Tool Approval
The second scenario focuses on CUGA’s preventive capabilities, which introduces safety mechanisms that operate independently of the agent’s reasoning process\. First, a user attempts to execute a malicious \(which can also be by accidental\) command by requesting to“delete every contact in CRM\.”theIntent Guardimmediately intercepts the request and issues a block command that terminates the execution graph\. As a result, the agent never receives the instruction, preventing the agent from reasoning about or executing the harmful request\. The demonstration then simulates a scenario in which an attacker paraphrases the request in a way that some how bypasses the initial guard\. In this case, the agent proceeds to generate code intended to drop a database\. Before the action can be executed, CUGA’sTool Approvalmechanism pauses the execution graph and introduces a HITL checkpoint\. The system requires explicit human confirmation before allowing the operation to proceed, ensuring that potentially destructive actions cannot be executed without oversight or accountability by the operator\. Together, these scenarios highlight CUGA’s ability to combine policy\-driven execution with layered safety mechanisms, enabling reliable and secure deployment of language agents in enterprise environments\.
## 4\.Results
We evaluate CUGA’spolicy systemon two enterprise\-focused generalist agent benchmarks that reflect the structured, policy\-governed workflows central to our demonstration\.OAK\(CUGA Project,[2026b](https://arxiv.org/html/2605.20874#bib.bib19)\)is a 27\-task customer\-care benchmark grounded in realistic insurance scenarios, assessing the agent’s ability to process claims, retrieve coverage information, and answer health\-insurance questions via API calls\.BPO\(IBM Research,[2026](https://arxiv.org/html/2605.20874#bib.bib20)\)is a 26\-task business\-process benchmark covering a broader range of enterprise back\-office operations, providing a more challenging generalist evaluation\. Together, these benchmarks reflect real\-world enterprise settings where tasks carry well\-defined procedures, tool\-heavy execution, and low tolerance for deviation\. All results are reported across three backbone models:GPT\-OSS\-120B, our main focus as an open\-source model,GPT\-4\.1, andClaude Opus\-4\.5\. The primary metric isSuccess Rate\(SR\): the percentage of tasks completed correctly end\-to\-end\. Each configuration was executed over multiple independent runs with a clean environment\.
Table 1\.Success Rate on OAK and BPO benchmarks, with and without CUGA’s policy system, across three backbone models\. W/O = without policy system; W = with policy system\.The results in Table[1](https://arxiv.org/html/2605.20874#S4.T1)show consistent and substantial gains across both benchmarks and all three models, ranging from\+15\+15to\+37\.7\+37\.7percentage points\. On OAK Bench, the policy system drives GPT\-OSS\-120B to a perfect 100% SR, with GPT\-4\.1 and Claude Opus\-4\.5 reaching 96%\. This is particularly evident in complex, multi\-step tasks where, without the policy system, the agent deviates from required procedures or invokes tools out of sequence, mirroring how the demo’s workflow becomes a structured, reliable process only once the Playbook is active\. On the more demanding BPO benchmark, absolute gains are even larger: GPT\-4\.1 improves by\+37\.7\+37\.7pp despite its lower baseline, suggesting the policy system is especially impactful where the model’s native instruction\-following is weakest\. Taken together, these results confirm that*governance\-by\-construction*is both effective and model\-agnostic, achieving strong and consistent performance without reliance on frontier LLMs\. Runtime governance does, however, introduce token overhead, an inherent consequence of replacing probabilistic compliance with explicit, auditable instructions \(full figures in Section[A](https://arxiv.org/html/2605.20874#A1)\)\. In enterprise deployments where a single policy violation may constitute a compliance breach or trigger a destructive operation, this tradeoff is well justified\.
### 4\.1\.BPO Policy Ablation
To isolate the contribution of individual governance primitives, we ran CUGA on BPO under a controlled ablation:*no policies*,*2 policies*, and*5 policies*, usingGPT\-OSS\-120B\. Each configuration was executed over three independent runs with a clean vector store \(Milvus\) to eliminate carryover effects\. Policies were matched at runtime via lightweight triggers and injected into the agent loop without model fine\-tuning\. Success rate improves monotonically: 46\.2% \(12\.0/26\) with no policies→\\rightarrow71\.8% with 2 policies→\\rightarrow78\.2% with 5 policies \(\+32\.0 pp overall\)\. Each step corresponds directly to the governance primitives introduced in Section[2](https://arxiv.org/html/2605.20874#S2):
\(1\) Capability Boundaries \(Intent Guard\)\.Without policies, the agent attempted to answer unsupported queries by requesting irrelevant identifiers or invoking unrelated APIs\. An explicit capability\-boundary policy caused it to correctly decline out\-of\-scope questions, converting multiple tasks from 0/3 to 3/3 \(e\.g\., Tasks 16, 19, 21–23\)\.
\(2\) Tool Guide\.Several failures stemmed from calling known\-unreliable endpoints \(returning 503 errors or malformed schemas\)\. Augmenting tool descriptions with warning guidance caused the agent to re\-route to stable APIs, stabilizing previously flaky tasks \(e\.g\., Tasks 12, 14, 18\)\.
\(3\) Playbook \(Structured Reasoning Constraint\)\.For multi\-metric questions, the agent previously shortcut to summary endpoints and returned incorrect counts\. A multi\-API reasoning playbook enforced granular tool selection, resolving systematic count errors \(e\.g\., Task 9\) and restoring full correctness\. Full per\-task results and evaluation sub\-metrics are provided in Section[A](https://arxiv.org/html/2605.20874#A1)of the supplementary material\.
## 5\.Conclusion \- Our Vision
As LLM\-based agents become increasingly prevalent across industry, their autonomous operation in diverse and sensitive environments introduces critical challenges around safety, compliance, and predictability\. Recent incidents involving data and privacy leakage in computer\-using agents underscore the urgency of addressing these risks, particularly as deployment moves beyond controlled research settings into production environments handling real users and sensitive data\. Our vision is grounded in the belief that capable agents and trustworthy agents should be one and the same\. As generalist computer\-using agents progress toward production readiness, a principled governance layer becomes not a limitation, but a prerequisite for responsible deployment\. This work demonstrates one realization of that vision through CUGA’s policy system, which provides structured mechanisms to constrain unwanted and unpredictable agent behavior without sacrificing generality or performance, but the other way around\. CUGA as an open\-source, enterprise\-oriented agent supporting on\-premises deployment, reflects the operational realities of organizations that cannot rely on third\-party cloud infrastructure for sensitive workloads\. Its strong benchmark performance across diverse agent evaluation settings \(e\.g\., AppWorld\(Trivediet al\.,[2024](https://arxiv.org/html/2605.20874#bib.bib9)\), WebArena\(Zhouet al\.,[2023](https://arxiv.org/html/2605.20874#bib.bib11)\), BPO\-TA\(Shlomovet al\.,[2026](https://arxiv.org/html/2605.20874#bib.bib6)\)\) and by our results in Section[4](https://arxiv.org/html/2605.20874#S4), further validates it as a credible and production\-relevant baseline\. These properties, combined with lessons drawn from prior work and companies’ use cases, informed both our design choices and our understanding of where policy enforcement is most needed\. Ultimately, our policy system is a means to a broader end: the development of generalist agents that organizations can deploy with confidence\.
## References
- Z\. Chen, M\. Kang, and B\. Li \(2025\)Shieldagent: shielding agents via verifiable safety policy reasoning\.arXiv preprint arXiv:2503\.22738\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- CUGA Project \(2026a\)CUGA: computer\-using generalist agent\.Note:[https://github\.com/cuga\-project/cuga\-agent](https://github.com/cuga-project/cuga-agent)Accessed: 2026\-04\-27Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p3.1)\.
- CUGA Project \(2026b\)OAK Bench: customer\-care benchmark for insurance workflows\.Note:[https://github\.com/cuga\-project/oak\-bench](https://github.com/cuga-project/oak-bench)Accessed: 2026\-04\-27Cited by:[§4](https://arxiv.org/html/2605.20874#S4.p1.1)\.
- S\. Gaurav, J\. Heikkonen, and J\. Chaudhary \(2025\)Governance\-as\-a\-service: a multi\-agent framework for ai system compliance and policy enforcement\.arXiv preprint arXiv:2508\.18765\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p2.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin,et al\.\(2023\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InThe twelfth international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p2.1)\.
- IBM Research \(2026\)BPO\-Bench: business process operations benchmark\.Note:[https://huggingface\.co/datasets/ibm\-research/BPO\-Bench](https://huggingface.co/datasets/ibm-research/BPO-Bench)Accessed: 2026\-04\-27Cited by:[§4](https://arxiv.org/html/2605.20874#S4.p1.1)\.
- C\. Jiang, X\. Pan, and M\. Yang \(2025\)Think twice before you act: enhancing agent behavioral safety with thought correction\.arXiv preprint arXiv:2505\.11063\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- I\. Levy, B\. Wiesel, S\. Marreed, A\. Oved, A\. Yaeli, and S\. Shlomov \(2024\)St\-webagentbench: a benchmark for evaluating safety and trustworthiness in web agents\.arXiv preprint arXiv:2410\.06703\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- H\. Luo, S\. Dai, C\. Ni, X\. Li, G\. Zhang, K\. Wang, T\. Liu, and H\. Salam \(2025\)Agentauditor: human\-level safety and security evaluation for llm agents\.arXiv preprint arXiv:2506\.00641\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- X\. Ma, Y\. Gao, Y\. Wang, R\. Wang, X\. Wang, Y\. Sun, Y\. Ding, H\. Xu, Y\. Chen, Y\. Zhao,et al\.\(2026\)Safety at scale: a comprehensive survey of large model and agent safety\.Foundations and Trends in Privacy and Security8\(3\-4\),pp\. 1–240\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- S\. Marreed, A\. Oved, A\. Yaeli, S\. Shlomov, I\. Levy, O\. Akrabi, A\. Sela, A\. Adi, and N\. Mashkif \(2025\)Towards enterprise\-ready computer using generalist agent\.arXiv preprint arXiv:2503\.01861\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p3.1)\.
- A\. Oved, S\. Shlomov, S\. Zeltyn, N\. Mashkif, and A\. Yaeli \(2025\)SNAP: semantic stories for next activity prediction\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 28871–28877\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- Parlant \(2025\)Parlant: conversational control layer for customer\-facing llm agents\.Note:[https://parlant\.io/](https://parlant.io/)and[https://github\.com/emcie\-co/parlant](https://github.com/emcie-co/parlant)Accessed March 2026Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p2.1)\.
- S\. Schwartz, A\. Yaeli, and S\. Shlomov \(2023\)Enhancing trust in llm\-based ai automation agents: new considerations and future challenges\.arXiv preprint arXiv:2308\.05391\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- Y\. Shi, W\. Yu, J\. Huang, W\. Yao, W\. Chen, and N\. Liu \(2025\)Towards trustworthy gui agents: a survey\.arXiv preprint arXiv:2503\.23434\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- S\. Shlomov, A\. Oved, S\. Marreed, I\. Levy, O\. Akrabi, A\. Yaeli, Ł\. Strąk, E\. Koumpan, Y\. Goldshtein, E\. Shapira,et al\.\(2026\)From benchmarks to business impact: deploying ibm generalist agent in enterprise production\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 40423–40431\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p3.1),[§5](https://arxiv.org/html/2605.20874#S5.p1.1)\.
- S\. Shlomov, B\. Wiesel, A\. Sela, I\. Levy, L\. Galanti, and R\. Abitbol \(2025\)From grounding to planning: benchmarking bottlenecks in web agents\.InECAI 2025 – 28th European Conference on Artificial Intelligence,Frontiers in Artificial Intelligence and Applications, Vol\.413,pp\. 4815–4822\.External Links:2409\.01927Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian \(2024\)Appworld: a controllable world of apps and people for benchmarking interactive coding agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16022–16076\.Cited by:[§5](https://arxiv.org/html/2605.20874#S5.p1.1)\.
- L\. Tsai and E\. Bagdasarian \(2025\)Contextual agent security: a policy for every purpose\.InProceedings of the 2025 Workshop on Hot Topics in Operating Systems,pp\. 8–17\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p2.1)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020\)Minilm: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.Advances in neural information processing systems33,pp\. 5776–5788\.Cited by:[§2\.1](https://arxiv.org/html/2605.20874#S2.SS1.SSS0.Px1.p3.1)\.
- A\. Yaeli, S\. Shlomov, A\. Oved, S\. Zeltyn, and N\. Mashkif \(2022\)Recommending next best skill in conversational robotic process automation\.InInternational Conference on Business Process Management,pp\. 215–230\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- X\. Yang, J\. Chen, J\. Luo, Z\. Fang, Y\. Dong, H\. Su, and J\. Zhu \(2025\)Mla\-trust: benchmarking trustworthiness of multimodal llm agents in gui environments\.arXiv preprint arXiv:2506\.01616\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- Z\. Ying, Y\. Shao, J\. Gan, G\. Xu, W\. Zhang, Q\. Zou, J\. Shi, Z\. Yin, M\. Zhang, A\. Liu,et al\.\(2025\)Securewebarena: a holistic security evaluation benchmark for lvlm\-based web agents\.arXiv preprint arXiv:2510\.10073\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- S\. Zeltyn, S\. Shlomov, A\. Yaeli, and A\. Oved \(2022\)Prescriptive process monitoring in intelligent process automation with chatbot orchestration\.arXiv preprint arXiv:2212\.06564\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2023\)Webarena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.Cited by:[§5](https://arxiv.org/html/2605.20874#S5.p1.1)\.
- N\. Zwerdling, D\. Boaz, E\. Rabinovich, G\. Uziel, D\. Amid, and A\. A\. Tavor \(2025\)Towards enforcing company policy adherence in agentic workflows\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,pp\. 595–606\.Cited by:[§1](https://arxiv.org/html/2605.20874#S1.p2.1)\.
## Appendix AAblation Study: BPO Benchmark
This appendix provides the complete experimental results for the CUGA policy evaluation on the BPO benchmark, including per\-run statistics, all metric breakdowns, per\-task pass rates, and descriptions of each implemented policy\.
### A\.1\.Aggregate Results Across Policy Configurations
Each configuration was evaluated over three independent runs on a clean Milvus database\. Table[2](https://arxiv.org/html/2605.20874#A1.T2)reports per\-run scores and summary statistics\.
Table 2\.Per\-run scores for each policy configuration \(out of 26 tasks\)\.Table[3](https://arxiv.org/html/2605.20874#A1.T3)summarizes all evaluation metrics across configurations, with the total improvement from no\-policy to five\-policy baseline\.
Table 3\.Evaluation metrics by policy configuration\.
### A\.2\.Per\-Task Breakdown
Table[4](https://arxiv.org/html/2605.20874#A1.T4)reports pass counts \(out of 3 runs\) for each task across the three configurations, together with the net improvement and task category\.
Table 4\.Per\-task pass rates across policy configurations\.
### A\.3\.Policy Descriptions
The five implemented policies are described below in order of application\.
#### Policy \#1 — API Capability Boundaries
Type: Playbook\. Trigger: keywords \+ natural language \(threshold 0\.65, priority 90\)\.
This policy teaches the agent to recognise when no available API can answer a query\. It enumerates supported and unsupported API capabilities and instructs the agent to decline out\-of\-scope requests directly, rather than soliciting a requisition ID or invoking irrelevant tools\.
Failure pattern addressed\.Without this policy the agent would request a requisition ID, or call arbitrary APIs, for queries relating to job descriptions, time\-to\-fill, geography filtering, SLA deadlines, funnel timing, and job\-card details — none of which are supported by any API\.
Tasks fixed\.16, 19, 21, 22, 23 \(all from 0/3 to 3/3\)\.
#### Policy \#2 — Error\-Prone Tool Warnings
Type: Tool Guide\. Target: 19 error\-prone tools \(warning prepended to each description\)\.
This policy prepends a warning to the descriptions of the 19 known\-unreliable tools \(those returning 503 errors, schema violations, or type mismatches\)\. It steers the agent towards the 13 reliable core tools and teaches graceful recovery when an error tool is nonetheless invoked\.
Failure pattern addressed\.The agent would call tools such asfunnel\_status\(503 error\),model\_registry\(incorrect data\), orsource\_recommendation\_summary\(incomplete shortcut\) instead of the correct granular APIs\.
Tasks fixed\.12, 14, 18 \(stabilised from 2/3 to 3/3\); 24 \(0/3 to 1–2/3, partial improvement\)\.
#### Policy \#3 — Multi\-API Reasoning
Type: Playbook\. Trigger: keywords \+ natural language \(threshold 0\.65, priority 80\)\.
This policy instructs the agent to call multiple specific APIs for multi\-metric questions rather than relying on a single summary endpoint\. It provides an explicit mapping from question type to the correct tool and clarifies the distinction between “total requisitions used for computation” \(definitions\-and\-methodology\) and “similar requisitions analysed” \(metadata\-and\-timeframe\)\.
Failure pattern addressed\.The agent would use the summary shortcut tool for multi\-metric queries, or confuse which endpoint returns the requisition count\.
Tasks fixed\.9 \(0/3 to 3/3; agent now correctly returns 1047 fromdefinitions\-and\-methodologyrather than 40 from metadata\-and\-timeframe\)\.
#### Policy \#4 — Average vs\. Total Calculations
Type: Playbook\. Trigger: keywords \+ natural language \(threshold 0\.65, priority 70\)\.
This policy teaches the agent that when a user asks for “average” or “typical” values it must compute a per\-requisition average by dividing the aggregate total by the count of similar requisitions, rather than returning the raw total\.
Failure pattern addressed\.The agent would return the total candidate count \(2,913\) when asked “how many candidates do we usually get?” instead of computing the per\-requisition average \(2,913÷40≈732\{,\}913\\div 40\\approx 73\)\.
Tasks fixed\.26 \(0/3 to 3/3\)\.
#### Policy \#5 — Missing Requisition ID vs\. Unsupported Query
Type: Playbook\. Trigger: keywords \+ natural language \(threshold 0\.60, priority 85\)\.
This policy helps the agent distinguish between “I need a requisition ID to answer this” \(answerable but missing context\) and “this cannot be answered regardless of requisition ID” \(unsupported by any API\)\. It reinforces Policy \#1 for edge cases where the triggering language differs\.
Failure pattern addressed\.The agent would request a requisition ID even for queries that no API supports under any circumstances\.
Tasks fixed\.Overlaps with Policy \#1; provides reinforcement for edge cases not caught by Policy \#1’s triggers\.
### A\.4\.Remaining Failing Tasks
Table[5](https://arxiv.org/html/2605.20874#A1.T5)documents the six tasks that remain unsolved \(or partially unsolved\) after all five policies are applied, together with a diagnosis of each failure mode\.
Table 5\.Remaining failing tasks after five\-policy configuration\.
## Appendix BPolicy System Evaluations
We presented CUGA’s policy system evaluation in Section[4](https://arxiv.org/html/2605.20874#S4)\. Here, we extend that discussion by reporting the token usage associated with each evaluation, providing a quantitative measure of the overhead introduced by our method\. Results are shown in Table[6](https://arxiv.org/html/2605.20874#A2.T6)\.
BenchmarkModelAccuracy \(W/O→\\rightarrowW\)Δ\\DeltaTokens \(W/O→\\rightarrowW\)OAKGPT\-OSS\-120B75%→100%75\\%\\rightarrow 100\\%\+25\+25pp26\.2k→40k26\.2k\\rightarrow 40kGPT\-4\.170%→96%70\\%\\rightarrow 96\\%\+26\+26pp28\.2k→44k28\.2k\\rightarrow 44kClaude Opus\-4\.581%→96%81\\%\\rightarrow 96\\%\+15\+15pp44k→85k44k\\rightarrow 85kBPOGPT\-OSS\-120B49\.2%→82\.3%49\.2\\%\\rightarrow 82\.3\\%\+33\.1\+33\.1pp3\.7M→6\.6M3\.7M\\rightarrow 6\.6MGPT\-4\.128\.5%→66\.2%28\.5\\%\\rightarrow 66\.2\\%\+37\.7\+37\.7pp2\.7M→8\.8M2\.7M\\rightarrow 8\.8MClaude Opus\-4\.550\.0%→68\.5%50\.0\\%\\rightarrow 68\.5\\%\+18\.5\+18\.5pp6\.2M→9\.0M6\.2M\\rightarrow 9\.0MTable 6\.Success Rate and Token Usage on OAK and BPO benchmarks, with and without CUGA’s policy system, across three backbone models\. W/O = without policy system; W = with policy system\.Similar Articles
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
This paper introduces symbolic guardrails that enforce concrete policies to provide provable safety and security guarantees for domain-specific AI agents without reducing utility, showing 74% of specified policies can be enforced via simple mechanisms.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
This academic paper proposes a unified architecture-lifecycle framework for securing computer-use agents (CUAs) as they transition from benchmarks to real-world software environments. It analyzes reliability challenges across perception, decision, execution layers and creation, deployment, operation, maintenance stages.
PolicyBank: Evolving Policy Understanding for LLM Agents
PolicyBank proposes a memory mechanism that enables LLM agents to autonomously refine their understanding of organizational policies through iterative interaction and corrective feedback, closing specification gaps that cause systematic behavioral divergence from true requirements. The work introduces a systematic testbed and demonstrates PolicyBank can close up to 82% of policy-gap alignment failures, significantly outperforming existing memory mechanisms.
Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
This paper introduces HCL-GP, a dynamic policy-learning framework that integrates generalized planning and hierarchical task decomposition to enable LLM-based agents to learn and reuse executable policy components, significantly improving performance on the AppWorld benchmark.
Towards Security-Auditable LLM Agents: A Unified Graph Representation
This paper introduces Agent-BOM, a unified graph representation for security auditing in LLM-based agentic systems. It addresses the semantic gap in post-hoc auditing by modeling static capabilities and dynamic runtime states to detect complex attack chains like memory poisoning and tool misuse.