DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Hugging Face Daily Papers 05/06/26, 12:00 AM Papers

ai-agents red-teaming security simulation evaluation-benchmark autonomous-testing

Summary

This paper introduces the DecodingTrust-Agent Platform (DTap), a controllable and interactive red-teaming platform for evaluating AI agent security across multiple domains. It also presents DTap-Red, an autonomous agent for discovering attack strategies, and DTap-Bench, a large-scale dataset for risk assessment.

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

Original Article

View Cached Full Text

Cached at: 05/11/26, 07:18 AM

Paper page - DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Source: https://huggingface.co/papers/2605.04808 Authors:

Abstract

A comprehensive platform and autonomous agent framework for evaluating and enhancing AI agent security through controlled red-teaming across multiple real-world domains and simulation environments.

AI agentsare increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments forlarge-scale risk assessmentremain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactivered-teamingplatform forAI agents, spanning 14 real-world domains and over 50simulation environmentsthat replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the firstautonomous red-teaming agentthat systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effectiveattack strategiestailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scalered-teamingdataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popularAI agentsbuilt on various backbone models, spanning security policies, risk categories, andattack strategies, revealing systematicvulnerability patternsand providing valuable insights for developing secure next-generation agents.

View arXiv page View PDF Project page GitHub16 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.04808 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.04808 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.04808 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Paper page - DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

We built a public red team environment for our AI agent security proxy — submit attacks and get a full security trace back

Advancing red teaming with people and AI

Introducing Trusted Access for Cyber

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

Submit Feedback

Similar Articles

We built a public red team environment for our AI agent security proxy — submit attacks and get a full security trace back

Advancing red teaming with people and AI

Introducing Trusted Access for Cyber

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming