From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Hugging Face Daily Papers Papers

Summary

This paper introduces multi-step trojan attacks against local LLM agents, where malicious prompts are embedded across multiple operations to bypass existing defenses. It proposes ClawTrojan benchmark and DASGuard defense to detect and mitigate such attacks.

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.
Original Article
View Cached Full Text

Cached at: 06/01/26, 03:18 AM

Paper page - From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Source: https://huggingface.co/papers/2605.31042

Abstract

Multi-step trojan attacks in local LLM agents can bypass existing defenses by embedding malicious prompts across multiple operations, requiring new detection methods like DASGuard for effective protection.

LLM agentsare evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed aprompt injectionwithin a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In thismulti-step trojan attackparadigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identifymulti-step trojan attacks in local agentic harnesses. In anOpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we proposeDASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show thatDASGuardachieves strong dynamic defense by combiningruntime attack blockingwithsanitized commitsto the workspace.

View arXiv pageView PDFGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.31042

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31042 in a model README.md to link it from this page.

Datasets citing this paper1

#### zstanjj/ClawTrojan Updated3 minutes ago

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31042 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

@NFTCPS: Damn, a doxxing tool is here! Just enter a username, and it scrapes over 840 platforms for you. It's called ALIENS EYE. It's not stupid—not just guessing by HTTP status codes—it uses a trained ML model with 25 features to judge, with results in three tiers: Found, Maybe, Not Found, …

X AI KOLs Timeline

ALIENS EYE is an AI-powered open-source username scanner that uses a machine learning model and 25 features to detect across 840+ platforms, with support for proxies, Tor, and multiple export formats.