POISE: Position-Aware Undetectable Skill Injection on LLM Agents
Summary
POISE is a stealthy skill-poisoning attack that embeds malicious triggers within benign-looking instructions, achieving high attack success rates while evading detection by LLM scanners.
View Cached Full Text
Cached at: 06/11/26, 01:37 PM
Paper page - POISE: Position-Aware Undetectable Skill Injection on LLM Agents
Source: https://huggingface.co/papers/2606.07943
Abstract
POISE is a stealthy skill-poisoning attack that embeds malicious triggers within benign-looking instructions, achieving high attack success rates while avoiding detection by LLM scanners that are overly sensitive to privileged tool operations.
Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them toskill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user’s legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks byAttack Success Rate, which requires the injected payload to execute and the user’s task to still pass its verifier in the same trial. Priorskill-poisoning attacksface a reliability-stealth trade-off under this lens:YAML-header injectionsare reliably loaded but easily inspected, whereas stealthierbody injectionsthat place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent’s own suspicion. We introduce POISE, aposition-aware attackthat compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using acontext-aware generatorto blend it with nearby setup or prerequisite steps. On Skill-Inject withcodex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations,LLM scannersare hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering currentstatic defensesineffective.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.07943
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.07943 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.07943 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.07943 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems
This paper identifies a new class of injection attacks where payloads mimic the domain language to evade LLM injection detectors, showing detection rates drop dramatically (e.g., from 93.8% to 9.7% on Llama 3.1 8B). The vulnerability is systematic and extends to dedicated safety classifiers like Llama Guard 3, which detected zero camouflage payloads.
I got paranoid about OpenClaw skills injecting crap into my system prompt, so I built a quarantine pipeline with two LLMs as reviewers (93.75% detection, zero false negatives)
A developer built a quarantine pipeline using two LLM reviewers (Claude and Codex) to detect injection attacks in OpenClaw skills, achieving 93.75% detection rate with zero false negatives. The system uses a dual mandate of checklist-based pattern matching and open analysis to catch both known and novel injection techniques.
Prompt Injection as Role Confusion
This paper presents a theory that prompt injection attacks on LLMs stem from a fundamental flaw in how models perceive roles, treating roles as a type system for language. It explains existing attacks, predicts new ones, and proposes a research agenda for a science of roles.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
OpenAI proposes an instruction hierarchy approach to defend LLMs against prompt injection and jailbreak attacks by training models to prioritize system instructions over user inputs. The method significantly improves robustness without degrading standard capabilities.
Prompt Injection as Role Confusion
Research paper shows that LLMs suffer from 'role confusion', where they prioritize the style of text over its actual role tags, enabling prompt injection attacks. Destyling text reduces attack success from 61% to 10%, indicating a fundamental challenge for LLM security.