A system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]

Reddit r/MachineLearning Papers

Summary

This paper proposes Sentinel Gateway, a middleware layer that enforces strict separation between trusted instruction channels and untrusted data channels to mitigate prompt injection in LLM agents, using signed runtime authorization tokens and offering audit logging capabilities.

Prompt injection has emerged as one of the most persistent failure modes in tool-using LLM systems, particularly in agentic workflows where models interact with external data sources. Most mitigation strategies focus on input filtering or model-side alignment, but these approaches struggle because the core issue is structural: Approach I explored a system-level mitigation strategy by introducing a middleware layer (Sentinel Gateway) that enforces a strict separation between: Instruction channel: trusted, runtime-issued commands Data channel: untrusted external inputs (web, files, APIs) Instead of attempting to classify malicious inputs, the system ensures that: All agent actions require a signed, scoped runtime authorization token, effectively decoupling observation from execution. Implementation FastAPI middleware layer for agent tool calls Token-based authorization for execution requests Streamlit interface for inspection and debugging Audit logging of agent decisions and tool usage Supports multi-agent integration patterns (e.g., Claude-based sessions) Local or Postgres-backed persistence layer Repo https://github.com/cmtopbas/Sentinel-Gateway Discussion question I’m interested in feedback on: whether instruction/data separation is a meaningful abstraction for agent safety failure modes in token-based execution gating how this compares conceptually to other agent safety or sandboxing approaches
Original Article

Similar Articles

Understanding prompt injections: a frontier security challenge

OpenAI Blog

OpenAI publishes guidance on prompt injection attacks, a social engineering vulnerability where malicious instructions hidden in web content or documents can trick AI models into unintended actions. The company outlines its multi-layered defense strategy including instruction hierarchy research, automated red-teaming, and AI-powered monitoring systems.