AI Agent Security - MIT 6.566 guest lecture

Lobsters Hottest 05/18/26, 03:41 PM Events

ai-security agent-security llm-security prompt-injection lecture mit ai-agents

Summary

Guest lecture at MIT 6.566 on AI agent security covering system-level threats, prompt injection, tool-use vulnerabilities, and demonstrations with LLMs like GPT-5.4 and Qwen 3.5.

<p><a href="https://lobste.rs/s/evwqcs/ai_agent_security_mit_6_566_guest_lecture">Comments</a></p>

Original Article

View Cached Full Text

Cached at: 05/18/26, 04:32 PM

anishathalye/ai-agent-security-lecture

Source: https://github.com/anishathalye/ai-agent-security-lecture

AI Agent Security (guest lecture, MIT 6.566, April 2026)

You can run demos with uv, for example uv run 00_completion.py. For some, you will need Ollama and the appropriate models downloaded. For others, you’ll need the appropriate API keys, such as OPENAI_API_KEY, set.

General information

Reading: Defeating Prompt Injections by Design (CaMeL) (Debenedetti et al., 2025)
Speaker: Anish Athalye

Introduction

Examples: Claude Code, OpenClaw
What is an agent?
- AI system that perceives its environment, makes decisions, and takes autonomous actions to achieve user-defined goals
System-level model
- User <-> Agent <-> Environment
- Agent often operates with high privilege
Not robust (even under natural inputs)
- Example: PocketOS founder using Cursor + Opus 4.6, agent deleted production database and backups: https://x.com/lifeof_jer/status/2048103471019434248
Susceptible to various types of attacks
- Example: ChatGPT data exfiltration
- Example: ICML organizers prompt-injecting LLMs being used for reviews: https://blog.icml.cc/2026/03/18/on-violations-of-llm-review-policies/
AI and agents are evolving faster than security can keep up

Background: a system-level view of LLMs and agents

Omitting how the model itself is trained
The foundation: a large language model (LLM)
- Probabilistic next-token prediction: $p(\cdot \mid x_1, x_2, \ldots, x_n)$
- Sampling: $y_1 \sim p(\cdot \mid x_1, x_2, \ldots, x_n)$ , $y_2 \sim p(\cdot \mid x_1, x_2, \ldots, x_n, y_1)$ , $\ldots$
- Code: ./00_completion.py
  - Here, using a “base model” (pretrained, like GPT-3, but not instruction-tuned like InstructGPT / ChatGPT)
Conversational chat
- Informally, the LLM is role playing, so give it an input that looks like a conversation thread
- Poor man’s version: ./01_messages.py
- Can build multi-turn messaging on top of this: ./02_multi_turn_messages.py
- Modern models have native support for this via special control tokens that mark start-of-turn, end-of-turn, etc. (example from Qwen): ./03_native_messages.py
  - Now, using an instruction-tuned model (Qwen 3.5 9B)
Tool use
- Can tell the LLM about “tools”, and have code that dispatches requests to call tools and returns values back to the model: ./04_tools.py
  - Here, we also introduce “system messages”, context that is included up front to steer the model
- We can have multiple tools, and dispatch tool calls in a loop until the model is done: ./05_multiple_tools.py
  - Observe, the model has “agency” here, it dictates control flow
  - Surrounding code is called an “agent harness”
- Modern models have native support for tool calling, too, via a well-defined way to encode tool schemas that are passed to the model up front: ./06_native_tools.py
  - Here, switching to a more powerful model, run via API (GPT 5.4 / GPT 5.4 Mini)
Agents
- Common pattern that is implemented in many libraries, to simplify your code: ./07_native_agent.py
  - These libraries often implement the ReAct pattern, the way most modern agents work at a high level (at its core, just dispatching tool calls in a loop)
- Agent can complete a complex task by chaining together many tool calls: ./08_complex_agent.py
- Having all data flow go through the model is inefficient; instead, can have the model generate code that uses tools: ./09_code_agent.py
  - This is the CodeAct pattern