This article introduces a new natural-language testing system for AI agents that uses simulated isolates to automatically generate multi-turn simulations and evaluate agent behavior, helping developers catch regressions from prompt changes.
tldr: we now allow agent builders to simulate conversations to test our agents using natural language prompts. When you run AI agents in production, they constantly encounter unexpected situations. Over time, you extend your system prompt and tools to handle these edge cases. That's a natural part of building agents. The problem is that prompts and tools, unlike code, are notoriously difficult to test. Imagine a 10,000-token prompt full of carefully engineered instructions and tool descriptions. Is your latest change strong enough? Is it too broad? Too distracting? You might tweak a single word to fix one issue, only to accidentally break five other behaviors. To handle this we built a robust, side-effect-free, multi-turn testing system directly into the platform. Here's how it works. Imagine a simple pizza ordering bot in NYC. Initially, it's configured to deliver only to Manhattan and Brooklyn. You update its prompt to include Queens, but you want to guarantee the agent now correctly tells users that Queens is supported. Instead of writing brittle mocks for your database, payment, or other custom tools, the testing environment automatically intercepts every tool call and replaces your handlers with an AI-powered simulator. The simulator reads each tool's description, parameters, and the conversation history to generate realistic, context-aware responses on the fly. You define the test with a single natural-language assertion: "When asked where you deliver, the agent should explain that we ship to Manhattan, Brooklyn, and Queens." From that single sentence, prompt2bot automatically generates an entire multi-turn simulation: an initial user message (for example, "Where do you deliver?") a user simulator persona (such as a customer in Queens trying to place an order) a semantic evaluation rule that determines whether the agent behaved correctly The simulation runs end-to-end. The agent interacts with the simulated tools, while the semantic judge evaluates every turn. If the assertion is violated at any point, the test immediately fails and returns the exact offending message along with an explanation. This gives you confidence that prompt changes fix the intended behavior without introducing unintended regressions. Because the testing system is exposed through a first-class API, you can run simulations locally, from the terminal, or automatically in your GitHub Actions CI pipeline, keeping deployments fully automated. As a bonus, you don't even have to write the test yourself. You can simply ask: "Test that agent X responds with Y when asked Z." The builder generates and runs the simulation for you. And, of course, tests can be as simple or as sophisticated as you need—they can span many turns, involve complex tool-calling workflows, and validate nuanced agent behavior. Now we can sleep a bit better.
This article argues that standard test suites with fixed inputs and expected outputs are insufficient for AI agents due to infinite input spaces and non-deterministic behavior, advocating for property-based testing instead.
The article argues that traditional chatbot QA is broken because it only tests happy paths, and proposes using an AI-powered user simulator that attacks the bot with diverse personas and edge cases to find vulnerabilities before deployment.
This paper presents an AI agent that integrates large language models with laboratory orchestration software, allowing scientists to create, monitor, and manage automated lab protocols using natural language. Evaluated on three simulated labs, the agent achieves a 97% first-attempt protocol generation success rate and requires far fewer interface actions.
The author is building a tool to automatically test AI agents by simulating realistic user conversations and providing pass/fail reports, saving developers from manual testing.
This paper introduces RogueAI, a reverse Turing test implemented as an interactive webapp where human players interrogate two LLM agents to identify which one is licensed to deceive within a shared fictional scenario. A pilot deployment shows a gap between heuristic detection (75.6% accuracy) and human performance (56.6%), highlighting the potential of the system as a data-collection and teaching tool for AI deception and honesty.