Tag
This paper introduces Turing-RL, a reinforcement learning approach that uses Turing test-based rewards to train language models to generate responses indistinguishable from human users in conversational and forum settings, outperforming baseline methods.
Introduces Dialogue-SWE-Bench, a benchmark for evaluating coding agents' ability to resolve software engineering problems through dialogue with a user. Proposes a persona-grounded user simulator and a schema-guided agent that improves dialogue capabilities.
The article argues that traditional chatbot QA is broken because it only tests happy paths, and proposes using an AI-powered user simulator that attacks the bot with diverse personas and edge cases to find vulnerabilities before deployment.