Tag
The paper introduces ATOD, a hybrid online distillation algorithm combining on-policy distillation and reinforcement learning for training small language model agents in multi-turn tasks, featuring an annealed OPD-RL schedule and Turn-level Disagreement-Uncertainty Reweighting to improve dense supervision.
This article introduces a new natural-language testing system for AI agents that uses simulated isolates to automatically generate multi-turn simulations and evaluate agent behavior, helping developers catch regressions from prompt changes.
A new benchmark for multi-turn prompt injection attacks reveals that most current defenses fail to detect sophisticated, multi-step attacks.
Introduces EHRNote-ChatQA, a benchmark for evidence-grounded multi-turn clinical question answering over multiple discharge summaries, constructed with expert validation. Benchmarking 22 LLMs reveals challenges in evidence grounding and multi-turn error accumulation.
CacheRL trains small agent foundation models for multi-step tool-calling tasks, achieving 92% process accuracy (approaching GPT-5's 94%) with 100x less compute using cached rollouts and hybrid reward shaping, with innovations in knowledge transfer, cache-aware rewards, and iterative SFT/GRPO training.
DLawBench is a new benchmark for evaluating large language models in multi-turn legal consultation, covering Chinese and US law with four client types. Experiments show significant room for improvement, with the best model achieving only 0.562 on legal reasoning.
The Shopping Reasoning Bench is an expert-authored benchmark for evaluating multi-turn conversational shopping assistants, with 525 missions and over 10,000 binary rubrics. Evaluations of GPT, Claude, and Gemini show that current models achieve only 57-77% pass rates, revealing significant gaps in expert-level shopping reasoning.
HERO introduces a hindsight-enhanced self-distillation framework that uses environment observations as locally aligned feedback to improve multi-turn agent capabilities, outperforming existing methods on TauBench and WebShop, especially under limited turn budgets.
This paper introduces ISE, a three-stage synthesis paradigm for generating multi-turn OS-agent trajectories with grounded execution, demonstrating that fine-tuning on the resulting ISE-Trace dataset significantly improves agent performance on ClawEval.
IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.
This paper studies a deployed LLM-as-judge system for evaluating multi-turn conversational agents and finds it catches far fewer defects than human review, revealing a structured blind-spot taxonomy and routing failures.
Proposes Adwm, an autoregressive diffusion world model for off-policy evaluation of LLM agents, enabling reliable value estimates from pre-collected trajectories without online interaction.
This paper introduces PersuasionTrace, a framework for studying multi-turn persuasion in human-LLM interaction, using a Bayesian-network simulated target that models belief updates. The framework reveals that LLMs are persuasive across topics and modalities, and that the Bayesian target better matches human belief dynamics than vanilla LLM simulators.
AdaPlanBench is a dynamic benchmark for evaluating LLM agents' ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions, showing current models struggle especially with user constraints.
This paper proposes WRIT, a pipeline for synthesizing multi-turn agent training trajectories that balance write-intensive and read-heavy complexity. The method generates diverse tasks and simulations, enabling small models to achieve strong performance with reduced inference cost.
Discussion of Salesforce's CRMArena-Pro benchmark showing agent success drops from 58% on single-turn to 35% on multi-turn tasks, plus practical advice for splitting agent workflows into narrow stages to reduce error compounding.
This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.
LongDS is a benchmark for evaluating AI agents on long-horizon, multi-turn data analysis tasks derived from Kaggle notebooks; experiments show best models only achieve 48% accuracy with significant drop over long turns.
The paper introduces SeDT, a training-free inference-time method that improves LLM reliability in multi-turn conversations by annotating conversation history with cumulative relevance scores from three signals, achieving up to +37.7% performance gains on the Lost-in-Conversation benchmark.
This paper introduces EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark, and evaluates five frontier models across memory architectures, finding that stateless models collapse by the third turn and that working memory yields the largest gains.