multi-turn

#multi-turn

ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents

arXiv cs.AI ↗ · 3h ago Cached

The paper introduces ATOD, a hybrid online distillation algorithm combining on-policy distillation and reinforcement learning for training small language model agents in multi-turn tasks, featuring an annealed OPD-RL schedule and Turn-level Disagreement-Uncertainty Reweighting to improve dense supervision.

0 favorites 0 likes

#multi-turn

Natural-Language Testing for AI Agents (using simulated isolates)

Reddit r/AI_Agents ↗ · 9h ago

This article introduces a new natural-language testing system for AI agents that uses simulated isolates to automatically generate multi-turn simulations and evaluate agent behavior, helping developers catch regressions from prompt changes.

0 favorites 0 likes

#multi-turn

I built a benchmark for multi-turn prompt injection attacks. Most defenses never see them coming.

Reddit r/artificial ↗ · 2026-06-19

A new benchmark for multi-turn prompt injection attacks reveals that most current defenses fail to detect sophisticated, multi-step attacks.

0 favorites 0 likes

#multi-turn

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

arXiv cs.CL ↗ · 2026-06-16 Cached

Introduces EHRNote-ChatQA, a benchmark for evidence-grounded multi-turn clinical question answering over multiple discharge summaries, constructed with expert validation. Benchmarking 22 LLMs reveals challenges in evidence grounding and multi-turn error accumulation.

0 favorites 0 likes

#multi-turn

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

arXiv cs.CL ↗ · 2026-06-15 Cached

CacheRL trains small agent foundation models for multi-step tool-calling tasks, achieving 92% process accuracy (approaching GPT-5's 94%) with 100x less compute using cached rollouts and hybrid reward shaping, with innovations in knowledge transfer, cache-aware rewards, and iterative SFT/GRPO training.

0 favorites 0 likes

#multi-turn

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

arXiv cs.CL ↗ · 2026-06-15 Cached

DLawBench is a new benchmark for evaluating large language models in multi-turn legal consultation, covering Chinese and US law with four client types. Experiments show significant room for improvement, with the best model achieving only 0.562 on legal reasoning.

0 favorites 0 likes

#multi-turn

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

arXiv cs.CL ↗ · 2026-06-12 Cached

The Shopping Reasoning Bench is an expert-authored benchmark for evaluating multi-turn conversational shopping assistants, with 525 missions and over 10,000 binary rubrics. Evaluations of GPT, Claude, and Gemini show that current models achieve only 57-77% pass rates, revealing significant gaps in expert-level shopping reasoning.

0 favorites 0 likes

#multi-turn

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

arXiv cs.AI ↗ · 2026-06-11 Cached

HERO introduces a hindsight-enhanced self-distillation framework that uses environment observations as locally aligned feedback to improve multi-turn agent capabilities, outperforming existing methods on TauBench and WebShop, especially under limited turn budgets.

0 favorites 0 likes

#multi-turn

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper introduces ISE, a three-stage synthesis paradigm for generating multi-turn OS-agent trajectories with grounded execution, demonstrating that fine-tuning on the resulting ISE-Trace dataset significantly improves agent performance on ClawEval.

0 favorites 0 likes

#multi-turn

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

arXiv cs.LG ↗ · 2026-06-10 Cached

IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.

0 favorites 0 likes

#multi-turn

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper studies a deployed LLM-as-judge system for evaluating multi-turn conversational agents and finds it catches far fewer defects than human review, revealing a structured blind-spot taxonomy and routing failures.

0 favorites 0 likes

#multi-turn

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

arXiv cs.LG ↗ · 2026-06-05 Cached

Proposes Adwm, an autoregressive diffusion world model for off-policy evaluation of LLM agents, enabling reliable value estimates from pre-collected trajectories without online interaction.

0 favorites 0 likes

#multi-turn

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

arXiv cs.CL ↗ · 2026-06-05 Cached

This paper introduces PersuasionTrace, a framework for studying multi-turn persuasion in human-LLM interaction, using a Bayesian-network simulated target that models belief updates. The framework reveals that LLMs are persuasive across topics and modalities, and that the Bayesian target better matches human belief dynamics than vanilla LLM simulators.

0 favorites 0 likes

#multi-turn

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

AdaPlanBench is a dynamic benchmark for evaluating LLM agents' ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions, showing current models struggle especially with user constraints.

0 favorites 0 likes

#multi-turn

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

arXiv cs.CL ↗ · 2026-06-03 Cached

This paper proposes WRIT, a pipeline for synthesizing multi-turn agent training trajectories that balance write-intensive and read-heavy complexity. The method generates diverse tasks and simulations, enabling small models to achieve strong performance with reduced inference cost.

0 favorites 0 likes

#multi-turn

Did you see it when Salesforce's run their own AI Agents benchmark

Reddit r/ArtificialInteligence ↗ · 2026-06-01

Discussion of Salesforce's CRMArena-Pro benchmark showing agent success drops from 58% on single-turn to 35% on multi-turn tasks, plus practical advice for splitting agent workflows into narrow stages to reduce error compounding.

0 favorites 0 likes

#multi-turn

Agentic RL: Token-In, Token-Out Done Right (16 minute read)

TLDR AI ↗ · 2026-06-01 Cached

This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.

0 favorites 0 likes

#multi-turn

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

LongDS is a benchmark for evaluating AI agents on long-horizon, multi-turn data analysis tasks derived from Kaggle notebooks; experiments show best models only achieve 48% accuracy with significant drop over long turns.

0 favorites 0 likes

#multi-turn

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

arXiv cs.CL ↗ · 2026-05-27 Cached

The paper introduces SeDT, a training-free inference-time method that improves LLM reliability in multi-turn conversations by annotating conversation history with cumulative relevance scores from three signals, achieving up to +37.7% performance gains on the Lost-in-Conversation benchmark.

0 favorites 0 likes

#multi-turn

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

arXiv cs.CL ↗ · 2026-05-27 Cached

This paper introduces EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark, and evaluates five frontier models across memory architectures, finding that stateless models collapse by the third turn and that working memory yields the largest gains.

0 favorites 0 likes

multi-turn

Submit Feedback