multi-turn

#multi-turn

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

arXiv cs.AI ↗ · 2026-05-26 Cached

Introduces EvoCode-Bench, a benchmark of 26 stateful coding tasks across 227 rounds that evaluates coding agents in multi-turn iterative interactions, revealing that single-round performance overestimates multi-round capabilities by 22–40 points.

0 favorites 0 likes

#multi-turn

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

arXiv cs.CL ↗ · 2026-05-26 Cached

This paper introduces Found in Conversation (FiC), a training framework using View-Asymmetric Self-Distillation to close the multi-turn performance gap in LLMs. The method teaches models to recover single-turn competence from underspecified multi-turn prompts, achieving 92-100% recovery across model families and sizes.

0 favorites 0 likes

#multi-turn

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

arXiv cs.CL ↗ · 2026-05-26 Cached

SEAL proposes a closed-loop framework for jointly evolving LLM agents and their training environments, using diagnosis-guided labels to align both sides. It achieves substantial gains in multi-turn tool-use tasks with only 400 training samples, demonstrating improved robustness and out-of-distribution transfer.

0 favorites 0 likes

#multi-turn

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Hugging Face Daily Papers ↗ · 2026-05-25 Cached

WBench is a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns, providing automatic sub-metrics and diagnostic insights. It reveals that no single model excels across all dimensions.

0 favorites 0 likes

#multi-turn

LLM Guard scored 0/8 on a USENIX 2025 multi-turn jailbreak. Here’s what caught it instead.

Reddit r/artificial ↗ · 2026-05-23

Arc Sentry detects multi-turn jailbreaks like Crescendo by reading model internal state rather than text output, catching attacks that text-based monitors miss entirely.

0 favorites 0 likes

#multi-turn

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

arXiv cs.CL ↗ · 2026-05-22 Cached

RankJudge is a benchmark generator that creates paired multi-turn conversations with injected flaws to evaluate LLM judges on their ability to correctly identify better and worse responses in complex dialogues.

0 favorites 0 likes

#multi-turn

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

arXiv cs.AI ↗ · 2026-05-20 Cached

This paper presents the first systematic study of credit assignment in multi-turn LLM agents, introducing SERL, a selective environment-reweighted learning framework. SERL uses environment feedback to sharpen the RL objective on causally relevant actions, achieving 90.0% and 80.1% success rates on ALFWorld and WebShop respectively.

0 favorites 0 likes

#multi-turn

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

Hugging Face Daily Papers ↗ · 2026-05-19 Cached

π-Bench is a new benchmark comprising 100 multi-turn tasks with hidden user intents across 5 domain-specific user personas, designed to evaluate proactive assistance in long-horizon workflows for personal assistant agents.

0 favorites 0 likes

#multi-turn

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

arXiv cs.AI ↗ · 2026-05-14 Cached

This paper provides a mechanistic explanation for why LLMs lose track of instructions in long multi-turn interactions, introducing the Goal Accessibility Ratio (GAR) metric and a channel-transition framework. Through ablation studies and residual stream probes, it shows that attention to goal-defining tokens closes over turns while goal information persists in residual representations, with architecture-specific failure modes.

0 favorites 0 likes

#multi-turn

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

IndicMedDialog is a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages, with a fine-tuned model for personalized symptom elicitation. The dataset is derived from MDDial, enhanced with LLM-generated synthetic consultations and expert verification, supporting multilingual healthcare AI.

0 favorites 0 likes

#multi-turn

Revisiting DAgger in the Era of LLM-Agents

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

This paper revisits Dataset Aggregation (DAgger) for training long-horizon LLM agents, demonstrating that turn-level teacher-student policy interpolation mitigates covariate shift and outperforms existing methods on software engineering benchmarks like SWE-bench Verified.

0 favorites 0 likes

#multi-turn

SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper introduces Sequor, a new benchmark for evaluating how well AI models follow constraints in long, multi-turn conversations. It highlights that current models struggle significantly with maintaining instruction adherence over extended interactions.

0 favorites 0 likes

#multi-turn

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-05-08 Cached

This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.

0 favorites 0 likes

#multi-turn

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

Hugging Face Daily Papers ↗ · 2026-05-06 Cached

This paper presents the winning system for SemEval-2026 Task 8's generation subtask, using a heterogeneous ensemble of seven LLMs with dual prompting strategies and a GPT-4o-mini judge to select the best response. The system achieved first place with a conditioned harmonic mean of 0.7827, outperforming all baselines and demonstrating the value of model diversity.

0 favorites 0 likes

#multi-turn

Demystifying evals for AI agents

Anthropic Engineering ↗ · 2026-05-08 Cached

Anthropic provides a guide on designing rigorous automated evaluations for AI agents, addressing the complexities of multi-turn interactions and state modifications.

0 favorites 0 likes

multi-turn

Submit Feedback