Tag
This paper investigates whether vision-language models can distinguish potential from established common ground in asymmetric dialogue. Experiments on MapTask data show that providing task-relevant map content (visual or textual) biases models toward over-predicting alignment, as they rely on static referential cues rather than tracking grounding through dialogue history.
An essay exploring why thinking out loud with another person produces better understanding and insight than solitary reflection, drawing on cognitive science and philosophy.
Introduces Dialogue-SWE-Bench, a benchmark for evaluating coding agents' ability to resolve software engineering problems through dialogue with a user. Proposes a persona-grounded user simulator and a schema-guided agent that improves dialogue capabilities.
ParaBridge is an on-policy self-distillation method that bridges the gap between paralinguistic perception and dialogue behavior in speech language models, significantly improving safety and empathy without external rewards.
Introduces CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in mental health conversations, along with an Alert–Confirm evaluation protocol and a synthetic training corpus plus a 32B model that outperforms existing open-source and proprietary models.
Ψ-Bench is a benchmark for evaluating LLMs' ability to influence users through persuasive dialogues, incorporating user profiles for personalized persuasion. Experiments show that even state-of-the-art models have room for improvement, and access to client profiles significantly boosts performance.
This paper studies how humans and large language models linguistically accommodate each other during multi-turn conversations, finding that LLMs overconverge to user style while humans accommodate LLMs no differently than humans.
SwanVoice is a zero-shot text-to-speech model designed for expressive long-form monologue and dialogue synthesis, combining VAE, flow-matching DiT, and diffusion post-training to achieve higher richness and hierarchy scores than existing baselines.
Conv-to-Bench is a multi-stage framework that automatically transforms multi-turn user-assistant dialogues into structured, verifiable requirement checklists for evaluating large language models on code tasks, achieving near-perfect alignment with human-authored benchmarks at lower computational cost.
This paper introduces Bot-Mod, a moderation framework that identifies malicious intent in multi-agent systems through multi-turn dialogue and Gibbs-based sampling, and presents a dataset from Moltbook for evaluation.
Anthropic announces a series of dialogues with religious, philosophical, and cultural groups to broaden perspectives on building safe and beneficial AI. The conversations aim to inform the moral formation of AI systems like Claude.