SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
Summary
SoCRATES introduces a realistic multi-domain benchmark for evaluating proactive LLM mediators, showing that top models resolve only about one-third of the consensus gap in conflict resolution.
View Cached Full Text
Cached at: 06/08/26, 11:15 AM
Paper page - SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
Source: https://huggingface.co/papers/2606.05563
Abstract
SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution.
EvaluatingLLM mediatorsremains challenging, as mediation unfolds as areal-time trajectoryshaped by disputants’ shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactiveLLM mediatorsin realistic,multi-domain testbeds. It constructs scenarios from real conflicts through anagentic pipelineacross eight domains, probes fivesocio-cognitive adaptationaxes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via atopic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediatedconsensus gapunder diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.05563
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.05563 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.05563 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05563 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline
This paper introduces an automated mediator for human negotiation that uses a structured pipeline of LLM modules to conduct pre-mediation. In human-subject experiments, the system achieves preparation outcomes comparable to professional human mediators while reducing error in preference inference.
Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
This study presents a 33-model atlas analyzing domain-level metacognitive monitoring in frontier LLMs using MMLU benchmarks, revealing significant variations in confidence calibration across different knowledge domains that are obscured by aggregate metrics.
Counsel: A Meta-Evaluation Dataset for Agentic Tasks
Counsel is the first public dataset of human meta-evaluations of LLM critiques for agentic tasks, designed to improve the calibration and reliability of automated evaluation methods.
Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
This paper introduces AARR (Act As a Real Researcher), a suite of benchmarks to evaluate frontier LLMs and agentic systems on granular research scenarios. The first benchmark, AARRI-Bench, reveals that even top-performing agents achieve only 68.3% success, highlighting gaps in field sensitivity and nuanced reasoning.
RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity
RoleConflictBench is a novel benchmark containing over 13,000 scenarios across 65 roles designed to evaluate how well LLMs handle contextual sensitivity in role conflict situations where multiple social expectations clash. Analysis of 10 LLMs reveals that models predominantly rely on learned role preferences rather than dynamic contextual cues when making decisions.