SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
Summary
SCICONVBENCH is a benchmark that evaluates LLMs on multi-turn clarification for ill-posed scientific queries across computational science domains, finding that even frontier models struggle with disambiguation and frequently make silent assumptions.
View Cached Full Text
Cached at: 05/19/26, 10:34 PM
Paper page - SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
Source: https://huggingface.co/papers/2605.18630
Abstract
SCICONVBENCH evaluates large language models’ ability to handle ill-posed scientific queries through multi-turn dialogue, focusing on clarifying ambiguous requests and resolving inconsistent information across computational science domains.
Large Language Models(LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification inscientific task formulationacross fourcomputational scienceproblem domains:fluid mechanics,solid mechanics,materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structuredtask ontologywith arubric-based evaluationframework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior,conversational grounding, andfinal-specification fidelity. Current frontier models perform relatively well oninconsistency resolution, but even the best model resolves only 52.7% of thedisambiguationcases influid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and performimplicit specification repairsthat are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliablecomputational scienceassistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2605\.18630
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.18630 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.18630 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18630 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Can LLMs model real-world systems in TLA+?
Researchers from the Specula team created SysMoBench, a benchmark evaluating whether LLMs can faithfully model real-world computing systems in TLA+ or merely recite textbook specifications. The benchmark tests 11 systems across four phases and reveals systematic gaps in current LLMs' ability to accurately model system implementations versus reference papers.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation
Introduces A2RBench, an automated pipeline for generating formally verifiable abstract reasoning benchmarks for LLMs, using cycle consistency to ensure unique solutions, and reveals that current LLMs underperform humans significantly on 3D reasoning tasks.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.