SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Hugging Face Daily Papers 05/18/26, 12:00 AM Papers

Summary

SCICONVBENCH is a benchmark that evaluates LLMs on multi-turn clarification for ill-posed scientific queries across computational science domains, finding that even frontier models struggle with disambiguation and frequently make silent assumptions.

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.

Original Article

View Cached Full Text

Cached at: 05/19/26, 10:34 PM

Paper page - SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Source: https://huggingface.co/papers/2605.18630

Abstract

SCICONVBENCH evaluates large language models’ ability to handle ill-posed scientific queries through multi-turn dialogue, focusing on clarifying ambiguous requests and resolving inconsistent information across computational science domains.

Large Language Models(LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification inscientific task formulationacross fourcomputational scienceproblem domains:fluid mechanics,solid mechanics,materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structuredtask ontologywith arubric-based evaluationframework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior,conversational grounding, andfinal-specification fidelity. Current frontier models perform relatively well oninconsistency resolution, but even the best model resolves only 52.7% of thedisambiguationcases influid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and performimplicit specification repairsthat are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliablecomputational scienceassistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.

View arXiv page View PDF GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2605\.18630

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18630 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18630 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18630 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Paper page - SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Are LLMs Ready for Scientific Discovery? A Capability-Oriented Benchmark for AI Scientists

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

Can LLMs model real-world systems in TLA+?

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Submit Feedback

Similar Articles

Are LLMs Ready for Scientific Discovery? A Capability-Oriented Benchmark for AI Scientists

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

Can LLMs model real-world systems in TLA+?

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning