KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Summary
KWBench introduces a benchmark of 223 professional tasks to evaluate whether LLMs can recognize the underlying game-theoretic structure of a situation without prompting, finding that even the best model succeeds on only 27.9% of tasks. The benchmark targets unprompted problem recognition—a step prior to task execution—across domains like acquisitions, clinical pharmacy, and fraud analysis.
View Cached Full Text
Cached at: 04/22/26, 01:58 AM
Paper page - KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Source: https://huggingface.co/papers/2604.15760
Abstract
KWBench presents a benchmark for evaluating large language models’ ability to recognize professional scenarios without prompting, focusing on identifying underlying game-theoretic structures from raw inputs.
We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unpromptedproblem recognitioninlarge language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict,signaling,mechanism design failure,strategic omission,coalitional dynamics,strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2604\.15760
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.15760 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.15760 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.15760 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity
RoleConflictBench is a novel benchmark containing over 13,000 scenarios across 65 roles designed to evaluate how well LLMs handle contextual sensitivity in role conflict situations where multiple social expectations clash. Analysis of 10 LLMs reveals that models predominantly rely on learned role preferences rather than dynamic contextual cues when making decisions.
The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
A new cross-domain benchmark (Metacognitive Monitoring Battery) with 524 items evaluates LLM self-monitoring capabilities across six cognitive domains using human psychometric methodology. Applied to 20 frontier LLMs, it reveals three distinct metacognitive profiles and shows that accuracy rank and metacognitive sensitivity rank are largely inverted.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
BAGEL is a new benchmark for evaluating animal-related knowledge in large language models, constructed from diverse scientific sources and covering taxonomy, morphology, habitat, behavior, and species interactions through closed-book question-answer pairs. The benchmark enables fine-grained analysis across taxonomic groups and knowledge categories, providing insights into model strengths and failure modes for biodiversity applications.