KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Hugging Face Daily Papers 04/17/26, 12:00 AM Papers

llm-evaluation benchmark problem-recognition game-theory knowledge-work large-language-models

Summary

KWBench introduces a benchmark of 223 professional tasks to evaluate whether LLMs can recognize the underlying game-theoretic structure of a situation without prompting, finding that even the best model succeeds on only 27.9% of tasks. The benchmark targets unprompted problem recognition—a step prior to task execution—across domains like acquisitions, clinical pharmacy, and fraud analysis.

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/22/26, 01:58 AM

Paper page - KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Source: https://huggingface.co/papers/2604.15760

Abstract

KWBench presents a benchmark for evaluating large language models’ ability to recognize professional scenarios without prompting, focusing on identifying underlying game-theoretic structures from raw inputs.

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unpromptedproblem recognitioninlarge language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict,signaling,mechanism design failure,strategic omission,coalitional dynamics,strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

View arXiv page View PDF Project page GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2604\.15760

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.15760 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.15760 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.15760 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Paper page - KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Submit Feedback

Similar Articles

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models