GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

arXiv cs.AI Papers

Summary

The paper introduces GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, containing 63 problems across three difficulty levels. It evaluates five frontier models and finds that performance degrades with difficulty, with GPT-5 achieving near-perfect results on basic problems but only 82% on graduate-level proofs.

arXiv:2606.03144v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:43 AM

# GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
Source: [https://arxiv.org/html/2606.03144](https://arxiv.org/html/2606.03144)
Noujoud Nader, Ibrahem Aljabea,Patrick DiehlLos Alamos National LaboratoryLos Alamos National LaboratoryNMUSA[diehlpk@lanl\.gov](https://arxiv.org/html/2606.03144v1/mailto:[email protected])[1234\-5678\-9012](https://orcid.org/1234-5678-9012)andDeepti GuptaTexas A&M\-Central TexasCollege StationTXUSA

\(2018\)

###### Abstract\.

Large language models \(LLMs\) are increasingly used as self\-study assistant in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood\. We introduceGTBench, a curriculum\-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties \(Group 1\), algorithm tracing and structural reasoning \(Group 2\), and graduate\-level proof construction \(Group 3\)\. Problems are sourced from verified academic materials including Diestel’s*Graph Theory*\. We evaluate five frontier models —GPT\-5,Claude Sonnet 4\.6,Gemini 2\.5 Flash\-Lite,Llama 3\.3 70B, andMistral Large 3— under zero\-shot and chain\-of\-thought prompting, using exact\-match and LLM\-as\-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM\-as\-judge protocol for Group 3\. Our results reveal a pronounced performance hierarchy:GPT\-5approaches ceiling on Group 1 \(95\.8% zero\-shot\) and maintains meaningful accuracy on graduate proofs \(82%\), while all other models degrade substantially with difficulty, withLlamaachieving 0% under human evaluation on Group 3 zero\-shot\. Failure mode analysis shows thatcorrect algorithm, wrong executionerrors dominate Groups 1 and 2, while Group 3 additionally surfacesincomplete reasoningfailures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near\-complete proofs \(κ=0\.48\\kappa=0\.48–0\.830\.83across human pairs\)\.GTBenchprovides the first curriculum\-grounded evaluation framework for graph\-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research\.

Large language models \(LLMs\), mathematical reasoning, graph theory, benchmark, zero\-shot prompting, chain\-of\-thought prompting, failure analysis

††copyright:acmlicensed††journalyear:2018††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY††isbn:978\-1\-4503\-XXXX\-X/2018/06††ccs:Computing methodologies Artificial intelligence††ccs:Mathematics of computing Discrete mathematics††ccs:Artificial intelligence Knowledge representation and reasoning## 1\.Introduction

Large language models \(LLMs\) have demonstrated remarkable performance across a wide range of reasoning tasks, from commonsense question answering to mathematical problem solving and code generation\([cobbe2021gsm8k,](https://arxiv.org/html/2606.03144#bib.bib8);[hendrycks2021math,](https://arxiv.org/html/2606.03144#bib.bib18);[diehl2024evaluating,](https://arxiv.org/html/2606.03144#bib.bib11);[nader2025llm,](https://arxiv.org/html/2606.03144#bib.bib26);[diehl2025llm,](https://arxiv.org/html/2606.03144#bib.bib13);[mhatre2026can,](https://arxiv.org/html/2606.03144#bib.bib25)\)\. These advances have driven growing interest in understanding the extent and limits of LLM reasoning, particularly in domains requiring structured, multi\-step thinking\. Among these, mathematics has emerged as a central proving ground, offering problems with well\-defined correct answers and clear gradations of difficulty\. However, the mathematical reasoning literature has largely concentrated on numerical, algebraic, and formal proof settings\. Graph\-theoretic reasoning occupies a unique position in this field — it is at once elementary in its definitions and deeply challenging in its proofs, requiring the kind of structural intuition that develops only through sustained practice\. And it is precisely because of this richness that it serves as an ideal stress\-test for LLMs: can these models move beyond pattern\-matched answers to genuinely reason about relational structure, combinatorial properties, and formal proof construction? This question is not merely theoretical\. Researchers and students increasingly rely on LLMs as reasoning assistants in their daily work\([nader2025llm,](https://arxiv.org/html/2606.03144#bib.bib26);[diehl2025llmhpc,](https://arxiv.org/html/2606.03144#bib.bib12)\), and the gap between assumed and actual capability in a domain as foundational as graph theory carries real consequences for how these tools are trusted and adopted\.

Graph theory constitutes a core branch of discrete mathematics and theoretical computer science, offering significant contributions to algorithm design, network analysis, combinatorial optimization, complexity theory, and data structures\([nader2015classification,](https://arxiv.org/html/2606.03144#bib.bib27);[nader2015pregnancy,](https://arxiv.org/html/2606.03144#bib.bib28)\)\. Problems in graph theory demand a distinctive form of reasoning: one that is simultaneously visual and abstract, requiring the manipulation of relational structures, the application of algorithmic procedures, and the justification of structural properties\. Unlike arithmetic or algebraic reasoning, graph\-theoretic problem solving often requires a model to track the state of a combinatorial object across multiple reasoning steps, recognize structural patterns, and apply theorems whose conditions must be carefully verified\. These characteristics make graph theory an especially informative and challenging domain for probing the reasoning capabilities of LLMs\.

Graph\-theoretic reasoning has not been systematically evaluated in the LLM benchmarking literature\. Existing benchmarks either focus on numerical and algebraic reasoning, as in MATH\([hendrycks2021math,](https://arxiv.org/html/2606.03144#bib.bib18)\), GSM8K\([cobbe2021gsm8k,](https://arxiv.org/html/2606.03144#bib.bib8)\), and GHOSTS\([frieder2023ghosts,](https://arxiv.org/html/2606.03144#bib.bib15)\)or on formal, machine\-verifiable proof generation, as in LeanDojo\([yang2023leandojo,](https://arxiv.org/html/2606.03144#bib.bib33)\)and MiniF2F\([zheng2021minif2f,](https://arxiv.org/html/2606.03144#bib.bib34)\)\. Research\-level benchmarks such as FrontierMath\([glazer2024frontiermath,](https://arxiv.org/html/2606.03144#bib.bib16)\), HLE\([anonymous2024hle,](https://arxiv.org/html/2606.03144#bib.bib4)\), LemmaBench\([anonymous2024lemmabench,](https://arxiv.org/html/2606.03144#bib.bib5)\), and RealMath\([anonymous2024realmath,](https://arxiv.org/html/2606.03144#bib.bib6)\)target expert\-level mathematical difficulty but do not specifically isolate graph\-theoretic reasoning as an evaluation domain\. While recent work has begun to probe whether LLMs can solve graph problems expressed in natural language\([fatemi2023graphnl,](https://arxiv.org/html/2606.03144#bib.bib14)\), a systematic, curriculum\-grounded benchmark spanning the full range of graph theory instruction has remained absent from the literature\. Crucially, none of these benchmarks addresses the specific governance question of whether LLMs are trustworthy enough to serve as mathematical research assistants — tools that a scientist or student might rely on to understand, verify, or extend their knowledge of a technical domain\.

To address this limitation, we introduceGTBench, a curriculum\-grounded benchmark specifically designed to evaluate the graph\-theoretic reasoning capabilities of large language models\.GTBenchorganizes problems according to the standard progression of graph theory instruction in undergraduate and graduate programs, drawing on verified academic sources including textbooks, course materials, and problem sets used in university settings\. The benchmark is structured into three groups of increasing difficulty, each corresponding to a recognizable stage of graph theory education\. The current study focuses on Group 1 \(undergraduate introductory\) and Group 2 \(undergraduate intermediate\), with Group 3 covering graduate\-level content\.

Group 1problems assess foundational knowledge: familiarity with standard graph families such as complete graphsKnK\_\{n\}, cyclesCnC\_\{n\}, pathsPnP\_\{n\}, wheel graphsWnW\_\{n\}, hypercubesQdQ\_\{d\}, and complete bipartite graphsKr,sK\_\{r,s\}, as well as degree sequences, subgraph operations, complement graphs, isomorphism, and elementary counting arguments\. These problems are predominantly definitional and combinatorial in nature, and their answers are compact enough to be evaluated by exact match\.

Group 2problems step up in complexity, requiring the application of graph algorithms and structural reasoning about properties that emerge from connectivity and traversal\. Topics include breadth\-first search \(BFS\) and depth\-first search \(DFS\), cut vertices and bridges, Eulerian and Hamiltonian conditions, tree characterizations, and spanning tree algorithms such as Kruskal’s and Prim’s\. These problems correspond to the algorithmic core of a standard undergraduate discrete mathematics or graph theory course, requiring models to execute multi\-step procedures, track intermediate states, and produce verifiable reasoning at each step\.

Group 3problems move beyond standard algorithmic reasoning and require deeper mathematical justification, proof construction, and comparative evaluation of multiple solution strategies\. These problems often involve advanced graph\-theoretic concepts such as graph coloring, planarity, matching, network flows, spectral properties, NP\-complete graph problems, and complexity\-based reasoning\. In contrast to Groups 1 and 2, Group 3 emphasizes open\-ended reasoning where multiple valid approaches may exist, and partial correctness must be carefully assessed\. Because of this complexity, evaluation in Group 3 combines human expert judges and LLM\-as\-judge evaluation to ensure reliable scoring and consistency\.

Our main contributions are as follows:

- •We introduceGTBench, the first curriculum\-grounded benchmark for evaluating graph\-theoretic reasoning in LLMs, framed explicitly as an evaluation of LLMs as mathematical research assistants and organized into three groups aligned with the standard progression of graph theory instruction\.
- •We conduct a systematic empirical evaluation of leading LLMs on Group 1 and 2, providing detailed analysis of model strengths and failure modes across a range of graph\-theoretic problem types\.
- •We design and validate ahybrid evaluation methodologycombining LLM\-as\-judge scoring and human expert evaluation\.
- •We releaseGTBenchpublicly to support reproducible research and to serve as a foundation for future evaluation of LLM reasoning in discrete mathematics and theoretical computer science\.

##### Paper Structure

The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2606.03144#S2)reviews the related work\. The benchmark design is presented in Section[3](https://arxiv.org/html/2606.03144#S3)\. The filtering procedure is described in Section[4](https://arxiv.org/html/2606.03144#S4)\. The experimental methodology is outlined in Section[5](https://arxiv.org/html/2606.03144#S5), followed by the failure mode taxonomy in Section[6](https://arxiv.org/html/2606.03144#S6)\. The results are discussed in Section[7](https://arxiv.org/html/2606.03144#S7)\. Section[8](https://arxiv.org/html/2606.03144#S8)concludes the paper, outlining directions for future work\.

## 2\.Related Work

Efforts to benchmark LLM mathematical reasoning span a wide difficulty spectrum\. At the school and competition level, GSM8K\([cobbe2021gsm8k,](https://arxiv.org/html/2606.03144#bib.bib8)\)covers grade\-school arithmetic, MATH\([hendrycks2021math,](https://arxiv.org/html/2606.03144#bib.bib18)\)targets competition\-level algebra and geometry, and GHOSTS\([frieder2023ghosts,](https://arxiv.org/html/2606.03144#bib.bib15)\)probes undergraduate problem solving\. At the research frontier:\(i\)LemmaBench\([anonymous2024lemmabench,](https://arxiv.org/html/2606.03144#bib.bib5)\)usedGemini2\.5 Pro and ChatGPT 4/5, with results judged by both LLMs and humans;\(ii\)RealMath\([anonymous2024realmath,](https://arxiv.org/html/2606.03144#bib.bib6)\)draws from arXiv preprints and mathematical forums, using ChatGPT o3/o4,Gemini2\.5 Pro, Claude 3\.7, Grok\-3, and DeepSeek R1, with results judged by LLMs;\(iii\)FrontierMath\([glazer2024frontiermath,](https://arxiv.org/html/2606.03144#bib.bib16)\)used GPT\-4, Claude 3\.7, andGemini1\.5 Pro, with results verified through automated verification; and\(iv\)HLE\([anonymous2024hle,](https://arxiv.org/html/2606.03144#bib.bib4)\)uses expert\-curated problems to push models to their limits, with the initial publication evaluating GPT\-4o/o1 and Claude 3\.5\. Their leaderboard additionally reports results for GPT o3\-mini/o3/o5, Sonnet 3\.7/4/4\.5,Gemini2\.5 Pro/3\.x, Grok 4, DeepSeek R1, and Kimi K2\. Table[1](https://arxiv.org/html/2606.03144#S2.T1)summarizes these works\. A separate line of work, represented by LeanDojo\([yang2023leandojo,](https://arxiv.org/html/2606.03144#bib.bib33)\)and MiniF2F\([zheng2021minif2f,](https://arxiv.org/html/2606.03144#bib.bib34)\), focuses on machine\-verifiable proof generation within interactive theorem provers\. Across all these efforts, problems are drawn primarily from numerical, algebraic, and analytic domains, leaving discrete and graph\-theoretic mathematics largely unrepresented\. In this work, the authors\([xie2026core,](https://arxiv.org/html/2606.03144#bib.bib32)\)introduced CORE, a human\-verified benchmark to evaluate the semantic reasoning capabilities of LLMs on fundamental static code analysis tasks\. The authors\([kulkarni2026evaluating,](https://arxiv.org/html/2606.03144#bib.bib20)\)evaluated how effectively a LLM can support graph\-theoretic reasoning by testing it on both a solved problem and an open research problem using a structured mathematical evaluation process\. Wang et al\.\([wang2025exploring,](https://arxiv.org/html/2606.03144#bib.bib30)\)conducted a comprehensive study of LLMs on graph learning tasks, comparing their reasoning abilities, robustness, and transfer performance with traditional graph learning models across different evaluation settings\. In this work, the authors\([liu2025combibench,](https://arxiv.org/html/2606.03144#bib.bib22)\)introduced CombiBench, a formal benchmark and evaluation framework for testing the combinatorial reasoning abilities of LLMs using Lean 4–based mathematical problems\.

Most closely related to our work,\([fatemi2023graphnl,](https://arxiv.org/html/2606.03144#bib.bib14)\)investigates whether LLMs can solve graph problems expressed in natural language, surfacing important limitations in structural reasoning\. However, that work lacks a curriculum\-grounded structure and does not address the evaluation challenges posed by multi\-step procedural problems\.GTBenchfills this gap, providing the first systematic, multi\-level benchmark aligned with the standard undergraduate graph theory curriculum and employing a hybrid exact\-match and LLM\-as\-judge evaluation methodology suited to the full range of problem types\.

Table 1\.Comparison of related benchmarks andGTBench\(this work\) across difficulty level, models evaluated, data source, and evaluation method\.Auto\.= automated verification;LLM= LLM\-as\-judge;Human= human expert evaluation\.
## 3\.Benchmark Design

In this study, we introduceGTBench, a curriculum\-grounded benchmark for evaluating the graph\-theoretic reasoning capabilities of LLMs\. This section describes the design principles underlyingGTBench, the sources from which problems were drawn, and the three groups that constitute the current study\. We organize the problems into three groups corresponding to the standard progression of graph theory instruction from first\-year undergraduate to graduate level\. Table[2](https://arxiv.org/html/2606.03144#S3.T2)summarizes the groups structure\. Group 1 \(introductory\) problems test familiarity with the foundational vocabulary and properties of graphs including standard graph families \(KnK\_\{n\},CnC\_\{n\},PnP\_\{n\},WnW\_\{n\},QdQ\_\{d\},Kr,sK\_\{r,s\}\), degree sequences, subgraph operations, complement graphs, isomorphism, and elementary counting arguments\. Group 2 \(intermediate\) problems require the application of graph algorithms and the reasoning about structural properties that emerge from connectivity and traversal\. Topics include BFS and DFS, cut vertices and bridges, Eulerian and Hamiltonian conditions, tree characterisations, and spanning tree algorithms\. These correspond to the second half of a standard graph theory course and require the researcher to execute multi\-step procedures and justify their outputs\. Group 3 \(graduate — proof construction\) problems require the construction of formal mathematical arguments, the application of graduate\-level theorems, and in several cases the identification of counterexamples to disprove stated claims\. Topics span matching theory \(Hall’s theorem, Tutte’s theorem\), connectivity \(Menger’s theorem, 2\-connected and 3\-connected graphs\), planarity \(outerplanar graphs, self\-dual plane graphs\), graph coloring, flows, and cycle structure\. These problems correspond to the content of a graduate graph theory course and demand that the model not only identify the correct theorem or technique but produce a logically sound and complete proof — a qualitatively harder task than algorithm tracing or definitional recall\. Unlike Groups 1 and 2, where answers are evaluated by exact match and LLM\-as\-judge against a ground truth, Group 3 problems are evaluated by both human expert judges and an LLM\-as\-judge using a structured rubric \(Section[5](https://arxiv.org/html/2606.03144#S5)\), given the inherent complexity of graduate\-level proof construction and the need for reliable human validation at this level\.

### 3\.1\.Source

GTBenchproblems were drawn from two primary academic sources, chosen for their scholarly credibility, unambiguous provenance, and well\-documented difficulty structure\.

Table 2\.GTBenchgroups structure\.Group \#LevelDescriptionProblems1Definitions & BasicsDefinitions, basic graph families, degree arguments, counting, isomorphism312Algorithms & StructuresAlgorithm tracing \(BFS/DFS\), walks, connectivity, Eulerian & Hamiltonian graphs213Proof ConstructionTheorem proving, matching, planarity, colouring, flows11Total63#### 3\.1\.1\.Source A: Diestel’s*Graph Theory*and CMU 21\-484 Solutions\.

The first source is the exercise set of Diestel’s*Graph Theory*\([diestel,](https://arxiv.org/html/2606.03144#bib.bib2)\), the standard graduate reference textbook in the field, now in its fifth edition\. Ground truth answers for Group 1 Diestel exercises were taken directly from the homework solution sets of Carnegie Mellon University’s course 21\-484 Graph Theory\([cmu\_hw,](https://arxiv.org/html/2606.03144#bib.bib23)\), which provide faculty\-endorsed reference proofs for a subset of Diestel’s Chapter 1 exercises\. Ground truth answers for Group 3 problems were provided by a graph theory expert at Louisiana State University \(LSU\)\.

#### 3\.1\.2\.Source B: UPC Mathematics 1 Graph Theory Problem Set\.

The second source is the official problem set for the Mathematics 1 course \(Graph Theory module\) taught at the Facultat d’Informàtica de Barcelona, Universitat Politècnica de Catalunya \(UPC\), for the academic year 2025–2026\([upc\_problems,](https://arxiv.org/html/2606.03144#bib.bib10);[upc\_solutions,](https://arxiv.org/html/2606.03144#bib.bib9)\)\. This collection was assembled by faculty members of the Departament de Matemàtiques and covers the standard undergraduate graph theory curriculum: basic graph families and properties, subgraphs and operations, walks and connectivity, Eulerian and Hamiltonian graphs, and trees\. It is particularly well\-suited to Groups 1 and 2 ofGTBenchbecause it is structured as a teaching resource with clearly differentiated exercise types, ranging from definitional recall and adjacency matrix construction at the introductory level to non\-trivial proof exercises requiring induction or counting arguments\.

All solutions of the three groups were reviewed and validated by graph theory expert prior to inclusion inGTBench, ensuring that the reference answers meet the standards of mathematical rigour expected at each curricular level\.

## 4\.Filtering procedure

Starting from the full exercise sets of both sources, we applied the following exclusion criteria in order:

1. \(1\)Figure dependency: exclude any problem whose statement requires reading or interpreting a graph drawing\.
2. \(2\)Ambiguity: exclude problems with multiple defensible interpretations or that depend on conventions not specified in the problem statement\.
3. \(3\)Ground truth unavailability: exclude problems for which no verified reference answer could be established\.
4. \(4\)Group boundary: exclude problems whose difficulty clearly falls outside the intended group range \(*e\.g\.*, problems requiring graduate\-level machinery in a Group 1 context\)\.

After filtering, 13 problems were retained from Source A and 18 from Source B, yielding 31 problems for Group 1\. Group 2 draws exclusively from Source B \(Chapters 2–4\), retaining 21 problems whose ground truth answers were verified against the official solution file\. Group 3 draws from graduate\-level proof construction problems rooted in Diestel’s*Graph Theory*\([diestel,](https://arxiv.org/html/2606.03144#bib.bib2)\), retaining 11 problems evaluated by human expert judges and an LLM\-as\-judge given the proof\-construction nature of the tasks \(Table[2](https://arxiv.org/html/2606.03144#S3.T2)\)\.

Each problem inGTBenchis stored as a JSON object with an evaluation type of eitherexact\_matchorproof\. Figure[1](https://arxiv.org/html/2606.03144#LST1)shows a representative Group 1 problem as stored inGTBench\. Problems of typeexact\_matchare those where the model’s answer can be compared directly to the ground truth string or numerical value, used for problems with a unique, fully specified correct answer \(*e\.g\.*, the number of spanning trees ofK2,rK\_\{2,r\}, or the Eulerian condition forQnQ\_\{n\}\)\. Problems in Group 2 problems and 13 of the 31 Group 1 problems are of this type\. Problems of typeproofrequire the model to produce a natural\-language or semi\-formal mathematical argument, which cannot be assessed by string comparison alone\. For Group 1, the 18 proof\-type problems are evaluated using an LLM\-as\-judge approach with a structured rubric \(Section[5\.3](https://arxiv.org/html/2606.03144#S5.SS3)\)\. For Group 3, all 11 problems are evaluated by both human graph theory experts and an LLM\-as\-judge, since graduate\-level proof construction requires human validation to ensure reliability and mathematical rigour\.

1\{

2"id""T1\_009",

3"tier"1,

4"source""UPCGraphTheory,Exercise1\.16",

5"topic""Parityargument",

6"question""Provethatifagraphisregularofodd

7degree,thenithasevenorder\.",

8"ground\_truth""LetGber\-regularwithroddandorder

9n\.Bythehandshakinglemma,thesumof

10alldegreesequals2m\(even\)\.Thesum

11equalsn\*r\.Sincerisodd,n\*riseven

12ifandonlyifniseven\.",

13"requires\_figure"false,

14"evaluation\_type""proof"

15\}

Listing 1:A representativeGTBenchGroup 1 problem entry \(JSON format\)\.
## 5\.Experimental Methodology

### 5\.1\.Models Evaluated

We evaluate five large language models spanning the major families of frontier and open\-weight systems currently available to researchers\. The models were selected to provide broad coverage across developers \(Anthropic, Google, OpenAI, Meta, Mistral\), architectural approaches \(proprietary dense, open\-weight dense, mixture\-of\-experts\), and capability tiers \(frontier flagship, efficient mid\-tier, open\-weight\)\. Table[3](https://arxiv.org/html/2606.03144#S5.T3)summarises the five models and their key characteristics\. We useGPT\-4oas the LLM judge rather thanGPT\-5to avoid self\-serving bias, as using the same model to both generate and evaluate answers could systematically favour its own outputs\([zheng2023judging,](https://arxiv.org/html/2606.03144#bib.bib35)\)\.

Table 3\.LLMs evaluated inGTBench\. All models were accessed via their respective commercial APIs using the exact model strings listed\.Claude Sonnet 4\.6\(Anthropic, released February 2026\) is a mid\-tier model in Anthropic’s Claude 4 family\. It features a 1\-million\-token context window and is optimised for agentic workflows, long\-context reasoning, and instruction following\([claude\_sonnet\_46,](https://arxiv.org/html/2606.03144#bib.bib7)\)\.

Gemini 2\.5 Flash\-Lite\(Google DeepMind, released June 2025\) is a lightweight reasoning model in theGemini 2\.5family, built on a sparse mixture\-of\-experts architecture and designed for low\-latency, high\-throughput applications\. It supports a 1\-million\-token context window; reasoning \(multi\-pass thinking\) is disabled by default to prioritise speed\([gemini\_flash\_lite,](https://arxiv.org/html/2606.03144#bib.bib17)\)\.

GPT\-5\(OpenAI, released 2025\) is OpenAI’s fifth\-generation frontier model, featuring substantially improved reasoning, multimodality, and a 1\-million\-token context window available via the API\([gpt5,](https://arxiv.org/html/2606.03144#bib.bib3)\)\.

Llama 3\.3 70B Instruct\(Meta, released December 2024\) is an instruction\-tuned, text\-only open\-weight model with 70\.6 billion parameters and a 128K\-token context window\. Despite its size, it approaches the performance of the much largerLlama 3\.1 405Bon instruction\-following tasks, making it an efficient open\-weight baseline\([llama33,](https://arxiv.org/html/2606.03144#bib.bib24)\)\.

Mistral Large 3\(Mistral AI, released December 2025\) is a sparse mixture\-of\-experts model with 675 billion total parameters \(41 billion active per token\) and a 128K\-token context window\. It is released under the Apache 2\.0 licence, making it the only fully open\-weight frontier\-class model in our evaluation\([mistral\_large3,](https://arxiv.org/html/2606.03144#bib.bib1)\)\.

The selection balances proprietary closed models \(Claude\-\-Sonnet\-\-4\-\-6,Gemini,GPT\-5\) against open\-weight alternatives \(Llama 3\.3,Mistral Large 3\), and covers both dense \(Claude,GPT\-5,Llama 3\.3\) and sparse mixture\-of\-experts \(Gemini,Mistral\) architectures\.

##### Implementation Details

All models were queried via their respective commercial APIs using the exact model strings listed in Table[3](https://arxiv.org/html/2606.03144#S5.T3)\. Temperature was set to0for all models and all conditions to ensure deterministic and reproducible outputs across runs\. Each problem was evaluated three times per model per prompting condition \(zero\-shot and CoT\) to measure output consistency\. To handle transient API rate limit errors, all model calls were wrapped in an automatic retry mechanism with up to five retries and a 60\-second wait between attempts; any problem that failed after all retries was logged as skipped and excluded from the analysis\. All experiments were implemented in Python 3\.11 using the following libraries:scikit\-learn\(v1\.2\.2\),matplotlib\(v3\.9\.2\), andnumpy\(v1\.26\.4\)\. All data collection, API calls to LLMs, LLM\-as\-Judge evaluation, scoring, and analysis were automated via custom Python scripts\.

### 5\.2\.Prompting Conditions

All experiments used the prompt templates shown in Figure[1](https://arxiv.org/html/2606.03144#S5.F1), applied consistently across all models and groups\. Templates were fixed prior to any model evaluation and were not modified between runs\. Two prompting conditions were applied: zero\-shot, in which the question is presented directly, and chain\-of\-thought \(CoT\)\([wei2022chain,](https://arxiv.org/html/2606.03144#bib.bib31)\), in which the model is explicitly instructed to reason step by step before giving its final answer\. The zero\-shot condition establishes a baseline that reflects how a researcher or student would most naturally interact with an LLM as a self\-study tool — by posing a question directly\. The CoT condition tests whether explicit reasoning instructions improve answer quality, particularly for proof\-type problems in Groups 1 and 3 that require multi\-step argumentation\. Prior work has shown that CoT prompting substantially improves LLM performance on mathematical reasoning tasks\([wei2022chain,](https://arxiv.org/html/2606.03144#bib.bib31);[kojima2022large,](https://arxiv.org/html/2606.03144#bib.bib19)\), but its effectiveness on formal proof construction at the graduate level remains an open question thatGTBenchis positioned to address\.

Zero\-Shot Prompt TemplateAnswer the following graph theory question\.\{question\}

\(a\)
Chain\-of\-Thought \(CoT\) Prompt TemplateAnswer the following graph theory question\. Think step by step before giving your final answer\.\{question\}

\(b\)

Figure 1\.Prompt templates used across all models and groups\.\{question\}is replaced with the problem text at runtime\. Both templates are fixed prior to any model evaluation and kept constant across all models\.
### 5\.3\.Evaluation Protocol

#### 5\.3\.1\.LLM\-as\-judge

For Groups 1 and 2, where ground truth answers are fully specified, we apply two evaluation strategies depending on the problem type\. These problems are evaluated using an LLM\-as\-judge approach\([zheng2023judging,](https://arxiv.org/html/2606.03144#bib.bib35)\), in which a separate GPT\-4o instance receives the original question, the reference ground truth, and the model’s answer, and returns a structured JSON verdict classifying the answer as correct or assigning it a failure type from the taxonomy defined in Section[6](https://arxiv.org/html/2606.03144#S6), with justification in case of failure\. The evaluator prompt used by GPT\-4o is presented in Figure[2](https://arxiv.org/html/2606.03144#S5.F2)\.

Evaluator Prompt TemplateYou are an expert in graph theory evaluating an LLM’s answer\.QUESTION: \{question\}

CORRECT ANSWER: \{ground\_truth\}

MODEL ANSWER: \{model\_answer\}

First, determine if the model answer is correct \(Yes/No\)\. If incorrect, classify the failure as exactly one of: \- Type A: Hallucinated definition \-\-\- model invents a wrong theorem, lemma, or concept \- Type B: Correct algorithm, wrong execution \-\-\- right method, computational or logical error \- Type C: Wrong algorithm \-\-\- applies an irrelevant or inapplicable method \- Type D: Incomplete reasoning \-\-\- starts correctly but stops before reaching conclusion \- Type E: Refusal \-\-\- model says it does not know or cannot answer

Respond in JSON only, with no additional text: \{ "correct": true or false, "failure\_type": "A" or "B" or "C" or "D" or null, "justification": "one sentence explanation" \}

Figure 2\.Evaluator prompt used by the GPT\-4o judge\. The three placeholders\{question\},\{ground\_truth\}, and\{model\_answer\}are replaced at runtime with the problem text, the reference solution, and the model\-generated answer respectively\. The judge is instructed to return a JSON object only — with no preamble or additional text — to enable reliable automated parsing\. Failure types A–D are defined in Section[6](https://arxiv.org/html/2606.03144#S6)\.
#### 5\.3\.2\.Human Evaluation

Group 3 problems require the construction of graduate\-level mathematical proofs and cannot be reliably evaluated by automated means using LLM\-as\-judge\. All 11 Group 3 problems are therefore evaluated by three independent human expert judges with extensive experience in teaching and grading graduate\-level mathematics courses at Louisiana State University \(LSU\) and Texas A&M University\. Each evaluator receives a sheet containing the problem statement, the model\-generated answer, and the reference ground truth solution provided by a graph theory expert at LSU\. The model identity and prompting condition are hidden to prevent bias\. For each answer, the evaluator independently compares the model\-generated answer against the ground\-truth solution and assigns: a score on a 0–1 rubric \(0 = completely wrong; 0\.5 = partially correct; 1 = correct and complete\); a failure type \(A–D, if score≠1\\neq 1\); and a one\-sentence justification\. Inter\-rater agreement between evaluators is measured using Cohen’sκ\\kappa\([landis1977measurement,](https://arxiv.org/html/2606.03144#bib.bib21)\), which quantifies the agreement between two raters beyond what would be expected by chance\. We adopt the Landis & Koch scale:κ<0\.40\\kappa<0\.40= fair,0\.40≤κ<0\.600\.40\\leq\\kappa<0\.60= moderate,0\.60≤κ<0\.750\.60\\leq\\kappa<0\.75= substantial,κ≥0\.80\\kappa\\geq 0\.80= excellent\.

## 6\.Failure Modes Categories

We define a list of four failure modes, denoted Types A through D\. Each incorrect model answer in our evaluation is assigned exactly one failure type by theGPT\-4ojudge or the human judge\. Table[4](https://arxiv.org/html/2606.03144#S6.T4)summarises the taxonomy\.

Table 4\.GTBenchfailure mode taxonomy\. Each incorrect answer is assigned exactly one type\.
## 7\.Results and Discussion

This section presents the experimental results for all three groups under zero\-shot and CoT prompting conditions\. For each group we report model accuracy, failure mode distribution, and consistency, followed by a cross\-group discussion of the main findings and their implications for the use of LLMs as mathematical research assistants\.

### 7\.1\.Group 1 — Basics and Definitions

Figure[3](https://arxiv.org/html/2606.03144#S7.F3)\(Left\) reports accuracy under zero\-shot and CoT prompting for all five models on Group 1\.GPT\-5achieves the highest overall accuracy at95\.895\.8% \(zero\-shot\) and reached100100% with CoT, outperforming all other models by a consistent margin across both conditions\.Mistral Large 3andClaude Sonnet 4\.6follow with accuracies of76\.776\.7% and7070% respectively, whileGemini 2\.5 Flash\-LiteandLlama 3\.3trail significantly at6060% and2626%\. Contrary to expectations, CoT prompting does not improve performance on Group 1\. Notably, CoT prompting degraded performance for three of the five models evaluated: Mistral Large dropped by 10\.0 percentage points \(76\.776\.7% →66\.766\.7%\),Claude Sonnet4\.6 by6\.76\.7% \(70\.070\.0% →63\.363\.3%\), andGemini 2\.5 Flash\-Liteby3\.33\.3% \(60\.060\.0% →56\.756\.7%\)\.LLaMA 3\.3 70Bwas the exception among weaker models, improving by13\.313\.3percentage points under CoT \(26\.726\.7% →40\.040\.0%\)\. This suggests that for definitional and combinatorial problems at the introductory level, CoT prompting introduces unnecessary overhead without measurable improvement in accuracy — consistent with prior observations that CoT benefits diminish on tasks that do not require multi\-step reasoning\([kojima2022large,](https://arxiv.org/html/2606.03144#bib.bib19)\)\.

Figure[3](https://arxiv.org/html/2606.03144#S7.F3)\(Right\) shows the distribution of failure modes for each model, computed over zero\-shot incorrect responses\. ForMistral Large8686% of its errors are Type B \(correct algorithm, wrong execution\), with the remaining1414% classified as Type C \(wrong algorithm\)\. This suggests that Mistral generally retrieves the appropriate proof strategy but consistently introduces errors during execution\. A similar pattern emerges forGemini2\.5 Flash\-Lite, where Type B failures account for 67% of errors and Type D \(incomplete reasoning\) for 25\.0%, suggesting that the model frequently misapplies or conflates core definitions rather than failing at the reasoning stage itself\.Claude Sonnet 4\.6shows a notably different profile: its failures divide equally between Type B \(definition errors\) and Type D \(incomplete reasoning\), with a small proportion of Type A hallucinations, suggesting the model struggles with precision rather than problem comprehension\.LLaMA 3\.3 70Bfailures are dominated by Type C \(45%\), followed by Type B \( 32%\) and Type D \( 18%\)\. This is the only model for which selecting an irrelevant or inapplicable method constitutes the primary failure mode — pointing to a more fundamental limitation in problem comprehension rather than a failure of execution or reasoning\.GPT\-5failures are of Type A \(hallucinated definition\)\. Taken together,GPT\-5operates near ceiling, Mistral andGeminifail primarily through execution, Claude through incomplete reasoning chains, and LLaMA through fundamentally incorrect problem\-solving strategies\.

Consistency\.Across three repeated zero\-shot runs, all models produce consistent outputs on Group 1 \(mean consistency≥0\.99\\geq 0\.99\), with the sole exception ofMistral Large 3which records a consistency of0\.840\.84, the lowest across all models and groups\. Among incorrect zero\-shot responses,95\.395\.3% were reproduced with full consistency, indicating that model failures are not random errors\.Mistral Large 3is the only partial exception, with7171% of its wrong answers given consistently, suggesting residual uncertainty on a subset of introductory problems\.

Failure Topics\.Failures concentrate on standard graph families, complements of graphs, bipartite distance arguments, long paths in connected graphs, and the Matrix Tree Theorem — each drawing incorrect responses from most of the five models \(Table[5](https://arxiv.org/html/2606.03144#S7.T5)\)\.

![Refer to caption](https://arxiv.org/html/2606.03144v1/x1.png)
![Refer to caption](https://arxiv.org/html/2606.03144v1/x2.png)

Figure 3\.Evaluation results across five LLMs onGTBenchGroup 1\. Left: prompting condition comparison\. Right: failure mode breakdown\. Data shown is from zero\-shot condition\.Table 5\.Failure mode distribution by topic across Groups \(zero\-shot condition, all five models combined\)\. The dominant failure type per topic is shown with a colored background\. The rightmost column lists the models that produced at least one incorrect answer on that topic, identified by their logos:![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)GPT\-5,![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)Claude,![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)Gemini,![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Llama,![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Mistral\.A= hallucinated definition;B= correct algorithm, wrong execution;C= wrong algorithm;D= incomplete reasoning\.GroupTopicABCDModels failingGroup 1Adjacency matrix powers and walk counting———X![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)Automorphisms of trees——XX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Bipartite graphs and distances—XXX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Complements of graphs—XX—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Counting graphs———X![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Counting graphs up to isomorphism——X—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Degree counting argument—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Directed 3\-cycles in tournaments——X—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Edge bound in bipartite graphs—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)Graph productX—X—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Graph product properties—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Hypercube graph: degree, edges, diameter, girth, circumferenceX———![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Leaves in trees without degree\-2 vertices—XX—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Long paths in connected graphs—XXX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Matrix Tree Theorem applied toKn,nK\_\{n,n\}—X—X![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Minimum degree bound in graphs of large girth—XX—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Minimum spanning trees have unique spectrum——XX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Parity argument—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Ramsey\-type argument——XX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Self\-complementary graphs——XX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Size of regular and bipartite graphs———X![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Standard graph familiesXX——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)Vertex and edge connectivity of standard graph families—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Walks traversing each edge in both directions——X—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Group 2BFS algorithm — distances from vertices—XXX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)Bridges in 3\-regular graphs — smallest order——XX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Cut vertices and bridges — identification—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)DFS algorithm — connected components—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)DFS algorithm — multiple components—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)DFS spanning trees of complete bipartite graph—XX—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Diameter of standard graph families—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)Diameter under vertex removal — examples—XX—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Eccentricity, radius and center of graphs—XXX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)Eulerian graphs — complete bipartite condition—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)Eulerian graphs — minimum edges to merge components—XX—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)Hamiltonian graphs — minimum edges to merge components—XX—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Prüfer sequences — decoding trees—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)Prüfer sequences — encoding trees—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Prüfer sequences — equal values and two distinct values—XXX![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Prüfer sequences of length 1—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)Spanning trees — cycle and complete bipartite graphs—XX—![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Tree degree sequence — non\-isomorphic trees—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Tree degree sequence with prescribed structure—X——![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)

### 7\.2\.Group 2 — Algorithms and Structures

In Figure[4](https://arxiv.org/html/2606.03144#S7.F4)\(Left\), we report again the accuracy under zero\-shot and CoT prompting for all five models on Group 2\.GPT\-5again leads with57\.157\.1% \(zero\-shot\) and61\.961\.9% \(CoT\), maintaining its position as the strongest model across groups\.Claude Sonnet 4\.6andMistral Large 3follow with zero\-shot accuracies of38\.138\.1% and28\.628\.6% respectively, whileGemini 2\.5 Flash\-LiteandLlama 3\.3record the lowest scores at14\.314\.3% and23\.823\.8%\. Compared to Group 1, all models show a marked drop in accuracy, confirming that algorithm tracing and structural reasoning tasks pose substantially greater difficulty than basics definitional problems\. The CoT effect on Group 2 is more mixed than on Group 1\.GPT\-5,Mistral Large 3, andGeminishow modest improvements under CoT, whereasClaude Sonnet 4\.6andLLaMAdid not improve\.

Figure[4](https://arxiv.org/html/2606.03144#S7.F4)\(Right\) shows that Type B \(correct algorithm, wrong execution\) dominates across all models, accounting for8585% of Claude’s failures,8989% ofGPT\-5’s,7373% of Mistral’s, and7272% ofGemini’s\. This reflects the procedural nature of the problems: models consistently identify the correct method \(*e\.g\.*, DFS traversal, Eulerian conditions, Prüfer encoding\) but introduce errors during execution, such as incorrect component labeling, miscounted edges, or incomplete algorithm traces\. Type C \(wrong algorithm\) is the secondary failure mode forLlama 3\.3at3838%, indicating that this model struggle not only with execution but with selecting the appropriate technique, consistent with its Group 1 behavior\. Type D \(incomplete reasoning\) appears modestly across Claude \(88%\), LlaMa\(66%\), andGemini\(1111%\), suggesting occasional failures to carry a multi\-step procedure to completion\. Notably, no Type A hallucinations are observed in Group 2, which is expected given the exact\-match nature of all the problems\.

Consistency\.Consistency across three repeated runs is near\-perfect for all models on Group 2 \(mean≥0\.95\\geq 0\.95\), indicating that the exact\-match nature of the problems produces stable outputs regardless of model size or architecture\. Additionally,98\.698\.6% of incorrect zero\-shot responses were given with full consistency \(score=1\.0=1\.0\) across repeated runs\. Claude Sonnet 4\.6,Gemini2\.5Flash\-Lite,GPT\-5, andMistral Large 3each reach100100% — every wrong answer on algorithmic problems is reproduced identically across all three runs — pointing to systematic, firmly\-held errors in algorithm simulation and graph traversal rather than surface\-level instability\.

Failure Topics\.Failures are most concentrated on Prüfer sequence decoding, DFS on multi\-component graphs, cut vertex identification, BFS distance computation, and Eulerian edge\-completion problems \(Table[5](https://arxiv.org/html/2606.03144#S7.T5)\)\.

![Refer to caption](https://arxiv.org/html/2606.03144v1/x3.png)
![Refer to caption](https://arxiv.org/html/2606.03144v1/x4.png)

Figure 4\.Evaluation results across five LLMs onGTBenchGroup 2\. Left: prompting condition comparison\. Right: failure mode breakdown\. Data shown is from zero\-shot condition\.
### 7\.3\.Group 3 — Proof Construction

Figure[5](https://arxiv.org/html/2606.03144#S7.F5)reports the accuracy of all five models on Group 3 under zero\-shot and CoT prompting, evaluated by both the LLM judge \(GPT\-4o\) and three independent human experts \(binarised scores\)\. GPT\-5 achieves by far the strongest performance, with 82% accuracy under both conditions according to the LLM judge and 81% under human evaluation — the only model for which automated and human assessments are in close agreement\. All other models perform substantially belowGPT\-5:Mistral Large 3achieves 27% \(zero\-shot, LLM judge\) and 12% \(human\), Claude Sonnet 4\.6 reaches 9% under both evaluators on zero\-shot, andLlama 3\.3 70Bscores 9% \(LLM judge\) and 0% \(human\) — the weakest performance across the benchmark\.

Gemini 2\.5 Flash\-Litescores 36% under the LLM judge \(zero\-shot\) and 21% under human evaluation\. The CoT condition drops sharply to 0% \(LLM judge\) and 18% \(human\), suggesting that CoT prompting amplifies Gemini’s verbosity to the point where neither the automated judge nor human evaluators can reliably follow the argument\. Gemini produces substantially longer responses than all other models, averaging 3,976 words per zero\-shot answer and 7,611 words per CoT answer — roughly 16×\\timesand 25×\\timesmore verbose thanGPT\-5\(243 and 310 words respectively\), and far exceeding the next most verbose model,Mistral Large 3\(856 and 1,424 words\)\. Individual zero\-shot responses range from 2,002 to 5,900 words, indicating that even Gemini’s shortest answers are substantially long\. This extreme verbosity obscures valid mathematical reasoning within lengthy outputs, confounding both automated evaluation and human assessment: evaluators reported difficulty identifying the core argument in Gemini’s responses, particularly under CoT where the reasoning is further expanded\.

Human vs\. LLM judge\.A consistent pattern emerges across models: the LLM judge and human experts largely converge forGPT\-5\(82% vs\. 81%\) but diverge notably for weaker models\. The gap is most pronounced forGemini 2\.5 Flash\-Liteunder CoT \(0% LLM vs\. 18% human\), driven by answer length rather than proof quality, as discussed above\. ForClaude Sonnet 4\.6, the LLM judge and human evaluators agree on zero\-shot \(9% both\) but diverge under CoT \(9% LLM vs\. 18% human\), suggesting that Claude’s longer CoT responses are partially credited by human experts but not by the automated judge\. ForMistral Large 3, the pattern reverses: the LLM judge scores higher on zero\-shot \(27% vs\. 12% human\), indicating that some answers the judge accepts as correct are rejected by human experts as incomplete or insufficiently rigorous\.Llama 3\.3 70Breceives 0% from human experts on zero\-shot — the lowest human score in the benchmark — while the LLM judge assigns 9%, again suggesting the judge applies a more lenient standard on short, plausible\-sounding but ultimately incorrect proofs\. These divergences confirm that graduate\-level proof evaluation cannot be reliably delegated to an LLM judge and that human assessment is essential for Group 3\.

Effect of CoT prompting\.CoT prompting produces mixed and largely negative results on Group 3\.GPT\-5shows no change under CoT \(82% in both conditions\), suggesting it has reached its ceiling on these problems\.Claude Sonnet 4\.6improves under human evaluation from 9% to 18%, the only model to benefit consistently from CoT at this level\.Mistral Large 3degrades from 27% to 9% \(LLM judge\) and from 12% to 9% \(human\), andLlama 3\.3 70Bshows marginal improvement from 0% to 3% \(human\) but no change under the LLM judge \(9% in both conditions\)\.Gemini 2\.5 Flash\-Litecollapses to 0% under the LLM judge and declines from 21% to 18% under human evaluation, consistent with the verbosity amplification effect described above\. Overall, CoT prompting does not provide the gains observed in Groups 1 and 2 for most models, suggesting that structured step\-by\-step instructions do not compensate for the fundamental limitations in graduate\-level mathematical reasoning at this stage of model development\.

![Refer to caption](https://arxiv.org/html/2606.03144v1/x5.png)Figure 5\.Comparison of Zero\-Shot \(ZS\) and Chain\-of\-Thought \(CoT\) accuracy on Group 3 \(proof\-based problems\) across five LLMs, evaluated by two methods: LLM\-as\-Judge \(solid bars\) and human judge experts average \(hatched bars\)\.Inter\-rater agreement\.Table[6](https://arxiv.org/html/2606.03144#S7.T6)reports Cohen’sκ\\kappafor all six evaluator pairs across both prompting conditions\. Human–human agreement is heterogeneous: the E1–E2 pair achieves substantial agreement under zero\-shot \(κ=0\.77\\kappa=0\.77\) but drops to moderate under CoT \(κ=0\.48\\kappa=0\.48\), while E1–E3 achieves substantial agreement on zero\-shot \(κ=0\.69\\kappa=0\.69\) and excellent agreement on CoT \(κ=0\.83\\kappa=0\.83\)\. The E2–E3 pair achieves moderate agreement across both conditions \(κ=0\.57\\kappa=0\.57ZS,0\.520\.52CoT\)\. Human–human agreement averagesκ=0\.68\\kappa=0\.68\(zero\-shot\) andκ=0\.61\\kappa=0\.61\(CoT\), both meeting the minimum threshold ofκ\>0\.60\\kappa\>0\.60for acceptable agreement\. Human–LLM agreement is moderate across all three expert pairs on zero\-shot, withκ\\kapparanging from0\.480\.48to0\.510\.51\(averageκ=0\.50\\kappa=0\.50\), indicating that the LLM judge captures a meaningful signal but does not fully align with any individual human expert\. Agreement improves under CoT for two of the three pairs: E2–LLM reachesκ=0\.64\\kappa=0\.64\(substantial\) and E3–LLM improves toκ=0\.55\\kappa=0\.55\(moderate\), while E1–LLM remains stable atκ=0\.52\\kappa=0\.52\.

These results point to two distinct sources of disagreement\. First, the inherent subjectivity of graduate proof evaluation — even among domain experts, partial credit boundaries are interpreted differently, as evidenced by the E1–E2 drop from substantial to moderate agreement under CoT \(κ=0\.77→0\.48\\kappa=0\.77\\to 0\.48\) and the residual moderate E2–E3 agreement \(κ=0\.52\\kappa=0\.52–0\.570\.57\)\. Second, the structural mismatch between the LLM judge’s strict binary verdict and the nuanced judgments of human evaluators produces a systematic gap, most visible forGemini 2\.5 Flash\-Litewhere response verbosity further compounds the difficulty of automated assessment\. Taken together, these findings underscore a central limitation of automated evaluation at the graduate level\.

Table 6\.Inter\-rater agreement \(Cohen’sκ\\kappa\) for Group 3 evaluation\([landis1977measurement,](https://arxiv.org/html/2606.03144#bib.bib21)\)\. Landis & Koch scale:κ<0\.40\\kappa<0\.40= Fair;0\.40≤κ<0\.600\.40\\leq\\kappa<0\.60= Moderate;0\.60≤κ<0\.800\.60\\leq\\kappa<0\.80= Substantial;κ≥0\.80\\kappa\\geq 0\.80= Excellent\.Failure Topics\.Table[7](https://arxiv.org/html/2606.03144#S7.T7)shows the failure mode distribution across the 11 Group 3 topics\. Failures are not uniformly distributed — certain topics consistently draw incorrect responses from multiple models and multiple evaluators, while others are handled correctly by most models\. The most broadly failed topic isodd cycles in 3\-connected graphs, where all four evaluators \(E1, E2, E3, and the LLM judge\) flag failures across all five models\. Type D \(incomplete reasoning\) dominates for human experts and the LLM judge alike, indicating that models understand the Fan Lemma argument in principle but consistently fail to complete the construction of the four required odd cycles\. Similarly,maximum length paths sharing a vertexandmatching saturatingXXvia degree conditionare flagged by all three human experts and the LLM judge, with Type B dominating — models apply the correct strategy \(pigeonhole or Hall’s theorem\) but introduce logical errors during execution\. Type A failures \(hallucinated definitions\) are rare in Group 3 but appear exclusively oneven cycles with large average degree, where E1 and E2 flag Type A across Claude, Gemini, Llama, and Mistral, indicating that these models fabricate or misstate the bipartite subgraph theorem used in the proof\. This is the only topic where human experts and the LLM judge disagree on the dominant failure type: humans classify the errors as Type A hallucinations while the LLM judge classifies the same answers as Type C \(wrong algorithm\), consistent with the hypothesis that the judge struggles to detect subtle definitional errors in long\-form proofs\. A systematic divergence between human and automated evaluation appears onconnectivity and edge\-connectivityproblems, where human experts flag both Type B and Type D as co\-dominant while the LLM judge assigns only Type B\. Similarly, onlong paths from minimum degree, the LLM judge identifies Type D \(incomplete reasoning\) as dominant while human experts collectively flag Type B\. These divergences suggest that models begin the correct argument — often reaching the key intermediate step — but fail to complete the final case analysis or induction, a distinction that human evaluators are better positioned to identify\. Across all topics, failures are predominantly of Type B \(correct algorithm, wrong execution\), and in the majority of cases at least two human evaluators independently flag the same failure type, indicating consistent and interpretable error patterns rather than isolated disagreements\.

Table 7\.Failure mode distribution by topic for Group 3 \(zero\-shot condition, all five models combined\)\. Each topic has two rows: the human row shows the judge label\(s\) of experts who flagged failures \(E1, E2, E3\); the LLM row shows the GPT\-4o judge verdict\. The dominant failure type per row is shown with a colored background\. The rightmost column lists models with at least one incorrect answer on that topic:![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/gpt.png)GPT\-5,![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Claude_AI.png)Claude,![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/GeminiAI.png)Gemini,![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/Meta.png)Llama,![[Uncaptioned image]](https://arxiv.org/html/2606.03144v1/logos/MistralAI.png)Mistral\.A= hallucinated definition;B= correct algorithm, wrong execution;C= wrong algorithm;D= incomplete reasoning\.Implications for the learning assistant use case\.From the perspective of a researcher using an LLM as a self\-study companion, Group 3 results carry a clear warning\.GPT\-5is the only model that can reliably assist with graduate\-level graph theory proofs, achieving near\-human accuracy on the evaluated problems\. All other models produce correct proofs fewer than one time in three, and in several cases —Llama 3\.3 70B\(0% human, zero\-shot\) and Mistral Large 3 \(12% human, zero\-shot\) — correct answers are vanishingly rare\. A student relying on these models for proof guidance would receive incorrect or incomplete arguments the majority of the time, with no reliable mechanism to detect the error, since the consistency analysis shows that most failures are reproduced identically across repeated queries\.

## 8\.Conclusion

This paper introducedGTBench, a curriculum\-grounded benchmark for evaluating the graph\-theoretic reasoning capabilities of LLMs as self\-study tools for researchers and students\. Spanning 63 problems across three groups — from undergraduate definitions to graduate proof construction —GTBenchwas designed to mirror the natural progression of graph theory instruction and to assess whether LLMs can serve as reliable learning assistant at each stage of that progression\. Five frontier models were evaluated under zero\-shot and chain\-of\-thought prompting, with automated LLM\-as\-judge evaluation for Groups 1 and 2 and human expert evaluation combined with LLM\-as\-judge for Group 3\. Our results reveal a clear and consistent hierarchy:GPT\-5dominates across all groups and conditions, approaching ceiling performance on Group 1 and maintaining meaningful accuracy on graduate\-level proofs\. All other models degrade substantially as problem difficulty increases, withLlama 3\.3 70Bachieving 0% under human evaluation on Group 3 zero\-shot\. A striking finding across all groups is that model failures are largely consistent across repeated runs — 95\.3% of incorrect zero\-shot responses on Group 1 and 98\.6% on Group 2 are reproduced identically — indicating that errors reflect firmly\-held misconceptions rather than random variation, a particularly concerning pattern for researchers who might accept a confident but wrong answer without verification\.

Beyond accuracy, our failure mode analysis reveals that Type B errors \(correct algorithm, wrong execution\) dominate Groups 1 and 2, while graduate\-level problems in Group 3 additionally surface Type D failures \(incomplete reasoning\) and, in several topics, genuine disagreement between human evaluators — a finding that itself confirms the inherent difficulty of automated proof assessment at this level\. The inter\-rater agreement observed in Group 3 \(κ\\kapparanging from0\.480\.48to0\.830\.83across human pairs\) underscores that graduate proof evaluation resists standardisation, and that the LLM judge, despite achieving moderate agreement with individual experts \(κ=0\.48\\kappa=0\.48–0\.640\.64\), systematically underestimates model performance on verbose or near\-complete proofs\. Taken together, these findings carry a practical conslusion for the AI\-enabled scientific research community: current LLMs can plausibly support self\-directed learning in graph theory at the introductory and intermediate levels, but graduate\-level proof construction requires human expert verification\.

Future Work\.WhileGTBenchestablishes a rigorous foundation for evaluating graph\-theoretic reasoning in LLMs, several limitations point to natural extensions\. The current 63\-problem scope, though carefully graded, restricts generalizability; expanding coverage to adjacent mathematical domains — combinatorics, linear algebra, abstract algebra, and topology — would clarify whether the sharp performance degradation observed at the graduate level is domain\-specific or reflects a fundamental ceiling in formal mathematical reasoning across frontier models\. Evaluation methodology represents the most immediate direction for improvement\. The current three\-point scoring scale \(0, 0\.5, 1\) is a coarse approximation of proof quality that collapses meaningful distinctions — a response that correctly states all lemmas but fails only the final deduction is scored identically to one that fails entirely after the first step\. Future work should replace this scheme with a continuous percentage\-based scoring rubric that awards partial credit proportionally across well\-defined proof dimensions: problem comprehension, strategy selection, intermediate reasoning, and final execution\. Such granularity would expose precisely where in the reasoning chain models break down, enable finer\-grained comparison across models and difficulty tiers, and provide a more honest signal for practitioners deciding when to trust LLM\-generated mathematical arguments\. The near\-perfect reproducibility of incorrect responses \(95\.3–98\.6% across groups\) remains the study’s most consequential finding\. Understanding whether this consistency originates from pretraining data gaps, distributional biases, or architectural properties is a prerequisite for designing systems in which errors are detectable rather than confidently invisible — a critical property for any tool intended to support rigorous scientific or educational work across any domain\.

## Acknowledgment

P\. Diehl was supported by the U\.S\. Department of Energy through the Los Alamos National Laboratory\. Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U\.S\. Department of Energy \(Contract No\. 89233218CNA000001\)\. Approved by Los Alamos National Laboratory as LA\-UR\-26\-24106\.

## Data Availability

The LLM\-generated solutions for all problems and all models, along with the benchmark JSON files and scoring sheets, will be made publicly available on Zenodo upon acceptance of this paper\.

## Code Availability

The source code will be made publicly available on GitHub111and Zenodo upon acceptance of this paper\.

## References

- \[1\]Frontier AI LLMs, assistants, agents, services — Mistral AI — mistral\.ai\.[https://mistral\.ai](https://mistral.ai/)\.\[Accessed 09\-05\-2026\]\.
- \[2\]Graph Theory — diestel\-graph\-theory\.com\.[https://diestel\-graph\-theory\.com/](https://diestel-graph-theory.com/)\.\[Accessed 12\-05\-2026\]\.
- \[3\]OpenAI — openai\.com\.[https://openai\.com/](https://openai.com/)\.\[Accessed 09\-05\-2026\]\.
- \[4\]Anonymous\.HLE: A Human\-Level Evaluation Benchmark for Large Language Models\.arXiv preprint, 2024\.
- \[5\]Anonymous\.LemmaBench: Evaluating LLMs on Research\-Level Lemmas from arXiv Preprints\.arXiv preprint, 2024\.
- \[6\]Anonymous\.RealMath: A Benchmark for Mathematical Reasoning Derived from Research Papers and Forums\.arXiv preprint, 2024\.
- \[7\]Anthropic\.Introducing Sonnet 4\.6 — anthropic\.com\.[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6)\.\[Accessed 09\-05\-2026\]\.
- \[8\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training Verifiers to Solve Math Word Problems\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2021\.
- \[9\]Departament de Matemàtiques, Universitat Politècnica de Catalunya\.Mathematics 1 — Part I: Graph Theory\. Answers to Some Exercises\.[https://web\.mat\.upc\.edu/fib/matematiques1/docs/pm1\_graphs\_sol\.pdf](https://web.mat.upc.edu/fib/matematiques1/docs/pm1_graphs_sol.pdf), 2026\.Academic Year 2025–2026\.
- \[10\]Departament de Matemàtiques, Universitat Politècnica de Catalunya\.Mathematics 1 — Part I: Graph Theory\. Exercises and Problems\.[https://web\.mat\.upc\.edu/fib/matematiques1/docs/pm1\_graphs\.pdf](https://web.mat.upc.edu/fib/matematiques1/docs/pm1_graphs.pdf), 2026\.Academic Year 2025–2026\.
- \[11\]Patrick Diehl, Noujoud Nader, Steve Brandt, and Hartmut Kaiser\.Evaluating ai\-generated code for c\+\+, fortran, go, java, julia, matlab, python, r, and rust\.InEuropean Conference on Parallel Processing, pages 243–254\. Springer Nature Switzerland Cham, 2024\.
- \[12\]Patrick Diehl, Noujoud Nader, and Deepti Gupta\.Llm\-hpc\+\+: Evaluating llm\-generated modern c\+\+ and mpi\+ openmp codes for scalable mandelbrot set computation\.arXiv preprint arXiv:2512\.17023, 2025\.
- \[13\]Patrick Diehl, Noujoud Nader, Maxim Moraru, and Steven R Brandt\.Llm benchmarking with llama2: Evaluating code development performance across multiple programming languages\.Journal of Machine Learning for Modeling and Computing, 6\(3\), 2025\.
- \[14\]Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi\.Can Language Models Solve Graph Problems in Natural Language?InAdvances in Neural Information Processing Systems \(NeurIPS\), 2023\.
- \[15\]Simon Frieder, Luca Pinchetti, Ryan\-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and Julius Berner\.Mathematical Capabilities of ChatGPT\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2023\.
- \[16\]Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Bialer, Jaidn Gunning, Simon Lermen, and Fabien Roger\.FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI\.arXiv preprint arXiv:2411\.04872, 2024\.
- \[17\]Google DeepMind\.Models — Gemini API — Google AI for Developers — ai\.google\.dev\.[https://ai\.google\.dev/gemini\-api/docs/models](https://ai.google.dev/gemini-api/docs/models)\.\[Accessed 09\-05\-2026\]\.
- \[18\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring Mathematical Problem Solving with the MATH Dataset\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2021\.
- \[19\]Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa\.Large language models are zero\-shot reasoners\.Advances in neural information processing systems, 35:22199–22213, 2022\.
- \[20\]Adithya Kulkarni, Mohna Chakraborty, and Jay Bagga\.Evaluating large language models on solved and unsolved problems in graph theory: Implications for computing education\.Journal of Computing Sciences in Colleges, 41\(9\):83–100, 2026\.
- \[21\]GG Landis JRKoch\.The measurement of observer agreement for categorical data\.Biometrics, 33\(1\):159174, 1977\.
- \[22\]Junqi Liu, Xiaohan Lin, Jonas Bayer, Yael Dillies, Weijie Jiang, Xiaodan Liang, Roman Soletskyi, Haiming Wang, Yunzhou Xie, Beibei Xiong, et al\.Combibench: Benchmarking llm capability for combinatorial mathematics\.arXiv preprint arXiv:2505\.03171, 2025\.
- \[23\]John Mackey\.Math 484: Graph Theory — homework and solutions\.[https://www\.math\.cmu\.edu/~jmackey/math484/](https://www.math.cmu.edu/~jmackey/math484/), 2020\.Accessed: 2025\.
- \[24\]Meta AI\.meta\-llama/Llama\-3\.3\-70B\-Instruct · Hugging Face — huggingface\.co\.[https://huggingface\.co/meta\-llama/\{L\}lama\-3\.3\-70\{B\}\-\{I\}nstruct](https://huggingface.co/meta-llama/%7BL%7Dlama-3.3-70%7BB%7D-%7BI%7Dnstruct)\.\[Accessed 09\-05\-2026\]\.
- \[25\]Akshay Mhatre, Noujoud Nader, Patrick Diehl, and Deepti Gupta\.Can llms find bugs in code? an evaluation from beginner errors to security vulnerabilities in python and c\+\+\.InSoutheastCon 2026, pages 1–8\. IEEE, 2026\.
- \[26\]Noujoud Nader, Patrick Diehl, Steve Brandt, and Hartmut Kaiser\.LLM & HPC: Benchmarking deepseek’s performance in high\-performance computing tasks\.InInternational Conference on High Performance Computing, pages 626–638\. Springer, 2025\.
- \[27\]Noujoud Nader, Mahmoud Hassan, W Falou, Ahmad Diab, Sally Al\-Omar, Mohamad Khalil, and Catherine Marque\.Classification of pregnancy and labor contractions using a graph theory based analysis\.In2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society \(EMBC\), pages 2876–2879\. IEEE, 2015\.
- \[28\]Noujoud Nader, Catherine Marque, Mahmoud Hassan, Wassim Falou, Ahmad Diab, and Mohamad Khalil\.Pregnancy monitoring using graph theory based analysis\.In2015 International Conference on Advances in Biomedical Engineering \(ICABME\), pages 73–76\. IEEE, 2015\.
- \[29\]Antoine Peyronnet, Fabian Gloeckle, and Amaury Hayat\.LemmaBench: A Live, Research\-Level Benchmark to Evaluate LLM Capabilities in Mathematics, 2026\.
- \[30\]Yuxiang Wang, Xinnan Dai, Wenqi Fan, and Yao Ma\.Exploring graph tasks with pure llms: A comprehensive benchmark and investigation\.arXiv preprint arXiv:2502\.18771, 2025\.
- \[31\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al\.Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems, 35:24824–24837, 2022\.
- \[32\]Danning Xie, Mingwei Zheng, Xuwei Liu, Jiannan Wang, Chengpeng Wang, Lin Tan, and Xiangyu Zhang\.Core: Benchmarking llms’ code reasoning capabilities through static analysis tasks\.Advances in Neural Information Processing Systems, 38, 2026\.
- \[33\]Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar\.Leandojo: Theorem proving with retrieval\-augmented language models\.Advances in Neural Information Processing Systems, 36:21573–21612, 2023\.
- \[34\]Kunhao Zheng, Jesse Michael Han, and Stanislas Polu\.Minif2f: a cross\-system benchmark for formal olympiad\-level mathematics\.arXiv preprint arXiv:2109\.00110, 2021\.
- \[35\]Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems, 36:46595–46623, 2023\.

Similar Articles

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

arXiv cs.LG

Introduces GraphInfer-Bench, a benchmark to evaluate whether LLMs can perform graph inference—producing open-ended answers about a node and its neighborhood that cannot be retrieved from a single node or path. Experiments show that even frontier LLMs lag behind plain GNNs on these tasks, revealing a capability gap.

PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

Hugging Face Daily Papers

PRL-Bench is a comprehensive benchmark for evaluating LLMs' capabilities in frontier physics research, constructed from 100 curated Physical Review Letters papers across five physics subfields. The benchmark reveals significant gaps in current LLM performance (best scores below 50%), designed to test end-to-end research workflows, complex reasoning, and autonomous exploration.