CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models
Summary
CombEval is a dynamic benchmark for evaluating combinatorial counting in large language models, using typed specifications to generate problems with solver-verified answers. It tests 11 LLMs under direct and code-augmented settings and finds brittleness on ordered objects, indistinguishable elements, relative constraints, and nested dependencies.
View Cached Full Text
Cached at: 06/20/26, 02:33 PM
# CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models Source: [https://arxiv.org/html/2606.19788](https://arxiv.org/html/2606.19788) Yuxu Zhou1,Ondřej Kuželka2,Yuyi Wang3,4,Yuanhong Wang1,Yi Chang1,5,611footnotemark:1 1School of Artificial Intelligence, Jilin University, Changchun, China 2Czech Technical University in Prague, Prague, Czech Republic 3CRRC Zhuzhou Institute, Zhuzhou, China4Tengen Intelligence Institute, China 5International Center of Future Science, Jilin University 6Engineering Research Center of Knowledge\-Driven Human\-Machine Intelligence, MOE, China Correspondence:[lucienwang@jlu\.edu\.cn](https://arxiv.org/html/2606.19788v1/mailto:[email protected]) ###### Abstract We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models\. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural\-language counting problems with exact solver\-verified answers\. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth\. We evaluate 11 LLMs under direct and code\-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies\. Error analysis further identifies failures in constraint interpretation and counting principles\. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning\. The code and generated benchmark suites are publicly available at[https://github\.com/YuxuZhou\-CN/combination\-problem\-generation](https://github.com/YuxuZhou-CN/combination-problem-generation)\. CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models Yuxu Zhou1, Ondřej Kuželka2, Yuyi Wang3,4, Yuanhong Wang1††thanks:Corresponding authors\., Yi Chang1,5,611footnotemark:11School of Artificial Intelligence, Jilin University, Changchun, China2Czech Technical University in Prague, Prague, Czech Republic3CRRC Zhuzhou Institute, Zhuzhou, China4Tengen Intelligence Institute, China5International Center of Future Science, Jilin University6Engineering Research Center of Knowledge\-Driven Human\-Machine Intelligence, MOE, ChinaCorrespondence:[lucienwang@jlu\.edu\.cn](https://arxiv.org/html/2606.19788v1/mailto:[email protected]) ## 1Introduction *Combinatorial counting*\(*CO*\) or*enumeration*is a fundamental branch of mathematicsStanley \([2011](https://arxiv.org/html/2606.19788#bib.bib25)\), with one of its most critical functions lies in calculating probability\. For most equiprobable events, probability is derived by using CO to count the number of favorable outcomes and the total number of possible outcomes, then computing the ratio of these two values\. CO is also crucial in key domains such as finance, logistics, and healthcare, where it empowers constraint\-driven decision optimization by quantifying viable options amid complex requirements\. Yet its complexity stems primarily from the intricate nature of these constraints, which are often interdependent, dynamic, and multi\-faceted, making it challenging to develop one\-size\-fits\-all systematic solutions that adapt to diverse real\-world scenarios\. This is where large language models \(LLMs\) emerge as a promising approach to CO problems\. The strength of LLMs in abstract reasoning and processing unstructured, complex constraints allows them to tackle CO problems without relying on rigid, scenario\-specific algorithms\. LLMs have demonstrated remarkable capabilities in tasks such as natural language processingZhaoet al\.\([2023](https://arxiv.org/html/2606.19788#bib.bib1)\); Kalyan \([2024](https://arxiv.org/html/2606.19788#bib.bib2)\), code generationJianget al\.\([2024](https://arxiv.org/html/2606.19788#bib.bib3)\), and question answeringKuanget al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib7)\), with their superior performance on mathematical reasoning benchmarks being particularly noteworthyXuet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib4)\)\. Current mainstream modelsOpenAI \([2025](https://arxiv.org/html/2606.19788#bib.bib74)\); Qwen Team \([2025](https://arxiv.org/html/2606.19788#bib.bib73)\); DeepSeek\-AI \([2025](https://arxiv.org/html/2606.19788#bib.bib78)\)have achieved or even surpassed human\-average accuracy on widely used mathematical reasoning datasets like GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2606.19788#bib.bib8)\)and MATHHendryckset al\.\([2021](https://arxiv.org/html/2606.19788#bib.bib27)\), sparking widespread interest in exploring the potential of LLMs to tackle more complex mathematical problems and reasoning tasksYanget al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib5)\)\. Current benchmarks for evaluating the CO capabilities of LLMs are often contained within broader mathematical reasoning datasetsHendryckset al\.\([2021](https://arxiv.org/html/2606.19788#bib.bib27)\); Zhenget al\.\([2021](https://arxiv.org/html/2606.19788#bib.bib11)\); Veeraboina \([2023](https://arxiv.org/html/2606.19788#bib.bib9)\);[Xuet al\.](https://arxiv.org/html/2606.19788#bib.bib44); Xuejunet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib12)\)\. For example, the MATH datasetHendryckset al\.\([2021](https://arxiv.org/html/2606.19788#bib.bib27)\)includes12451245problems in the "Counting and Probability" category, covering topics such as basic counting principles, permutations and combinations, and probability calculations\. There are also limited benchmarks specifically targeting CO problems, such as CombiBenchLiuet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib14)\), where100100CO problems are collected spanning from middle school to International Mathematical Olympiad levels\. Though facilitating initial assessment of LLMs’ CO capabilities, these static benchmarks usually face the*data contamination*and*spurious reasoning*issues prevalent in mathematics benchmarks\. The data contamination issue arises when test questions or their variants are present in the training data of LLMsBalloccuet al\.\([2024](https://arxiv.org/html/2606.19788#bib.bib29)\); Golchin and Surdeanu \([2023](https://arxiv.org/html/2606.19788#bib.bib30)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib32)\)\. This issue makes models potentially able to achieve high scores through memorization rather than genuine reasoning, leading to overestimation of their true reasoning capabilitiesDenget al\.\([2024](https://arxiv.org/html/2606.19788#bib.bib6)\)\. On the other hand, even when the benchmark problems are not directly present in the training data, models may still exploit superficial textual patterns to arrive at correct answers without engaging in deep logical reasoningMirzadehet al\.\([2024](https://arxiv.org/html/2606.19788#bib.bib58)\); Laiet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib59)\); Boye and Moell \([2025](https://arxiv.org/html/2606.19788#bib.bib60)\)\. In scenarios requiring structural abstraction like combinatorics, models might memorize the superficial association that "choosingkkitems fromnn" corresponds to binomial coefficients, but their reasoning ability collapses rapidly when additional constraints are appliedShresthaet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib61)\)\. Figure 1:Overview of the CombEval generation framework with the current Cofola backend\. The generator samples a typed object DAG𝒫=⟨D,O,C⟩\\mathcal\{P\}=\\langle D,O,C\\rangle, attaches compatible constraints, verbalizes the resulting formal specification with templates, and verifies the exact answer using the Cofola solver\.Therefore, this paper proposes CombEval, a dynamic evaluation framework for CO in LLMs\. CombEval synthesizes CO instances from a typed formal specification rather than from a fixed question bank, as outlined in[Figure˜1](https://arxiv.org/html/2606.19788#S1.F1)\. This construction makes it possible to sample fresh problems after benchmark design time, while also controlling the structural variables that affect difficulty\. Moreover, because every instance is grounded in a formal program and solved exactly before evaluation, the benchmark can test whether models reason about the underlying structure rather than merely matching familiar surface templates\. At the core of CombEval is Cofola\(Wanget al\.,[2026](https://arxiv.org/html/2606.19788#bib.bib88)\), a typed declarative language and solver for combinatorial counting\. Following Cofola, we represent a problem as⟨D,O,C⟩\\langle D,O,C\\rangle, whereDDis a finite entity domain,OOis a set of combinatorial objects defined overDD\(e\.g\., sets, sequences, partitions\) with potential dependencies between them, andCCis a set of constraints over these objects\. Cofola provides seven object types including sets, bags, tuples, sequences, circles, partitions, and compositions, together with constraints for membership, subset and disjointness, multiplicity and cardinality, absolute tuple positions, relative order patterns, and group\-level conditions \(see[Tables˜1](https://arxiv.org/html/2606.19788#S3.T1)and[2](https://arxiv.org/html/2606.19788#S3.T2)\)\. Its solver compiles the formal specification into weighted first\-order model counting \(WFOMC\) instances, which can be solved efficiently with off\-the\-shelf WFOMC solversvan Bremen and Kuzelka \([2021](https://arxiv.org/html/2606.19788#bib.bib82)\); Kuzelka \([2021](https://arxiv.org/html/2606.19788#bib.bib83)\); Zouet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib87)\), enabling exact answer verification for a wide range of combinatorial problems\. Using CombEval, we construct several solver\-verified evaluation suites, and conduct a systematic evaluation of1111mainstream general\-purpose, math\-specialized, and thinking\-specialized LLMs, including both open\-source models \(e\.g\., Qwen, DeepSeek, LLaMA, and gpt\-oss\) and closed\-source models \(e\.g\., GPT\-5 and Gemini\-3\)\. The main observations are: - •Model scale and advanced reasoning techniques lead to improved overall accuracy under both zero\-shot and code\-augmented settings\. However, all evaluated models still show clear degradation on typed objects that require indistinguishable elements, ordered structures, or multi\-step dependencies\. - •CombEval enables fine\-grained difficulty control\. Increasing entity size, constraint count, or object\-dependency depth predictably lowers model accuracy, showing that the generated suites support granular model comparisons rather than only a single aggregate score\. - •Surface template variations can cause performance fluctuations to some extent, especially for less capable \(but still strong\) models\. However, the latestgpt\-5\.5, one of the strongest models, show greater robustness to these variations, suggesting that their reasoning capabilities are not tightly coupled to specific prompt formulations\. ## 2Related Work Mathematical reasoning benchmarks such as GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2606.19788#bib.bib8)\)and MATHHendryckset al\.\([2021](https://arxiv.org/html/2606.19788#bib.bib27)\)have established a foundation for evaluating arithmetic and algebraic capabilities including combinatorial counting in LLMs, but they also face significant limitations\. Primarily, these static datasets are highly susceptible to data contamination\. As pre\-training corpora expand, test questions are increasingly memorized rather than solved, distorting evaluation resultsBalloccuet al\.\([2024](https://arxiv.org/html/2606.19788#bib.bib29)\); Golchin and Surdeanu \([2023](https://arxiv.org/html/2606.19788#bib.bib30)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib32)\)\. While researchers have attempted to mitigate this through dynamic generation in other reasoning fieldsShiet al\.\([2022](https://arxiv.org/html/2606.19788#bib.bib23)\); Saparovet al\.\([2023](https://arxiv.org/html/2606.19788#bib.bib17)\); Wanet al\.\([2024](https://arxiv.org/html/2606.19788#bib.bib18)\); Opedalet al\.\([2024](https://arxiv.org/html/2606.19788#bib.bib31)\); Wanget al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib24)\); Ariyaniet al\.\([2025](https://arxiv.org/html/2606.19788#bib.bib16)\), constructing benchmarks specifically for CO presents unique challenges\. First, CO problems possess high structural and semantic diversity, making it difficult to design a unified framework that covers the full spectrum of problem types\. Second, efficient solutions often require specialized mathematical tools, such as generating functions and inclusion\-exclusion principlesStanley \([2011](https://arxiv.org/html/2606.19788#bib.bib25)\); Pak \([2019](https://arxiv.org/html/2606.19788#bib.bib22)\); Ferrariset al\.\([2015](https://arxiv.org/html/2606.19788#bib.bib26)\), which complicates the design of automated answer verification\. Finally, as many CO problems go beyond polynomial\-time complexityValiant \([1979](https://arxiv.org/html/2606.19788#bib.bib15)\), generation and evaluation frameworks must be carefully designed to ensure scalability without resorting to brute\-force enumeration\. Solver\-backed benchmark generation therefore depends not only on natural\-language templates, but also on the expressiveness and reliability of the formal backend\. There are various existing methods for automatically solving CO problems, ranging from Constraint Satisfaction Programming \(CSP\)Akgünet al\.\([2022](https://arxiv.org/html/2606.19788#bib.bib76)\), Answer Set Programming \(ASP\)Gebseret al\.\([2022](https://arxiv.org/html/2606.19788#bib.bib72)\), to lifted inference techniquesVan Den Broecket al\.\([2011](https://arxiv.org/html/2606.19788#bib.bib70)\); Taghipouret al\.\([2012](https://arxiv.org/html/2606.19788#bib.bib79)\); Kuzelka \([2021](https://arxiv.org/html/2606.19788#bib.bib83)\)\. We provide a brief overview of these methods in[Appendix˜B](https://arxiv.org/html/2606.19788#A2)\. Recently,Totiset al\.\([2023](https://arxiv.org/html/2606.19788#bib.bib19)\)proposed CoLa/CoSo, a lifted framework for modeling and solving CO problems\. CoLa/CoSo provides an important lifted approach to combinatorial counting, but it is centered on a more restricted single\-configuration language, where only one combinatorial object can be defined and solved per problem\. The updated Cofola language and solver\(Wanget al\.,[2026](https://arxiv.org/html/2606.19788#bib.bib88)\)used in CombEval extends this line with typed object dependencies, sets and bags, ordered objects, grouped objects, and WFOMC\-based solving\. This broader formal layer allows CombEval to generate larger and more structurally varied suites while retaining exact answer verification\. ## 3Formalization of Combinatorial Counting Problems CombEval follows the updated Cofola semantics and formalizes each CO problem as a triple𝒫=⟨D,O,C⟩\\mathcal\{P\}=\\langle D,O,C\\rangle\. HereDDis a finite domain of entities,OOis a set of typed combinatorial objects overDD, andCCis a set of constraints over these objects\. The answer is the number of valid combinatorial structures, i\.e\., assignments of concrete instances to every object inOOthat respect the object dependencies and satisfy all constraints inCC\. We introduce the components below with the following running example\. ###### Example 1 I have seven books on a shelf\. Two are math books and one is a physics book\. How many ways can I select55books and arrange them if 1\) the two math books and one physics book must be included, 2\) the math books must be adjacent, and 3\) both math books must be to the left of the physics book? ##### Entity domain The entity domainDDcontains the atomic objects that may appear in the count\. In Example[1](https://arxiv.org/html/2606.19788#Thmexample1),DDcontains seven book entities, including two math booksM1,M2M\_\{1\},M\_\{2\}, one physics bookPP, and four other booksO1,…,O4O\_\{1\},\\dots,O\_\{4\}\. Entities may be distinguishable, as in ordinary sets, or may appear with multiplicity inside bags \(multisets\), where indistinguishable copies must not be over\-counted\. ##### Combinatorial objects The object setOOcontains the combinatorial structures defined overDDthat are relevant to the problem\. Each object is defined either directly fromDDor derived from other objects through combinatorial operations\. For instance, for the problem in Example[1](https://arxiv.org/html/2606.19788#Thmexample1),OOcontains a set object defined asS=set\(M1,M2,P,O1,…,O4\)S=\\texttt\{set\}\(M\_\{1\},M\_\{2\},P,O\_\{1\},\\dots,O\_\{4\}\), a subset object defined asS′=choose\(S,5\)S^\{\\prime\}=\\texttt\{choose\}\(S,5\), and a sequence object defined asT=sequence\(S′\)T=\\texttt\{sequence\}\(S^\{\\prime\}\)\.[Table˜1](https://arxiv.org/html/2606.19788#S3.T1)summarizes the object types and representative operators available in Cofola\. Table 1:Cofola object types and representative operators used by CombEval\. Sets and bags are basic objects that can be defined directly from the entity domain, while tuples, sequences, circles, partitions, and compositions are derived from existing set or bag objects\.Object TypeRepresentative OperatorsDescriptionSetset,choose,union,intersect,diff,suppUnordered entities and set operations\.Bagbag,choose,choose\_replace,add\_union,union,intersect,diffMultisets and replacement\-based choices\.Tupletuple,choose\_tuple,choose\_replace\_tupleOrdered arrangements supporting absolute\-position constraints\.Sequencesequence,choose\_sequence,choose\_replace\_sequenceLinear arrangements supporting relative\-position patterns\.Circlecircle,choose\_circle,choose\_replace\_circleCircular arrangements, optionally identifying reflection\.PartitionpartitionUnordered grouping into disjoint parts\.Compositioncompose,indexOrdered grouping into disjoint parts and access to a specific group\. ##### Constraints The constraint setCCrestricts the valid configurations of objects inOO\. Cofola supports constraints over basic, ordered, and grouped objects, as summarized in[Table˜2](https://arxiv.org/html/2606.19788#S3.T2)\. For Example[1](https://arxiv.org/html/2606.19788#Thmexample1), the constraints includesubsetconstraints to ensure the required books are included, atogetherconstraint to enforce adjacency, and aprecedenceconstraint to ensure the math books are to the left of the physics book\. The language also allows logical combinations of atomic constraints, enabling negated or disjunctive conditions when needed\. Table 2:Representative constraint families supported by the updated Cofola backend\.Object ScopeConstraint FamilyDescriptionSet/BagMembership, subset, equality, disjointnessRequire entities or objects to be included, excluded, equal, nested, or disjoint\.Set/BagCardinality and multiplicityConstraints over cardinalities and multiset multiplicities\.TupleAbsolute\-position constraintsConstrain a fixed tuple position, e\.g\.,T\[i\] == eorT\[i\] in S\.TupleCount and deduplicated\-count constraintsCount positions occupied by elements, or count distinct appearing elements\.Sequence/CircleRelative\-position patternsExpress together, precedence, immediate predecessor, and adjacency patterns\.Sequence/CirclePattern\-count constraintsCount occurrences of relative\-position patterns\.Partition/CompositionGroup\-level and indexed constraintsConstrain every group, or constrain a specific indexed group in a composition\. Putting these components together, Example[1](https://arxiv.org/html/2606.19788#Thmexample1)can be expressed as the following Cofola program: `We do not expect that CombEval covers all CO problems\. More complex domains, such as graph objects, grid placements, arithmetic over entity\-level numeric attributes, and arbitrary symmetry groups, would require additional language constructs and hand\-crafted approaches; some of these extensions run into known \#𝖯\\mathsf\{P\}\-hard barriers Valiant \(1979\)\.` `4 Combinatorial Counting Instance Generation This section details the generation method for CO instances that we use in this work\. Problem Generation The generation of a CO problem follows from its formalization 𝒫=⟨D,O,C⟩\\mathcal\{P\}=\\langle D,O,C\\rangle\. As shown in Figure 1, our framework generates problems through a typed object\-DAG pipeline\. First, we initialize a finite entity domain with nn entities and sample mm initial set or bag objects, forming the reusable source pool\. Next, we construct a Cofola object DAG by repeatedly sampling type\-compatible operators, such as choose, choose\_replace, set/bag operations, tuple, sequence, circle, compose, partition, and indexing\. The maximum dependency depth dd and operator count ono\_\{n\} control the length and breadth of the generated reasoning chain\. Finally, we sample kk constraints whose types match the selected objects, covering structural constraints, cardinality and multiplicity constraints, absolute\-position constraints, relative\-position patterns, and group\-level constraints\. Natural Language Translation We use a template\-based method to convert the generated CO problem into natural language\. Each Cofola object declaration and constraint is verbalized by a template associated with its type\. For example, a choose declaration becomes a sentence describing a selection without replacement, a sequence declaration becomes an arrangement in a line, and a compose declaration becomes a distribution into labeled groups\. The final query is generated from the target object or object set being counted, e\.g\., "How many different ways can we obtain the resulting objects?" The full template library is provided in Appendix˜E\. Problem Solving We use the Cofola solver as our backend because it supports a broader typed object language, multiple dependent objects, and a WFOMC\-based compilation pipeline\. For each generated instance, Cofola computes the exact answer under a timeout budget; instances with solver failures, unsupported features, or unstable answers are filtered out before LLM evaluation\. 5 Experiments In this section, we present experiments evaluating the performance of various LLMs on the CombEval dataset generated using the proposed framework\. Our experimental evaluation is systematically designed to address the following three primary research questions: • RQ1: Capabilities of LLMs in Combinatorial Reasoning\. Can the generated combinatorial counting problems in CombEval faithfully reflect the capability hierarchy of current LLMs, producing performance trends consistent with existing benchmarks where stronger models achieve higher accuracy? • RQ2: Controllability of Problem Difficulty\. Can our generation framework effectively and predictably regulate the reasoning difficulty of problem instances by tuning multiple structural parameters? • RQ3: Impact of Prompt Templates on LLM Reasoning\. Does the surface form of custom problem templates bias or constrain the models’ mathematical reasoning performance when underlying logical structures are kept strictly identical? 5\.1 Experimental Setup Models\. We evaluate a set of LLMs on the CombEval dataset, including both open\-source and closed\-source systems\. The open\-source models comprise Meta\-Llama\-3\-8B\-Instruct, DeepSeek\-Math\-7B\-Instruct Shao et al\. \(2024\), DeepSeek\-V3\.2 DeepSeek\-AI \(2025\), Qwen2\.5\-Math\-7B\-Instruct Yang et al\. \(2024\), Qwen3\-4B\-Thinking\-2507, Qwen3\-30B\-Thinking\-2507 Qwen Team \(2025\), gpt\-oss\-20B OpenAI \(2025\), and LLaMA\-3\.3\-70B\-Instruct Dubey et al\. \(2024\), covering math\-specialized, instruction\-tuned, and reasoning\-enhanced architectures across different scales\. For closed\-source models, we evaluate the state\-of\-the\-art models gpt\-5\.1, gpt\-5\.5 and gemini\-3\-flash\-preview\-thinking\. Closed\-source models are evaluated through their official APIs, while open\-source models are run locally on a machine with 4 NVIDIA A40 GPUs\. All models are used under their licensed terms for research purposes\. Evaluation Protocol\. We use a zero\-shot setting with only the problem description and the required output format\. Temperature is set to 0 for deterministic outputs, with a maximum generation length of 4096 tokens\. The full prompt templates used for evaluation are provided in Appendix C\.2\. Metrics\. Accuracy is the primary metric\. We use regular expressions to extract the final numerical answer; failure or mismatch is counted as an error\. For ease of analysis, we categorize the Cofola objects into four main types: Choose \(objects derived from choose or choose\_replace\), Group \(objects derived from compose or partition\), Tuple \(objects derived from tuple or choose\_tuple\), and Sequence \(objects derived from sequence or choose\_sequence\)\. 5\.2 RQ1: Capabilities of LLMs in Combinatorial Reasoning Using CombEval, we vary the key difficulty parameters: constraint count kk from 0 to 3, entity size nn over \{6,8,10,12\}, and number of combinatorial operations ono\_\{n\} from 1 to 4\. We fix m=1m=1 and leave the DAG depth dd unconstrained, allowing the number of operations per layer to vary freely\. With a 2000\-second Cofola timeout to filter unstable instances, we construct 1,500 high\-quality CombEval problems, whose type distribution is reported in the appendix C\.1\. For a sanity check, we also use CoLa/CoSo to solve the same problem set, where 705705 problems are solvable by CoLa/CoSo that produce the same answers as Cofola\. To comprehensively assess reasoning capabilities, we evaluate the models under two distinct paradigms: standard natural language reasoning \(Zero\-shot\) and code\-augmented reasoning, where models are prompted to return executable Python code to compute the final answer\. The Python setting is designed to test whether models can leverage programmatic reasoning to handle complex combinatorial structures\. As shown in Figure 2, the generated evaluation problems exhibit strong discriminative power and clearly characterize the capability hierarchy of current LLMs\. Stronger models with larger parameter scales or advanced reasoning mechanisms, such as gpt\-5\.5 and gemini\-3\-flash\-preview\-thinking, achieve substantial performance advantages in both settings, whereas smaller models show clear limitations\. This performance ladder not only validates CombEval as an effective benchmark, but also directly reveals the capability boundaries of different LLM tiers when facing structured mathematical reasoning\. Figure 2: Overall accuracy comparison of evaluated LLMs on CombEval\. Models were tested using standard natural language reasoning \(Zero\-shot\) and code\-augmented reasoning \(Python\)\. Due to the exceedingly weak coding capabilities of smaller models, which frequently fail to return compliant and executable code, they were not evaluated under the Python setting\. 5\.3 RQ2: Controllability of Problem Difficulty To verify whether the generation parameters selected in our framework can effectively and predictably regulate problem difficulty, we adopt a controlled\-variable design and analyze the impact of entity size, constraint count, and reasoning depth on LLM performance\. We also conduct experiments on the runtime of the backend CO solver under different d,n,kd,n,k\. The results, shown in Section˜C\.3, indicate that nn and kk have a significant impact on the solving time, while dd has a moderate effect\. Though, the impact of dd may be more pronounced when considering the basic combinatorial reasoning ability of LLMs, deeper depth often requires more complex multi\-step reasoning\. 5\.3\.1 Impact of Operation Type and Difficulty Parameters Figure 3: Impact of entity size and constraint count on model performance\. We analyze operation type, constraint count, and entity scale on the same controlled subset\. To isolate these effects, we fix m=1m=1, on=1o\_\{n\}=1, and d=1d=1 \(i\.e\., a single non\-nested operation\) and vary entity size n∈\{6,8,10,12,14\}n\\in\\\{6,8,10,12,14\\\} and constraint count k∈\{0,1,2,3\}k\\in\\\{0,1,2,3\\\}\. For each configuration, we generate 25 valid instances under the same 2000\-second Cofola timeout, resulting in 2,000 problems\. As shown in Figure 3, both factors consistently increase problem difficulty\. Adding more constraints leads to a clear accuracy drop for most models, indicating that membership, positional, and relative\-order constraints effectively disrupt simple counting patterns\. Similarly, increasing nn enlarges the combinatorial search space and places greater pressure on numerical calculation and symbolic reasoning\. These trends confirm that CombEval can control difficulty in a fine\-grained and predictable way through structural generation parameters\. Table 3 further shows that operation type substantially affects model performance\. Across models, Sequence is the most challenging category, with the lowest overall average accuracy, suggesting that ordered selections with relative\-position constraints are particularly difficult for current LLMs, likely because they require more precise state tracking and constraint handling\. Choose is comparatively easier, while Group and Tuple fall between these extremes\. The same model hierarchy remains visible across operation types: stronger models such as gpt\_5\.5, gemini\_3\_flash\_preview\_thinking, and DeepSeek\_V3\.2 consistently outperform smaller models, but their performance still varies substantially by operator\. This confirms that CombEval not only separates model tiers overall, but also exposes fine\-grained weaknesses tied to specific combinatorial operations\. Table 3: Accuracy across combinatorial operation types\. Gray cells indicate the best open\-source result, and blue cells indicate the best closed\-source result\. Model CHOOSE GROUP TUPLE SEQUENCE Avg\. Open\-source Models Meta\_Llama\_3\_8B\_Instruct 0\.026 0\.008 0\.028 0\.018 0\.020 deepseek\_math\_7b\_instruct 0\.074 0\.010 0\.054 0\.040 0\.045 Qwen2\.5\_Math\_7B\_Instruct 0\.252 0\.166 0\.316 0\.316 0\.263 llama\_3\.3\_70b\_instruct 0\.462 0\.254 0\.368 0\.380 0\.366 Qwen3\_4B\_Thinking\_2507 0\.518 0\.408 0\.380 0\.242 0\.387 Qwen3\_30B\_A3B\_Thinking\_2507 0\.636 0\.256 0\.428 0\.254 0\.394 gpt\_oss\_20b 0\.672 0\.630 0\.616 0\.458 0\.594 DeepSeek\_V3\.2 0\.840 0\.744 0\.730 0\.540 0\.714 Closed\-source Models gpt\_5\.1 0\.920 0\.744 0\.802 0\.498 0\.741 gemini\_3\_flash\_preview\_thinking 0\.978 0\.908 0\.906 0\.668 0\.865 gpt\_5\.5 0\.996 0\.992 0\.992 0\.764 0\.936 Overall Average 0\.579 0\.465 0\.511 0\.380 0\.484 5\.3\.2 Impact of Reasoning Depth Beyond horizontal constraint accumulation, another key source of difficulty is multi\-step operator nesting, which requires chain\-structured reasoning\. To study this factor in isolation, we use a minimally nested linear execution chain \(e\.g\., Choose→Choose→Tuple\\textsc\{Choose\}\\rightarrow\\textsc\{Choose\}\\rightarrow\\textsc\{Tuple\}\)\. We eliminate confounding factors by fixing n=10n=10, m=1m=1, and k=0k=0, while strictly enforcing that the nesting depth equals the number of operations \(d=ond=o\_\{n\}\)\. We systematically vary this depth across \{1,2,3,4\}\\\{1,2,3,4\\\}\. For each depth level, we generate 300 instances, resulting in 1,200 depth\-controlled problems\. As illustrated in Figure 4, model accuracy degrades substantially as the nesting depth dd increases\. Even in the absence of additional constraints and within a restricted entity space, each successive operation forces the model to track extended intermediate state chains, thereby exacerbating the risk of reasoning drift and error accumulation\. This monotonic performance degradation quantitatively demonstrates CombEval’s efficacy in modulating the difficulty of long\-chain combinatorial reasoning via depth adjustments\. Furthermore, while frontier models equipped with advanced reasoning paradigms \(e\.g\., DeepSeek\-V3\.2, gpt\-5\.5, and gemini\-3\-flash\-preview\-thinking\) exhibit greater resilience, they are not immune to this degradation\. Figure 4: Impact of reasoning depth on model performance\. 5\.4 RQ3: Impact of Prompt Templates on LLM Reasoning A central question is whether LLMs fail on CombEval because they lack deep combinatorial reasoning ability, or because they are unfamiliar with our automatically generated custom templates\. To isolate the influence of surface wording, we compare each original template problem with a solver\-verified natural\-language rewrite that preserves the same Cofola structure\. The full rewrite\-and\-verification protocol is described in Appendix D\. In total, we obtain 661 verified isomorphic pairs and randomly sample 300 pairs for this analysis\. We report both single\-prompt accuracy and paired consistency, where the latter requires the model to answer both logically equivalent formulations correctly\. Model Setting Acc\. Δ\\Delta vs\. Template 95% CI DeepSeek\-V3\.2 Template 80\.42% – \[77\.41%, 83\.52%\] ReStory 79\.56% −0\.86\-0\.86 \[76\.41%, 82\.26%\] Both 72\.51% −7\.91\-7\.91 \[68\.89%, 75\.92%\] gpt\-5\.1 Template 82\.25% – \[79\.08%, 85\.59%\] ReStory 80\.25% −2\.00\-2\.00 \[77\.74%, 83\.26%\] Both 74\.23% −8\.02\-8\.02 \[71\.08%, 77\.85%\] gemini\-3\-flash \-preview\-thinking Template 87\.38% – \[84\.74%, 89\.33%\] ReStory 89\.28% \+1\.90\+1\.90 \[87\.67%, 91\.33%\] Both 84\.33% −3\.05\-3\.05 \[81\.48%, 86\.92%\] gpt\-5\.5 Template 92\.59% – \[89\.89%, 94\.33%\] ReStory 94\.55% \+1\.96\+1\.96 \[92\.74%, 96\.00%\] Both 91\.83% −0\.76\-0\.76 \[88\.82%, 93\.67%\] Table 4: Prompt\-template robustness of DeepSeek\-V3\.2, gemini\-3\-flash\-preview\-thinking, gpt\-5\.1, and gpt\-5\.5\. Template is the baseline; Δ\\Delta reports percentage\-point changes relative to Template\. ReStory reports accuracy on rewritten prompts, and Both requires both formulations to be correct\. As shown in Table 4, all models perform similarly on original templates and rewritten prompts\. This suggests that CombEval’s custom templates do not artificially depress model performance\. However, paired consistency of relatively weaker models DeepSeek\-V3\.2, gpt\-5\.1, and gemini\-3\-flash\-preview\-thinking is lower than single\-prompt accuracy \(outside the confidence intervals of the original/rephrased template\), indicating that some correct answers are not preserved under superficial rewrites\. This gap becomes negligible for gpt\-5\.5, dropping only 0\.76%0\.76\\% compared to the original template\. Taken together, these results indicate that surface template variations can cause some performance fluctuations, especially for less capable \(but still strong\) models\. The latest stronger models, however, show greater robustness to these variations, suggesting that their reasoning capabilities are not tightly coupled to specific prompt formulations\. 5\.5 Error Analysis and Qualitative Case Studies We manually analyze the 208208 wrong answers of gpt\-5\.5, of which 1515 are zero predictions and 193193 disagree on a non\-zero value\. Errors span all major operators including Sequence: 121121, Group: 6262, Tuple: 5555 and constraints together: 3535, indexequalmember: 3434, lessthan: 2929, next\_to: 2727 \(note that one problem can have multiple operators and constraints, so these counts are not mutually exclusive\)\. They can be broadly categorized into: \(i\) collapsing a together block to one fixed internal order, missing its k\!k\! permutations; \(ii\) reading “together\(A\) not in seq” as “no two of AA adjacent” instead of “AA not contiguous”; \(iii\) treating multiple choose\_tuple on the same bag as one deal rather than independent draws; \(iv\) dropping the \(n−1\)\!\(n\{\-\}1\)\! quotient for multiset circles; \(v\) collapsing chains formed by together that share an entity; and \(vi\) collapsing internal ordering and direction of together\+\+next\_to blocks in circles\. Patterns \(i\)–\(ii\) reflect a semantic gap between natural\-language and Cofola, while \(iii\)–\(vi\) are classical combinatorial slips\. Representative examples are in Appendix˜F\. 6 Conclusion We present CombEval, an framework for dynamically generating diverse and difficulty\-controllable combinatorial counting problems, used to evaluate the combinatorial counting capabilities of large language models\. Experimental results on CombEval reveal that current LLMs, including state\-of\-the\-art closed\-source models, still struggle with various combinatorial counting challenges, particularly those involving indistinguishable elements, complex sequencing constraints, and deep reasoning chains\. We anticipate that CombEval could serve as a valuable resource for advancing research in combinatorial counting and reasoning with LLMs, inspiring the development of more robust and capable models in this domain\. In future work, we plan to scale the LLM\-facing suites to additional combinatorial object types, such as circular arrangements, partitions, graph objects, and numeric attributes, and to explore agentic techniques to enhance combinatorial reasoning performance\. Limitations While CombEval provides a structured benchmark for combinatorial reasoning, it has several limitations\. The benchmark currently supports only English, and extending it to other languages would require additional template development and validation\. In addition, verification with the Cofola solver is constrained by computational time and memory, which prevents the generation of more complex problem instances\. Finally, human validation was performed only on a subset of samples, motivating future human\-in\-the\-loop evaluation\. Ethical considerations This work introduces CombEval, a benchmark for evaluating combinatorial counting capabilities of large language models\. CombEval is intended for research purposes to advance understanding of LLM reasoning abilities\. We do not foresee direct ethical concerns arising from the use of CombEval itself\. Acknowledgements Yuanhong Wang and Yuxu Zhou are supported by the National Natural Science Foundation of China \(No\.62506141\)\. Ondřej Kuželka is supported by the Czech Science Foundation project “The Automatic Combinatorialist” \(24\-11820S\)\. Additionally, this work is funded by the New Cornerstone Science Foundation via the XPLORER PRIZE\. References Ö\. Akgün, A\. M\. Frisch, I\. P\. Gent, C\. Jefferson, I\. Miguel, and P\. Nightingale \(2022\) Conjure: automatic generation of constraint models from problem specifications\. Artificial Intelligence 310, pp\. 103751\. Cited by: Appendix B, §2\. N\. F\. Ariyani, Z\. Bouraoui, R\. Booth, and S\. Schockaert \(2025\) There’s no such thing as simple reasoning for llms\. In Findings of the Association for Computational Linguistics: ACL 2025, pp\. 4503–4514\. Cited by: §2\. S\. Balloccu, P\. Schmidtová, M\. Lango, and O\. Dušek \(2024\) Leak, cheat, repeat: data contamination and evaluation malpractices in closed\-source llms\. arXiv preprint arXiv:2402\.03927\. Cited by: §1, §2\. J\. Boye and B\. Moell \(2025\) Large language models and mathematical reasoning failures\. arXiv preprint arXiv:2502\.11574\. Cited by: §1\. K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, et al\. \(2021\) Training verifiers to solve math word problems\. arXiv preprint arXiv:2110\.14168\. Cited by: §1, §2\. DeepSeek\-AI \(2025\) DeepSeek\-v3\.2: pushing the frontier of open large language models\. Cited by: §1, §5\.1\. C\. Deng, Y\. Zhao, X\. Tang, M\. Gerstein, and A\. Cohan \(2024\) Investigating data contamination in modern benchmarks for large language models\. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\), Mexico City, Mexico, pp\. 8706–8719\. External Links: Document, Link Cited by: §1\. A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan, et al\. \(2024\) The llama 3 herd of models\. arXiv preprint arXiv:2407\.21783\. Cited by: §5\.1\. S\. Ferraris, A\. Mendelson, G\. Ballesio, and T\. Vercauteren \(2015\) Counting sub\-multisets of fixed cardinality\. arXiv preprint arXiv:1511\.06142\. Cited by: §2\. A\. M\. Frisch, W\. Harvey, C\. Jefferson, B\. Martínez\-Hernández, and I\. Miguel \(2008\) Essence: a constraint language for specifying combinatorial problems\. Constraints 13 \(3\), pp\. 268–306\. Cited by: Appendix B\. L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023a\) Pal: program\-aided language models\. In International Conference on Machine Learning, pp\. 10764–10799\. Cited by: Appendix F\. Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, H\. Wang, and H\. Wang \(2023b\) Retrieval\-augmented generation for large language models: a survey\. arXiv preprint arXiv:2312\.10997 2 \(1\)\. Cited by: §D\.2\. M\. Gebser, R\. Kaminski, B\. Kaufmann, and T\. Schaub \(2022\) Answer set solving in practice\. Springer Nature\. Cited by: Appendix B, §2\. I\. P\. Gent and C\. Jefferson \(2006\) MINION: a fast, scalable, constraint solver1\. In ECAI 2006: 17th European Conference on Artificial Intelligence, August 29\-September 1, 2006, Riva del Garda, Italy: including Prestigious Applications of Intelligent Systems \(PAIS 2006\): proceedings, Vol\. 141, pp\. 98\. Cited by: Appendix B\. S\. Golchin and M\. Surdeanu \(2023\) Time travel in llms: tracing data contamination in large language models\. arXiv preprint arXiv:2308\.08493\. Cited by: §1, §2\. D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\) Measuring mathematical problem solving with the math dataset\. In Proceedings of NeurIPS, Cited by: §D\.2, §1, §1, §2\. J\. Jiang, F\. Wang, J\. Shen, S\. Kim, and S\. Kim \(2024\) A survey on large language models for code generation\. arXiv preprint arXiv:2406\.00515\. Cited by: §1\. K\. S\. Kalyan \(2024\) A survey of gpt\-3 family large language models including chatgpt and gpt\-4\. Natural Language Processing Journal 6, pp\. 100048\. Cited by: §1\. J\. Kuang, Y\. Shen, J\. Xie, H\. Luo, Z\. Xu, R\. Li, Y\. Li, X\. Cheng, X\. Lin, and Y\. Han \(2025\) Natural language understanding and inference with mllm in visual question answering: a survey\. ACM Computing Surveys 57 \(8\), pp\. 1–36\. Cited by: §1\. O\. Kuzelka \(2021\) Weighted first\-order model counting in the two\-variable fragment with counting quantifiers\. Journal of Artificial Intelligence Research 70, pp\. 1281–1307\. External Links: Document, ISSN 1076\-9757, 2007\.05619 Cited by: Appendix B, §1, §2\. Z\. Lai, X\. Geng, Z\. Wang, Y\. Bai, J\. Li, R\. Weng, J\. Wang, X\. Cao, X\. Cai, and S\. Huang \(2025\) Making mathematical reasoning adaptive\. arXiv preprint arXiv:2510\.04617\. Cited by: §1\. J\. Liu, X\. Lin, J\. Bayer, Y\. Dillies, W\. Jiang, X\. Liang, R\. Soletskyi, H\. Wang, Y\. Xie, B\. Xiong, et al\. \(2025\) CombiBench: benchmarking llm capability for combinatorial mathematics\. arXiv preprint arXiv:2505\.03171\. Cited by: §1\. I\. Mirzadeh, K\. Alizadeh, H\. Shahrokhi, O\. Tuzel, S\. Bengio, and M\. Farajtabar \(2024\) Gsm\-symbolic: understanding the limitations of mathematical reasoning in large language models\. arXiv preprint arXiv:2410\.05229\. Cited by: §1\. A\. Opedal, H\. Shirakami, B\. Schölkopf, A\. Saparov, and M\. Sachan \(2024\) Mathgap: out\-of\-distribution evaluation on problems with arbitrarily complex proofs\. arXiv preprint arXiv:2410\.13502\. Cited by: §2\. OpenAI \(2025\) Gpt\-oss\-120b & gpt\-oss\-20b model card\. External Links: Link, 2508\.10925 Cited by: §1, §5\.1\. I\. Pak \(2019\) COMPLEXITY PROBLEMS IN ENUMERATIVE COMBINATORICS\. In Proceedings of the International Congress of Mathematicians \(ICM 2018\), Rio de Janeiro, Brazil\. External Links: Document, 1803\.06636 Cited by: §2\. Qwen Team \(2025\) Qwen3 technical report\. External Links: Link, 2505\.09388 Cited by: §1, §5\.1\. A\. Saparov, R\. Y\. Pang, V\. Padmakumar, N\. Joshi, M\. Kazemi, N\. Kim, and H\. He \(2023\) Testing the general deductive reasoning capacity of large language models using ood examples\. Advances in Neural Information Processing Systems 36, pp\. 3083–3105\. Cited by: §2\. Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\) DeepSeekMath: pushing the limits of mathematical reasoning in open language models\. External Links: Link, 2402\.03300 Cited by: §5\.1\. Z\. Shi, Q\. Zhang, and A\. Lipani \(2022\) Stepgame: a new benchmark for robust multi\-hop spatial reasoning in texts\. In Proceedings of the AAAI conference on artificial intelligence, Vol\. 36, pp\. 11321–11329\. Cited by: §2\. S\. Shrestha, M\. Kim, and K\. Ross \(2025\) Mathematical reasoning in large language models: assessing logical and arithmetic errors across wide numerical ranges\. arXiv preprint arXiv:2502\.08680\. Cited by: §1\. R\. P\. Stanley \(2011\) Enumerative combinatorics volume 1 second edition\. Cambridge studies in advanced mathematics\. Cited by: §1, §2\. N\. Taghipour, D\. Fierens, J\. Davis, and H\. Blockeel \(2012\) Lifted variable elimination with arbitrary constraints\. In Artificial Intelligence and Statistics, pp\. 1194–1202\. Cited by: Appendix B, §2\. M\. Thurley \(2006\) SharpSAT–counting models with advanced component caching and implicit bcp\. In International Conference on Theory and Applications of Satisfiability Testing, pp\. 424–429\. Cited by: Appendix B\. J\. Tóth and O\. Kuželka \(2022\) Lifted inference with linear order axiom\. Proceedings of the AAAI Conference on Artificial Intelligence 37 \(10\), pp\. 12295–12304\. External Links: Document Cited by: Appendix B\. P\. Totis, J\. Davis, L\. De Raedt, and A\. Kimmig \(2023\) Lifted Reasoning for Combinatorial Counting\. Journal of Artificial Intelligence Research 76, pp\. 1–58\. External Links: Document, ISSN 1076\-9757 Cited by: Appendix B, §2\. L\. G\. Valiant \(1979\) The Complexity of Enumeration and Reliability Problems\. SIAM Journal on Computing 8 \(3\), pp\. 410–421\. External Links: ISSN 0097\-5397, 1095\-7111 Cited by: §2, §3\. T\. van Bremen and O\. Kuzelka \(2021\) Faster lifting for two\-variable logic using cell graphs\. In Proceedings of the Thirty\-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, 27\-30 July 2021, C\. P\. de Campos, M\. H\. Maathuis, and E\. Quaeghebeur \(Eds\.\), Proceedings of Machine Learning Research, Vol\. 161, pp\. 1393–1402\. External Links: Link Cited by: Appendix B, §1\. G\. Van Den Broeck, N\. Taghipour, W\. Meert, J\. Davis, and L\. De Raedt \(2011\) Lifted Probabilistic Inference by First\-Order Knowledge Compilation\. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16\-22, 2011, IJCAI 2011, Vol\. 29, pp\. 2178–2185\. External Links: Document, ISBN 978\-1\-57735\-512\-0, 1412\.0315 Cited by: Appendix B, §2\. H\. Veeraboina \(2023\) Cited by: §1\. Y\. Wan, W\. Wang, Y\. Yang, Y\. Yuan, J\. Huang, P\. He, W\. Jiao, and M\. Lyu \(2024\) LogicAsker: evaluating and improving the logical reasoning ability of large language models\. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp\. 2124–2155\. Cited by: §2\. J\. Wang, A\. Rutkiewicz, A\. Wang, and M\. Sachan \(2025\) Generating pedagogically meaningful visuals for math word problems: a new benchmark and analysis of text\-to\-image models\. In Findings of the Association for Computational Linguistics: ACL 2025, pp\. 11229–11257\. Cited by: §2\. Y\. Wang, J\. Pu, Y\. Zhou, Y\. Wang, and O\. Kuželka \(2026\) Solving combinatorial counting problems with weighted first\-order model counting\. External Links: 2605\.24845, Link Cited by: Appendix B, §1, §2\. \[44\] C\. Xu, S\. Guan, D\. Greene, and M\. Kechadi Benchmark data contamination of large language models: a survey, 2024\. URL https://arxiv\. org/abs/2406\.04244\. Cited by: §1\. F\. Xu, Q\. Hao, Z\. Zong, J\. Wang, Y\. Zhang, J\. Wang, X\. Lan, J\. Gong, T\. Ouyang, F\. Meng, et al\. \(2025\) Towards large reasoning models: a survey of reinforced reasoning with large language models\. arXiv preprint arXiv:2501\.09686\. Cited by: §1\. Y\. Xuejun, J\. Zhong, Z\. Feng, P\. Zhai, R\. Yousefzadeh, W\. C\. Ng, H\. Liu, Z\. Shou, J\. Xiong, Y\. Zhou, et al\. \(2025\) Mathesis: towards formal theorem proving from natural languages\. arXiv preprint arXiv:2506\.07047\. Cited by: §1\. A\. Yang, B\. Zhang, B\. Hui, B\. Gao, B\. Yu, C\. Li, D\. Liu, J\. Tu, J\. Zhou, J\. Lin, K\. Lu, M\. Xue, R\. Lin, T\. Liu, X\. Ren, and Z\. Zhang \(2024\) Qwen2\.5\-math technical report: toward mathematical expert model via self\-improvement\. arXiv preprint arXiv:2409\.12122\. Cited by: §5\.1\. K\. Yang, G\. Poesia, J\. He, W\. Li, K\. E\. Lauter, S\. Chaudhuri, and D\. Song \(2025\) Position: formal mathematical reasoning—a new frontier in ai\. In Forty\-second International Conference on Machine Learning Position Paper Track, Cited by: §1\. W\. X\. Zhao, K\. Zhou, J\. Li, T\. Tang, X\. Wang, Y\. Hou, Y\. Min, B\. Zhang, J\. Zhang, Z\. Dong, et al\. \(2023\) A survey of large language models\. arXiv preprint arXiv:2303\.18223 1 \(2\)\. Cited by: §1\. K\. Zheng, J\. M\. Han, and S\. Polu \(2021\) Minif2f: a cross\-system benchmark for formal olympiad\-level mathematics\. arXiv preprint arXiv:2109\.00110\. Cited by: §1\. X\. Zhou, M\. Weyssow, R\. Widyasari, T\. Zhang, J\. He, Y\. Lyu, J\. Chang, B\. Zhang, D\. Huang, and D\. Lo \(2025\) Lessleak\-bench: a first investigation of data leakage in llms across 83 software engineering benchmarks\. arXiv preprint arXiv:2502\.06215\. Cited by: §1, §2\. K\. Zou, J\. Mai, Y\. Zhang, Y\. Wang, O\. Kuzelka, Y\. Wang, and Y\. Chang \(2025\) Faster lifting for ordered domains with predecessor relations\. CoRR abs/2507\.19182\. External Links: Document, Link, 2507\.19182 Cited by: §1\. Appendix A Problem Examples Choose Set set0set\_\{0\} contains: e4e\_\{4\}, e1e\_\{1\}, e5e\_\{5\}, e3e\_\{3\}, e2e\_\{2\}, e6e\_\{6\}, e7e\_\{7\}, e8e\_\{8\}\. From set0set\_\{0\}, we repeatedly select 16 elements \(with replacement\) to form choose0choose\_\{0\}\. choose0choose\_\{0\} contains e4e\_\{4\}\. The element e5e\_\{5\} must appear at most 11 times in choose0choose\_\{0\}\. choose0choose\_\{0\} is disjoint from \{e1,e2,e6\}\\\{e\_\{1\},e\_\{2\},e\_\{6\}\\\}\. How many different ways can we obtain choose0choose\_\{0\}? Count each distinct combination of choose0choose\_\{0\} as one way\. \(Answer: 3841\) Group Set set0set\_\{0\} contains: e6e\_\{6\}, e2e\_\{2\}, e9e\_\{9\}, e10e\_\{10\}, e3e\_\{3\}, e7e\_\{7\}, e4e\_\{4\}, e8e\_\{8\}, e5e\_\{5\}, e1e\_\{1\}\. We divide set0set\_\{0\} into 4 groups \(allowing empty groups and groups are labeled\), storing the result in group0group\_\{0\}\. The 4th division group of group0group\_\{0\} is a subset of \{e1,e10,e2,e3,e5,e6,e8\}\\\{e\_\{1\},e\_\{10\},e\_\{2\},e\_\{3\},e\_\{5\},e\_\{6\},e\_\{8\}\\\}\. The 3rd division group of group0group\_\{0\} is disjoint from \{e8\}\\\{e\_\{8\}\\\}\. How many different ways can we obtain group0group\_\{0\}? Count each distinct combination of group0group\_\{0\} as one way\. \(Answer: 331776\) Sequence Set set0set\_\{0\} contains: e5e\_\{5\}, e1e\_\{1\}, e11e\_\{11\}, e3e\_\{3\}, e2e\_\{2\}, e8e\_\{8\}, e10e\_\{10\}, e4e\_\{4\}, e12e\_\{12\}, e6e\_\{6\}, e9e\_\{9\}, e7e\_\{7\}, e13e\_\{13\}, e14e\_\{14\}\. We draw 4 elements from set\_0set\\\_0 and arrange them in sequence choose\_sequence0choose\\\_sequence\_\{0\}\. Sequence choose\_sequence0choose\\\_sequence\_\{0\} must contain e8e\_\{8\} and e10e\_\{10\} such that e8e\_\{8\} is immediately before e10e\_\{10\}\. How many different ways can we obtain choose\_sequence0choose\\\_sequence\_\{0\}? Count each distinct combination of choose\_sequence0choose\\\_sequence\_\{0\} as one way\. \(Answer: 396\) Tuple Set set0set\_\{0\} contains: e3e\_\{3\}, e2e\_\{2\}, e4e\_\{4\}, e6e\_\{6\}, e1e\_\{1\}, e8e\_\{8\}, e5e\_\{5\}, e7e\_\{7\}\. We draw 4 elements from set0set\_\{0\} \(with replacement\) and arrange them in choose\_tuple0choose\\\_tuple\_\{0\}\. The element at index 4 of choose\_tuple0choose\\\_tuple\_\{0\} must not be e4e\_\{4\}\. The element e5e\_\{5\} must appear at most 3 times in choose\_tuple0choose\\\_tuple\_\{0\}\. The element at index 2 of choose\_tuple0choose\\\_tuple\_\{0\} must be e6e\_\{6\}\. How many different ways can we obtain choose\_tuple0choose\\\_tuple\_\{0\}? Count each distinct combination of choose\_tuple0choose\\\_tuple\_\{0\} as one way\. \(Answer: 448\) Appendix B Solvers for Combinatorial Counting We briefly review the existing solvers for combinatorial counting \(CO\) problems\. These algorithmic approaches can be broadly categorized into Constraint Satisfaction Problems \(CSPs\), propositional logic frameworks \(including Answer Set Programming and \#SAT\), and lifted inference methods\. Constraint Satisfaction Problems \(CSPs\) Constraint\-based frameworks typically employ a two\-step approach: modeling and solving\. In modeling, problems are defined using high\-level constraint languages, such as ESSENCE Frisch et al\. \(2008\); Akgün et al\. \(2022\), which allow users to specify combinatorial objects using primitives like sets, multisets, and partitions, along with constraints on valid configurations\. Then, these high\-level models are refined into lower\-level representations \(e\.g\., ESSENCE′\) and compiled into a counting constraint satisfaction problem \(\#CSP\)\. Finally, specialized solvers \(e\.g\., Minion Gent and Jefferson \(2006\)\) are used to count the number of solutions satisfying all constraints, answering the original combinatorial counting question\. Propositional Logic and Answer Set Programming \(ASP\) Logic\-based solvers approach counting by finding the number of "models" \(satisfying truth assignments\) for a given logical theory, i\.e\., via \#𝖲𝖠𝖳\\\#\{\\mathsf\{SAT\}\} Thurley \(2006\) or \#count\{l1,l2,…,ln\}\\\#count\\ \\\{l\_\{1\},l\_\{2\},\\dots,l\_\{n\}\\\} queries \(which count the number of true literals among l1,l2,…,lnl\_\{1\},l\_\{2\},\\dots,l\_\{n\}\) Gebser et al\. \(2022\)\. The main limitation of these methods is that they do not natively support high\-level combinatorial constructs, e\.g\., sets or multisets, requiring explicit encoding of such structures into propositional variables and clauses\. Lifted Inference Methods Lifted inference frameworks, such as Forclift Van Den Broeck et al\. \(2011\), GC\-FOVE Taghipour et al\. \(2012\), WFOMC Kuzelka \(2021\); van Bremen and Kuzelka \(2021\); Tóth and Kuželka \(2022\), aim to count solutions without exhaustively enumerating every individual state\. These methods rely on the principle of exchangeability \(symmetries of the domain constants\), reasoning about groups of individuals together rather than unique constants\. This is very similar to how combinatorial counting formulas operate on groups of indistinguishable objects, making lifted inference a natural fit for CO problems\. Recently, Totis et al\. \(2023\) attempted to bridge lifted inference and \#CSP, proposing a lifted reasoning framework specifically designed for combinatorial counting tasks\. This framework, containing a Domain\-Specific Language \(DSL\) called CoLa and the solver CoSo, showed promising results on modeling and solving various CO problems Totis et al\. \(2023\)\. Cofola \(Wang et al\., 2026\) used in this work extends this direction with a typed object language and a WFOMC compilation pipeline, supporting multiple dependent objects, sets and bags, tuples, sequences, circles, partitions, compositions, and relative or group\-level constraints\. Evaluation Prompt for Zero\-Shot Reasoning You are an expert in combinatorics\. Your task is to solve combinatorics problems by providing a step\-by\-step reasoning process followed by a single, exact numerical answer\. Instructions: 1\. First, reason step\-by\-step to solve the problem\. Show all intermediate steps, including the use of standard combinatorial notation and formulas\. 2\. Compute the exact numerical value; do not leave expressions unevaluated\. 3\. After completing the reasoning, report the final answer as a single number enclosed in \\boxed\{\}\. 4\. Do not include any text after the boxed answer\. Question: \{question\} Figure 5: The prompt template used during evaluation\. The placeholder \{question\} is instantiated with a specific problem instance at inference time\. Evaluation Prompt for Code Augmented Reasoning You are an expert in combinatorics and Python programming\. Your task is to solve combinatorics problems by writing Python code\. Instructions: 1\. You must ONLY output executable Python code\. Do not include any natural language explanations, introduction, text, or mathematical derivation\. 2\. The code must be self\-contained and calculate the exact numerical answer to the problem\. 3\. Print or return the final result within the code\. Do not add any text or formatting \(like \\boxed\{\}\) outside the code block\. Question: \{question\} Figure 6: The prompt template used during code\-augmented evaluation \(Python setting\)\. The placeholder \{question\} is instantiated with a specific problem instance at inference time\. Appendix C Experimental Details C\.1 Capabilities of LLMs in Combinatorial Reasoning dataset distribution For RQ1 \(Capabilities of LLMs in Combinatorial Reasoning\), we construct an evaluation dataset across the four operation groups sampled in the main CombEval suite: Choose, Group, Sequence, and Tuple\. The final dataset comprises 1,500 problem instances in total\. By operation type, instances are distributed as follows: Tuple \(589, 39\.3%\), Group \(337, 22\.5%\), Sequence \(297, 19\.8%\), and Choose \(277, 18\.5%\)\. Instance difficulty varies across multiple axes\. By reasoning depth, most instances involve a single operator application \(depth\-1: 1,272, 84\.8%\), with decreasing numbers at deeper nesting levels \(depth\-2: 193, 12\.9%; depth\-3: 32, 2\.1%; depth\-4: 3, 0\.2%\)\. By constraint count, 574 instances \(38\.3%\) contain no constraints, 413 \(27\.5%\) contain one constraint, 303 \(20\.2%\) contain two constraints, and 210 \(14\.0%\) contain three constraints\. By operator count, the majority of problems use a single combinatorial operator \(858, 57\.2%\), with 441 \(29\.4%\) using two operators, 146 \(9\.7%\) using three operators, and 55 \(3\.7%\) using four operators\. All instances are verified against the Cofola solver to ensure ground\-truth answers exist\. Figure 7: Distribution of problem instances across operation types and difficulty parameters for the RQ1 dataset\. C\.2 Evaluation Prompts All models are evaluated using a standardized prompt template designed to elicit step\-by\-step chain\-of\-thought reasoning followed by a clearly marked final answer\. This prompt engineering approach serves two purposes: \(1\) it encourages models to show their reasoning process, enabling error analysis and identification of failure modes, and \(2\) it enforces a consistent output format that facilitates automatic answer extraction and evaluation\. Figure 5 presents the complete prompt template used across all experiments\. The prompt explicitly instructs models to act as combinatorics experts, decompose problems into intermediate steps using standard notation \(e\.g\., \(nk\)\\binom\{n\}\{k\}, permutations, factorial expressions\), and compute exact numerical values rather than leaving symbolic expressions unevaluated\. Critically, the final answer must be enclosed in \\boxed\{\} delimiters with no additional text afterward, allowing reliable extraction via regex pattern matching\. During evaluation, the \{question\} placeholder is replaced with individual problem instances from the dataset\. We apply this template uniformly to all problem types \(Choose, Group, Sequence, Tuple\) without type\-specific modifications, ensuring fair comparison across tasks\. Models receive no few\-shot examples or additional context beyond the problem statement itself, constituting a pure zero\-shot evaluation setting\. C\.3 Runtime as Difficulty Indicator Figure 8: Performance analysis of solver runtime across different problem parameters\. To validate that generated problems exhibit genuine computational complexity, we analyze the relationship between problem parameters and solver runtime, which serves as an objective measure of intrinsic difficulty independent of model\-specific biases\.The scaling of solver performance with depth, entity size, and constraint count on a logarithmic scale is shown in Figure 8\. The results reveal several important patterns that support our difficulty parameterization\. Notably, solver runtime exhibits a marginal decrease as the reasoning depth scales from 1 to 4\. This counterintuitive behavior stems from our isolated experimental design for depth evaluation, where the constraint count is strictly fixed to k=0k=0 to eliminate confounding factors\. Without the overhead of processing complex constraint interactions, the solver can efficiently traverse the nested chain\-structured operators \(e\.g\., Choose→Sequence→…\\textsc\{Choose\}\\rightarrow\\textsc\{Sequence\}\\rightarrow\\dots\), bypassing intensive filtering or backtracking\. Consequently, deeper nesting under zero\-constraint settings effectively simplifies the solver’s localized search paths, resulting in a slight reduction in execution time\. In contrast, entity size and constraint count lead to substantial, exponential increases in solver runtime, reflecting their dominant impact on expanding the combinatorial search space and complicating constraint satisfaction\. Appendix D Experimental Details for RQ3 D\.1 Dataset We construct the evaluation dataset for RQ3 directly from the problem instances used in our main experiments \(RQ1 and RQ2\)\. Each problem is originally synthesized by the CombEval framework in the formal Cofola representation, then automatically translated into natural language using a structured template, producing a baseline natural language formulation\. To obtain a linguistically diversified variant while strictly preserving semantic equivalence, we apply a closed\-loop verification pipeline to rewrite each problem into a MATH\-style natural language formulation, as detailed in the next subsection\. As shown in Figure 12, the same underlying combinatorial structure can be expressed in both the template\-based and rewritten natural\-language forms while maintaining semantic equivalence\. D\.2 Closed\-Loop Verification for Semantic Preservation A core challenge in evaluating prompt sensitivity is ensuring that rewritten problems are mathematically isomorphic to the originals—i\.e\., any observed performance variation must be attributable solely to surface\-level linguistic changes, not to unintended semantic drift\. To this end, we design a closed\-loop verification pipeline comprising three stages: Style Transfer, Formal Back\-translation, and Solver Verification\. Step 1: Style Transfer\. We leverage gemini\-3\-flash\-preview\-thinking to rewrite each template\-generated problem into a MATH\-style natural language formulation\. The rewriting is conducted in a retrieval\-augmented generation \(RAG\) setting Gao et al\. \(2023b\): reference examples are retrieved from the “Counting and Probability” category of the MATH dataset Hendrycks et al\. \(2021\) \(Figure 11\) and provided as few\-shot stylistic guidance\. The rewriting prompt \(Figure 9\) explicitly conditions generation on the underlying Cofola representation,separating mathematical logic from surface\-level language to encourage diverse narrative reformulations without altering the solution space\. Step 2: Formal Back\-translation\. To guard against condition omission or semantic shift introduced during rewriting, we employ the same LLM to translate the rewritten problem back into Cofola code using a dedicated NL\-to\-Cofola translation prompt \(Figure 10\)\. This prompt provides exhaustive one\-to\-one mappings between natural language patterns and Cofola constructs, enforcing a deterministic, line\-by\-line translation that recovers the formal structure from the paraphrased text\. Step 3: Solver Verification\. The back\-translated Cofola expression is fed into the Cofola constraint solver to compute its exact solution\.If and only if the solver’s output matches the ground\-truth answer of the original problem exactly—across all valid combinatorial configurations—is the rewritten instance accepted as a semantically lossless paraphrase\.Instances that fail this verification are discarded and re\-generated \(with different random seeds for the style transfer stage\) until convergence or a maximum retry budget is exhausted\. This closed\-loop protocol guarantees that every accepted Rewritten variant encodes precisely the same combinatorial enumeration problem as its Math and Cards counterparts,enabling a rigorous, confounding\-free evaluation of model robustness to paraphrasing in RQ3\. D\.3 Prompt for Problem Rewriting Prompt Template for Problem Rewriting You are an expert in creating combinatorics math problems\. Your task is to rewrite a given math problem to make it more engaging, story\-like, or similar in style to the provided reference problems, while STRICTLY preserving the underlying mathematical logic and constraints\. Here are some reference problems that have a similar mathematical structure or style\. Use them as inspiration for the tone and narrative style: \-\-\- \{references\_text\} \-\-\- Here is the mathematical logic of the original problem \(represented in Cofola language\): \{original\_cofola\} \(Note: The rewritten problem MUST correspond exactly to this logic\. Do not change numbers, constraints, or the type of combinatorial object being counted\.\) Original Problem Text: ‘‘\{original\_question\}’’ Task: Rewrite the ‘‘Original Problem Text’’ above\. 1\. Keep the mathematical core exactly the same \(same numbers, same conditions\)\. 2\. You can change the setting \(e\.g\., from balls in boxes to students in classrooms\) if it fits the logic, or just improve the narrative of the current setting\. 3\. Ensure the question is clear and unambiguous\. Output Format: Just provide the rewritten problem text\. Do not include explanations or the Cofola code\. Rewritten Problem: Figure 9: The prompt used to generate rewritten variations of combinatorial problems\. Placeholders in blue are replaced with reference examples, the Cofola logic representation, and the original problem text for each instance\. Prompt Template for Natural Language to Cofola Translation \# Role You are an expert combinatorial problem solver and an assistant strictly trained to translate natural language combinatorial questions into a specialized domain\-specific language \(DSL\) called Cofola\. \# Task Your task is to read a combinatorics problem described in English and map it line\-by\-line or step\-by\-step into Cofola code\. You must use ONLY the mappings provided in the dictionary below\. \# Dictionary: Cofola to Natural Language Mapping \[The full dictionary defines exhaustive one\-to\-one mappings between natural language patterns and Cofola constructs across the following categories: \(1\) Types & Declarations; \(2\) Operations \(Set & Bag Math, Choice & Arrangement, Partitions & Groupings\); \(3\) Constraints \(comparators, general constraint mappings, and sequence/tuple\-specific constraints\); \(4\) Question Template \(target output format\)\. See the supplementary material for the complete dictionary\.\] Here are some reference examples showing the expected input\-output format: \-\-\- \{references\_text\} \-\-\- Important requirements: • Analyze the reference examples carefully to understand the expected code structure and style • Generate code that exactly matches the format shown in the examples • Output only the program code \(Cofola DSL\) without any explanations, comments or additional markdown text like ‘‘‘cofola • Ensure the code follows proper syntax and can be executed correctly Task: Translate the following natural language problem into a Cofola program\. Problem: \{original\_question\} Program: Figure 10: The prompt template used for translating natural language combinatorial problems into Cofola DSL code\. The Dictionary section \(summarized above\) provides exhaustive one\-to\-one mappings between natural language patterns and Cofola constructs; the full dictionary is available in the supplementary material\. Placeholders in blue are replaced with retrieved reference examples and the target problem\. Example 1: Choose \(Level 2\) Problem: My school’s math club has 6 boys and 8 girls\. I need to select a team to send to the state math competition\. We want 6 people on the team\. In how many ways can I select the team without restrictions? Cofola Encoding: boys = set\(boy0\.\.\.6\) girls = set\(girl0\.\.\.8\) team = choose\(boys \+ girls, 6\) Example 2: Group \(Level 5\) Problem: How many ways are there to put 4 balls in 3 boxes if two balls are indistinguishably green, two are indistinguishably red, and the boxes are distinguishable? Cofola Encoding: balls = bag\(gree: 2, red: 2\) boxes = Group\(balls, 3\) Figure 11: Example problems from the MATH dataset with their formal Cofola encodings used as references during the rewriting process, where bag is the Cofola construct for multisets\. Figure 9 presents the prompt template used for problem rewriting\. The prompt instructs the model to generate diverse, story\-like reformulations while strictly preserving the underlying combinatorial logic specified in the Cofola representation\. By explicitly separating mathematical constraints from surface\-level language and providing reference problems as stylistic guidance, this design enables controlled variation in problem phrasing without altering the solution space\. Figure 10 presents the prompt template used for natural\-language\-to\-Cofola back\-translation\. This prompt asks the model to recover the formal Cofola program from a rewritten natural\-language problem using an explicit dictionary of mappings between linguistic patterns and Cofola constructs\. The generated Cofola program is then executed by the solver and compared against the original ground\-truth answer\. Together, the rewriting prompt and the back\-translation prompt form a closed verification loop, allowing us to study reasoning robustness under paraphrasing while avoiding confounding changes to problem structure\. Template\-based Set 𝑠𝑒𝑡0\\mathit\{set\}\_\{0\} contains: e8\\mathit\{e\}\_\{8\}, e6\\mathit\{e\}\_\{6\}, e2\\mathit\{e\}\_\{2\}, e9\\mathit\{e\}\_\{9\}, e7\\mathit\{e\}\_\{7\}, e4\\mathit\{e\}\_\{4\}, 𝑒𝑙𝑒𝑚𝑒𝑛𝑡5\\mathit\{element\}\_\{5\}, e1\\mathit\{e\}\_\{1\}, e3\\mathit\{e\}\_\{3\}, e10\\mathit\{e\}\_\{10\}\. From 𝑠𝑒𝑡0\\mathit\{set\}\_\{0\}, we select 7 elements without replacement to form 𝑐ℎ𝑜𝑜𝑠𝑒0\\mathit\{choose\}\_\{0\}\. How many different ways can we obtain 𝑐ℎ𝑜𝑜𝑠𝑒0\\mathit\{choose\}\_\{0\}? Count each distinct combination of 𝑐ℎ𝑜𝑜𝑠𝑒0\\mathit\{choose\}\_\{0\} as one way\. Rewritten\* A school’s debate team consists of 10 students\. For an upcoming national competition, the coach needs to select a group of 7 students to represent the school\. In how many different ways can the coach choose the 7\-member team from the 10 available students? \*Rewritten by gemini\-3\-flash\-preview\-thinking\. Figure 12: An example of a generated combinatorial problem in its template\-based and rewritten natural\-language forms\. Both formulations encode the same underlying combinatorial structure \(Choose from a set of 10 elements without replacement, selecting 7\)\. Appendix E Template We list the template system we design to verbalize generated Cofola instances as natural\-language problems\. The template system separates problem statements into four components: entity/object declarations, set and bag construction steps, typed combinatorial\-object operations, and constraints\. This separation allows the generator to preserve the formal structure of each instance while producing readable natural\-language descriptions\. Templates are parameterized by placeholders enclosed in braces, such as \{set\_name\}, \{source\_name\}, \{node\_name\}, and \{k\}\. During generation, these placeholders are filled with sampled entities, intermediate variable names, operation parameters, and constraint values\. For example, the template “From \{source\_name\}, we select \{k\} elements without replacement to form \{node\_name\}\.” can be instantiated as “From aux\_0, we select 3 elements without replacement to form aux\_1\.” The same mechanism is used for set/multiset construction, sequence constraints, tuple constraints, and final counting questions\. Table 5 gives templates for entity declarations and set/bag construction; Table 6 lists typed object\-operation templates; Tables 7, 8, and 9 summarize constraint templates; and Table 10 reports the question, comparator, polarity, and entity\-pool templates\. Table 5: Math templates for entity type declarations and universe construction\. Category Template Entity Type Declarations Set Set \{set\_name\} contains: \{description\}\. Bag \(Multiset\) \{set\_name\} contains: \{description\}\. Tuple \{set\_name\} contains: \{description\}\. Sequence \{set\_name\} contains: \{description\}\. Universe Construction Operations SET\_UNION We take the union of \{left\_name\} and \{right\_name\}, calling the result \{node\_name\}\. SET\_INTERSECTION We take the intersection of \{left\_name\} and \{right\_name\}, calling the result \{node\_name\}\. SET\_DIFFERENCE Starting from \{left\_name\}, we remove elements of \{right\_name\} to get \{node\_name\}\. BAG\_UNION We merge \{left\_name\} and \{right\_name\} by adding their quantities to obtain \{node\_name\}\. BAG\_INTERSECTION We combine \{left\_name\} and \{right\_name\} by taking the minimum quantity of each element to get \{node\_name\}\. BAG\_DIFFERENCE Starting from \{left\_name\}, we subtract the quantities in \{right\_name\} to get \{node\_name\}, ensuring no element’s quantity becomes negative\. Table 6: Math templates for combinatorial operations\. Operation Template CHOOSE CHOOSE From \{source\_name\}, we select \{k\} elements without replacement to form \{node\_name\}\. CHOOSE\_REPLACE From \{source\_name\}, we repeatedly select \{k\} elements \(with replacement\) to form \{node\_name\}\. Group COMPOSE We divide \{source\_name\} into \{k\} groups \(allowing empty groups and groups are labeled\), storing the result in \{node\_name\}\. PARTITION We split \{source\_name\} into \{k\} groups \(allowing empty groups and groups are not labeled\), storing the result in \{node\_name\}\. TUPLE CHOOSE\_REPLACE\_TUPLE We draw \{k\} elements from \{source\_name\} \(with replacement\) and arrange them in \{node\_name\}\. CHOOSE\_TUPLE From \{source\_name\}, we pick \{k\} elements and arrange them in order to form \{node\_name\}\. TUPLE We convert \{source\_name\} into an ordered sequence \{node\_name\}\. SEQUENCE SEQUENCE We arrange all elements from \{source\_name\} to form an ordered sequence \{node\_name\}\. CHOOSE\_REPLACE\_SEQUENCE We draw \{k\} elements from \{source\_name\} with replacement and arrange them in sequence \{node\_name\}\. CIRCLE We arrange all elements from \{source\_name\} around a circle called \{node\_name\}\{reflection\}\. Table 7: Math templates for constraints \(Part 1\): General and Group constraints\. Constraint Name Template General Constraints MemberConstraint \{target\} \{positive\}contains \{entity\}\. DisjointConstraint \{second\_set\} is disjoint from \(\{first\_set\}\)\. SubsetConstraint \{first\_set\} is a subset of \(\{second\_set\}\)\. EqualityConstraint \{left\_name\} must be the same as \{right\_name\}\. CardinalityConstraint \{target\} must contain \{comparator\} \{value\} elements\. CountConstraint The element \{entity\} must appear \{comparator\} \{value\} times in \{target\}\. NonEmptyConstraint \{target\} must not be empty\. LinearCardinalityConstraint The total size of \{left\} and \{right\} must be \{comparator\} \{value\}\. IndexMemberConstraint The set at index \{index\} of \{target\} \{positive\} belong to \{container\}\. IndexEqualMemberConstraint The element at index \{index\} of \{target\} \{positive\} be \{entity\}\. Group Constraints ComposeSizeConstraint The size of the \{index\}th subset must be \{comparator\} \{param\}\. QuantifiedConstraint For each group in the \{target\}, the number of entities it contains must \{comparator\} \{value\}\. Table 8: Math templates for constraints \(Part 2\): Sequence ordering constraints\. Constraint Name Template Sequence Ordering Constraints \(Standard\) PredecessorConstraint \(pos\) Sequence \{target\} must contain \{entity1\} and \{entity2\} such that \{entity1\} is immediately before \{entity2\}\. PredecessorConstraint \(neg\) Sequence \{target\} does not have \{entity1\} immediately before \{entity2\} if both are present\. LessThanConstraint \(pos\) Sequence \{target\} must contain \{entity1\} and \{entity2\} such that \{entity1\} comes before \{entity2\} \(not necessarily immediately\)\. LessThanConstraint \(neg\) Sequence \{target\} does not have \{entity1\} before \{entity2\} \(immediately or not\) if both are present\. NextToConstraint \(pos\) Sequence \{target\} must contain \{entity1\} and \{entity2\} such that they are adjacent\. NextToConstraint \(neg\) Sequence \{target\} does not have \{entity1\} and \{entity2\} adjacent if both are present\. TogetherConstraint \(pos\) Entities \(\{group\}\) must be together in sequence \{target\} if any are present\. TogetherConstraint \(neg\) Sequence \{target\} contains the entities \(\{group\}\) and they are not together\. Sequence Ordering Constraints \(Bag Semantics\) PredecessorBagConstraint \(pos\) There is a pair of \{entity1\} and \{entity2\} in sequence \{target\} such that \{entity1\} is immediately before \{entity2\}\. PredecessorBagConstraint \(neg\) Sequence \{target\} does not have any \{entity1\} immediately before any \{entity2\} \(\{target\} that contain only \{entity1\} or only \{entity2\} or none of them are also allowed\)\. LessThanBagConstraint \(pos\) There is a pair of \{entity1\} and \{entity2\} in sequence \{target\} such that \{entity1\} comes before \{entity2\} \(not necessarily immediately\)\. LessThanBagConstraint \(neg\) Sequence \{target\} does not have any \{entity1\} before any \{entity2\} \(immediately or not\) \(\{target\} that contain only \{entity1\} or only \{entity2\} or none of them are also allowed\)\. NextToBagConstraint \(pos\) There is a pair of \{entity1\} and \{entity2\} in sequence \{target\} such that they are adjacent\. NextToBagConstraint \(neg\) Sequence \{target\} does not have any \{entity1\} and \{entity2\} adjacent \(\{target\} that contain only \{entity1\} or only \{entity2\} or none of them are also allowed\)\. Complex Sequence Count Constraints SequenceCountConstraint In sequence \{target\}, \{count\_type\}\. next\_to \{entity1\} and \{entity2\} are adjacent \(in any order\) \{comparator\} \{value\} times predecessor \{entity1\} is immediately followed by \{entity2\} \{comparator\} \{value\} times less\_than \{entity1\} is followed by \{entity2\} \(allowing gaps\) \{comparator\} \{value\} times Table 9: Math templates for constraints \(Part 3\): Tuple\-specific constraints\. Constraint Name Template TupleMembershipConstraint \{target\} \{positive\} contains \{entity\}\. TupleIndexMembershipConstraint The element at position \{index\} of \{target\} must \{positive\}be \{entity\}\. TupleAllMembersInConstraint All elements in \{target\} must belong to \{target\_set\}\. TupleCountSizeConstraint The number of \{count\_obj\} in \{target\} must be \{comparator\} \{param\}\. TupleIndexEqConstraint The \{index\}th element of \{target\} \{positive\}is \{entity\}\. TupleDedupCountSizeConstraint In \{target\}, the number of distinct elements belonging to \{count\_obj\} must be \{comparator\} \{param\}\. PositionConstraint The \{ordinal\} position in \{target\} must be \{entity\}\. Table 10: Question templates and comparator mappings for natural language generation\. Category Template / Mapping Default Question Template Counting question How many different ways can we obtain \{uncertain\_node\_name\}? Count each distinct combination of \{uncertain\_node\_name\} as one way\. Comparator to Natural Language \(comparator2str\) == exactly <= at most < less than \>= at least \> greater than \!= not exactly Positive/Negative Boolean Mapping \(positive\_bool2txt\) true \(empty string\) false not Special Tokens entity\_type element entity\_unit \(single space\) empty\_entity no \{entity\_type\} at all Appendix F Error Examples of gpt\-5\.5 We present representative error examples generated by gpt\-5\.5 in Appendices˜F, F, F, F, F and F\. The text in red indicates the parts where the model made a mistake\. We note that we do not cover all error types here as exhaustive manual inspection is labor\-intensive\. Now, let us analyze these error examples in detail\. • Appendix˜F shows a circle arrangement where the model treats a “together” block as a fixed ordered chunk, omitting the k\!k\! internal permutations of the block and undercounting by a factor of 3\!3\!\. • Appendix˜F shows a sequence problem where the model reads “together not in seq” as “no two elements of the set are adjacent”, a much stricter condition than the intended “the elements do not all form one contiguous block”\. • Appendix˜F illustrates a case where two independent choose\_tuple operators on the same bag are treated by the model as a single joint deal that depletes the bag, rather than two independent draws from the original bag\. • Appendix˜F shows a multiset circular arrangement where the model uses the linear formula n\!/∏mi\!n\!/\\prod m\_\{i\}\! instead of the circular formula \(n−1\)\!/∏mi\!\(n\{\-\}1\)\!/\\prod m\_\{i\}\!, overcounting by a factor of nn\. • Appendix˜F shows a circle problem with two together constraints sharing an entity\. The model correctly identifies the forced chain but treats it as a single fixed orientation, missing the four internal orientations that the chain admits\. • Appendix˜F shows a circle problem combining together \(over three entities\) with next\_to\. The combined block has four distinct configurations \(two internal orderings ×\\times two chain directions in a rotation\-only circle\), all of which the model collapses to one\. The first two patterns reflect a systematic gap between the natural\-language reading of Cofola constraints and their formal semantics over multisets, particularly the symmetry of “together” blocks\. The remaining patterns reflect classical combinatorial slips that scale with the structure of compound constraints\. We anticipate that explicitly formalizing these semantic edge cases in the prompt template, or pairing the LLM with a symbolic verifier, would mitigate a substantial fraction of these errors \(Gao et al\., 2023a\)\. Error 1: Missing internal permutations of a “together” block Question: Set set0set\_\{0\} contains: element7, element5, element8, element6, element2, element4, element1, element3\. We arrange all elements from set0set\_\{0\} around a circle called circle0circle\_\{0\}\. Entities \(element2, element3, element5\) must be together in sequence circle0circle\_\{0\} if any are present\. How many different ways can we obtain circle0circle\_\{0\}? Count each distinct combination of circle0circle\_\{0\} as one way\. Reference: ‘720‘ Prediction: Treat \(element2, element3, element5\) as one fixed consecutive block, since they must be together in that sequence\. So we arrange the following 6 objects around a circle: \[element2,element3,element5\],element1,element4,element6,element7,element8\.\[\\text\{element2\},\\text\{element3\},\\text\{element5\}\],\\ \\text\{element1\},\\ \\text\{element4\},\\ \\text\{element6\},\\ \\text\{element7\},\\ \\text\{element8\}\. The number of circular arrangements of 6 distinct objects is \(6−1\)\!=5\!=120\(6\-1\)\!=5\!=120\. 120 Error 2: Misreading negated “together” as “no two adjacent” Question: bag0bag\_\{0\} contains: element3:2, element5:2, element6:2\. We arrange all elements from bag0bag\_\{0\} to form an ordered sequence sequence0sequence\_\{0\}\. Sequence sequence0sequence\_\{0\} contains the entities \(element3, element5\) and they are not together\. How many different ways can we obtain sequence0sequence\_\{0\}? Count each distinct combination of sequence0sequence\_\{0\} as one way\. Reference: ‘72‘ Prediction: Let A=element3A=\\text\{element3\}, B=element5B=\\text\{element5\}, C=element6C=\\text\{element6\}\. We arrange the multiset \{A,A,B,B,C,C\}\\\{A,A,B,B,C,C\\\} with the condition that AA and BB are not together, meaning no AA is adjacent to any BB\. Use the two CC’s as separators: \_C\_C\_\\\_\\ C\\ \\\_\\ C\\ \\\_\. There are 3 gaps where the AA’s and BB’s may be placed\. To avoid adjacency between AA and BB, each gap can contain only AA’s, only BB’s, or be empty\. We distribute 2 copies of AA and 2 copies of BB among these 3 gaps: • AA’s in 1 gap and BB’s in 1 gap: 3×2=63\\times 2=6\. • AA’s in 1 gap and BB’s split across 2 gaps: 33\. • AA’s split across 2 gaps and BB’s in 1 gap: 33\. Total: 6\+3\+3=126\+3\+3=12\. 12 Error 3: Joint vs\. independent sampling from the same bag Question: bag0bag\_\{0\} contains: element2:2, element5:1, element6:3\. From bag0bag\_\{0\}, we pick 2 elements and arrange them in order to form choose\_tuple0choose\\\_tuple\_\{0\}\. From bag0bag\_\{0\}, we pick 3 elements and arrange them in order to form choose\_tuple1choose\\\_tuple\_\{1\}\. How many different ways can we obtain choose\_tuple0,choose\_tuple1choose\\\_tuple\_\{0\},choose\\\_tuple\_\{1\}? Count each distinct combination as one way\. Reference: ‘152‘ Prediction: The bag contains a multiset of 6 elements: \{2,2,5,6,6,6\}\\\{2,2,5,6,6,6\\\}\. Together, the two tuples use 2\+3=52\+3=5 elements from the bag, so this is equivalent to forming an ordered sequence of length 5 from the bag \(first 2 positions = choose\_tuple0choose\\\_tuple\_\{0\}, last 3 = choose\_tuple1choose\\\_tuple\_\{1\}\)\. Since we use 5 out of 6 elements, exactly one element is left unused\. • Leave out one 2: counts \(1,1,3\)⇒5\!/\(1\! 1\! 3\!\)=20\(1,1,3\)\\Rightarrow 5\!/\(1\!\\,1\!\\,3\!\)=20\. • Leave out the 5: counts \(2,0,3\)⇒5\!/\(2\! 3\!\)=10\(2,0,3\)\\Rightarrow 5\!/\(2\!\\,3\!\)=10\. • Leave out one 6: counts \(2,1,2\)⇒5\!/\(2\! 1\! 2\!\)=30\(2,1,2\)\\Rightarrow 5\!/\(2\!\\,1\!\\,2\!\)=30\. Total: 20\+10\+30=6020\+10\+30=60\. 60 Error 4: Missing rotational quotient for a multiset circle Question: bag0bag\_\{0\} contains: element1:1, element2:1, element5:3, element6:1, element7:1, element9:1, element10:2\. We arrange all elements from bag0bag\_\{0\} around a circle called circle0circle\_\{0\}\. How many different ways can we obtain circle0circle\_\{0\}? Count each distinct combination of circle0circle\_\{0\} as one way\. Reference: ‘30240‘ Prediction: The bag has 10 elements with multiplicities \{1,1,3,1,1,1,2\}\\\{1,1,3,1,1,1,2\\\}\. The number of distinguishable circular arrangements of the multiset equals the multinomial coefficient 10\!1\! 1\! 3\! 1\! 1\! 1\! 2\!=3,628,80012=302,400\.\\frac\{10\!\}\{1\!\\,1\!\\,3\!\\,1\!\\,1\!\\,1\!\\,2\!\}=\\frac\{3\{,\}628\{,\}800\}\{12\}=302\{,\}400\. 302400 Error 5: Chained “together” constraints sharing an entity Question: Set set0set\_\{0\} contains: element7, element5, element8, element6, element2, element4, element1, element3\. We arrange all elements from set0set\_\{0\} around a circle called circle0circle\_\{0\}, two arrangements are considered the same if one can be rotated or reflected \(flipped\) to match the other\. Entities \(element6, element7\) must be together in sequence circle0circle\_\{0\} if any are present\. Entities \(element5, element2, element7\) must be together in sequence circle0circle\_\{0\} if any are present\. How many different ways can we obtain circle0circle\_\{0\}? Count each distinct combination of circle0circle\_\{0\} as one way\. Reference: ‘48‘ Prediction: The required ‘‘together’’ of \(element5, element2, element7\) means these three elements must be consecutive around the circle\. Also, \(element6, element7\) must be consecutive, so element7 must be adjacent to both element2 and element6\. So the forced block is the chain element5\-\-element2\-\-element7\-\-element6 up to reflection\. Treating this chain as one single block, we need to arrange the following 5 objects around a circle: \[element5,element2,element7,element6\]\[\\text\{element5\},\\text\{element2\},\\text\{element7\},\\text\{element6\}\], element1, element3, element4, element8\. The number of circular arrangements of 5 distinct objects up to rotation and reflection is \(5−1\)\!/2=12\(5\{\-\}1\)\!/2=12\. 12 Error 6: Compound block from together and next\_to in a circle Question: Set set0set\_\{0\} contains: element7, element5, element8, element6, element11, element12, element9, element2, element4, element1, element3, element10\. We arrange all elements from set0set\_\{0\} around a circle called circle0circle\_\{0\}\. Entities \(element12, element6, element9\) must be together in circle0circle\_\{0\} if any are present\. circle0circle\_\{0\} must contain element12 and element11 such that they are adjacent\. circle0circle\_\{0\} does not have element2 and element3 adjacent if both are present\. How many different ways can we obtain circle0circle\_\{0\}? Reference: ‘120960‘ Prediction: We treat \(element12, element6, element9\) as a required consecutive sequence\. Since element12 must also be adjacent to element11, and element12 is already adjacent to element6 in the required sequence, element11 must be on the other side of element12\. Thus we form one block \(element11, element12, element6, element9\)\. Now we arrange this block together with the remaining 8 elements, giving 99 objects around a circle: \(9−1\)\!=8\!=40320\(9\-1\)\!=8\!=40320\. Subtract arrangements where element2 and element3 are adjacent \(treat them as one block of size 2\): 2\!⋅7\!=100802\!\\cdot 7\!=10080\. Valid arrangements: 40320−10080=3024040320\-10080=30240\. 30240`
Similar Articles
OpenCompass: A Universal Evaluation Platform for Large Language Models
OpenCompass is a one-stop, scalable, high-concurrency evaluation platform for large language models, supporting diverse benchmarks and modular design to unify and standardize LLM assessment.
ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
ComBench is an Olympiad-level combinatorics benchmark with 100 problems designed to evaluate rigorous proof reasoning and constructive realization in large language models, revealing that frontier models like GPT-5.5 achieve only 65.4% overall average and that these two capabilities are distinct.
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
WebCompass is a multimodal benchmark for evaluating LLMs on web coding tasks across three input modalities (text, image, video) and three task types (generation, editing, repair). It introduces an Agent-as-a-Judge paradigm that autonomously executes generated websites in a real browser to assess visual fidelity and interactivity.