CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

arXiv cs.CL 06/08/26, 04:00 AM Papers
Summary
CRAFT is a unified counterfactual reasoning framework that improves tabular question answering and fact verification by constructing both original and counterfactual statements, extracting evidence from bidirectional reasoning paths, and integrating them via a weighted mechanism. Experiments show consistent improvements over baselines on WikiTQ and TabFact datasets.
arXiv:2606.06842v1 Announce Type: new Abstract: Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:21 AM
# CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification
Source: [https://arxiv.org/html/2606.06842](https://arxiv.org/html/2606.06842)
Chenshuo Pan1,Yu Zhao1,Jie Zhang1,Changzai Pan1, Zhenhe Wu1,Jiayi Liang1,Yujie Mao1, Shuangyong Song1,Yongxiang Li1,Zhongjiang He1 1Xingchen AGI Lab,China Telecom Artificial Intelligence Technology \(Beijing\) Co\., Ltd

###### Abstract

Table reasoning remains challenging for large language models \(LLMs\), particularly in tasks that require multi\-step inference over long and structured tables\. Existing approaches predominantly rely on single\-direction reasoning, which limits their ability to explore alternative hypotheses across tasks\. In this work, we proposeCRAFT, a unifiedCounterfactualReasoningFramework that reformulatesTabular question answering and fact verification into a general bidirectional verification process\. Our method explicitly constructs both declarative statements and their counterfactual variants\. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer\. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering\. Our framework also significantly mitigates performance gaps between different backbone LLMs\. This indicates that counterfactual reasoning effectively overcomes the limitations of single\-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks\. Our code will be made publicly available upon acceptance\.

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

Chenshuo Pan1, Yu Zhao1, Jie Zhang1, Changzai Pan1,Zhenhe Wu1,Jiayi Liang1,Yujie Mao1,Shuangyong Song1,Yongxiang Li1,Zhongjiang He11Xingchen AGI Lab,China Telecom Artificial Intelligence Technology \(Beijing\) Co\., Ltd

## 1Introduction

Tables are a representative form of structured data and are widely found in financial reports, statistical yearbooks, and various domain\-specific knowledge bases\(Liuet al\.,[2023a](https://arxiv.org/html/2606.06842#bib.bib20); Chenet al\.,[2021b](https://arxiv.org/html/2606.06842#bib.bib7)\)\. Compared to natural language text, tables organize information explicitly through row–column structures, which poses challenges for data localization, symbolic computation, and entity linking\(Fanget al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib1); DU,[2025](https://arxiv.org/html/2606.06842#bib.bib25)\)\. Understanding and reasoning over tables has become an important research topic in natural language processing, with table question answering \(QA\) and table fact verification \(FV\) as two representative tasks\. To reason over tabular information, Large Language Models\(LLMs\) must interpret not only natural language questions but also the structural relations across rows and columns\(Liuet al\.,[2023b](https://arxiv.org/html/2606.06842#bib.bib22); Ruanet al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib35)\)\. This requires the ability to perform operations such as cross\-cell comparison and numerical reasoning to produce reliable inference results\(Badaroet al\.,[2023](https://arxiv.org/html/2606.06842#bib.bib2)\)\.

![Refer to caption](https://arxiv.org/html/2606.06842v1/x1.png)Figure 1:Overview of CRAFT, a multi\-agent framework for table reasoning, Rewriter forms an initial statement from question, Reverser derives a counterfactual\-inspired reverse statement to create a complementary reasoning path, Extractor gathers evidence and predictions from both paths, and Rethink jointly evaluates and generates the final answer\.Recent work addresses tabular reasoning challenges through two main strategies\. One line of research improves table understanding at the parameter level by incorporating table\-specific inductive bias through pre\-training or fine\-tuning LLMs\(Herziget al\.,[2020](https://arxiv.org/html/2606.06842#bib.bib12); Suet al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib36); Zhanget al\.,[2025b](https://arxiv.org/html/2606.06842#bib.bib55)\), which is typically computationally expensive\. Other research promote explicit reasoning and prompt engineering for sub\-table extraction\(Yeet al\.,[2023](https://arxiv.org/html/2606.06842#bib.bib49); Suiet al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib37); Nahid and Rafiei,[2024](https://arxiv.org/html/2606.06842#bib.bib27); Wanget al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib43)\)to break down questions into verifiable statements without updating model parametersWeiet al\.\([2022](https://arxiv.org/html/2606.06842#bib.bib46)\); Chenet al\.\([2023](https://arxiv.org/html/2606.06842#bib.bib8)\); Xionget al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib65)\)\.

Critique\-based methods such as posterior self\-critique\(Yuet al\.,[2025](https://arxiv.org/html/2606.06842#bib.bib51)\)and consistency\-based aggregation\(Jiet al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib14); Wanget al\.,[2023](https://arxiv.org/html/2606.06842#bib.bib42); Liuet al\.,[2024b](https://arxiv.org/html/2606.06842#bib.bib23)\)have recently proven effective for TableQA by detecting and correcting errors\. Nevertheless, the performance gain of repeated single\-direction reasoning often reduce rapidly as the number of iterations increases\. This suggests that reasoning in a single direction is insufficient for robust table understanding\. To effectively challenge initial premises, it is crucial to explicitly integrate counterfactual analysis into the reasoning process, which received little attention in previous studies\.

To bridge this gap, we proposeCRAFT, a novel counterfactual reasoning framework for tabular data\. CRAFT jointly constructs both supporting and counterfactual evidence chains to systematically explore alternative scenarios\. Under each assumption, distinct reasoning chains are executed by LLMs, with the resulting evidence then aggregated to synthesize a final answer\. This eliminates the need for extensive, task\-specific pipeline engineering, offering a versatile solution for both table\-based question answering and fact verification\. By addressing the brittleness of unidirectional reasoning, CRAFT offers a principled approach advancing toward more structured and reliable table comprehension\.

Specifically, we incorporate four cooperative modules: 1\)Rewriterreformulates the original question into a declarative hypothesis and transforms the problem into a verifiable claim\. 2\)Reverserapplies targeted transformation rules to this hypothesis to generate an informative counterfactual statement, creating a complementary scenario for reasoning\. 3\)Extractordistills essential supporting evidence from the intermediate reasoning steps of LLMs, proposing candidate answers under both factual and counterfactual assumptions\. 4\)Rethinkeraggregates the extracted evidence and candidate answers to deliberate and arrive at the final, verified decision\. The overall framework is illustrated in Figure[1](https://arxiv.org/html/2606.06842#S1.F1)\.

In summary, our contributions are three folds:

• We propose CRAFT, a counterfactual reasoning framework agnostic to specific task formulations and applies uniformly for reasoning over structured data\.

• Extensive experiments and analyses are con ducted to validate the effectiveness and robustness of CRAFT, which significantly outperforms existing methods and consistency\-based strategies in both accuracy and stability\.

• Our results indicate that counterfactual reasoning helps guide LLMs away from fixed patterns and toward more heuristic inference, providing a new perspective for the development of trustworthy reasoning systems\.

## 2Related work

#### Table Reasoning

Table reasoning has gained significant attention, with many studies focusing on modifying model parameters to better adapt to the structural characteristics of tabular data\. Among early attempts, TAPASHerziget al\.\([2020](https://arxiv.org/html/2606.06842#bib.bib12)\)models row–column relations through structured embeddings\. A parallel line of research trains models on large collections of table\-related corpora through pre\-training, instruction tuning, or structured fine\-tuning, exemplified by RePandaCheginiet al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib63)\), enabling them to acquire tabular semantic patterns, alignment behavior, and execution abilitiesLiuet al\.\([2022](https://arxiv.org/html/2606.06842#bib.bib21)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.06842#bib.bib55)\); Suet al\.\([2024](https://arxiv.org/html/2606.06842#bib.bib36)\)\.Studies also incorporate reinforcement learning, using feedback signals to further enhance model performanceAlyet al\.\([2023](https://arxiv.org/html/2606.06842#bib.bib40)\); Wuet al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib60)\); Panet al\.\([2026](https://arxiv.org/html/2606.06842#bib.bib30)\)\.

On the other hand, large language models exhibit strong inherent reasoning abilities, enabling competitive performance on table tasks without modifying parametersBrownet al\.\([2020](https://arxiv.org/html/2606.06842#bib.bib5)\)\. Prompting\-based strategies have therefore been widely explored as a way to elicit this latent reasoning capabilityZhouet al\.\([2023a](https://arxiv.org/html/2606.06842#bib.bib57)\); Chenet al\.\([2023](https://arxiv.org/html/2606.06842#bib.bib8)\); Brownet al\.\([2020](https://arxiv.org/html/2606.06842#bib.bib5)\)\. Chain\-of\-ThoughtWeiet al\.\([2022](https://arxiv.org/html/2606.06842#bib.bib46)\)explicitly encourages models to articulate intermediate reasoning steps\. Subsequent work extends this paradigm to tables by exploiting their structural properties\. Chain\-of\-TableWanget al\.\([2024](https://arxiv.org/html/2606.06842#bib.bib43)\)achieves this by breaking reasoning into executable sub\-steps, where each step builds on the sub\-table produced by the last\. Other methods condense or reorganize table inputs prior to reasoning to ensure that models operate on compact, task\-relevant dataSuiet al\.\([2024](https://arxiv.org/html/2606.06842#bib.bib37)\); Nahid and Rafiei \([2024](https://arxiv.org/html/2606.06842#bib.bib27)\), or incorporate external symbolic execution frameworks that offload computation and validation to SQL or Python enginesChenget al\.\([2023](https://arxiv.org/html/2606.06842#bib.bib9)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.06842#bib.bib52)\); Niet al\.\([2023](https://arxiv.org/html/2606.06842#bib.bib29)\); Maoet al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib26)\), allowing LLMs to concentrate on planning rather than calculation\.

Building further, other studies emphasize multi\-round and structural reasoning mechanisms for robustness and corrective capability\. Examples include self\-consistencyWanget al\.\([2023](https://arxiv.org/html/2606.06842#bib.bib42)\); Liet al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib18)\), tree\-structuredJiet al\.\([2024](https://arxiv.org/html/2606.06842#bib.bib14)\)and graph\-structured reasoningLiet al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib18)\)to enhance stability through multi\-branch logical inference\. MIX\-SCLiuet al\.\([2024b](https://arxiv.org/html/2606.06842#bib.bib23)\)combines textual and symbolic reasoning, while Table\-CriticYuet al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib51)\)introduces critical assessment to allow self\-correction of intermediate steps\. However, existing table\-reasoning methods generally reinforce inference around the original question and do not explore the use of counterfactual reasoning paths for explicit evidence contrast\.

#### Counterfactual Reasoning

Many studies construct counterfactual variants of inputs to enhance causal stabilityFenget al\.\([2021](https://arxiv.org/html/2606.06842#bib.bib11)\), strengthen out\-of\-distribution generalization through relation\-driven counterfactual contrastsYanget al\.\([2023](https://arxiv.org/html/2606.06842#bib.bib48)\), and test inference discrimination under atypical reasoning conditionsWebbet al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib45)\)\. It has also been employed for semantic delimitation and self\-calibration, such as distinguishing hypothetical from factual knowledgeLiet al\.\([2023](https://arxiv.org/html/2606.06842#bib.bib17)\)and improving contextual faithfulness through counterfactual exemplars in promptsZhouet al\.\([2023b](https://arxiv.org/html/2606.06842#bib.bib58)\)\. The Direct–Indirect ReasoningZhanget al\.\([2025c](https://arxiv.org/html/2606.06842#bib.bib56)\)framework provides a more general paradigm for solving mathematical problems by jointly incorporating forward and reverse verification\. Kim introduces Counterfactual\-Consistency PromptingKim and Hwang \([2025](https://arxiv.org/html/2606.06842#bib.bib15)\), which constructs temporal counterfactual questions to enforce consistency, showing strong performance in relative temporal reasoning\. However, this approach relies exclusively on counterfactual and mainly to be effective for time\-related verification tasks\. To our knowledge, counterfactual reasoning has not been explored in table reasoning, even though the structured nature of tabular data makes it naturally compatible to counterfactual validation\.

## 3Method

Notation and Definitions\.The overall framework of CRAFT consists of four modules—Rewriter,Reverser,Extractor, andRethinker\. To formalize the problem setting, we denote the table asTT, ground\-truth answer asAA, the natural\-language question asQQ, and the declarative statement derived fromQQasSS\. In table fact verification \(FV\), the input is already a declarative claim, and we haveS=QS=Q\.

### 3\.1Rewriter

Rewriter module is designed to transform an open\-end table QA task into an informative and verifiable declarative statement that facilitates the subsequent generation of counterfactual statement\. Formally, the Rewriter process can be written as:

a∼ΩT,Q:=\{v∣τ\(v\)=τAQ\}\.a\\sim\\Omega\_\{T,Q\}\\;:=\\;\\\{\\,v\\mid\\tau\(v\)=\\tau\_\{A\_\{Q\}\}\\,\\\}\.\(1a\)S=ℳ\(Q,a\)\.S=\\mathcal\{M\}\(Q,a\)\.\(1b\)
We first infer the expected answer typeτAQ\\tau\_\{A\_\{Q\}\}from the semantics ofQQalone, without refer to any ground\-truth answer at inference time\. Here,τAQ\\tau\_\{A\_\{Q\}\}denotes the semantic type expected byQQ\(e\.g\., time, number, or boolean\), andτ\(v\)\\tau\(v\)denotes the semantic type of a candidate valuevv\. Accordingly,ΩT,Q\\Omega\_\{T,Q\}denotes the type\-consistent candidate value space induced byTTandQQ\. An atomic valueaais sampled from this space and used to instantiate the semantic slot inQQ, yielding a declarative statementSS\. In practice, the LLM is prompted to inferaafrom cells inTT\. When no explicit cell is available, it may randomly instantiate anaawith a type\-consistent value grounded inTTandQQ\.

Unlike prior work that primarily uses QA pairs to improve fact verification\(Alyet al\.,[2023](https://arxiv.org/html/2606.06842#bib.bib40)\), ourRewriterexplicitly bridges QA and FV by transforming a question into a declarative statementSSvia type\-consistent instantiation\. In this way, QA is recast into the same statement\-based form as FV, allowing the downstream modules to function uniformly across both tasks\.

### 3\.2Reverser

The Reverser module takes a declarative statementSSas input and aims to construct an optimal counterfactual statementR∗R^\{\*\}\. Counterfactual reasoning is not a simple binary flip; instead, a statementSSmay admit multiple counterfactual directions\. In particular, when theQQis available,SSmay carry latent semantics beyond its surface form, so the admissible counterfactual directions should be defined with respect to the joint semantics of\(Q,S\)\(Q,S\)\. For clarity, we denote counterfactuals instantiated along different directions byRi=ϕδi\(S\)R\_\{i\}=\\phi^\{\\delta\_\{i\}\}\(S\), whereδi\\delta\_\{i\}simply indexes distinct admissible directions under the semantic context\. Prior work has shown that models may underutilize information depending on input position or presentation\(Liuet al\.,[2024a](https://arxiv.org/html/2606.06842#bib.bib24); Wanet al\.,[2025](https://arxiv.org/html/2606.06842#bib.bib44)\)\.Therefore, we introduce the notion of a latent reasoning spaceΨ\(⋅\)\\Psi\(\\cdot\)to denote the latent reasoning space induced by a declarative statement during inference\. Under this notation, counterfactuals constructed along different directions are characterized by inducing different reasoning subspaces:

δi≠δj⇒Ψ\(Ri\)≠Ψ\(Rj\)\.\\delta\_\{i\}\\neq\\delta\_\{j\}\\;\\Rightarrow\\;\\Psi\(R\_\{i\}\)\\neq\\Psi\(R\_\{j\}\)\.\(2\)The purpose of introducing counterfactuals is to broaden the reasoning space beyond that induced bySSalone\. We characterize the optimal counterfactualR∗R^\{\*\}as the one whose induced reasoning, when considered together with that ofSS, provides the greatest overall informational contribution:

R∗=arg⁡maxi⁡𝒢\(Ψ\(S\)∪Ψ\(Ri\)\)\.R^\{\*\}\\;=\\;\\arg\\max\_\{i\}\\;\\mathcal\{G\}\\\!\\left\(\\Psi\(S\)\\;\\cup\\;\\Psi\(R\_\{i\}\)\\right\)\.\(3\)
To construct an informative reverse statementR∗R^\{\*\}, we first generate a small set of candidate counterfactuals fromSSusing a rule\-based template set derived from statement semantics \(and the original questionQQwhen available\), ensuring that the candidates remain plausible and semantically aligned with the original statement\. We then prompt the model to generate a SQL\-style verification program forSSand each reverse candidate, conditioned on the table’s column names and the entities mentioned in the statement, and use the resulting program structure as a proxy for the induced reasoning space\. The final reverse statementR∗R^\{\*\}is selected as the candidate whose induced program yields the largest structural expansion beyond that ofSS\. While more advanced SQL generation techniques may further improve performance, we intentionally adopt a lightweight generation strategy in this work\. More implementation details are provided in Appendix[D\.1](https://arxiv.org/html/2606.06842#A4.SS1)\.

Table 1:Performance comparison between CRAFT and representative table\-reasoning baselines on WikiTQ and TabFact across multiple backbone LLMs, bold denotes the best performance, while underline denotes the second\-highest performance\.
### 3\.3Extractor

Extractor aims to produce both supporting evidenceEEand candidate answersAAfor the original questionQQ\. Given the rewrite statementSSand the counterfactual statementR∗R^\{\*\}, the model is prompted to perform step\-by\-step reasoning grounded in the tableTT111We implement our method within the Table\-Critic framework for execution convenience, but our approach can be deployed within other table reasoning systems as well\., resulting in two reasoning traces,Trace\(S\)\\mathrm\{Trace\}\(S\)andTrace\(R∗\)\\mathrm\{Trace\}\(R^\{\*\}\)\. Each reasoning trace is treated as an instantiation of the corresponding latent reasoning spaceΨ\(X\)\\Psi\(X\)\.

We collect evidence from each trace as follows:

𝕀T,Q\(e\)\\displaystyle\\mathbb\{I\}\_\{T,Q\}\(e\)=\{1,ifeis consistent w\.r\.t\.\(T,Q\),0,otherwise,\\displaystyle=\(4\)E\(X\)\\displaystyle E\(X\)=∑e∈Trace\(X\)𝕀T,Q\(e\),X∈\{S,R∗\}\.\\displaystyle=\\sum\_\{e\\in\\mathrm\{Trace\}\(X\)\}\\mathbb\{I\}\_\{T,Q\}\(e\),\\qquad X\\in\\\{S,R^\{\*\}\\\}\.Here,𝕀T,Q\(e\)\\mathbb\{I\}\_\{T,Q\}\(e\)using an LLM to evaluate whether a candidate evidence itemeeappearing inTrace\(X\)\\mathrm\{Trace\}\(X\)is logically consistent under the table–question context\(T,Q\)\(T,Q\)and informative for answering the question\. Items that fail this check are discarded, while those with𝕀T,Q\(e\)=1\\mathbb\{I\}\_\{T,Q\}\(e\)=1are accumulated along the trace to form the evidence associated withXX, denotedE\(X\)E\(X\)\.

Applying this procedure to the two traces yields two evidence sets,E1=E\(S\)E\_\{1\}=E\(S\)andE2=E\(R∗\)E\_\{2\}=E\(R^\{\*\}\), corresponding to the Rewriter and Reverser reasoning processes, respectively\. Each evidence set is subsequently used together with the original question and table to produce a candidate answer via the same large language model, i\.e\.,A1=ℳ\(Q,T,E1\)A\_\{1\}=\\mathcal\{M\}\(Q,T,E\_\{1\}\)andA2=ℳ\(Q,T,E2\)A\_\{2\}=\\mathcal\{M\}\(Q,T,E\_\{2\}\)\.

### 3\.4Rethinker

We design a set of weighted decision rules for the Rethink module\. WhenA1A\_\{1\}andA2A\_\{2\}coincide, the module directly returns the shared answer\. When the two candidate answers differ, we compute self\-consistency scores for each\(answer,evidence\)\(\\text\{answer\},\\text\{evidence\}\)pair produced by Rewriter and Reverser, mapping each score into the range\[−1,1\]\[\-1,1\]\. The module then compares the score difference and determines the answer according to a predefined set of rules\. For cases in which the two candidates cannot be reliably distinguished, the model re\-evaluates each candidate under the opposing evidence to test whether its reasoning remains consistent under counter perspectives\. The detailed decision algorithm is provided in Appendix[D\.2](https://arxiv.org/html/2606.06842#A4.SS2)\.

## 4Experiments

### 4\.1Experimental Setup

#### Datasets

We evaluate the proposed CRAFT on two widely used table\-understanding datasets: TabFact\(Chenet al\.,[2020](https://arxiv.org/html/2606.06842#bib.bib6)\)and WikiTQ\(Pasupat and Liang,[2015](https://arxiv.org/html/2606.06842#bib.bib32)\)

TabFact is a table\-based binary fact verification \(FV\) dataset in which the task is to determine whether a given statement is supported by the table\. Following prior work, we report binary classification accuracy as the evaluation metric\. In contrast, WikiTQ is a table\-based question answering \(QA\) dataset with short, structured answers\. We evaluate performance using the official denotation accuracy\.

For all datasets, we conduct systematic evaluations using several strong base models, including Deepseek\-R1\-distilled\-14B\(Guoet al\.,[2025](https://arxiv.org/html/2606.06842#bib.bib61)\), LLaMA 3\.3–70B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib16)\), Qwen 2\.5–Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2606.06842#bib.bib34)\), and GPT\-5\-mini\(OpenAI,[2025](https://arxiv.org/html/2606.06842#bib.bib4)\)\.222For each model family, we report the best\-performing version observed within the models we evaluated\.The full details of all tested model settings, prompt formats, hyperparameters, and additional experiments settings are provided in Appendix[A](https://arxiv.org/html/2606.06842#A1)\.

#### Baselines

We compare CRAFT with three representative categories of baseline methods:

\(1\) Standard reasoningmethods typically rely on direct prompting\. End\-to\-End QA feeds the table and the question directly into the model, without any explicit intermediate reasoning\. Few\-Shot QA extends this setting by conditioning the model on a small set of in\-context table–question–answer examples, from which the model implicitly learns the desired input–output patterns and reasoning regularities\.

\(2\) Task decompositionmethods decompose table reasoning into structured subtasks\. Binder\(Chenget al\.,[2023](https://arxiv.org/html/2606.06842#bib.bib9)\)translates questions into executable SQL or Python programs, enabling interpretable symbolic reasoning\. Dater\(Yeet al\.,[2023](https://arxiv.org/html/2606.06842#bib.bib49)\)follows a parsing–execution–filling process and performs question decomposition at the sub\-table level\. Chain\-of\-Table\(Wanget al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib43)\)incrementally constructs sub\-tables, with each step building on the previous one\. ALTER\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.06842#bib.bib53)\)retrieves a relevant subset of data and augments it with schema and semantic information\.

\(3\) Critic\-basedmethods refine reasoning through self\-evaluation or multi\-agent critique\. Critic\-CoT\(Zhenget al\.,[2025](https://arxiv.org/html/2606.06842#bib.bib59)\)applies a self\-critique stage that reviews and refines generated CoTs to revise erroneous inference\. Table\-Critic\(Yuet al\.,[2025](https://arxiv.org/html/2606.06842#bib.bib51)\)adopts an agentic multi\-stage workflow that iteratively evaluates, critiques, and revises intermediate reasoning steps to enhance logical consistency and reliability\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2606.06842#S3.T1)reports the performance of our method and representative baselines under different backbone LLMs\.

Across the four backbone LLMs, our method achieves an average of 82\.4% on WikiTQ and 94\.6% on TabFact, exceeding the strongest baseline by 4\.7 and 1\.1 points, respectively\. This improvement is observed for all four backbones\. Notably, the largest absolute gain occurs with Llama3\.3–70B, where WikiTQ accuracy increases by 11\.3 points over the strongest baseline\. In standard baseline settings and at comparable model scales, Qwen2\.5\-72B consistently outperforms Llama3\.3\-70B\. With our framework in place, however, this ordering is substantially altered: Llama3\.3\-70B surpasses Qwen2\.5\-72B on WikiTQ and TabFact\. This reversal indicates that our method significantly reduces cross\-model performance discrepancies\.

Beyond absolute performance, the results reveal two systematic patterns\. First, although open\-ended TableQA remains more challenging than fact verification, our framework yields larger relative gains on WikiTQ, providing direct evidence that casting QA into a fact\-verification–style reasoning paradigm is an effective modeling choice\. Second, our method exhibits uniform applicability across a wide range of model regimes, spanning different model families, different parameter scales, and both open\-source and closed\-source LLMs\. This robustness further extends to different reasoning instantiations: even when executed by a simpler table\-reasoning framework such as Chain\-of\-Table, our method still outperforms other baseline methods, suggesting that it serves as a general reasoning scaffold rather than backbone\-dependent optimization\.

Taken together, these results indicate that counterfactual reasoning functions as a general reasoning heuristic rather than a task\-specific solution\. By guiding how models explore alternatives, it provides a more effective way of thinking about table reasoning, leading to consistent performance gains across settings\.

### 4\.3Ablation Study on Rewriter and Reverser

As shown in Table[2](https://arxiv.org/html/2606.06842#S4.T2), we conduct ablation experiments to verify the effectiveness of Rewriter, Reverser and their combination\. For TabFact, where inputs are already in declarative form\. Taking LLaMA3\.3–70B as an example, Rewriter\-only matches the baseline performance of 92\.1%, Reverser\-only follows the counterfactual path alone and drops slightly to 91\.9%\. This difference is as anticipated because counterfactual reasoning deliberately explores alternative and potentially incorrect hypotheses\. Notably, the comparable accuracy between the two paths indicates that solving counterfactual statements can still recover sufficient evidence to answer the original question\.

In TableQA, however, Rewriter\-only leads to a substantial improvement, highlighting the benefit of converting questions into declarative statements\. The additional gain from combining Rewriter and Reverser suggests a complementary interaction, where counterfactual reasoning broadens the evidence considered for the original question\.

Table 2:Accuracy of Rewriter\-only, Reverser\-only, and the CRAFT based on both Rewriter and Reverser across different backbone models\.†Rewriter differs from the Baseline for WikiTQ but is equivalent for TabFact\.![Refer to caption](https://arxiv.org/html/2606.06842v1/latex/figures/framework_ablation_bar_acl_column_v2.png)

\(a\)

![Refer to caption](https://arxiv.org/html/2606.06842v1/latex/figures/path_at_k_two_panel_acl_real_polished.png)

\(b\)

Figure 2:Repeated\-Sampling analyses on WikiTQ/TabFact\.\(a\)Accuracy comparison between voting methods and CRAFT\.\(b\)Ideal\(Pass@K\) accuracy upper bounds compared with our method\.
### 4\.4Effectiveness Beyond Repeated\-Sampling

Given that our framework involves one more reasoning path in structure, we perform controlled comparisons to eliminate the concern that the observed gains are merely due to repeated sampling\. Specifically, we compare the voting method based on Table\-Critic with \(1\)Self\-Consistency \(SC\), which samplesNNanswers for each question and selects the final prediction by majority voting over their frequencies; and \(2\)Confidence\-Weighted \(CW\), which also samplesNNanswers but aggregates them by summingexp⁡\(score\)\\exp\(\\text\{score\}\)for each unique candidate and choosing the answer with the highest total weight\. Here we chooseN=3N=3to allow a fair comparison within a reasonable computational cost\.333From this section onward, all experiments use Llama3–70B unless stated otherwise\. Outputs are normalized and format mismatches are not counted as errors, unlike in the main table\.

As shown in Fig\.[2](https://arxiv.org/html/2606.06842#S4.F2)\(a\), on both table question answering \(QA\) and table fact verification \(FV\) tasks, our framework, respectively, achieves improvements of 3\.8 and 1\.0 percentage points over the best of two voting methods\. This indicates that the improvements do not simply stem from ensembling a diverse set of answers\.

To further quantify the potential benefit of our method, we adopt the*Pass@K*metric\(Chenet al\.,[2021a](https://arxiv.org/html/2606.06842#bib.bib10)\)to measure whether the correct answer appears among a set ofKKgenerated predictions\. Given the set of predictions\{y^1,…,y^K\}\\\{\\hat\{y\}\_\{1\},\\dots,\\hat\{y\}\_\{K\}\\\}and the ground\-truth labelyy, Pass@K is defined as:

Pass@K=𝕀\(∃i∈\{1,…,K\}s\.t\.y^i=y\),\\mathrm\{Pass@\}K=\\mathbb\{I\}\\\!\\left\(\\exists\\,i\\in\\\{1,\\dots,K\\\}\\;\\text\{s\.t\.\}\\;\\hat\{y\}\_\{i\}=y\\right\),\(5\)where𝕀\(⋅\)\\mathbb\{I\}\(\\cdot\)denotes the indicator function\.

Pass@KKmeasures whether the candidate answer set contains at least one correct prediction, thereby reflecting how many truly solvable cases are covered by the model’s output distribution and providing an upper bound on the achievable accuracy underKKsamplesYueet al\.\([2025](https://arxiv.org/html/2606.06842#bib.bib64)\)\.

As reported in Fig\.[2](https://arxiv.org/html/2606.06842#S4.F2)\(b\), we compare our method with repeated sampling of the strongest baseline under different values ofKK\. Our method effectively corresponds toK=2K=2, since it explicitly constructs two distinct reasoning paths\. Under the same setting, our method achieves a higher Pass@KKthan the repeated\-sampling baseline\. Even when the baselines are allowed to increaseKK, they do not match the performance of our approach\. This suggests that the direction of sampling—i\.e\., encouraging diverse and structured exploration of the reasoning space—is more important than the sheer number of samples\. By guiding the model to explore different regions of table semantics rather than repeatedly sampling from the same distribution, our method achieves higher coverage of correct reasoning trajectories with only a small number of candidates\.

![Refer to caption](https://arxiv.org/html/2606.06842v1/latex/figures/reflection_budget_QA_FV.png)Figure 3:WikiTQ and TabFact accuracy as the number of self\-critic iterations increases, comparing CRAFT with single\-direction reasoning\.
### 4\.5Impact of Self\-Critique Iterations

The relationship between our method and self\-critique is examined by analyzing the effect of varying the number of self\-critic iterations\. In our method, self\-critique is used in the Extractor\. Figure[3](https://arxiv.org/html/2606.06842#S4.F3)shows the effect of varying the number of self\-critic iterations on task performance\. Across both WikiTQ and TabFact, increasing the number of self\-critic iterations consistently improves accuracy compared to the no\-self\-critic setting, while the performance gain of single\-direction reasoning quickly saturate\.

Notably, even with an increase in the maximum number of self\-critic iterations \(rmax∈\[1,10\]r\_\{\\max\}\\in\[1,10\]\) , the performance of single\-direction reasoning remains below that of our method, which integrates bidirectional evidences from Rewriter and Reverser\. At the same time, the performance of CRAFT also improves asrmaxr\_\{\\max\}increases, indicating that it is compatible with self\-critique\. These trends suggest that self\-critic effectively strengthens individual candidates, while bidirectional reasoning provides additional, complementary gains that cannot be recovered by increasing self\-critic iterations alone\.

### 4\.6Effect of Multiple Counterfactual Statements

To examine whether multiple counterfactual directions can further improve reasoning coverage beyond the single\-counterfactual setting, we generateKKcounterfactual statements \(K∈1,2,3K\\in\{1,2,3\}\) for each original statement, run the same reasoning pipeline independently on each of them, and evaluate the resulting predictions with Pass@K\. As shown in Table[3](https://arxiv.org/html/2606.06842#S4.T3), increasingKKyields consistent but modest gains, with diminishing returns as more counterfactual statements are added\. Notably, compared with the repeated\-sampling results in Fig\.[2](https://arxiv.org/html/2606.06842#S4.F2)\(b\), introducing additional counterfactual\-driven reasoning paths leads to larger Pass@K improvements than simply repeating the reasoning process\. However, this comes at higher computational cost, and although higher Pass@K suggests greater potential benefit from multiple paths, reliably selecting the best one remains non\-trivial, which we left for future work\.

Table 3:Pass@K results when using Rewriter withKKReverse \(counterfactual\) reasoning paths\.
### 4\.7Performance Across Different Table Sizes

![Refer to caption](https://arxiv.org/html/2606.06842v1/latex/figures/framework_ablation_line_acl_column.png)Figure 4:Performance across different table sizes\.We partition tables into three size groups with thresholds: For WikiTQ,small\(<2000 tokens\),medium\(2000–4000\), andlarge\(\>4000\); For TabFact,small\(<500 tokens\),medium\(500–800\), andlarge\(\>800\)\.Large tables pose substantial challenges for large language models \(LLMs\), as they often struggle to effectively track, integrate, and reason over long input contexts\(Liuet al\.,[2023a](https://arxiv.org/html/2606.06842#bib.bib20); Yeet al\.,[2023](https://arxiv.org/html/2606.06842#bib.bib49)\)\. To evaluate the impact of table size on performance, we compare our method against two representative baselines, as shown in Figure[4](https://arxiv.org/html/2606.06842#S4.F4)\.

For WikiTQ, all methods exhibit degraded performance as table size increases, a trend that reflects the growing complexity of reasoning over larger contexts\. In contrast, evaluation results of TabFact task show no significant correlation with table size\. A plausible explanation is that it primarily requires pinpointing and validating specific claims against relevant table cells and is less sensitive to overall table scale\. Notably, our approach demonstrates superior robustness, consistently outperforming baselines across all table sizes\. It allows the counterfactual framework to maintain computational efficiency as context length increases\.

## 5Conclusion

We propose CRAFT, an explicit counterfactual framework for table reasoning\. The method jointly guides an LLM through both forward and counterfactual reasoning paths, which are integrated via a dedicated rethink module to arrive at the final decision\. The framework not only delivers substantial performance gains but also unifies table\-based question answering and fact verification under a shared counterfactual reasoning paradigm\. Furthermore, the counterfactual perspective provides new insight into how LLMs can be more effectively steered toward structured table understanding\.

## Limitations

While our framework achieves competitive and encouraging results, we acknowledge several limitations that call for continued exploration\. First, We did not conduct experiments with smaller base models such as 3B, because tabular inputs typically require long contexts\. Second, our method is tailored to table reasoning tasks\. Although the underlying ideas may generalize to other forms of reasoning—such as text\-based QA—ensuring semantic consistency and generating atomic facts that align closely with ground\-truth answers remains highly challenging\. Finally, the performance of ourRethinkmodule still leaves room for improvement\. While the current results are encouraging, the module has not yet fully realized the potential of counterfactual reasoning\. Further refinement is required to maximize its corrective and reasoning capabilities, providing an important direction for future work\.

## Ethics Statement

This work studies a multi\-agent reasoning framework for table\-based question answering and fact verification\. All experiments in this paper are conducted on publicly available benchmark datasets \(WikiTQ and TabFact\), which do not contain personal or sensitive information\. Parts of the implementation and baseline systems are adapted from prior work and publicly released codebases, with appropriate attribution and in compliance with their respective licenses\. We view this framework as a research contribution and encourage its use in settings consistent with its intended scope\. We therefore believe that this study complies with the ARR Ethics Policy\.

## References

- QA\-NatVer: question answering for natural logic\-based fact verification\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 8376–8391\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.521/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.521)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.06842#S3.SS1.p3.1)\.
- G\. Badaro, M\. Saeed, and P\. Papotti \(2023\)Transformers for tabular data representation: a survey of models and applications\.Transactions of the Association for Computational Linguistics11,pp\. 227–249\.External Links:[Link](https://aclanthology.org/2023.tacl-1.14/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00544)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p1.1)\.
- A\. Bavaresco, R\. Bernardi, L\. Bertolazzi, D\. Elliott, R\. Fernández, A\. Gatt, E\. Ghaleb, M\. Giulianelli, M\. Hanna, A\. Koller, A\. Martins, P\. Mondorf, V\. Neplenbroek, S\. Pezzelle, B\. Plank, D\. Schlangen, A\. Suglia, A\. K\. Surikuchi, E\. Takmaz, and A\. Testoni \(2025\)LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 238–255\.External Links:[Link](https://aclanthology.org/2025.acl-short.20/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-short.20),ISBN 979\-8\-89176\-252\-7Cited by:[§D\.2](https://arxiv.org/html/2606.06842#A4.SS2.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- A\. Chegini, K\. Rezaei, H\. Eghbalzadeh, and S\. Feizi \(2025\)RePanda: pandas\-powered tabular verification and reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 32200–32212\.External Links:[Link](https://aclanthology.org/2025.acl-long.1549/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1549),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021a\)Evaluating large language models trained on code\.External Links:2107\.03374,[Link](https://arxiv.org/abs/2107.03374)Cited by:[§4\.4](https://arxiv.org/html/2606.06842#S4.SS4.p3.3)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2023\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.External Links:2211\.12588,[Link](https://arxiv.org/abs/2211.12588)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- W\. Chen, H\. Wang, J\. Chen, Y\. Zhang, H\. Wang, S\. Li, X\. Zhou, and W\. Y\. Wang \(2020\)TabFact: a large\-scale dataset for table\-based fact verification\.External Links:1909\.02164,[Link](https://arxiv.org/abs/1909.02164)Cited by:[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Chen, W\. Chen, C\. Smiley, S\. Shah, I\. Borova, D\. Langdon, R\. Moussa, M\. Beane, T\. Huang, B\. Routledge, and W\. Y\. Wang \(2021b\)FinQA: a dataset of numerical reasoning over financial data\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 3697–3711\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.300/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.300)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p1.1)\.
- Z\. Cheng, T\. Xie, P\. Shi, C\. Li, R\. Nadkarni, Y\. Hu, C\. Xiong, D\. Radev, M\. Ostendorf, L\. Zettlemoyer, N\. A\. Smith, and T\. Yu \(2023\)Binding language models in symbolic languages\.External Links:2210\.02875,[Link](https://arxiv.org/abs/2210.02875)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1),[Table 1](https://arxiv.org/html/2606.06842#S3.T1.1.1.5.5.1),[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px2.p3.1)\.
- W\. L\. J\. Z\. J\. F\. Z\. F\. Y\. C\. X\. DU \(2025\)Large language model for table processing: a survey\.Frontiers of Computer Science19,pp\. 192350–\.External Links:ISSN 2095\-2228,[Document](https://dx.doi.org/https%3A//doi.org/10.1007/s11704-024-40763-6),[Link](https://journal.hep.com.cn/fcs/EN/10.1007/s11704-024-40763-6)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p1.1)\.
- X\. Fang, W\. Xu, F\. A\. Tan, Z\. Hu, J\. Zhang, Y\. Qi, S\. H\. Sengamedu, and C\. Faloutsos \(2024\)Large language models \(LLMs\) on tabular data: prediction, generation, and understanding \- a survey\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=IZnrCGF9WI)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p1.1)\.
- F\. Feng, J\. Zhang, X\. He, H\. Zhang, and T\. Chua \(2021\)Empowering language understanding with counterfactual reasoning\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 2226–2236\.External Links:[Link](https://aclanthology.org/2021.findings-acl.196/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-acl.196)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px1.p3.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px1.p3.1)\.
- R\. Haldar and J\. Hockenmaier \(2025\)Rating roulette: self\-inconsistency in LLM\-as\-a\-judge frameworks\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24986–25004\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1361/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1361),ISBN 979\-8\-89176\-335\-7Cited by:[§D\.2](https://arxiv.org/html/2606.06842#A4.SS2.p1.1)\.
- J\. Herzig, P\. K\. Nowak, T\. Müller, F\. Piccinno, and J\. Eisenschlos \(2020\)TaPas: weakly supervised table parsing via pre\-training\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4320–4333\.External Links:[Link](https://aclanthology.org/2020.acl-main.398/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.398)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Ji, L\. Zhu, S\. Gao, P\. Xu, H\. Lu, J\. Ye, and F\. Zhao \(2024\)Tree\-of\-table: unleashing the power of llms for enhanced large\-scale table understanding\.External Links:2411\.08516,[Link](https://arxiv.org/abs/2411.08516)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p3.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p3.1)\.
- J\. Kim and S\. Hwang \(2025\)Counterfactual\-consistency prompting for relative temporal understanding in large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 1210–1225\.External Links:[Link](https://aclanthology.org/2025.acl-short.97/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-short.97),ISBN 979\-8\-89176\-252\-7Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Li, L\. Yu, and A\. Ettinger \(2023\)Counterfactual reasoning: testing language models’ understanding of hypothetical scenarios\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 804–815\.External Links:[Link](https://aclanthology.org/2023.acl-short.70/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-short.70)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Li, C\. Huang, S\. Li, Y\. Xiang, D\. Xiong, and W\. Lei \(2025\)GraphOTTER: evolving LLM\-based graph reasoning for complex table question answering\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 5486–5506\.External Links:[Link](https://aclanthology.org/2025.coling-main.368/)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p3.1)\.
- J\. Liu, Y\. Chabot, R\. Troncy, V\. Huynh, T\. Labbé, and P\. Monnin \(2023a\)From tabular data to knowledge graphs: a survey of semantic table interpretation tasks and methods\.Journal of Web Semantics76,pp\. 100761\.External Links:ISSN 1570\-8268,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.websem.2022.100761),[Link](https://www.sciencedirect.com/science/article/pii/S1570826822000452)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p1.1),[§4\.7](https://arxiv.org/html/2606.06842#S4.SS7.p1.1)\.
- J\. Liu, Y\. Chabot, R\. Troncy, V\. Huynh, T\. Labbé, and P\. Monnin \(2023b\)From tabular data to knowledge graphs: a survey of semantic table interpretation tasks and methods\.Journal of Web Semantics76,pp\. 100761\.External Links:ISSN 1570\-8268,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.websem.2022.100761),[Link](https://www.sciencedirect.com/science/article/pii/S1570826822000452)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024a\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.External Links:[Link](https://aclanthology.org/2024.tacl-1.9/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00638)Cited by:[§3\.2](https://arxiv.org/html/2606.06842#S3.SS2.p1.9)\.
- Q\. Liu, B\. Chen, J\. Guo, M\. Ziyadi, Z\. Lin, W\. Chen, and J\. Lou \(2022\)TAPEX: table pre\-training via learning a neural sql executor\.External Links:2107\.07653,[Link](https://arxiv.org/abs/2107.07653)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Liu, F\. Wang, and M\. Chen \(2024b\)Rethinking tabular data understanding with large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 450–482\.External Links:[Link](https://aclanthology.org/2024.naacl-long.26/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.26)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p3.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p3.1)\.
- Q\. Mao, Q\. Liu, Z\. Li, M\. Cheng, Z\. Zhang, and R\. Li \(2025\)PoTable: towards systematic thinking via stage\-oriented plan\-then\-execute reasoning on tables\.External Links:2412\.04272,[Link](https://arxiv.org/abs/2412.04272)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- M\. M\. H\. Nahid and D\. Rafiei \(2024\)TabSQLify: enhancing reasoning capabilities of LLMs through table decomposition\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 5725–5737\.External Links:[Link](https://aclanthology.org/2024.naacl-long.320/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.320)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- A\. Ni, S\. Iyer, D\. Radev, V\. Stoyanov, W\. Yih, S\. I\. Wang, and X\. V\. Lin \(2023\)LEVER: learning to verify language\-to\-code generation with execution\.External Links:2302\.08468,[Link](https://arxiv.org/abs/2302.08468)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- OpenAI \(2025\)GPT\-5 system card\.Note:[https://openai\.com/index/gpt\-5\-system\-card/](https://openai.com/index/gpt-5-system-card/)Cited by:[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px1.p3.1)\.
- C\. Pan, J\. Zhang, K\. Wei, C\. Pan, Y\. Zhao, J\. Huang, J\. Yang, Z\. Wu, H\. Zeng, X\. Gu, W\. Sun, Y\. Zhai, Y\. Mao, Z\. Jiang, J\. Zhong, S\. Song, Y\. Li, and Z\. He \(2026\)ReasonTabQA: a comprehensive benchmark for table question answering from real world industrial scenarios\.External Links:2601\.07280,[Link](https://arxiv.org/abs/2601.07280)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Pasupat and P\. Liang \(2015\)Compositional semantic parsing on semi\-structured tables\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong and M\. Strube \(Eds\.\),Beijing, China,pp\. 1470–1480\.External Links:[Link](https://aclanthology.org/P15-1142/),[Document](https://dx.doi.org/10.3115/v1/P15-1142)Cited by:[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px1.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px1.p3.1)\.
- V\. Raina, A\. Liusie, and M\. Gales \(2024\)Is LLM\-as\-a\-judge robust? investigating universal adversarial attacks on zero\-shot LLM assessment\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 7499–7517\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.427/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.427)Cited by:[§D\.2](https://arxiv.org/html/2606.06842#A4.SS2.p1.1)\.
- Y\. Ruan, X\. Lan, J\. Ma, Y\. Dong, K\. He, and M\. Feng \(2024\)Language modeling on tabular data: a survey of foundations, techniques and evolution\.External Links:2408\.10548,[Link](https://arxiv.org/abs/2408.10548)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p1.1)\.
- A\. Su, A\. Wang, C\. Ye, C\. Zhou, G\. Zhang, G\. Chen, G\. Zhu, H\. Wang, H\. Xu, H\. Chen, H\. Li, H\. Lan, J\. Tian, J\. Yuan, J\. Zhao, J\. Zhou, K\. Shou, L\. Zha, L\. Long, L\. Li, P\. Wu, Q\. Zhang, Q\. Huang, S\. Yang, T\. Zhang, W\. Ye, W\. Zhu, X\. Hu, X\. Gu, X\. Sun, X\. Li, Y\. Yang, and Z\. Xiao \(2024\)TableGPT2: a large multimodal model with tabular data integration\.External Links:2411\.02059,[Link](https://arxiv.org/abs/2411.02059)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Sui, J\. Zou, M\. Zhou, X\. He, L\. Du, S\. Han, and D\. Zhang \(2024\)TAP4LLM: table provider on sampling, augmenting, and packing semi\-structured data for large language model reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10306–10323\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.603/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.603)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- D\. Wan, J\. Vig, M\. Bansal, and S\. Joty \(2025\)On positional bias of faithfulness for long\-form summarization\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 8791–8810\.External Links:[Link](https://aclanthology.org/2025.naacl-long.442/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.442),ISBN 979\-8\-89176\-189\-6Cited by:[§3\.2](https://arxiv.org/html/2606.06842#S3.SS2.p1.9)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),Note:Camera\-ready version at ICLR 2023External Links:[Link](https://arxiv.org/abs/2203.11171)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p3.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p3.1)\.
- Z\. Wang, H\. Zhang, C\. Li, J\. M\. Eisenschlos, V\. Perot, Z\. Wang, L\. Miculicich, Y\. Fujii, J\. Shang, C\. Lee, and T\. Pfister \(2024\)Chain\-of\-table: evolving tables in the reasoning chain for table understanding\.InProceedings of the International Conference on Learning Representations \(ICLR\),Note:Published as a conference paper at ICLR 2024External Links:[Link](https://arxiv.org/abs/2401.04398)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1),[Table 1](https://arxiv.org/html/2606.06842#S3.T1.1.1.7.7.1),[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px2.p3.1)\.
- T\. W\. Webb, K\. J\. Holyoak, and H\. Lu \(2025\)Evidence from counterfactual tasks supports emergent analogical reasoning in large language models\.PNAS Nexus4\(5\),pp\. pgaf135\.External Links:[Document](https://dx.doi.org/10.1093/pnasnexus/pgaf135)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- Z\. Wu, J\. Yang, J\. Liu, X\. Wu, C\. Pan, J\. Zhang, Y\. Zhao, S\. Song, Y\. Li, and Z\. Li \(2025\)Table\-r1: region\-based reinforcement learning for table understanding\.External Links:2505\.12415,[Link](https://arxiv.org/abs/2505.12415)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Xiong, Z\. He, Z\. He, Y\. Zhao, C\. Pan, J\. Zhang, S\. Song, and Y\. Li \(2025\)TableZoomer: a collaborative agent framework for large\-scale table question answering\.Vicinagearth2\(1\),pp\. 11\.External Links:[Document](https://dx.doi.org/10.1007/s44336-025-00016-x),[Link](https://doi.org/10.1007/s44336-025-00016-x)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1)\.
- H\. Yang, S\. Hwang, and J\. So \(2023\)Relation\-based counterfactual data augmentation and contrastive learning for robustifying natural language inference models\.InProceedings of INTERSPEECH 2023,Note:Accepted at INTERSPEECH 2023Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Ye, B\. Hui, M\. Yang, B\. Li, F\. Huang, and Y\. Li \(2023\)Large language models are versatile decomposers: decomposing evidence and questions for table\-based reasoning\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’23,New York, NY, USA,pp\. 174–184\.External Links:ISBN 9781450394086,[Link](https://doi.org/10.1145/3539618.3591708),[Document](https://dx.doi.org/10.1145/3539618.3591708)Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[Table 1](https://arxiv.org/html/2606.06842#S3.T1.1.1.6.6.1),[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px2.p3.1),[§4\.7](https://arxiv.org/html/2606.06842#S4.SS7.p1.1)\.
- P\. Yu, G\. Chen, and J\. Wang \(2025\)Table\-critic: a multi\-agent framework for collaborative criticism and refinement in table reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 17432–17451\.External Links:[Link](https://aclanthology.org/2025.acl-long.853/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.853),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p3.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p3.1),[Table 1](https://arxiv.org/html/2606.06842#S3.T1.1.1.10.10.1),[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px2.p4.1)\.
- Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, Y\. Yue, S\. Song, and G\. Huang \(2025\)Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?\.External Links:2504\.13837,[Link](https://arxiv.org/abs/2504.13837)Cited by:[§4\.4](https://arxiv.org/html/2606.06842#S4.SS4.p4.2)\.
- H\. Zhang, Y\. Ma, and H\. Yang \(2025a\)ALTER: augmentation for large\-table\-based reasoning\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 179–198\.External Links:[Link](https://aclanthology.org/2025.naacl-long.9/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.9),ISBN 979\-8\-89176\-189\-6Cited by:[Table 1](https://arxiv.org/html/2606.06842#S3.T1.1.1.9.9.1),[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px2.p3.1)\.
- X\. Zhang, S\. Luo, B\. Zhang, Z\. Ma, J\. Zhang, Y\. Li, G\. Li, Z\. Yao, K\. Xu, J\. Zhou, D\. Zhang\-Li, J\. Yu, S\. Zhao, J\. Li, and J\. Tang \(2025b\)TableLLM: enabling tabular data manipulation by LLMs in real office usage scenarios\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 10315–10344\.External Links:[Link](https://aclanthology.org/2025.findings-acl.538/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.538),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.06842#S1.p2.1),[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, Y\. Sun, Y\. Zhan, D\. Tao, D\. Tao, and C\. Gong \(2025c\)Large language models as an indirect reasoner: contrapositive and contradiction for automated reasoning\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 5040–5057\.External Links:[Link](https://aclanthology.org/2025.coling-main.337/)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhang, Y\. Gao, B\. Chen, W\. Li, S\. Sun, and J\. Su \(2025d\)High\-quality complex text\-to\-SQL data generation through chain\-of\-verification\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,K\. Inui, S\. Sakti, H\. Wang, D\. F\. Wong, P\. Bhattacharyya, B\. Banerjee, A\. Ekbal, T\. Chakraborty, and D\. P\. Singh \(Eds\.\),Mumbai, India,pp\. 2368–2379\.External Links:[Link](https://aclanthology.org/2025.findings-ijcnlp.143/),ISBN 979\-8\-89176\-303\-6Cited by:[§D\.1](https://arxiv.org/html/2606.06842#A4.SS1.p3.1)\.
- Y\. Zhang, J\. Henkel, A\. Floratou, J\. Cahoon, S\. Deep, and J\. M\. Patel \(2024\)ReAcTable: enhancing react for table question answering\.Proc\. VLDB Endow\.17\(8\),pp\. 1981–1994\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/3659437.3659452),[Document](https://dx.doi.org/10.14778/3659437.3659452)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- X\. Zheng, J\. Lou, B\. Cao, X\. Wen, Y\. Ji, H\. Lin, Y\. Lu, X\. Han, D\. Zhang, and L\. Sun \(2025\)Critic\-CoT: boosting the reasoning abilities of large language model via chain\-of\-thought critic\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 1768–1806\.External Links:[Link](https://aclanthology.org/2025.findings-acl.89/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.89),ISBN 979\-8\-89176\-256\-5Cited by:[Table 1](https://arxiv.org/html/2606.06842#S3.T1.1.1.8.8.1),[§4\.1](https://arxiv.org/html/2606.06842#S4.SS1.SSS0.Px2.p4.1)\.
- D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le, and E\. Chi \(2023a\)Least\-to\-most prompting enables complex reasoning in large language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),Note:Published at ICLR 2023External Links:[Link](https://arxiv.org/abs/2205.10625)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px1.p2.1)\.
- W\. Zhou, S\. Zhang, H\. Poon, and M\. Chen \(2023b\)Context\-faithful prompting for large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 14544–14556\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.968/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.968)Cited by:[§2](https://arxiv.org/html/2606.06842#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix

## Appendix AImplementation Details

All experiments were conducted on a dedicated server equipped with NVIDIA A100 40GB GPUs \(×8\) with CUDA 12\.2\. We evaluated our framework using multiple backbone large language models\. Open\-source models including Llama\-3\.3\-70B and DeepSeek\-R1\-14B were deployed locally using the vLLM inference engine, while GPT\-5\-mini was accessed via the Microsoft Azure OpenAI API\.

Unless otherwise specified, we used a temperature of 0 and top\-ppof 1\.0\. The maximum generation length was set to at most 2048 tokens\. In the main experiments, we selected the number of self\-critique iterationsrrfrom1,…,10\{1,\\dots,10\}based on validation performance, and used the best\-performing value\. For API\-based models444We do not report WikiTQ results forgpt5\-miniunder ALTER because a subset of batches returned empty responses during inference, preventing reliable aggregation of results\., due to endpoint constraints, we set the temperature to 1\. In addition, when inference timeouts occurred, we set the reasoning level tominimumto ensure stable execution, while keeping all other settings unchanged\.

### A\.1Output Normalization and Evaluation Metrics

In our evaluation, post\-processing is applied at two different levels, corresponding to the main results and the supplementary analysis, respectively\. For the main results reported in the primary tables, we apply a strictly minimal normalization only to DeepSeek\-R1 distilled models \(e\.g\., DeepSeek\-Distilled\-14B\) by removing explicit reasoning traces or special markers \(e\.g\.,thinkblocks\)\. No matching or answer extraction is performed, and all other models are evaluated on their raw outputs without any post\-processing\. This setting is intentionally conservative, ensuring that the main comparisons reflect strict adherence to the canonical answer formats of WikiTQ and TabFact\.

In the supplementary analysis reported after Section 4\.3, we additionally enable a lightweight matching\-based extraction on top of the same normalization to better reflect practical usage scenarios in which users accept semantically correct answers even when they are embedded in natural language statements\. For example, if a model outputs “The answer is X” or similar declarative forms, we extractXand treat it as the predicted answer\. Unlike the main results, this matching\-based extraction is not restricted to DeepSeek distilled models, and is applied whenever a model’s output is non\-canonical but contains an unambiguous answer span\. Both stages are conservative and purely surface\-level: they neither modify model predictions nor introduce task\-specific heuristics, but only standardize syntactic variants for consistent evaluation\.

### A\.2TAT\-QA Format Alignment and Evaluation Protocol

For the additional TAT\-QA experiments, we do not modify the reasoning procedure or the prompts of any compared baseline\. Since TAT\-QA is a text\-table hybrid dataset, while the evaluated methods are originally designed around table\-centric inputs, we only perform input\-format alignment to make the dataset compatible with the same evaluation interface\. Specifically, we prepend the associated textual context before the table caption in the original table serialization, while keeping the remaining table content and question format unchanged\. The same serialized input format is used for all compared methods\. To enable a consistent accuracy\-style comparison with WikiTQ, we also canonicalize TAT\-QA gold answers into WikiTQ\-style answer strings and evaluate predictions with the same WikiTQ evaluator\. This conversion only standardizes the surface format of the reference answers, such as answer strings or lists of answer strings, and does not change the semantic content of the gold labels\. Therefore, the TAT\-QA results reported in the supplementary analysis should be interpreted under this adapted accuracy\-style protocol\.

## Appendix BAdditional Analysis

### B\.1Additional Results with Qwen3\.5

We further conduct supplementary experiments with Qwen3\.5 on WikiTQ and TabFact\. As shown in Table[4](https://arxiv.org/html/2606.06842#A2.T4), CRAFT achieves the best performance on both datasets, with 86\.1% on WikiTQ and 95\.3% on TabFact\. This trend is consistent with the main results, indicating that the conclusions remain unchanged when using Qwen3\.5\-27B as an additional backbone\.

Table 4:Performance \(%\) of different methods on WikiTQ and TabFact with Qwen3\.5 as the backbone\.
### B\.2Evaluation on TAT\-QA

Table 5:Performance \(%\) on TAT\-QA under the same adapted text\-table input setting\.We further conduct supplementary experiments on TAT\-QA with LLaMA3\-70B and Qwen2\.5\-72B, following the format alignment and evaluation protocol described in Section[A\.2](https://arxiv.org/html/2606.06842#A1.SS2)\. As shown in Table[5](https://arxiv.org/html/2606.06842#A2.T5), CRAFT consistently achieves the best performance under both backbones, reaching 71\.8% and 73\.9%, respectively\. This trend is consistent with the main results, indicating that the conclusions remain unchanged in the text\-table hybrid setting\.

### B\.3Step\-0 Candidate Accuracy and Error\-Correction Analysis

To examine whether the performance gain mainly comes from the initial candidate already being correct, we conduct an additional WikiTQ analysis with LLaMA3 over three independent runs\. Letyydenote the gold answer,y^0\\hat\{y\}\_\{0\}denote the Step\-0 candidate, andy^\\hat\{y\}denote the final prediction\. Table[6](https://arxiv.org/html/2606.06842#A2.T6)reports the mean Step\-0 candidate accuracy and the corresponding error\-correction results\.

Table 6:Step\-0 candidate accuracy and error\-correction analysis on WikiTQ with LLaMA3, accuracy is averaged over three runs\.The Step\-0 potential\-answer accuracy is only 61\.5% on average, substantially lower than the final\-answer accuracy of 81\.3%\. This indicates that the final performance gain cannot be explained by simply inheriting a correct initial candidate\. In particular, CRAFT corrects 59\.3% of initially wrong Step\-0 predictions, and 28\.1% of final correct answers come from cases where the Step\-0 prediction was wrong\. These results show that CRAFT has a strong error\-correction ability: even when the initial candidate is incorrect, the reverse path and the Rethinker can often recover the correct answer by integrating complementary table\-grounded evidence\.

## Appendix CEvaluations on Token Consumption

\(a\)WikiTQ
\(b\)TabFact

Table 7:Token consumption comparison across different methods on \(a\) table question answering \(WikiTQ\) and \(b\) fact verification \(TabFact\)\. Token counts are measured over the full reasoning pipeline, including candidate generation, critic evaluation, and final decision making\.In this section, we analyze the token consumption of the full reasoning pipeline under different numbers of critic evaluations\. The reported token counts shown in Table[7](https://arxiv.org/html/2606.06842#A3.T7)cover the entire process, including candidate answer generation, critic\-based evaluation, and final decision making\.

In our framework, the critic is applied to assess the quality and evidence consistency of candidate answers\. We vary the number of critic evaluations fromc=1c=1toc=10c=10, where each additional critic evaluation independently reassesses the candidate answers using the same criteria\. As the number of critic evaluations increases, the overall token consumption grows substantially, since each evaluation requires reprocessing the candidate answers together with their associated evidence\.

Beyond raw token counts, we also consider a more realistic cost model that reflects current commercial pricing schemes, in which input and output tokens are priced differently\. Following the latest OpenAI pricing, output tokens are substantially more expensive than input tokens; we therefore normalize token usage by weighting output tokens more heavily\. Concretely, we adopt a normalized cost defined as:

Normalized Cost=0\.125Tin\+0\.875,Tout,\\text\{Normalized Cost\}\\;=\\;0\.125\\,T\_\{\\text\{in\}\}\\;\+\\;0\.875,T\_\{\\text\{out\}\},
whereTinT\_\{\\text\{in\}\}andToutT\_\{\\text\{out\}\}denote the numbers of input and output tokens, andPinP\_\{\\text\{in\}\}andPoutP\_\{\\text\{out\}\}denote their respective per\-token prices\.

As shown in the main results, using a larger number of critic evaluations generally leads to stronger performance, as repeated evaluation helps reduce unreliable decisions\. Therefore, higher values ofccare preferred when computational resources permit\. At the same time, even a single critic evaluation \(c=1c=1\) already consistently outperforms the baseline, without further critic evaluation, while incurring substantially lower normalized cost\. In addition, introducing Extractor stage based on Chain\-of\-Table, CRAFT significantly reduces token consumption and doesn’t introduce critic, while its performance remains among the leading methods in the main results\.

These results indicate a clear trade\-off between performance and computational cost controlled by the number of critic evaluations\. While multiple critic evaluations yield stronger results, usingc=1c=1serves as an economical alternative that retains most of the performance gains with substantially reduced token usage under realistic pricing assumptions\.

## Appendix DA Pseudocode Description of Rethinker/Reverser

This section includes pseudocode and brief supplementary notes on the implementation of the Reverse and Rethink modules\.

Table 8:Atomic rewriting rules and templates\.### D\.1The Reverser

Algorithm 1Reverser Constructor1:Declarative statement

SS; optional question

QQ; table column names

𝒞\\mathcal\{C\}; reverse template set

𝒟\\mathcal\{D\}
2:Counterstatement statement

R∗R^\{\*\}
3:Step 1: Template matching and candidate instantiation⊳\\trianglerightgenerate multiple reverse candidates

4:

\{δ1,…,δk\}←ℳMatch\(S,Q,𝒟\)\\\{\\delta\_\{1\},\\dots,\\delta\_\{k\}\\\}\\leftarrow\\mathcal\{M\}\_\{\\textsc\{Match\}\}\(S,Q,\\mathcal\{D\}\)
5:for

j=1j=1to

kkdo

6:

Rj←ϕδj\(S,Q\)R\_\{j\}\\leftarrow\\phi^\{\\delta\_\{j\}\}\(S,Q\)
7:endfor

8:

ℛ←\{R1,…,Rk\}\\mathcal\{R\}\\leftarrow\\\{R\_\{1\},\\dots,R\_\{k\}\\\}
9:Step 2: SQL\-signature induction⊳\\trianglerightuse SQL structure as proxy for induced reasoning space

10:

PS←ℳSQL\(S,𝒞\)P\_\{S\}\\leftarrow\\mathcal\{M\}\_\{\\textsc\{SQL\}\}\(S,\\mathcal\{C\}\)
11:

ΣS←ExtractSignature\(PS\)\\Sigma\_\{S\}\\leftarrow\\textsc\{ExtractSignature\}\(P\_\{S\}\)
12:for

j=1j=1to

kkdo

13:

Pj←ℳSQL\(Rj,𝒞\)P\_\{j\}\\leftarrow\\mathcal\{M\}\_\{\\textsc\{SQL\}\}\(R\_\{j\},\\mathcal\{C\}\)
14:

Σj←ExtractSignature\(Pj\)\\Sigma\_\{j\}\\leftarrow\\textsc\{ExtractSignature\}\(P\_\{j\}\)
15:endfor

16:⊳\\trianglerighteach signature is represented by operation–object pairs extracted from minimal SQL

17:Step 3: Reverse selection⊳\\trianglerightmaximize structural expansion beyondSS

18:

j∗←arg⁡max1≤j≤k⁡\|ΣS∪Σj\|j^\{\*\}\\leftarrow\\arg\\max\\limits\_\{1\\leq j\\leq k\}\\;\\bigl\|\\Sigma\_\{S\}\\cup\\Sigma\_\{j\}\\bigr\|
19:

R∗←Rj∗R^\{\*\}\\leftarrow R\_\{j^\{\*\}\}
20:return

R∗R^\{\*\}

The Reverser procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.06842#alg1)\. To instantiate reverse candidates, we compile a small set of commonly used reverse templates based on empirical observation of statement semantics\. These templates are summarized in Appendix Table[8](https://arxiv.org/html/2606.06842#A4.T8)and are used to prompt candidate generation while keeping the candidates semantically aligned with the original statement\.

Given the original statement and each candidate reverse statement, we independently prompt the model to generate a minimal SQL\-style query for verification\. This design is intended to reduce reward hacking during SQL generation: instead of allowing candidates to benefit from arbitrarily long or structurally inflated programs, we explicitly require each statement to be expressed by a minimal verification query\.

We first elicit the SQL program itself and then apply rule\-based extraction to derive normalized structural representations from the generated SQL\. Specifically, we represent each SQL query using operation–object signatures, where each signature associates a SQL operation with its main object\. In implementation, reverse selection is based primarily on structural expansion at the operation–object level, namely, the amount of new operation–object structure introduced beyond the original statement\. If multiple candidates introduce the same amount of operation–object expansion, we further ask the model to decide which candidate is more helpful for solving the original question\. Based on the SQL query patterns and functions discussed by\(Zhanget al\.,[2025d](https://arxiv.org/html/2606.06842#bib.bib62)\), we summarizes the SQL operator vocabulary and the corresponding object\-level extraction used in our implementation in Table[9](https://arxiv.org/html/2606.06842#A4.T9)\.

Table 9:SQL operator vocabulary and object\-level signature construction used in reverse selection\.
### D\.2The Rethinker

Using large language models as critics or judges has been shown to be effective, but it is also challenging in practice, as their feedback can be noisy, inconsistent, overconfident, or sensitive to prompt phrasing and context\(Rainaet al\.,[2024](https://arxiv.org/html/2606.06842#bib.bib41); Bavarescoet al\.,[2025](https://arxiv.org/html/2606.06842#bib.bib3); Haldar and Hockenmaier,[2025](https://arxiv.org/html/2606.06842#bib.bib39)\)\. As a result, we do not rely on the model’s feedback alone\. Instead, based on empirical validation, we adopt a hybrid design that combines model feedback with simple structural constraints\. This design allows us to leverage the model’s semantic judgments while preventing unstable or unconstrained revisions\. To assess the contribution of this design, we report a simple ablation in which the structural constraints are removed and the system relies solely on model feedback \(Table[10](https://arxiv.org/html/2606.06842#A5.T10)\)\. Algorithm[2](https://arxiv.org/html/2606.06842#alg2)provides the pseudocode description of this module\.

## Appendix ECase Study and Prompts

We provide qualitative illustrations and implementation details for each step of the CRAFT\. It is organized to support a clearer understanding of how the proposed method operates, how its components interact, and how its behavior differs from that of prior reflection\-based approaches\.

Table 10:Ablation study results comparingRethinkwith a purely model\-based prompting baseline on WikiTQ and TabFact tasks, using Llama 3 as the backbone model\.We first present a detailed case study[5](https://arxiv.org/html/2606.06842#A5.F5)that traces the reasoning process of our method and contrasts it with that of a representative self\-reflection–based baseline on the same example\. By comparing the key intermediate steps under the two settings, the case study illustrates how counterfactual construction leads the model to explore alternative reasoning paths, and how inappropriate intermediate restrictions in prior approaches can result in incorrect conclusions, whereas the proposed framework reaches a consistent and correct outcome\. In a separate example[6](https://arxiv.org/html/2606.06842#A5.F6), we further show that CRAFT can still approach the correct solution even when neither the Rewriter nor the Reverser statements include the gold answer\. CRAFT leverages the partial but complementary clues uncovered along the two reasoning paths\. This allows CRAFT to filter out less reliable candidates, retain the more informative comparisons, and ultimately recover the correct answer through the final evidence aggregation stage\.

Following this, we report the exact prompts used for all modules in CRAFT to ensure transparency and reproducibility\. Specifically, we provide the prompts for the Rewriter, Reverser, Extractor, and Rethinker components \(see Figure[7](https://arxiv.org/html/2606.06842#A5.F7)\)\. Since the intermediate reasoning traces produced by baseline methods can be directly used as evidence inputs, we only include the Extractor prompt that maps a given piece of evidence to a final answer\. This prompt defines how evidence is interpreted and converted into a prediction\.

Algorithm 2Rethink Component1:Question

QQ, table

TT; Rewriter

\(aA,eA\)\(a\_\{A\},e\_\{A\}\); Reverser

\(aB,eB\)\(a\_\{B\},e\_\{B\}\); fallback answer

abasea\_\{\\mathrm\{base\}\}; thresholds

τ,τswap\\tau,\\tau\_\{\\mathrm\{swap\}\}\.

2:Selected answer

a^∈\{aA,aB\}\\hat\{a\}\\in\\\{a\_\{A\},a\_\{B\}\\\}\.

3:FunctionScore\(

Q,T,a,eQ,T,a,e\):

4:

S←ToStatement\(Q,a\)S\\leftarrow\\textsc\{ToStatement\}\(Q,a\)
5:

\(y,α\)←LLMJudge\(S,e,T\)\(y,\\alpha\)\\leftarrow\\textsc\{LLMJudge\}\(S,e,T\)⊳\\trianglerighty∈\{𝖲𝗎𝗉𝗉𝗈𝗋𝗍,𝖢𝗈𝗇𝗍𝗋𝖺𝖽𝗂𝖼𝗍\},α∈\[0,1\]y\\in\\\{\\mathsf\{Support\},\\mathsf\{Contradict\}\\\},\\ \\alpha\\in\[0,1\]

6:if

y=𝖲𝗎𝗉𝗉𝗈𝗋𝗍y=\\mathsf\{Support\}then

7:return

\+α\+\\alpha
8:else

9:return

−α\-\\alpha
10:endif

11:

12:Main\-view margin \(self\-evidence\)

13:

sA←Score\(Q,T,aA,eA\)s\_\{A\}\\leftarrow\\textsc\{Score\}\(Q,T,a\_\{A\},e\_\{A\}\),

sB←Score\(Q,T,aB,eB\)s\_\{B\}\\leftarrow\\textsc\{Score\}\(Q,T,a\_\{B\},e\_\{B\}\)⊳\\trianglerights∈\[−1,1\]s\\in\[\-1,1\]

14:

Δ←\|sA−sB\|\\Delta\\leftarrow\|s\_\{A\}\-s\_\{B\}\|
15:if

Δ≥τ\\Delta\\geq\\tauthen

16:return

arg⁡maxa∈\{aA,aB\}⁡s\(a\)\\arg\\max\_\{a\\in\\\{a\_\{A\},a\_\{B\}\\\}\}\\;s\(a\)
17:endif

18:⊳\\trianglerightIfΔ<τ\\Delta<\\tau, Stage 1 is inconclusive and we enter Stage 2

19:

20:Swapped\-evidence

21:

sA↔←Score\(Q,T,aA,eB\)s\_\{A\\leftrightarrow\}\\leftarrow\\textsc\{Score\}\(Q,T,a\_\{A\},e\_\{B\}\),

sB↔←Score\(Q,T,aB,eA\)s\_\{B\\leftrightarrow\}\\leftarrow\\textsc\{Score\}\(Q,T,a\_\{B\},e\_\{A\}\)
22:

cA←𝕀\[sA⋅sA↔\>0\]c\_\{A\}\\leftarrow\\mathbb\{I\}\\\!\\left\[s\_\{A\}\\cdot s\_\{A\\leftrightarrow\}\>0\\right\],

cB←𝕀\[sB⋅sB↔\>0\]c\_\{B\}\\leftarrow\\mathbb\{I\}\\\!\\left\[s\_\{B\}\\cdot s\_\{B\\leftrightarrow\}\>0\\right\]
23:⊳\\trianglerightc=1c=1: sign preserved;c=0c=0: sign flips

24:

Δswap←\|sA↔−sB↔\|\\Delta\_\{\\mathrm\{swap\}\}\\leftarrow\|s\_\{A\\leftrightarrow\}\-s\_\{B\\leftrightarrow\}\|
25:

26:Decision

27:if

cA≠cBc\_\{A\}\\neq c\_\{B\}then

28:if

Δswap≥τswap\\Delta\_\{\\mathrm\{swap\}\}\\geq\\tau\_\{\\mathrm\{swap\}\}then

29:return

aarg⁡maxv∈\{A,B\}⁡cva\_\{\\arg\\max\_\{v\\in\\\{A,B\\\}\}\\;c\_\{v\}\}
30:endif

31:endif

32:return

abasea\_\{\\mathrm\{base\}\}

![Refer to caption](https://arxiv.org/html/2606.06842v1/latex/figures/Case_study.png)Figure 5:A Case Study Comparing Self\-Critique and Counterfactual Reasoning Paths![Refer to caption](https://arxiv.org/html/2606.06842v1/x2.png)Figure 6:A Case Study Showing how CRAFT get correct answer when Both Reasoning Paths start with a wrong answer![Refer to caption](https://arxiv.org/html/2606.06842v1/x3.png)Figure 7:Rewriter Prompt Template\.![Refer to caption](https://arxiv.org/html/2606.06842v1/x4.png)Figure 8:Reverser Prompt Template\.![Refer to caption](https://arxiv.org/html/2606.06842v1/x5.png)Figure 9:Extractor Prompt Template\.![Refer to caption](https://arxiv.org/html/2606.06842v1/x6.png)Figure 10:Rethinker Prompt Template\.
CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

Similar Articles

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Submit Feedback

Similar Articles

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning
Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering