Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning Perspective

arXiv cs.CL Papers

Summary

This paper proposes EcoTab, a table-aware stepwise routing framework that separately estimates uncertainty for table tokens and text tokens to dynamically route reasoning steps between small and large models, achieving a better accuracy-efficiency trade-off on table reasoning tasks.

arXiv:2605.29319v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong performance on table reasoning tasks but incur substantial inference cost due to long reasoning traces. Stepwise model routing mitigates this issue by dynamically assigning reasoning steps to smaller or larger models. However, stepwise model routing for table reasoning remains underexplored. Through empirical analysis, we find that reasoning steps involving tables contain two types of tokens with distinct uncertainty distributions: table tokens grounded in table structure, such as cell values and headers, and text tokens representing surrounding natural-language reasoning. The uncertainty of both token types is correlated with the risk that the model makes an error in the next reasoning step. However, existing methods fail to model them separately, leading to suboptimal routing decisions. To address this, we propose EcoTab, a table-aware stepwise routing framework for efficient table reasoning. At each reasoning step, EcoTab separately estimates the uncertainties of table tokens and text tokens, maps them to next-step failure risks for the small model, and combines the two risks for routing. Experiments on multiple table reasoning benchmarks show that EcoTab consistently outperforms strong baselines and achieves a better balance between accuracy and efficiency.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:18 AM

# Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning Perspective
Source: [https://arxiv.org/html/2605.29319](https://arxiv.org/html/2605.29319)
Shenghao Ye1111Equal contribution,Yuxiang Wang2111Equal contribution,Yu Guo1111Equal contribution,Dong Jin3222Corresponding authors,Shuangwu Chen1222Corresponding authors,Jian Yang1 1University of Science and Technology of China2The University of Melbourne 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center \{ssh0321y, yukariguo\}@mail\.ustc\.edu\.cn \{kingdon, chensw, jianyang\}@ustc\.edu\.cn

###### Abstract

Large Reasoning Models \(LRMs\) achieve strong performance on table reasoning tasks but incur substantial inference cost due to long reasoning traces\. Stepwise model routing mitigates this issue by dynamically assigning reasoning steps to smaller or larger models\. However, stepwise model routing for table reasoning remains underexplored\. Through empirical analysis, we find that reasoning steps involving tables contain two types of tokens with distinct uncertainty distributions: table tokens grounded in table structure, such as cell values and headers, and text tokens representing surrounding natural\-language reasoning\. The uncertainty of both token types is correlated with the risk that the model makes an error in the next reasoning step\. However, existing methods fail to model them separately, leading to suboptimal routing decisions\. To address this, we propose EcoTab, a table\-aware stepwise routing framework for efficient table reasoning\. At each reasoning step, EcoTab separately estimates the uncertainties of table tokens and text tokens, maps them to next\-step failure risks for the small model, and combines the two risks for routing\. Experiments on multiple table reasoning benchmarks show that EcoTab consistently outperforms strong baselines and achieves a better balance between accuracy and efficiency\.

Rethinking Stepwise Model Routing: A Cost\-Efficient Table Reasoning Perspective

Shenghao Ye1111Equal contribution, Yuxiang Wang2111Equal contribution, Yu Guo1111Equal contribution, Dong Jin3222Corresponding authors, Shuangwu Chen1222Corresponding authors, Jian Yang11University of Science and Technology of China2The University of Melbourne3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center\{ssh0321y, yukariguo\}@mail\.ustc\.edu\.cn\{kingdon, chensw, jianyang\}@ustc\.edu\.cn

## 1Introduction

Table reasoning plays a critical role in real\-world applications, including data analytics\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.29319#bib.bib8)\), fact verification\(Parikhet al\.,[2020](https://arxiv.org/html/2605.29319#bib.bib54)\)and scientific reporting\(Newmanet al\.,[2024](https://arxiv.org/html/2605.29319#bib.bib9)\)\. However, it remains challenging because tables contain complex structures and implicit relations across rows and columns\. Recent Large Reasoning Models \(LRMs\), such as DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib10)\)and OpenAI’s o\-series\(Pfister and Jud,[2025](https://arxiv.org/html/2605.29319#bib.bib11)\), improve performance by using test\-time scaling to produce long reasoning chains during inference\. Despite their strong results, this process introduces high computational overhead\. The large model size and heavy token usage make LRMs difficult to deploy for table reasoning in latency\-sensitive and resource\-constrained settings\(Zenget al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib18)\)\.

To mitigate this bottleneck, stepwise model routing\(Shiet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib19); Leeet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib22)\)has emerged as a promising direction\. It decomposes the inference process into multiple reasoning steps, allocating simpler steps to smaller, cheaper models and more complex ones to larger, more expensive models\. In this way, stepwise model routing offers an effective balance between efficiency and performance\(Fernandezet al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib29)\)\. Existing methods, such as SpecCoT\(Shiet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib19)\)and SpecReason\(Panet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib31)\), perform well on free\-form text reasoning tasks like mathematical reasoning\. However, their effectiveness on structured table reasoning tasks remains underexplored\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x1.png)Figure 1:Effectiveness analysis on table and free\-form text reasoning tasks\. WikiTQ and TableBench represent table reasoning benchmarks, while MATH500 and AIME24 correspond to free\-form text reasoning benchmarks\.To examine this question, we revisit several stepwise model routing methods and evaluate their effectiveness on table reasoning tasks, as detailed in Sec\.[3](https://arxiv.org/html/2605.29319#S3)\. Our analysis reveals a clear gap in the efficiency and performance trade\-off between free\-form text reasoning and table reasoning\. We find that existing methods often misroute “table\-specific steps”, such as retrieving the relevant subtable for a question or performing numerical operations over tabular content\. To understand this failure, we further analyze the root cause and identify a key insight: such steps contain two types of tokens with distinct uncertainty distributions, namely table tokens, grounded in table structure such as cell values and headers, and text tokens, which reflect the surrounding natural language reasoning\. We further find that the uncertainty of both token types is correlated with the next\-step failure risk of the small model, which in turn is informative for model selection\. However, existing methods lack joint modeling of table tokens and text tokens, resulting in poor routing performance on table reasoning tasks\.

Motivated by these findings, we proposeEcoTab, anefficient table\-aware stepwise model routing framework fortable reasoning\. EcoTab is built on a simple intuition: table tokens and text tokens exhibit different uncertainty distributions and should therefore be modeled separately during routing\. For each reasoning step, EcoTab first identifies table tokens and text tokens in the current reasoning step and estimates their uncertainties separately\. To account for their distinct distributions, EcoTab constructs two offline risk mappings that convert these uncertainties into next\-step failure risks, where each risk reflects the likelihood that the small model will fail on the next step\. Finally, EcoTab combines the two failure risks into a unified routing score and compares it with a threshold to decide whether the next step should be generated by a small model or a large one\.

Our Contributions\.\(1\)New Perspective\.We conduct the first systematic study of stepwise model routing for table reasoning\. We reveal that table reasoning steps contain two types of tokens with distinct uncertainty distributions, explaining why existing routing methods designed for free\-form text fail on table reasoning\.\(2\)Novel Framework\.We propose EcoTab, an efficient table\-aware stepwise model routing framework for table reasoning\.\(3\)SOTA Performance\.Experiments on multiple table reasoning benchmarks show that EcoTab consistently outperforms strong baselines and achieves a better balance between accuracy and efficiency\.

## 2Preliminary

#### Table Reasoning with LRMs\.

Given a tableTTand a natural language queryQQ, an LRM generates a sequence of reasoning stepss1,…,sns\_\{1\},\\dots,s\_\{n\}, denoted ass1:ns\_\{1:n\}, where each stepsis\_\{i\}containskik\_\{i\}tokens\. Following prior studies\(Panet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib31); Zenget al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib18)\), we segment reasoning traces into steps using the newline delimiter “\\n\\n”\.

#### Stepwise Model Routing\.

Given the current reasoning prefixs1:ns\_\{1:n\}, stepwise model routing dynamically chooses between the small reasoning model \(SRM\) and the large reasoning model \(LRM\) to generate the next reasoning stepsi\+1s\_\{i\+1\}, so as to improve computational efficiency\. Formally, the next step is generated as

si\+1∼pθi\+1\(⋅∣T,Q,s1:i\),θi\+1=r\(ℐi\+1\)s\_\{i\+1\}\\sim p\_\{\\theta\_\{i\+1\}\}\(\\cdot\\mid T,Q,s\_\{1:i\}\),\\theta\_\{i\+1\}=r\(\\mathcal\{I\}\_\{i\+1\}\)\(1\)wherepθi\+1p\_\{\\theta\_\{i\+1\}\}denotes the probability distribution of the selected reasoning model, andθi\+1∈\{θM,θm\}\\theta\_\{i\+1\}\\in\\\{\\theta\_\{M\},\\theta\_\{m\}\\\}corresponds to the LRM or the SRM\. Here,r​\(⋅\)r\(\\cdot\)denotes the routing function, andℐi\+1\\mathcal\{I\}\_\{i\+1\}denotes the routing information used to determine the model for stepi\+1i\+1\. Depending on the routing function,ℐi\+1\\mathcal\{I\}\_\{i\+1\}may come from the previously generated stepsis\_\{i\}\(Leeet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib22)\), or from a lightweight preview or draft of the next step\(Panet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib31); Zenget al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib18); Shiet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib19)\)\. In this work, we usesis\_\{i\}asℐi\+1\\mathcal\{I\}\_\{i\+1\}, which avoids additional token generation overhead\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x2.png)Figure 2:Error distribution across four step categories over 1000 incorrect cases under GlimpRouter\.

## 3Motivation

In this section, we investigate why table reasoning requires a dedicated stepwise model routing beyond existing methods designed for free\-form text reasoning\. This leads to our first research question:

RQ1 –Do free\-form text routing methods adapt effectively to table reasoning?

Effectiveness Analysis\.The effectiveness of stepwise model routing depends on reaching LRM\-only accuracy at lower cost, which we measure by FLOPs\. We evaluate several representative methods, including GlimpRouter\(Zenget al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib18)\), SpecReason\(Panet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib31)\), SpecCoT\(Shiet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib19)\), and Random Routing\. We adopt Qwen3\-1\.7B and Qwen3\-14B\(Yanget al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib32)\)as the SRM and LRM\. The evaluation covers table reasoning benchmarks, including WikiTQ\(Pasupat and Liang,[2015](https://arxiv.org/html/2605.29319#bib.bib36)\)and TableBench\(Wuet al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib33)\), as well as free\-form text reasoning benchmarks such as MATH500\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.29319#bib.bib34)\)and AIME24\. More experimental details are provided in Appendix[A](https://arxiv.org/html/2605.29319#A1)\. As shown in Figure[1](https://arxiv.org/html/2605.29319#S1.F1), assigning more steps to the stronger LRM naturally increases both FLOPs and accuracy\. However, existing routing methods are much less efficient on table reasoning than on free\-form text reasoning\. On free\-form text benchmarks, they achieve near\-LRM performance with only about 60% of the full FLOPs\. In contrast, on table reasoning benchmarks, they require nearly 80% of the full FLOPs to reach a similar level of performance\. This efficiency gap indicates that existing methods do not transfer effectively to table reasoning\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x3.png)Figure 3:\(Left\) Difference in average entropy between correct and incorrect steps for Free\-Text steps and Table\-specific steps\. \(Right\) Entropy distributions of table tokens and text tokens within Table\-specific steps\.Error Analysis\. To understand the source of this efficiency gap, an error analysis of the routing process is conducted\. Specifically, we randomly sample 1000 erroneous cases that are correctly solved under the LRM\-only setting but fail under GlimpRouter\(Zenget al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib18)\), and ask human experts to identify the failure step in each case and classify it into one of four error types, following TaTToo\(Zouet al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib24)\): \(i\)Table Retrieval, \(ii\)Table Operation, \(iii\)Inner\-Thinking, and \(iv\)Others\(defined in Appendix[A](https://arxiv.org/html/2605.29319#A1)\)\.Table RetrievalandTable Operationare defined as “table\-specific steps”, as they are unique to table reasoning, whileInner\-ThinkingandOthersare regarded as “free\-text steps”\. As shown in Figure[2](https://arxiv.org/html/2605.29319#S2.F2), 82\.7% of routing errors arise from table\-specific steps\. This indicates that free\-form text routing methods fail to properly route these steps, often assigning them to the SRM when the LRM is actually needed\. Finding of RQ1\. Free\-form text routing methods fail to properly route table\-specific steps, leading to a notable efficiency gap in table reasoning\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x4.png)Figure 4:\(Left\) Error percentages across four entropy groups on WikiTQ and TableBench for Table\-specific steps\. \(Right\) Overall error distribution across the four groups\.It raises a subsequent research question:

RQ2 –Why do free\-form text routing methods fail on table\-specific reasoning steps?

Table Tokens Differ from Text Tokens\.To understand this failure, we compare free\-text steps with table\-specific steps\. Following GlimpRouter, we use the average step entropy as the routing score and randomly sample 500 correct steps and 500 incorrect steps\. As shown in Figure[3](https://arxiv.org/html/2605.29319#S3.F3)\(left\), free\-text steps show a clear separation between correct and incorrect cases, with an average entropy gap of 0\.14\. In contrast, the gap for table\-specific steps drops sharply to 0\.06\. This suggests that step\-level entropy is much less informative for routing table\-specific steps\. We then analyze table\-specific steps at the token level by separating each step intotable tokensandtext tokens\. As shown in Figure[3](https://arxiv.org/html/2605.29319#S3.F3)\(right\), the two token types exhibit clearly different entropy distributions\. This suggests that they play different roles during reasoning, which is also consistent with prior studies on table reasoning\(Wanget al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib43); Zouet al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib24); Liet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib23)\)\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x5.png)Figure 5:Overview of the EcoTab framework\. By separately modeling table tokens and text tokens in each reasoning stepsis\_\{i\}, EcoTab enables more effective routing between the SRM and LRM for table reasoning\.Both Token Types Matter for Routing\.Building on the analysis above, we further ask whether both table tokens and text tokens are related to the next\-step failure risk\. Specifically, we compute the average entropy of table tokens and text tokens for 1,000 sampled steps, including 500 correct steps and 500 incorrect steps\. For each type of token, we use the 70th percentile as the threshold\(Notinet al\.,[2021](https://arxiv.org/html/2605.29319#bib.bib27)\)\. A score above the threshold is labeled as High, and a score below it is labeled as Low\. This gives four groups, namely High\-High, High\-Low, Low\-High, and Low\-Low\. We then examine how the 500 incorrect steps are distributed across these four groups\. As shown in Figure[4](https://arxiv.org/html/2605.29319#S3.F4), errors are not concentrated only in the High\-High group\. A substantial portion also falls into the High\-Low and Low\-High groups\. This shows that either high table\-token uncertainty or high text\-token uncertainty alone can be associated with failure risk\. Therefore, effective routing should consider both token types rather than relying on only one of them\.

Insight for EcoTab\.In table reasoning, table tokens and text tokens exhibit different uncertainty distributions, and both are informative of the SRM’s next\-step failure risk\.

## 4EcoTab

Motivated by this insight, we proposeEcoTab, an efficient table\-aware stepwise model routing framework for table reasoning\. EcoTab works in three stages\. Given the input tableTT, it first builds a lightweight word\-level Table Trie to identify table tokens and separate them from text tokens in each reasoning stepsis\_\{i\}\(Sec\.[4\.1](https://arxiv.org/html/2605.29319#S4.SS1)\)\. For eachsis\_\{i\}, EcoTab then estimates table\-token uncertaintyΦtab\(i\)\\Phi\_\{\\text\{tab\}\}^\{\(i\)\}and text\-token uncertaintyΦtext\(i\)\\Phi\_\{\\text\{text\}\}^\{\(i\)\}\(Sec\.[4\.2](https://arxiv.org/html/2605.29319#S4.SS2)\)\. Finally, EcoTab constructs two offline risk mappings, maps the two uncertainties into two failure risks, and combines them into a final routing scoredfinal\(i\)d\_\{\\text\{final\}\}^\{\(i\)\}, which determines whether the next stepsi\+1s\_\{i\+1\}should be generated by the SRM or the LRM \(Sec\.[4\.3](https://arxiv.org/html/2605.29319#S4.SS3)\)\.

### 4\.1Table Trie Construction

To separate table tokens from text tokens, we build a word\-level Table Trie from the input tableTT\. The Trie stores normalized table content, including column headers and cell values\. Before insertion, we apply simple normalization such as lowercasing, removing extra spaces, and standardizing numbers and punctuation\. For each reasoning stepsi=\(ti,1,ti,2,…,ti,ki\)s\_\{i\}=\(t\_\{i,1\},t\_\{i,2\},\\dots,t\_\{i,k\_\{i\}\}\), we apply the same normalization and scan the step text from left to right with longest\-prefix matching over the Trie\. If a span matches a column header or cell value, we mark it as table\-related tokens\. We then map the matched spans back to token positions and obtain a boolean mask𝐦\(i\)∈\{0,1\}ki\\mathbf\{m\}^\{\(i\)\}\\in\\\{0,1\\\}^\{k\_\{i\}\}, where𝐦j\(i\)=1\\mathbf\{m\}^\{\(i\)\}\_\{j\}=1indicates thatti,jt\_\{i,j\}is a table token\. Based on this mask, we divide the tokens ofsis\_\{i\}into a table\-token set𝒱tab\(i\)\\mathcal\{V\}\_\{\\text\{tab\}\}^\{\(i\)\}and a text\-token set𝒱text\(i\)\\mathcal\{V\}\_\{\\text\{text\}\}^\{\(i\)\}:

𝒱tab\(i\)=\{ti,j∣𝐦j\(i\)=1\},𝒱text\(i\)=\{ti,j∣𝐦j\(i\)=0\},\\mathcal\{V\}\_\{\\text\{tab\}\}^\{\(i\)\}=\\\{t\_\{i,j\}\\mid\\mathbf\{m\}^\{\(i\)\}\_\{j\}=1\\\},\\mathcal\{V\}\_\{\\text\{text\}\}^\{\(i\)\}=\\\{t\_\{i,j\}\\mid\\mathbf\{m\}^\{\(i\)\}\_\{j\}=0\\\},\(2\)where1≤j≤ki1\\leq j\\leq k\_\{i\}\. This procedure is model\-free and introduces little overhead in practice\.

### 4\.2Step\-Level Uncertainty Estimation

Prior studies have shown that reasoning correctness is closely related to model uncertainty\(Xieet al\.,[2023](https://arxiv.org/html/2605.29319#bib.bib45); Wang and Zhou,[2024](https://arxiv.org/html/2605.29319#bib.bib46)\)\. In particular, the uncertainty of the current stepsis\_\{i\}can serve as a useful signal for judging whether the current model has sufficient capability to generate the next step correctly\(Leeet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib22)\)\. Specifically, we use Shannon entropy\(Shannon,[1948](https://arxiv.org/html/2605.29319#bib.bib25)\)as the uncertainty measure for each reasoning step\. Each step is generated by the current model, which can be either the LRM or the SRM\. To quantify the uncertainty of a reasoning step, we first define token\-level uncertainty for each token in the step\.

Token\-Level Uncertainty\.For thejj\-th tokenti,jt\_\{i,j\}in reasoning stepsis\_\{i\}, we define its token\-level uncertaintyci,j∈ℝc\_\{i,j\}\\in\\mathbb\{R\}as

ci,j=−∑v∈𝒱pi,j​\(v\)​log⁡pi,j​\(v\),c\_\{i,j\}=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{i,j\}\(v\)\\log p\_\{i,j\}\(v\),\\vskip\-6\.0pt\(3\)where𝒱\\mathcal\{V\}denotes the model vocabulary,v∈𝒱v\\in\\mathcal\{V\}is a candidate token set, andpi,j​\(v\)p\_\{i,j\}\(v\)denotes the predicted probability of token setvvat position\(i,j\)\(i,j\)\.

Step\-Level Uncertainty\.For each reasoning stepsis\_\{i\}, we use the Table Trie to partition its tokens into a table\-token set𝒱tab\(i\)\\mathcal\{V\}\_\{\\text\{tab\}\}^\{\(i\)\}and a text\-token set𝒱text\(i\)\\mathcal\{V\}\_\{\\text\{text\}\}^\{\(i\)\}\. We then average the token\-level uncertainty separately over the two sets to obtain the table uncertaintyΦtab\(i\)\\Phi\_\{\\text\{tab\}\}^\{\(i\)\}and text uncertaintyΦtext\(i\)\\Phi\_\{\\text\{text\}\}^\{\(i\)\}:

Φ∗\(i\)=1\|𝒱∗\(i\)\|∑ti,j∈𝒱∗\(i\)ci,j,∗∈\{tab,text\}\.\\Phi\_\{\\ast\}^\{\(i\)\}=\\frac\{1\}\{\|\\mathcal\{V\}\_\{\\ast\}^\{\(i\)\}\|\}\\sum\_\{t\_\{i,j\}\\in\\mathcal\{V\}\_\{\\ast\}^\{\(i\)\}\}c\_\{i,j\},\\quad\\ast\\in\\\{\\text\{tab\},\\text\{text\}\\\}\.\\vskip\-8\.0pt\(4\)

### 4\.3Routing Score via Noisy\-OR

Given table\-token uncertaintyΦtab\(i\)\\Phi\_\{\\text\{tab\}\}^\{\(i\)\}and text\-token uncertaintyΦtext\(i\)\\Phi\_\{\\text\{text\}\}^\{\(i\)\}, we need to combine them into a single routing score for model selection\. However, the two uncertainties follow different distributions\. If they are directly averaged or linearly weighted, the resulting score can be poorly calibrated and less reliable for routing\. To make them more comparable, we first map them into next\-step failure risks in\[0,1\]\[0,1\]\. For each uncertaintyΦ∗\(i\)\\Phi\_\{\*\}^\{\(i\)\}, where∗∈\{tab,text\}\*\\in\\\{\\text\{tab\},\\text\{text\}\\\}, we defined∗\(i\)≈Pr⁡\(SRM fails on​si\+1∣Φ∗\(i\)\),d\_\{\*\}^\{\(i\)\}\\approx\\Pr\(\\text\{SRM fails on \}s\_\{i\+1\}\\mid\\Phi\_\{\*\}^\{\(i\)\}\),which measures how likely the SRM is to fail if it is used to generate the next step\. To estimate this mapping, we build two offline failure\-risk mappings on a held\-out validation set\. Specifically, we construct step\-level supervision by identifying a critical routing boundary for each retained sample, namely the step whose routing score should trigger switching to the LRM for generating the next step\. This boundary is determined through suffix replacement on validation trajectories, as detailed in Appendix[C](https://arxiv.org/html/2605.29319#A3)\. The identified boundary step is labeled as positive, earlier retained steps are labeled as negative, and later steps are discarded\. This construction aligns the supervision target with the next\-step routing objective\. We then fit a sigmoid mapping for each uncertainty:

d∗\(i\)=f∗​\(Φ∗\(i\)\)=σ​\(a∗​Φ∗\(i\)\+b∗\),d\_\{\*\}^\{\(i\)\}=f\_\{\*\}\\\!\\left\(\\Phi\_\{\*\}^\{\(i\)\}\\right\)=\\sigma\\\!\\left\(a\_\{\*\}\\Phi\_\{\*\}^\{\(i\)\}\+b\_\{\*\}\\right\),\\vskip\-6\.0pt\(5\)wherea∗a\_\{\*\}andb∗b\_\{\*\}are learned from the validation set\. After calibration, the two failure risks are combined using Noisy\-OR\(Pearl,[2014](https://arxiv.org/html/2605.29319#bib.bib44)\):

dfinal\(i\)=1−\(1−dtab\(i\)\)​\(1−dtext\(i\)\)\.d\_\{\\text\{final\}\}^\{\(i\)\}=1\-\\left\(1\-d\_\{\\text\{tab\}\}^\{\(i\)\}\\right\)\\left\(1\-d\_\{\\text\{text\}\}^\{\(i\)\}\\right\)\.\\vskip\-10\.0pt\(6\)Finally, we comparedfinal\(i\)d\_\{\\text\{final\}\}^\{\(i\)\}with a thresholdτ\\tauto decide whether the next stepsi\+1s\_\{i\+1\}should be generated by the LRM or the SRM\.

### 4\.4Overall Pipeline

The overall procedure distinguishes the initial step from all subsequent steps, since no prior routing score is available for the first step\. Here,θM\\theta\_\{M\}andθm\\theta\_\{m\}denote the LRM and the SRM, respectively\.

\(1\)Initial step\. For all samples,θm\\theta\_\{m\}first generates the first reasoning steps1s\_\{1\}\.

\(2\)First\-step refinement\. Compute the routing scoredfinal\(1\)d\_\{\\text\{final\}\}^\{\(1\)\}ofs1s\_\{1\}, and regenerate the first steps1s\_\{1\}byθM\\theta\_\{M\}ifdfinal\(1\)\>τd\_\{\\text\{final\}\}^\{\(1\)\}\>\\tau\.

\(3\)Iterative\. For eachsis\_\{i\}, the routing scoredfinal\(i\)d\_\{\\text\{final\}\}^\{\(i\)\}is used to chooseθM\\theta\_\{M\}orθm\\theta\_\{m\}for the next stepsi\+1s\_\{i\+1\}\.

\(4\)Answer generation\. Repeat step \(3\) until the final answer is produced or reach the iteration limit\.

## 5Experiments

In this section, we present a comprehensive evaluation of EcoTab\. We first describe the experimental setup and evaluation metrics \(Sec\.[5\.1](https://arxiv.org/html/2605.29319#S5.SS1)\)\. We then organize our experiments around four key research questions:Q1: Can EcoTab consistently outperform existing state\-of\-the\-art stepwise model routing methods on table reasoning tasks \(Sec\.[5\.2](https://arxiv.org/html/2605.29319#S5.SS2)\)?Q2: How does each core component of EcoTab contribute to its overall performance \(Sec\.[5\.3](https://arxiv.org/html/2605.29319#S5.SS3)\)?Q3: Is EcoTab robust to different calibration and threshold settings \(Sec\.[5\.4](https://arxiv.org/html/2605.29319#S5.SS4)\)?Q4: Can EcoTab effectively capture reasoning difficulty while remaining lightweight in routing overhead \(Sec\.[5\.5](https://arxiv.org/html/2605.29319#S5.SS5)\)?

### 5\.1Experimental Setup

MethodExtra TokensWikiTQTabFactTableBenchHiTabFinQAAverageAcc↑\\uparrowFLOPs↓\\downarrowAcc↑\\uparrowFLOPs↓\\downarrowAcc↑\\uparrowFLOPs↓\\downarrowAcc↑\\uparrowFLOPs↓\\downarrowAcc↑\\uparrowFLOPs↓\\downarrowAcc↑\\uparrowFLOPs↓\\downarrowA/F↑\\uparrowQwen3\-Instruct1\.7B Only–67\.343\.8482\.422\.5142\.556\.1259\.344\.8247\.527\.3859\.834\.9315\.214B Only–84\.1221\.690\.4714\.654\.8542\.480\.1720\.567\.7448\.075\.4729\.43\.34Random✗75\.8218\.586\.7610\.646\.1236\.773\.4317\.458\.7840\.468\.1824\.73\.84RSD✓79\.7316\.288\.249\.3249\.2130\.475\.8815\.263\.8736\.471\.3921\.54\.55SpecCoT✓78\.5015\.887\.988\.9849\.8729\.776\.0415\.064\.0136\.071\.2821\.14\.66SpecReason✓78\.7615\.787\.749\.2150\.0229\.976\.4115\.064\.2136\.071\.4321\.14\.62STEER✗79\.4615\.488\.448\.4350\.3329\.477\.0414\.963\.7335\.471\.8020\.74\.87GlimpRouter✓79\.0416\.488\.858\.5650\.5230\.176\.2215\.463\.9836\.171\.7221\.34\.72\\rowcolor\[HTML\]D8ECE4EcoTab\(ours\)✗80\.8314\.089\.148\.4251\.7728\.178\.3213\.965\.4834\.373\.1119\.75\.15DeepSeek\-R1\-Distill\-Qwen1\.7B Only–67\.343\.8482\.422\.5142\.556\.1259\.344\.8247\.527\.3859\.834\.9315\.2214B Only–81\.7120\.889\.2115\.353\.6540\.570\.0822\.969\.4030\.972\.8126\.03\.28Random✗74\.7817\.985\.7711\.144\.0336\.463\.0618\.159\.6726\.765\.4622\.03\.77RSD✓78\.2515\.387\.129\.0148\.8128\.666\.6715\.366\.0221\.969\.3718\.04\.77SpecCoT✓78\.1215\.087\.418\.8648\.7428\.366\.9415\.065\.8922\.669\.4217\.94\.83SpecReason✓78\.0115\.087\.578\.9448\.8928\.466\.5815\.166\.0722\.269\.4217\.94\.82STEER✗78\.3214\.487\.648\.8349\.2229\.167\.0115\.366\.1521\.369\.6717\.84\.91GlimpRouter✓78\.0414\.887\.818\.7849\.5428\.166\.7114\.865\.8122\.869\.5817\.84\.89\\rowcolor\[HTML\]D8ECE4EcoTab\(ours\)✗79\.5613\.188\.048\.8150\.4127\.467\.9713\.967\.6321\.470\.7216\.95\.19

Table 1:Main Results\.Acc↑\\uparrowis measured at 60% of LRM\-only FLOPs, andFLOPs↓\\downarrowdenotes the computation required to reach 98% of LRM\-only accuracy\. The best result isbold, and the second\-best result isunderlined\. Extra Tokens denotes whether the method requires additional token generation during the reasoning process\.#### Models and Configurations\.

We use Qwen3\-1\.7B\(Yanget al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib32)\)as the SRM, and evaluate two LRMs: Qwen3\-14B\(Yanget al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib32)\)and DeepSeek\-R1\-Distill\-Qwen\-14B\(Guoet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib10)\)\. This setup allows us to study both same\-family and cross\-family collaboration\.

#### Benchmarks\.

We evaluate EcoTab on five table reasoning benchmarks\. TableBench\(Wuet al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib33)\)contains 886 complex questions spanning numerical reasoning, fact checking, and data analysis\. WikiTQ\(Pasupat and Liang,[2015](https://arxiv.org/html/2605.29319#bib.bib36)\)focuses on question answering over Wikipedia tables, and TabFact\(Chenet al\.,[2020a](https://arxiv.org/html/2605.29319#bib.bib37)\)evaluates table\-based fact verification\. To further test robustness, we include HiTab\(Chenget al\.,[2022](https://arxiv.org/html/2605.29319#bib.bib38)\), which features hierarchical nested tables, and FinQA\(Chenet al\.,[2021](https://arxiv.org/html/2605.29319#bib.bib39)\), a text\+table reasoning dataset requiring joint understanding of financial reports and tabular data\. Additional details are provided in Appendix[A\.2](https://arxiv.org/html/2605.29319#A1.SS2)\.

#### Baselines\.

We compare EcoTab against standalone models and state\-of\-the\-art stepwise model routing baselines, including SRM/LRM Only, Random, RSD\(Liaoet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib40)\), SpecCoT\(Shiet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib19)\), SpecReason\(Panet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib31)\), STEER\(Leeet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib22)\)and GlimpRouter\(Zenget al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib18)\)\.

#### Evaluation Metric\.

Following prior work\(Liaoet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib40); Sardanaet al\.,[2024](https://arxiv.org/html/2605.29319#bib.bib41)\), we use accuracy as the performance metric and estimate inference cost using the standard Transformer FLOPs approximation of2​N2Nper generated token for a model withNNparameters\. To enable clearer comparison across baselines, we conduct a grid search over threshold values with a step size of 0\.05 for all baseline methods and EcoTab, and derive the corresponding accuracy–FLOPs trade\-off curves\. We then report the accuracy achieved at60%60\\%of the LRM\-only FLOPs, as well as the FLOPs required to reach98%98\\%of the LRM\-only accuracy\. In addition, we report Accuracy\-per\-FLOPs \(A/F\), adapted from\(Maet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib42)\), to better characterize the trade\-off between performance and computational cost\.

Table 2:Ablation results of EcoTab on TableBench\. We report accuracy \(Acc\) and the average FLOPs per query\.

### 5\.2Main Results

Table[1](https://arxiv.org/html/2605.29319#S5.T1)presents the main results of EcoTab on five table reasoning benchmarks under both same\-family and cross\-family collaboration settings\. Overall, EcoTab consistently outperforms all baselines and achieves the best accuracy–efficiency trade\-off in both settings\. In particular, it obtains the best overall average performance in terms of accuracy, FLOPs, and A/F, showing that EcoTab can improve reasoning quality while reducing inference cost\. A closer look shows that EcoTab remains consistently effective across different datasets and model combinations\. Under both Qwen3\-14B and DeepSeek\-R1\-Distill\-Qwen\-14B as the LRM, EcoTab achieves the highest accuracy on all five benchmarks, while also maintaining the lowest overall inference cost\. This result verifies that the advantage of EcoTab is not limited to a specific model family, but generalizes to both same\-family and cross\-family collaboration\. Another notable advantage is that EcoTab does not require extra token generation during routing, yet it still surpasses strong baselines that rely on verification overhead\. This suggests that explicitly separating table\-token and text\-token is more effective for table reasoning than directly applying free\-form text routing methods\. Overall, the results confirm that EcoTab is a more suitable step\-level routing framework for table reasoning, delivering a stronger balance between effectiveness and efficiency\.

### 5\.3Ablation Study

Table[2](https://arxiv.org/html/2605.29319#S5.T2)reports the ablation results of EcoTab on TableBench\. Overall, the full EcoTab consistently performs best under both Qwen3\-Instruct and DeepSeek\-R1 settings, achieving the highest accuracy with the lowest FLOPs\. Removing the separation between table tokens and text tokens \(*w/ average token*\) causes the largest performance drop and a clear increase in inference cost, showing that collapsing all tokens into a single score weakens routing decisions\. Using only table tokens \(*w/ only table token*\) or only text tokens \(*w/ only text token*\) also degrades performance, indicating that both types of token are necessary and complementary for effective routing\. In addition, replacing Noisy\-OR with simple linear weighting \(*w/ linear weighting*\) consistently hurts both accuracy and efficiency, suggesting that EcoTab benefits not only from separating the two token types, but also from using a more suitable fusion strategy\. Overall, these results indicate that all components of EcoTab are essential to its effectiveness\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x6.png)Figure 6:Failure\-risk mapping transferability of EcoTab\. ID denotes in\-domain evaluation, and OOD denotes out\-of\-domain transfer\.
### 5\.4Robustness Analysis

#### Failure\-risk mapping transferability\.

We study whether the fitted score\-to\-risk mapping in Sec\.[4\.3](https://arxiv.org/html/2605.29319#S4.SS3)can transfer across domains\. Specifically, we compare an in\-domain setting \(ID\), where the mapping is fitted on the target dataset, with an out\-of\-domain setting \(OOD\), where the mapping fitted on the WikiTQ validation split is directly applied to TableBench\. Figure[6](https://arxiv.org/html/2605.29319#S5.F6)shows that the OOD variant still outperforms Random and STEER on both metrics\. It achieves higher accuracy at 60% of LRM only FLOPs and requires fewer FLOPs to reach 98% of LRM only accuracy\. Although the ID setting performs slightly better, the gap is small, which suggests that the learned risk mapping generalizes well across table reasoning domains\.

#### Threshold robustness\.

We further vary the routing threshold and plot the accuracy–FLOPs curves on WikiTQ and TableBench\. A higher threshold makes the router more conservative in calling the LRM, so fewer steps are assigned to the LRM and the total FLOPs become lower\. As shown in Figure[7](https://arxiv.org/html/2605.29319#S5.F7), EcoTab consistently stays above STEER and Random on both datasets\. It achieves higher accuracy at similar cost, or lower cost at similar accuracy\. Overall, the gain comes from more reliable model routing rather than a specific threshold\.

### 5\.5Further Analysis

#### Difficulty awareness\.

We further examine whether EcoTab is sensitive to sample difficulty\. Following prior work\(Yeet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib28)\), we estimate TableBench difficulty using GPT\-5\.4\-High\. For each question, we sample 100 independent answers and use the number of correct responses as a proxy for difficulty\. Based on this score, we group samples into three subsets:hard\(0–9 correct\),medium\(10–59 correct\), andeasy\(60–100 correct\)\. We then compute the LRM usage rate of EcoTab at different relative step positions within each group\. Figure[8](https://arxiv.org/html/2605.29319#S6.F8)shows that harder samples consistently trigger more LRM calls than easier ones\. This pattern holds for both correct and incorrect samples, and is especially clear on incorrect samples\. These results suggest that EcoTab can capture sample\-level difficulty and allocate more LRM computation to harder cases\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x7.png)Figure 7:Threshold robustness of EcoTab on WikiTQ and TableBench\. As the thresholdτ\\tauincreases, the total FLOPs decrease\.
#### Routing overhead\.

We then compare the routing latency of different methods on the full TableBench benchmark, excluding the reasoning time and measuring only model routing overhead\. Figure[9](https://arxiv.org/html/2605.29319#S6.F9)shows that EcoTab has an overall latency of 48\.6 seconds, which is comparable to STEER at 52\.1 seconds, while remaining far below RSD, SpecCoT, and SpecReason\. This result shows that EcoTab introduces little extra overhead and remains a lightweight routing method in practice\.

## 6Related Work

#### Table Reasoning\.

Reasoning over tables poses unique challenges for LLMs, as it requires both natural language understanding and structured reasoning over rows, columns, and cell values\(Jinet al\.,[2022](https://arxiv.org/html/2605.29319#bib.bib49); Zhanget al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib50)\)\. Recent studies\(Chenet al\.,[2020b](https://arxiv.org/html/2605.29319#bib.bib51); Denget al\.,[2022](https://arxiv.org/html/2605.29319#bib.bib52); Iidaet al\.,[2021](https://arxiv.org/html/2605.29319#bib.bib53)\)have explored table reasoning across a variety of downstream tasks, including table qa\(Chenet al\.,[2020b](https://arxiv.org/html/2605.29319#bib.bib51)\)and table fact verification\(Parikhet al\.,[2020](https://arxiv.org/html/2605.29319#bib.bib54)\)\. Early methods, such as TAPAS\(Herziget al\.,[2020](https://arxiv.org/html/2605.29319#bib.bib55)\)and TaBERT\(Yinet al\.,[2020](https://arxiv.org/html/2605.29319#bib.bib56)\), mainly model tables through Transformer\-based encoders\. With the rise of LLMs, later approaches began to improve table reasoning through prompt engineering\(Suiet al\.,[2024](https://arxiv.org/html/2605.29319#bib.bib57); Wanget al\.,[2024](https://arxiv.org/html/2605.29319#bib.bib58)\)or supervised fine\-tuning\(Suet al\.,[2024](https://arxiv.org/html/2605.29319#bib.bib59)\)\. More recent works, such as the Table\-R1 series\(Yanget al\.,[2025b](https://arxiv.org/html/2605.29319#bib.bib12); Wuet al\.,[2025b](https://arxiv.org/html/2605.29319#bib.bib13); Jinet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib15)\), further enhance reasoning performance by using reinforcement learning to optimize reasoning trajectories\. However, as table size grows and reasoning models become larger and more verbose, inference latency becomes increasingly prohibitive\. EcoTab addresses this issue by introducing a table\-aware step\-level routing framework to better balance efficiency and performance\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x8.png)Figure 8:LRM usage rate across GPT\-5\.4\-High difficulty levels, comparing correct and incorrect samples\.
#### Efficient Reasoning\.

Scaling test\-time compute in LRMs can substantially improve reasoning performance, but also incurs prohibitive latency\(Wanget al\.,[2025b](https://arxiv.org/html/2605.29319#bib.bib16)\)\. To mitigate this cost, recent work has explored dynamically offloading part of the reasoning process to smaller models, mainly at three levels: query\-level routing\(Chenet al\.,[2023b](https://arxiv.org/html/2605.29319#bib.bib17); Wanget al\.,[2025c](https://arxiv.org/html/2605.29319#bib.bib14); Zhaoet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib30)\), step\-level routing\(Shiet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib19); Panet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib31); Leeet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib22)\), and token\-level speculation\(Leviathanet al\.,[2023](https://arxiv.org/html/2605.29319#bib.bib20); Chenet al\.,[2023a](https://arxiv.org/html/2605.29319#bib.bib21)\)\. Among them, step\-level routing is particularly well aligned with the multi\-step reasoning nature of LRMs\. Existing methods range from training\-based reward guidance, such as RSD\(Liaoet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib40)\), to training\-free paradigms based on multi\-path selection, such as SpecCoT\(Shiet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib19)\), post\-hoc verification, such as SpecReason\(Panet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib31)\), and methods that route models based on uncertainty signals, such as entropy\(Zenget al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib18)\)or confidence\(Leeet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib22)\)\. However, these methods are mainly designed for free\-form text reasoning and do not account for the structured nature of table reasoning\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x9.png)Figure 9:Overall routing latency on TableBench, together with the latency breakdown of EcoTab\.

## 7Conclusion

In this paper, we study efficient stepwise model routing for table reasoning with LRMs\. We show that existing step\-level routing methods designed for free\-form text reasoning are less effective in tabular settings due to the structured nature of tables\. To address this issue, we propose EcoTab, an adaptive table\-aware step\-level routing framework that explicitly distinguishes table tokens and text tokens when estimating step difficulty\. The resulting routing score is computed through a probabilistic fusion mechanism to guide model selection during reasoning\. Experiments on multiple table reasoning benchmarks demonstrate that EcoTab consistently achieves a better trade\-off between reasoning accuracy and computational cost compared with existing stepwise model routing methods\. These results highlight the importance of table\-aware routing for efficient reasoning over structured data\.

## Limitations

Like existing routing methods, EcoTab cannot explicitly control the output length of the reasoning process\. In practice, the routing decision only determines which model generates the next step, but does not decide when the reasoning should stop\. As a result, unnecessary long reasoning traces may still appear even when the routing is accurate\. A promising direction for future work is to design a table aware early stopping mechanism that can terminate reasoning once the required table evidence and logical deductions are sufficient\.

## Ethics Statement

Our work aims to improve the efficiency and reliability of multi\-step table reasoning through stepwise model routing\. However, like any system built on LLMs, it may still produce incorrect intermediate reasoning steps or factually incorrect final answers\. We therefore encourage users to exercise caution and verify critical outputs when deploying such systems in real\-world scenarios\. Furthermore, our research builds upon open\-source models and frameworks, including Qwen3, DeepSeek\-R1\-Distill, PyTorch, and Hugging Face\. We strictly follow their respective licenses and usage policies, and acknowledge their important contributions to the research community\.

## References

- Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- L\. Chen, M\. Zaharia, and J\. Zou \(2023b\)Frugalgpt: how to use large language models while reducing cost and improving performance\.arXiv preprint arXiv:2305\.05176\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- W\. Chen, H\. Wang, J\. Chen, Y\. Zhang, H\. Wang, S\. Li, X\. Zhou, and W\. Y\. Wang \(2020a\)TabFact : a large\-scale dataset for table\-based fact verification\.InInternational Conference on Learning Representations \(ICLR\),Addis Ababa, Ethiopia\.Cited by:[§A\.2](https://arxiv.org/html/2605.29319#A1.SS2.SSS0.Px1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px2.p1.1)\.
- W\. Chen, H\. Zha, Z\. Chen, W\. Xiong, H\. Wang, and W\. Y\. Wang \(2020b\)HybridQA: a dataset of multi\-hop question answering over tabular and textual data\.InFindings of the Association for Computational Linguistics: EMNLP 2020,pp\. 1026–1036\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- Z\. Chen, W\. Chen, C\. Smiley, S\. Shah, I\. Borova, D\. Langdon, R\. Moussa, M\. Beane, T\. Huang, B\. Routledge, and W\. Y\. Wang \(2021\)FinQA: a dataset of numerical reasoning over financial data\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 3697–3711\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.300/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.300)Cited by:[§A\.2](https://arxiv.org/html/2605.29319#A1.SS2.SSS0.Px5),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px2.p1.1)\.
- Z\. Cheng, H\. Dong, Z\. Wang, R\. Jia, J\. Guo, Y\. Gao, S\. Han, J\. Lou, and D\. Zhang \(2022\)HiTab: a hierarchical table dataset for question answering and natural language generation\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 1094–1110\.External Links:[Link](https://aclanthology.org/2022.acl-long.78/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.78)Cited by:[§A\.2](https://arxiv.org/html/2605.29319#A1.SS2.SSS0.Px4),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px2.p1.1)\.
- X\. Deng, H\. Sun, A\. Lees, Y\. Wu, and C\. Yu \(2022\)Turl: table understanding through representation learning\.ACM SIGMOD Record51\(1\),pp\. 33–40\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- N\. Fernandez, B\. Kveton, R\. A\. Rossi, A\. Lan, and Z\. Wang \(2026\)RADAR: reasoning–ability and difficulty\-aware routing in language models\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CB6Ds5T4ae)Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§A\.1](https://arxiv.org/html/2605.29319#A1.SS1.p1.2),[§1](https://arxiv.org/html/2605.29319#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px1.p1.1)\.
- J\. Herzig, P\. K\. Nowak, T\. Müller, F\. Piccinno, and J\. Eisenschlos \(2020\)TaPas: weakly supervised table parsing via pre\-training\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 4320–4333\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- H\. Iida, D\. Thai, V\. Manjunatha, and M\. Iyyer \(2021\)Tabbie: pretrained representations of tabular data\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 3446–3456\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- N\. Jin, J\. Siebert, D\. Li, and Q\. Chen \(2022\)A survey on table question answering: recent advances\.InChina Conference on Knowledge Graph and Semantic Computing,pp\. 174–186\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- R\. Jin, Z\. Xin, X\. Xie, Z\. Li, G\. Qi, Y\. Chen, X\. Dai, T\. Wu, and G\. Haffari \(2025\)Table\-r1: self\-supervised and reinforcement learning for program\-based table reasoning in small language models\.arXiv preprint arXiv:2506\.06137\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Lee, D\. Kim, H\. Koh, N\. Yang, and K\. Jung \(2025\)Confidence\-guided stepwise model routing for cost\-efficient reasoning\.arXiv preprint arXiv:2511\.06190\.Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p2.1),[§2](https://arxiv.org/html/2605.29319#S2.SS0.SSS0.Px2.p1.11),[§4\.2](https://arxiv.org/html/2605.29319#S4.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- Y\. Leviathan, M\. Kalman, and Y\. Matias \(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- L\. Li, C\. Ye, W\. Ye, Y\. Sun, Z\. Jiang, H\. Wang, J\. Tian, Y\. Zhang, N\. Wang, X\. Fu,et al\.\(2025\)Table as a modality for large language models\.arXiv preprint arXiv:2512\.00947\.Cited by:[§3](https://arxiv.org/html/2605.29319#S3.p7.1)\.
- B\. Liao, Y\. Xu, H\. Dong, J\. Li, C\. Monz, S\. Savarese, D\. Sahoo, and C\. Xiong \(2025\)Reward\-guided speculative decoding for efficient LLM reasoning\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=AVeskAAETB)Cited by:[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px3.p1.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px4.p1.4),[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§3](https://arxiv.org/html/2605.29319#S3.p3.1)\.
- X\. Ma, G\. Wan, R\. Yu, G\. Fang, and X\. Wang \(2025\)Cot\-valve: length\-compressible chain\-of\-thought tuning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6025–6035\.Cited by:[§A\.3](https://arxiv.org/html/2605.29319#A1.SS3.p1.4),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px4.p1.4)\.
- B\. Newman, Y\. Lee, A\. Naik, P\. Siangliulue, R\. Fok, J\. Kim, D\. S\. Weld, J\. C\. Chang, and K\. Lo \(2024\)ArxivDIGESTables: synthesizing scientific literature into tables using language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 9612–9631\.Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p1.1)\.
- P\. Notin, J\. M\. Hernández\-Lobato, and Y\. Gal \(2021\)Improving black\-box optimization in VAE latent space using decoder uncertainty\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=F7LYy9FnK2x)Cited by:[§3](https://arxiv.org/html/2605.29319#S3.p8.1)\.
- R\. Pan, Y\. Dai, Z\. Zhang, G\. Oliaro, Z\. Jia, and R\. Netravali \(2025\)SpecReason: fast and accurate inference\-time compute via speculative reasoning\.InNeurIPS 2025 Workshop on Efficient Reasoning,External Links:[Link](https://openreview.net/forum?id=UrZgP5DD70)Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p2.1),[§2](https://arxiv.org/html/2605.29319#S2.SS0.SSS0.Px1.p1.6),[§2](https://arxiv.org/html/2605.29319#S2.SS0.SSS0.Px2.p1.11),[§3](https://arxiv.org/html/2605.29319#S3.p3.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Parikh, X\. Wang, S\. Gehrmann, M\. Faruqui, B\. Dhingra, D\. Yang, and D\. Das \(2020\)ToTTo: a controlled table\-to\-text generation dataset\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1173–1186\.Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p1.1),[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- P\. Pasupat and P\. Liang \(2015\)Compositional semantic parsing on semi\-structured tables\.arXiv preprint arXiv:1508\.00305\.Cited by:[§A\.2](https://arxiv.org/html/2605.29319#A1.SS2.SSS0.Px2),[§3](https://arxiv.org/html/2605.29319#S3.p3.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px2.p1.1)\.
- J\. Pearl \(2014\)Probabilistic reasoning in intelligent systems: networks of plausible inference\.Elsevier\.Cited by:[§4\.3](https://arxiv.org/html/2605.29319#S4.SS3.p1.8)\.
- R\. Pfister and H\. Jud \(2025\)Understanding and benchmarking artificial intelligence: openai’s o3 is not agi\.arXiv preprint arXiv:2501\.07458\.Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p1.1)\.
- N\. Sardana, J\. Portes, S\. Doubov, and J\. Frankle \(2024\)Beyond chinchilla\-optimal: accounting for inference in language model scaling laws\.InInternational Conference on Machine Learning,pp\. 43445–43460\.Cited by:[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px4.p1.4)\.
- C\. E\. Shannon \(1948\)A mathematical theory of communication\.The Bell system technical journal27\(3\),pp\. 379–423\.Cited by:[§4\.2](https://arxiv.org/html/2605.29319#S4.SS2.p1.1)\.
- J\. Shi, Y\. Zhu, Z\. Shi, D\. Zhao, Q\. Li, and Y\. Jiang \(2025\)SpecCoT: accelerating chain\-of\-thought reasoning through speculative exploration\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24405–24415\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1326/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1326),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p2.1),[§2](https://arxiv.org/html/2605.29319#S2.SS0.SSS0.Px2.p1.11),[§3](https://arxiv.org/html/2605.29319#S3.p3.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Su, A\. Wang, C\. Ye, C\. Zhou, G\. Zhang, G\. Chen, G\. Zhu, H\. Wang, H\. Xu, H\. Chen,et al\.\(2024\)Tablegpt2: a large multimodal model with tabular data integration\.arXiv preprint arXiv:2411\.02059\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- Y\. Sui, J\. Zou, M\. Zhou, X\. He, L\. Du, S\. Han, and D\. Zhang \(2024\)Tap4llm: table provider on sampling, augmenting, and packing semi\-structured data for large language model reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 10306–10323\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- L\. Wang, M\. Zheng, H\. Tang, Z\. Lin, Y\. Cao, J\. Wang, X\. Cai, and W\. Wang \(2025a\)NeedleInATable: exploring long\-context capability of large language models towards long\-structured tables\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=z5vZDI2r6J)Cited by:[§3](https://arxiv.org/html/2605.29319#S3.p7.1)\.
- Q\. Wang, R\. Ding, Y\. Zeng, Z\. Chen, L\. Chen, S\. Wang, P\. Xie, F\. Huang, and F\. Zhao \(2025b\)Vrag\-rl: empower vision\-perception\-based rag for visually rich information understanding via iterative reasoning with reinforcement learning\.arXiv preprint arXiv:2505\.22019\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, Y\. Liu, W\. Cheng, X\. Zhao, Z\. Chen, W\. Yu, Y\. Fu, and H\. Chen \(2025c\)Mixllm: dynamic routing in mixed large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 10912–10922\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- X\. Wang and D\. Zhou \(2024\)Chain\-of\-thought reasoning without prompting\.Advances in Neural Information Processing Systems37,pp\. 66383–66409\.Cited by:[§4\.2](https://arxiv.org/html/2605.29319#S4.SS2.p1.1)\.
- Z\. Wang, H\. Zhang, C\. Li, J\. M\. Eisenschlos, V\. Perot, Z\. Wang, L\. Miculicich, Y\. Fujii, J\. Shang, C\. Lee, and T\. Pfister \(2024\)Chain\-of\-table: evolving tables in the reasoning chain for table understanding\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=4L0xnS4GQM)Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- X\. Wu, J\. Yang, L\. Chai, G\. Zhang, J\. Liu, X\. Du, D\. Liang, D\. Shu, X\. Cheng, T\. Sun,et al\.\(2025a\)Tablebench: a comprehensive and complex benchmark for table question answering\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 25497–25506\.Cited by:[§A\.2](https://arxiv.org/html/2605.29319#A1.SS2.SSS0.Px3),[§3](https://arxiv.org/html/2605.29319#S3.p3.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px2.p1.1)\.
- Z\. Wu, J\. Yang, J\. Liu, X\. Wu, C\. Pan, J\. Zhang, Y\. Zhao, S\. Song, Y\. Li, and Z\. Li \(2025b\)Table\-r1: region\-based reinforcement learning for table understanding\.arXiv preprint arXiv:2505\.12415\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- Y\. Xie, K\. Kawaguchi, Y\. Zhao, X\. Zhao, M\. Kan, J\. He, and Q\. Xie \(2023\)Self\-evaluation guided beam search for reasoning\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Bw82hwg5Q3)Cited by:[§4\.2](https://arxiv.org/html/2605.29319#S4.SS2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§A\.1](https://arxiv.org/html/2605.29319#A1.SS1.p1.2),[§3](https://arxiv.org/html/2605.29319#S3.p3.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px1.p1.1)\.
- Z\. Yang, L\. Chen, A\. Cohan, and Y\. Zhao \(2025b\)Table\-r1: inference\-time scaling for table reasoning tasks\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 20616–20635\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Ye, Y\. Guo, D\. Jin, Y\. Shen, Y\. Hou, S\. Chen, J\. Yang, and X\. Jiang \(2025\)When tableqa meets noise: a dual denoising framework for complex questions and large\-scale tables\.arXiv preprint arXiv:2509\.17680\.Cited by:[§5\.5](https://arxiv.org/html/2605.29319#S5.SS5.SSS0.Px1.p1.1)\.
- P\. Yin, G\. Neubig, W\. Yih, and S\. Riedel \(2020\)TaBERT: pretraining for joint understanding of textual and tabular data\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 8413–8426\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- W\. Zeng, X\. Zhang, Y\. Shi, C\. Hu, Y\. Chen, B\. Shen, and X\. Gu \(2026\)Glimprouter: efficient collaborative inference by glimpsing one token of thoughts\.arXiv preprint arXiv:2601\.05110\.Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p1.1),[§2](https://arxiv.org/html/2605.29319#S2.SS0.SSS0.Px1.p1.6),[§2](https://arxiv.org/html/2605.29319#S2.SS0.SSS0.Px2.p1.11),[§3](https://arxiv.org/html/2605.29319#S3.p3.1),[§3](https://arxiv.org/html/2605.29319#S3.p4.1),[§5\.1](https://arxiv.org/html/2605.29319#S5.SS1.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- X\. Zhang, D\. Wang, L\. Dou, Q\. Zhu, and W\. Che \(2025\)A survey of table reasoning with large language models\.Frontiers of Computer Science19\(9\),pp\. 199348\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px1.p1.1)\.
- B\. Zhao, B\. Kapusuzoglu, K\. Balasubramaniam, S\. Sahu, S\. Chakraborty, and G\. I\. Winata \(2025\)Optimizing reasoning efficiency through prompt difficulty prediction\.arXiv preprint arXiv:2511\.03808\.Cited by:[§6](https://arxiv.org/html/2605.29319#S6.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhao, Y\. Long, H\. Liu, R\. Kamoi, L\. Nan, L\. Chen, Y\. Liu, X\. Tang, R\. Zhang, and A\. Cohan \(2024\)DocMath\-eval: evaluating math reasoning capabilities of llms in understanding long and specialized documents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16103–16120\.Cited by:[§1](https://arxiv.org/html/2605.29319#S1.p1.1)\.
- J\. Zou, S\. Roy, V\. K\. Verma, Z\. Wang, D\. Wipf, P\. Lu, S\. Negi, J\. Zou, and J\. He \(2026\)TaTToo: tool\-grounded thinking PRM for test\-time scaling in tabular reasoning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=zc1ezBrr5m)Cited by:[§A\.4](https://arxiv.org/html/2605.29319#A1.SS4.p1.1),[§3](https://arxiv.org/html/2605.29319#S3.p4.1),[§3](https://arxiv.org/html/2605.29319#S3.p7.1)\.

## Appendix AAdditional Experimental Setups

### A\.1Model Configurations

In our experiments, we use Qwen3\-1\.7B\(Yanget al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib32)\)as the SRM\. For the LRM, we consider two representative settings: Qwen3\-14B\(Yanget al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib32)\)for same\-family collaboration and DeepSeek\-R1\-Distill\-Qwen\-14B\(Guoet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib10)\)for cross\-family collaboration\. This design allows us to evaluate whether EcoTab remains effective under both homogeneous and heterogeneous model pairs\. For EcoTab and all compared baselines, we use the same decoding configuration for fair comparison\. Specifically, we set the temperature to 0\.7, the maximum generation length to 16,384 tokens, and top\-ppsampling withp=0\.95p=0\.95\.

### A\.2Dataset Details

#### TabFact\(Chenet al\.,[2020a](https://arxiv.org/html/2605.29319#bib.bib37)\)\.

TabFact is a large\-scale benchmark for table\-based fact verification rather than standard table question answering\. It consists of approximately 16K Wikipedia tables paired with 118K human\-annotated natural language statements, where each statement is labeled as eitherENTAILEDorREFUTEDwith respect to the corresponding table\. The dataset is challenging because it requires not only semantic understanding of natural language statements, but also symbolic reasoning over semi\-structured tables, such as comparison, counting, and aggregation\.

#### WikiTableQuestions \(WikiTQ\)\(Pasupat and Liang,[2015](https://arxiv.org/html/2605.29319#bib.bib36)\)\.

WikiTQ is a benchmark for answering complex natural language questions over semi\-structured HTML tables\. It contains 22,033 question\-answer pairs over 2,108 Wikipedia tables\. A key characteristic of WikiTQ is that the training and test tables are disjoint, which requires models to generalize to unseen table schemas\. The questions often involve compositional reasoning, including comparison, superlatives, aggregation, and arithmetic operations\. The tables are semi\-structured and non\-normalized, and many cells contain multi\-part values that must be interpreted appropriately during reasoning\.

#### TableBench\(Wuet al\.,[2025a](https://arxiv.org/html/2605.29319#bib.bib33)\)\.

TableBench is a comprehensive and challenging benchmark designed to evaluate complex table question answering in more realistic scenarios\. It covers 18 fine\-grained subcategories under four major categories, namely fact checking, numerical reasoning, data analysis, and visualization, with a total of 886 benchmark instances\. The benchmark is built from 3,681 unique tables spanning diverse domains, with an average of 16\.71 rows and 6\.68 columns per table\. In addition, 65\.74% of table cells are numerical, and each instance requires 6\.26 reasoning steps on average, making TableBench substantially more difficult than earlier TableQA benchmarks\.

#### HiTab\(Chenget al\.,[2022](https://arxiv.org/html/2605.29319#bib.bib38)\)\.

HiTab is a benchmark for question answering and natural language generation over hierarchical tables\. Unlike prior datasets that mainly focus on flat tables, HiTab emphasizes hierarchical indexing and implicit semantic and numerical relations induced by table structure\. It is a cross\-domain dataset constructed from statistical reports and Wikipedia pages, and nearly all tables exhibit hierarchical organization\. The dataset contains 10,686 QA pairs and descriptive sentences over 3,597 tables, together with fine\-grained annotations of entity and quantity alignment, which make it suitable for studying complex reasoning over hierarchical tabular data\.

#### FinQA\(Chenet al\.,[2021](https://arxiv.org/html/2605.29319#bib.bib39)\)\.

FinQA is a financial\-domain dataset for complex numerical reasoning over heterogeneous evidence\. It contains 8,281 question\-answer pairs annotated by finance professionals, along with gold reasoning programs for explainable evaluation\. The dataset is constructed from earnings reports of S&P 500 companies and requires models to integrate information from both tables and accompanying unstructured text\. Compared with general\-domain TableQA benchmarks, FinQA places greater emphasis on multi\-step numerical reasoning and domain\-specific financial knowledge\.

#### Validation set in Sec\.[4\.3](https://arxiv.org/html/2605.29319#S4.SS3)\.

The held\-out validation set in Sec\.[4\.3](https://arxiv.org/html/2605.29319#S4.SS3)is used only for fitting the offline risk mappings and is strictly separated from the final evaluation set\. For datasets with an official training split, we construct the validation set solely from the training data\. For datasets without a predefined training split, we randomly sample 10% of the original test set as a pseudo\-validation split, and use the remaining 90% for final testing\. In all cases, the validation data are used only for calibration and are never included in the final reported test results\.

### A\.3Implementation Details

For the accuracy–FLOPs evaluations in Sec\.[3](https://arxiv.org/html/2605.29319#S3)and Sec\.[5](https://arxiv.org/html/2605.29319#S5), we follow a unified threshold\-sweeping protocol for EcoTab and all compared baselines\. Specifically, for each method, we perform a grid search over the routing thresholdτ\\tauwith a step size of 0\.05, and compute the corresponding accuracy and average FLOPs under different values ofτ\\tau\. This process produces the full accuracy–FLOPs trade\-off curve for each method\. For the main results in Table[1](https://arxiv.org/html/2605.29319#S5.T1), we report two representative operating points from the trade\-off curve\. First, we use the accuracy achieved at 60% of the LRM\-only FLOPs as the reportedAcc\. Second, we use the FLOPs required to reach 98% of the LRM\-only accuracy as the reportedFLOPs\. These two metrics respectively reflect the model quality under a fixed computation budget and the computation required to approach near\-LRM performance\. In addition, we report Accuracy\-per\-FLOPs \(A/F\) as an overall indicator of the effectiveness–efficiency trade\-off\(Maet al\.,[2025](https://arxiv.org/html/2605.29319#bib.bib42)\)\. Following our main experimental setup, FLOPs are estimated using the standard Transformer approximation of2​N2Nper generated token for a model withNNparameters\. All reported results are averaged over three independent runs\.

### A\.4Four Error Types for Failed Steps

Following TATTOO\(Zouet al\.,[2026](https://arxiv.org/html/2605.29319#bib.bib24)\), we manually inspect the failed trajectories and categorize each failed step into one of four error types\.

Table Retrieval Step\.This type includes row or column mis\-selection, unit mismatch, and partial aggregation errors\. These errors account for 47\.7% of all failed steps, indicating that a substantial portion of failures arise from difficulty in correctly locating and extracting the relevant table region\.

Table Operation Step\.This type covers miscalculation, grouping mistakes, double counting, and misinterpretation of table semantics\. It represents 34\.3% of all failed steps, suggesting that even after the relevant contents are retrieved, reasoning over structured tabular information remains challenging\.

Inner\-Thinking Step\.This type refers to logical mistakes or self\-contradictory reasoning that are not directly caused by table grounding\. Such errors account for 12\.0% of all failed steps, indicating that LRMs are relatively more reliable on pure logical chains than on table\-centric operations\.

Others\.This category includes failures caused by context omission, incomplete responses, or improper output formatting\.

To provide a more concrete understanding of these error types, Table[3](https://arxiv.org/html/2605.29319#A2.T3)presents representative failure cases from three major categories:Table Retrieval,Table Operation, andInner\-Thinking\. For each case, we show the first erroneous reasoning step in the trajectory and briefly explain how this mistake propagates to the final incorrect answer\. These examples illustrate that failures in table reasoning may arise from different stages of the reasoning process, including incorrect table grounding, faulty operations over retrieved contents, and purely logical mistakes\.

## Appendix BTable Trie Construction Details

To separate table tokens from text tokens in each reasoning step, EcoTab builds a word\-level Table Trie from the input table\. The trie stores normalized textual entries extracted from the table, including column headers and cell values\. During inference, each reasoning step is normalized in the same way and scanned from left to right with longest\-prefix matching\. The matched spans are then mapped back to token positions to obtain the table\-token mask used in Eq\. \(4\)\. This implementation follows the procedure described in Sec\.[4\.1](https://arxiv.org/html/2605.29319#S4.SS1)\.

### B\.1Normalization Rules

We apply lightweight normalization to both table contents and reasoning steps before trie construction and matching\. The goal is to improve robustness to surface\-form variation while keeping the procedure simple and efficient\.

Lowercasing and whitespace cleanup\.All text is converted to lowercase\. Consecutive spaces, tabs, and line breaks are collapsed into a single space, and leading or trailing whitespace is removed\.

Punctuation normalization\.Common punctuation variants are standardized into a unified form\. For example, different dashes and quotation marks are mapped to their canonical ASCII forms when possible\. Surrounding punctuation that does not affect semantic identity is ignored during matching\.

Number normalization\.We normalize common number formats to reduce mismatches caused by formatting differences\. For example, “1,200” and “1200” are treated as the same value\. Decimal numbers and percentages are preserved in normalized form when they carry semantic meaning\.

Cell and header insertion\.We insert both column headers and cell values into the trie\. Each entry is split at the word level after normalization\. Multi\-word entries such as “new york” or “gross domestic product” are inserted as complete paths rather than as isolated words\.

Token\-span consistency\.Normalization is applied only for matching\. After a span is matched in normalized text, we map it back to the original token positions in the reasoning step, so the final boolean mask is still defined over the original generated tokens\.

Table 3:Representative error cases for major failed\-step categories\.
### B\.2Trie Matching Algorithm

Given a reasoning stepsi=\(ti,1,ti,2,…,ti,ki\)s\_\{i\}=\(t\_\{i,1\},t\_\{i,2\},\\dots,t\_\{i,k\_\{i\}\}\), we first normalize the step text using the same rules as above\. We then scan the step from left to right and perform longest\-prefix matching over the trie\. At each position, we attempt to extend the current span word by word along the trie\. If multiple matches are possible, we keep the longest valid match\. For example, if both “new” and “new york” exist in the trie, the matcher prefers “new york” whenever the longer span is observed in the step\. Once a match is confirmed, all tokens covered by that span are marked as table\-related, and the scan continues from the end of the matched span\. If no match is found, the scan advances by one token\. Formally, this process returns a boolean mask𝐦\(i\)∈\{0,1\}ki\\mathbf\{m\}^\{\(i\)\}\\in\\\{0,1\\\}^\{k\_\{i\}\}, where𝐦j\(i\)=1\\mathbf\{m\}^\{\(i\)\}\_\{j\}=1indicates that tokenti,jt\_\{i,j\}belongs to a matched table\-related span\. Based on this mask, the step tokens are partitioned into the table\-token setVtab\(i\)V\_\{\\text\{tab\}\}^\{\(i\)\}and the text\-token setVtext\(i\)V\_\{\\text\{text\}\}^\{\(i\)\}, which are then used to computeΦtab\(i\)\\Phi\_\{\\text\{tab\}\}^\{\(i\)\}andΦtext\(i\)\\Phi\_\{\\text\{text\}\}^\{\(i\)\}in Eq\. \(4\)\. Although this matching is surface\-form based, lightweight normalization already covers manycommon variations in numbers and punctuation, and we find it sufficient in practice\.

## Appendix CFailure\-risk Mapping Implementation

### C\.1Construction of Fitting Data

EcoTab performs next\-step model routing: the routing score computed from the current stepsis\_\{i\}is used to decide whether the next stepsi\+1s\_\{i\+1\}should be generated by the SRM or the LRM\. To align the supervision target with this objective, we construct step\-level labels by identifying a critical routing boundary through counterfactual suffix replacement\.

For each dataset, we first build a held\-out validation split following Appendix[A\.2](https://arxiv.org/html/2605.29319#A1.SS2), and run both LRM\-only and SRM\-only inference to obtain full reasoning trajectories\. We retain only samples that are correct under LRM\-only but incorrect under SRM\-only, since these cases directly indicate that the stronger model is needed for part of the reasoning process\. Each trajectory is segmented into steps using the delimiter “\\n\\n”\. For every stepsis\_\{i\}in the retained LRM trajectory, we compute the table\-token uncertaintyΦtab\(i\)\\Phi\_\{\\text\{tab\}\}^\{\(i\)\}and the text\-token uncertaintyΦtext\(i\)\\Phi\_\{\\text\{text\}\}^\{\(i\)\}\. For a retained trajectory with stepss1,…,sTs\_\{1\},\\dots,s\_\{T\}, we progressively replace its suffix with SRM generations, starting from the last step\. For a suffix lengthm∈\{1,…,M\}m\\in\\\{1,\\dots,M\\\}withM=min⁡\(T−1,8\)M=\\min\(T\-1,8\), we keep the prefixs1,…,sT−ms\_\{1\},\\dots,s\_\{T\-m\}fixed and let the SRM regenerate the remaining suffix\. For eachmm, we repeat generationk=5k=5times and evaluate the final answer\. A suffix is regarded as causing a stable outcome flip if at least44out of the55runs become incorrect, corresponding to a flip ratio of at leastγ=0\.8\\gamma=0\.8\.

We then definem⋆m^\{\\star\}as the smallest suffix length that causes a stable outcome flip, and stop the search immediately once it is found\. The corresponding critical routing boundary isb=T−m⋆b=T\-m^\{\\star\}\. The stepsbs\_\{b\}is labeled as a positive sample because its routing score should have triggered the switch to the LRM for generating the next step\. All earlier stepss1,…,sb−1s\_\{1\},\\dots,s\_\{b\-1\}are labeled as negative, and all later steps are discarded\. If no stable outcome flip is found within the scanned range, the sample is excluded from the fitting set\. This construction provides step\-level supervision that is better aligned with the next\-step routing objective than assigning labels directly from the final trajectory outcome\. Since the table\-token and text\-token mappings are fitted on the same retained step set, they share the same binary labels and the same total number of samples\. Table[4](https://arxiv.org/html/2605.29319#A3.T4)reports the total number of retained step\-level samples under this construction\.

Table 4:Number of retained step\-level samples used for fitting the offline risk mappings under the suffix\-replacement construction\.
### C\.2Fitting Failure\-risk Mappings

For each dataset, we fit two independent sigmoid risk mappings, one for table\-token uncertainty and the other for text\-token uncertainty, using the retained step\-level samples constructed above\. For each signalΦ∗\(i\)\\Phi\_\{\*\}^\{\(i\)\}, where∗∈\{tab,text\}\*\\in\\\{\\text\{tab\},\\text\{text\}\\\}, the risk score is defined as

d∗\(i\)=f∗​\(Φ∗\(i\)\)=σ​\(a∗​Φ∗\(i\)\+b∗\),d\_\{\*\}^\{\(i\)\}=f\_\{\*\}\\\!\\left\(\\Phi\_\{\*\}^\{\(i\)\}\\right\)=\\sigma\\\!\\left\(a\_\{\*\}\\Phi\_\{\*\}^\{\(i\)\}\+b\_\{\*\}\\right\),\(7\)
whereσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid function, anda∗a\_\{\*\}andb∗b\_\{\*\}are learned from the retained validation samples of that dataset\. Let

p=σ​\(a∗​Φ\+b∗\)\.p=\\sigma\(a\_\{\*\}\\Phi\+b\_\{\*\}\)\.The fitting objective is the standard binary cross\-entropy:

ℒ∗=−∑\(Φ,y\)∈𝒟∗\[y​log⁡p\+\(1−y\)​log⁡\(1−p\)\]\.\\mathcal\{L\}\_\{\*\}=\-\\sum\_\{\(\\Phi,y\)\\in\\mathcal\{D\}\_\{\*\}\}\\left\[y\\log p\+\(1\-y\)\\log\(1\-p\)\\right\]\.\(8\)
During inference, EcoTab mapsΦtab\(i\)\\Phi\_\{\\text\{tab\}\}^\{\(i\)\}andΦtext\(i\)\\Phi\_\{\\text\{text\}\}^\{\(i\)\}into two risk scores and combines them using Noisy\-OR:

dfinal\(i\)=1−\(1−dtab\(i\)\)​\(1−dtext\(i\)\)\.d\_\{\\text\{final\}\}^\{\(i\)\}=1\-\\left\(1\-d\_\{\\text\{tab\}\}^\{\(i\)\}\\right\)\\left\(1\-d\_\{\\text\{text\}\}^\{\(i\)\}\\right\)\.\\vskip\-4\.0pt\(9\)Finally,dfinal\(i\)d\_\{\\text\{final\}\}^\{\(i\)\}is compared with the thresholdτ\\tauto determine whether the next stepsi\+1s\_\{i\+1\}should be generated by the LRM or the SRM\.

### C\.3Discussion

#### Suffix replacement provides cleaner supervision\.

Trajectory\-level labeling assigns the same positive label to all steps in an incorrect trajectory, even though many early steps may still be handled correctly by the SRM\. This introduces label noise and does not align well with next\-step routing\. In contrast, suffix replacement identifies a critical routing boundary and assigns the positive label only to the step immediately before the shortest suffix whose SRM replacement causes a stable outcome flip\. This yields cleaner step\-level supervision and requires no additional human annotation\.

#### Offline construction introduces no online overhead\.

Suffix replacement and risk\-mapping fitting are performed only once on the held\-out validation set\. They are fully offline and do not participate in online inference, so they introduce no additional token generation overhead during routing\. At test time, EcoTab only computes uncertainty signals and queries the fitted mappings\. As shown in the main paper, EcoTab remains lightweight, with routing overhead comparable to STEER and far lower than RSD, SpecCoT, and SpecReason\.

#### The fitted mapping generalizes across domains\.

The main paper shows that the learned risk mapping transfers well across domains\. In the out\-of\-domain setting, the mapping fitted on WikiTQ still outperforms Random and STEER when applied directly to TableBench on both Acc@60% LRM\-only FLOPs and FLOPs@98% LRM\-only Acc\. Although the in\-domain variant performs slightly better, the gap is small, indicating good cross\-domain generalization\.

## Appendix DCase Study

We further present representative case studies on two key table reasoning skills, namely Table Retrieval and Table Operation, to qualitatively compare SRM\-only, STEER, and EcoTab\. Figures[10](https://arxiv.org/html/2605.29319#A4.F10),[11](https://arxiv.org/html/2605.29319#A4.F11), and[12](https://arxiv.org/html/2605.29319#A4.F12)show Table Retrieval cases in which both SRM\-only and STEER fail, while EcoTab succeeds\. Figures[13](https://arxiv.org/html/2605.29319#A4.F13),[14](https://arxiv.org/html/2605.29319#A4.F14), and[15](https://arxiv.org/html/2605.29319#A4.F15)present analogous cases for Table Operation\. The results show that EcoTab can more accurately identify the critical reasoning step and route it to the LRM for handling, leading to correct final predictions\.

![Refer to caption](https://arxiv.org/html/2605.29319v1/x10.png)Figure 10:Table Retrieval case with SRM\-only\.![Refer to caption](https://arxiv.org/html/2605.29319v1/x11.png)Figure 11:Table Retrieval case with STEER\.![Refer to caption](https://arxiv.org/html/2605.29319v1/x12.png)Figure 12:Table Operation case with EcoTab\.![Refer to caption](https://arxiv.org/html/2605.29319v1/x13.png)Figure 13:Table Operation case with SRM\-only\.![Refer to caption](https://arxiv.org/html/2605.29319v1/x14.png)Figure 14:Table Operation case with STEER\.![Refer to caption](https://arxiv.org/html/2605.29319v1/x15.png)Figure 15:Table Operation case with EcoTab\.

Similar Articles

Rubric-Guided Process Reward for Stepwise Model Routing

arXiv cs.AI

RoRo introduces a rubric-guided process reward framework for stepwise model routing in Large Reasoning Models, using process rewards alongside outcome rewards to train a routing policy via GRPO, outperforming baselines on reasoning benchmarks.

TabularMath: Understanding Math Reasoning over Tables with Large Language Models

arXiv cs.CL

TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.

Enhanced and Efficient Reasoning in Large Learning Models

arXiv cs.AI

This paper proposes a method for improving reasoning in large language models by recoding data to explicitly represent relationships, enabling efficient principled reasoning with polynomial-time learnability for relational rules, which addresses hallucinations and supports sound reasoning across multiple calls.