BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
Summary
This paper introduces BitCal-TTS, a runtime controller that improves accuracy and reduces premature halting in quantized reasoning models by calibrating confidence signals during test-time scaling.
View Cached Full Text
Cached at: 05/08/26, 08:24 AM
# Bit-Calibrated Test-Time Scaling for Quantized Reasoning ModelsCode: https://github.com/Saibabu7770/bitcal-tts
Source: [https://arxiv.org/html/2605.05561](https://arxiv.org/html/2605.05561)
\(May 2026\)
###### Abstract
Post\-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test\-time compute allocation\. Under a fixed cap on the number of newly generated tokens, miscalibrated confidence can lead to harmful early halting: the model may surface a plausible final line while the underlying reasoning is still wrong, or the controller may stop before the trace has stabilized\. We study this interaction for greedy 4\-bit inference and proposeBitCal\-TTS, a lightweight runtime controller that combines \(i\) inexpensive online proxies for token\-level uncertainty and reasoning\-trace stability, \(ii\) a bit\-conditioned confidence rescaling that is conservative at low nominal precision, and \(iii\) a bit\-aware post\-marker confirmation horizon designed for GSM8K\-style structured outputs\. The method requires*no*fine\-tuning of the base model and integrates with standard Hugging Face 4\-bit inference using forward hooks for logits and last\-layer hidden states\[[1](https://arxiv.org/html/2605.05561#bib.bib1),[2](https://arxiv.org/html/2605.05561#bib.bib2)\]\.
On small evaluation shards of GSM8K\[[3](https://arxiv.org/html/2605.05561#bib.bib3)\]with Qwen2\.5 Instruct models\[[4](https://arxiv.org/html/2605.05561#bib.bib4)\],BitCal\-TTSimproves exact\-match accuracy over a non–bit\-aware adaptive baseline at the 7B and 14B scales while preserving substantial token savings relative to fixed\-budget decoding\. At a token cap ofB=512B\{=\}512, on the evaluation shards we report \(N=54N\{=\}54for 7B andN=35N\{=\}35for 14B;*not*the full GSM8K test set\), accuracy gains are\+3\.7\+3\.7points \(7B\) and\+2\.8\+2\.8points \(14B\), with the premature\-stop rate falling from14\.8%14\.8\\%to11\.1%11\.1\\%on 7B and from17\.1%17\.1\\%to11\.4%11\.4\\%on 14B\. We report Wilson 95% confidence intervals throughout and explicitly discuss the limited statistical power of the partial\-shard comparisons\. We release code and figure\-generation scripts to support full reproduction\.
Keywords:test\-time scaling, quantization, adaptive halting, GSM8K reasoning, uncertainty calibration, large language models\.
## 1Introduction
Reasoning\-centric large language models \(LLMs\) often benefit from spending additional compute at inference time\. Chain\-of\-thought style deliberation\[[5](https://arxiv.org/html/2605.05561#bib.bib5),[6](https://arxiv.org/html/2605.05561#bib.bib6)\]and more recent sequential test\-time scaling strategies\[[7](https://arxiv.org/html/2605.05561#bib.bib7),[8](https://arxiv.org/html/2605.05561#bib.bib8)\]can substantially improve verifiable accuracy on math and logic tasks\. In production deployments, however, this compute is almost always*bounded*: a token capBBcontrols latency and cost, and product surfaces frequently add early\-exit heuristics whenever an answer “looks complete”\.
This paper focuses on a setting that is increasingly common in deployment: a*causal*, instruction\-tuned LLM served under aggressive post\-training quantization \(e\.g\., 4\-bit weights viabitsandbytes\[[2](https://arxiv.org/html/2605.05561#bib.bib2),[9](https://arxiv.org/html/2605.05561#bib.bib9),[10](https://arxiv.org/html/2605.05561#bib.bib10)\]\) under a hard token budgetBB\. Quantization expands the set of models that fit on consumer GPUs, but it also alters the geometry of logits and hidden states\. Online halting cues such as token entropy and trace stability become miscalibrated relative to full precision: the policy may*appear*confident while the underlying reasoning remains unreliable, increasing the risk of stopping too soon\[[11](https://arxiv.org/html/2605.05561#bib.bib11),[12](https://arxiv.org/html/2605.05561#bib.bib12)\]\. The failure mode is twofold—reduced final accuracy*and*adaptive compute that is wasted halting on a spurious “final” segment\.
#### Research questions\.
We address three questions: \(Q1\) How does aggressive 4\-bit quantization affect the reliability of adaptive halting signals when the controller treats them as if the model were full precision? \(Q2\) Can a precision\-aware adjustment to confidence and to the post\-answer confirmation window recover a meaningful slice of the lost accuracy without training the base model? \(Q3\) How do these effects scale across reasoning model sizes \(3B / 7B / 14B\) on a representative math reasoning benchmark?
#### Contributions\.
- •We formalize a budgeted, stepwise inference loop for quantized causal LMs with halting actions\{continue,stop,escalate\}\\\{\\texttt\{continue\},\\texttt\{stop\},\\texttt\{escalate\}\\\}and contrast \(a\) fixed\-budget decoding, \(b\) adaptive decoding with a precision\-*agnostic*calibrator, and \(c\) the proposedBitCal\-TTS\.
- •We introduce transparent, implementation\-aligned proxies for token entropy, reasoning\-trace stability, and last\-layer hidden\-state drift, and show how a bit\-width multiplicative scale makes stop decisions conservative at aggressive quantization\.
- •We propose a GSM8K\-oriented*post\-marker*confirmation rule: once the standard\#\#\#\#answer delimiter appears, decoding switches to a bit\-conditioned tail budget before termination is permitted\. This avoids treating the delimiter as an immediate halt signal, which we find to be brittle under 4\-bit noise\.
- •We report GSM8K results on Qwen2\.5\-3B/7B/14B Instruct under 4\-bit inference, including a multi\-budget sweep on the 7B model\.BitCal\-TTSrecovers a meaningful slice of the adaptive–vs–fixed accuracy gap on 7B and 14B while preserving large token savings relative to always consuming the full budget\. We additionally document a regime—Qwen2\.5\-3B at 4\-bit—in which neither adaptive variant works, and analyze its proximate cause \(Section[7](https://arxiv.org/html/2605.05561#S7)\)\.
#### Paper organization\.
Section[2](https://arxiv.org/html/2605.05561#S2)formalizes adaptive compute under hard caps and reviews how quantization perturbs online halting signals\. Section[3](https://arxiv.org/html/2605.05561#S3)positions this work relative to test\-time scaling, adaptive decoding, verification, and quantization reliability\. Section[4](https://arxiv.org/html/2605.05561#S4)specifies the proposed controller\. Section[5](https://arxiv.org/html/2605.05561#S5)describes the experimental protocol\. Sections[6](https://arxiv.org/html/2605.05561#S6)and[7](https://arxiv.org/html/2605.05561#S7)present and analyze the empirical findings\. Section[8](https://arxiv.org/html/2605.05561#S8)discusses limitations and broader impact\. Section[9](https://arxiv.org/html/2605.05561#S9)concludes\.
## 2Background and Motivation
#### Adaptive compute under a hard cap\.
LetB∈ℕB\\in\\mathbb\{N\}denote the maximum number of*new*tokens that may be produced for a single prompt\. A*fixed*policy always requestsBBtokens \(or stops at end\-of\-sequence\), which is safe but often inefficient when many problems admit shorter reasoning chains\. An*adaptive*policy interleaves short generation segments with cheap measurements and may terminate before the budget is exhausted, ideally without sacrificing quality\.
#### Quantization changes online signals\.
Post\-training quantization maps weights—and sometimes activations—to low\-bit containers while attempting to preserve downstream quality\[[9](https://arxiv.org/html/2605.05561#bib.bib9),[10](https://arxiv.org/html/2605.05561#bib.bib10),[13](https://arxiv.org/html/2605.05561#bib.bib13),[14](https://arxiv.org/html/2605.05561#bib.bib14)\]\. A separate question, central to this work, is how quantization affects*online*halting signals derived from logits and activations during autoregressive decoding\. If low\-bit inference inflates premature confidence relative to full precision, an adaptive controller will halt earlier than intended, trading away accuracy without realizing commensurate savings relative to a well\-tuned fixed budget\.
#### Structured final answers in GSM8K\.
GSM8K\[[3](https://arxiv.org/html/2605.05561#bib.bib3)\]adopts a standard extraction protocol in which the final numeric answer follows a delimiter token,\#\#\#\#\. The delimiter is convenient for parsing, but under quantization it can appear in locally fluent yet globally incorrect traces\.BitCal\-TTStherefore treats delimiter detection as a*phase change*: after the first occurrence, decoding continues for a precision\-conditioned horizon before termination is permitted\.
## 3Related Work
#### Test\-time scaling and adaptive decoding\.
Inference\-time compute can be scaled by allowing longer chains of thought\[[5](https://arxiv.org/html/2605.05561#bib.bib5)\], by sampling and aggregating multiple candidates\[[6](https://arxiv.org/html/2605.05561#bib.bib6)\], or by structured search such as Tree\-of\-Thoughts and ReAct\[[15](https://arxiv.org/html/2605.05561#bib.bib15),[16](https://arxiv.org/html/2605.05561#bib.bib16)\]\. More recent work studies how to scale test\-time compute optimally and how small, simple test\-time scalers can rival much larger budgets\[[7](https://arxiv.org/html/2605.05561#bib.bib7),[8](https://arxiv.org/html/2605.05561#bib.bib8),[17](https://arxiv.org/html/2605.05561#bib.bib17)\]\. Closer to our regime, adaptive computation policies expand generation depth based on confidence or estimated difficulty\[[18](https://arxiv.org/html/2605.05561#bib.bib18),[19](https://arxiv.org/html/2605.05561#bib.bib19)\]\. Most published policies are described and tuned for full\-precision models; our experiments isolate the additional error introduced by aggressive quantization while keeping the controller skeleton fixed\.
#### Verification and process supervision\.
Beyond “generating longer”, math reasoning benefits from outcome\- and step\-level verification\[[20](https://arxiv.org/html/2605.05561#bib.bib20)\], including reasoning RL signals such as those used in DeepSeek\-R1\[[21](https://arxiv.org/html/2605.05561#bib.bib21)\]and the OpenAI o1 family\[[22](https://arxiv.org/html/2605.05561#bib.bib22)\]\.BitCal\-TTSdoes*not*train a verifier; instead, it uses lightweight online proxies and a structured tail window as a*runtime*guardrail compatible with frozen quantized weights\.
#### Quantization and reliability\.
The quantization literature commonly reports perplexity or end\-task accuracy under different bit\-widths\[[9](https://arxiv.org/html/2605.05561#bib.bib9),[13](https://arxiv.org/html/2605.05561#bib.bib13),[14](https://arxiv.org/html/2605.05561#bib.bib14),[10](https://arxiv.org/html/2605.05561#bib.bib10)\]\. Complementary work studies whether language models “know what they know” and how to detect unreliable generations\[[11](https://arxiv.org/html/2605.05561#bib.bib11),[12](https://arxiv.org/html/2605.05561#bib.bib12)\]\. We connect these reliability concerns to*halting\-time*decisions under a hard token budget, which is the regime encountered in latency\-critical APIs and on consumer GPUs\.
#### Positioning\.
BitCal\-TTSis not a new quantization kernel; it is an inference\-time policy layer\. Unlike RL post\-training methods that reshape reasoning distributions\[[21](https://arxiv.org/html/2605.05561#bib.bib21),[22](https://arxiv.org/html/2605.05561#bib.bib22)\], we keep weights frozen and only modify \(i\) how many tokens are generated per example, and \(ii\) how delimiter\-triggered tails are handled at low precision\.
## 4Method:BitCal\-TTS
### 4\.1Overview
Figure[1](https://arxiv.org/html/2605.05561#S4.F1)summarizes the end\-to\-end control flow ofBitCal\-TTS\. The frozen quantized causal LM, served through Hugging Face Transformers\[[1](https://arxiv.org/html/2605.05561#bib.bib1)\], generates chunks ofkktokens\. After each chunk the controller computes scalar online signals \(token entropy and two stability proxies\), maps them to a bit\-conditioned confidence value, and applies a finite\-state halting policy that consults a marker\-aware tail rule once the GSM8K answer delimiter\#\#\#\#has appeared\. The selected action either loops back into the LM for a further chunk or finalizes the output\.
Prompt \+ chat templatechat\-formatted inputxxQuantized causal LMQwen2\.5\-Instruct, 4\-bit NF4 weights with BF16 compute — Transformers \+ bitsandbytesOnline signals \(per chunk ofkktokens\)HtH\_\{t\}\(entropy\)τttr\\tau^\{\\mathrm\{tr\}\}\_\{t\}\(trace stability\)τthid\\tau^\{\\mathrm\{hid\}\}\_\{t\}\(hidden\-state stability\)Bit\-conditioned calibratorct=clip\(ctraw⋅s\(b\),0,1\)c\_\{t\}=\\mathrm\{clip\}\\bigl\(c^\{\\mathrm\{raw\}\}\_\{t\}\\cdot s\(b\),\\,0,\\,1\\bigr\),s\(b\)∈\{0\.85,1\.00,1\.05\}s\(b\)\\in\\\{0\.85,\\,1\.00,\\,1\.05\\\}forb≤4,4<b≤8,b\>8b\\\!\\leq\\\!4,\\;4\\\!<\\\!b\\\!\\leq\\\!8,\\;b\\\!\>\\\!8Halting policy \+ marker\-aware tailthresholdsθH,θc,θE\\theta\_\{H\},\\theta\_\{c\},\\theta\_\{E\}; floormm; tailΔ\(b\)\\Delta\(b\)tokens after\#\#\#\#Actionat∈\{continue,stop,escalate\}a\_\{t\}\\in\\\{\\text\{\{continue\}\},\\,\\text\{\{stop\}\},\\,\\text\{\{escalate\}\}\\\}Outputreasoning trace \+ parsed final answercontinue:next chunkFigure 1:End\-to\-end control flow ofBitCal\-TTS\. Solid black arrows trace the per\-step pipeline: a chunk ofkktokens is decoded, online signals are computed, mapped to a bit\-conditioned confidence, and consumed by a halting policy with a marker\-aware tail\. The dashed feedback arrow on the right indicates that thecontinueaction loops execution back into the language model;stopandescalatefinalize the output\. The shaded blocks \(orange, green\) constitute theBitCal\-TTSsidecar around an unmodified quantized backbone\.
### 4\.2Problem Setup
Letxxbe a prompt andMbM\_\{b\}a causal language model served at nominal weight precisionbb\(our experiments useb=4b\{=\}4\)\. LetB∈ℕB\\in\\mathbb\{N\}denote a hard cap on newly generated tokens\. Decoding proceeds in stepst=1,2,…t=1,2,\\ldots; at each step the engine generates a chunk of up tokktokens \(defaultk=16k\{=\}16\), records logits and optional hidden states, and appends the decoded text to the partial outputy≤ty\_\{\\leq t\}\. LetTt=\|y≤t\|T\_\{t\}\\\!=\\\!\|y\_\{\\leq t\}\|denote the cumulative count of generated tokens after steptt\. We compare three controller variants:
- •Fixed:always generate until the budgetBBis exhausted or end\-of\-sequence \(EOS\) is returned\.
- •Adaptive:apply the halting machinery of Section[4\.5](https://arxiv.org/html/2605.05561#S4.SS5), but feed the calibrator an*effective*precision of 16 bits, so that the confidence scale is optimistic relative to the true 4\-bit serving regime\.
- •BitCal\-TTS:identical machinery, but the calibrator uses the true served bit widthbb, and the post\-marker tail usesΔ\(b\)\\Delta\(b\)\(Section[4\.5](https://arxiv.org/html/2605.05561#S4.SS5)\)\.
### 4\.3Online Signals
Letℓt∈ℝ\|𝒱\|\\ell\_\{t\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}denote the final\-position logits at the end of stepttover the vocabulary𝒱\\mathcal\{V\}, and letpt=softmax\(ℓt\)p\_\{t\}=\\mathrm\{softmax\}\(\\ell\_\{t\}\)\. We use Shannon entropy in nats:
Ht=−∑v∈𝒱pt\(v\)logpt\(v\)\.H\_\{t\}\\;=\\;\-\\\!\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{t\}\(v\)\\,\\log p\_\{t\}\(v\)\.\(1\)
#### Reasoning\-trace stability\.
Let\(s1,…,st\)\(s\_\{1\},\\ldots,s\_\{t\}\)be the textual chunks produced so far ands~i\\tilde\{s\}\_\{i\}the whitespace\-stripped chunk\. We define a lightweight stability score in\[0,1\]\[0,1\]as the fraction of consecutive pairs\(s~i−1,s~i\)\(\\tilde\{s\}\_\{i\-1\},\\tilde\{s\}\_\{i\}\)with both lengths at least eight characters that satisfys~i−1=s~i\\tilde\{s\}\_\{i\-1\}=\\tilde\{s\}\_\{i\}:
τttr=\|\{i≤t:\|s~i−1\|≥8,\|s~i\|≥8,s~i−1=s~i\}\|\|\{i≤t:\|s~i−1\|≥8,\|s~i\|≥8\}\|\.\\tau^\{\\mathrm\{tr\}\}\_\{t\}\\;=\\;\\frac\{\|\\\{i\\leq t:\|\\tilde\{s\}\_\{i\-1\}\|\\\!\\geq\\\!8,\\ \|\\tilde\{s\}\_\{i\}\|\\\!\\geq\\\!8,\\ \\tilde\{s\}\_\{i\-1\}=\\tilde\{s\}\_\{i\}\\\}\|\}\{\|\\\{i\\leq t:\|\\tilde\{s\}\_\{i\-1\}\|\\\!\\geq\\\!8,\\ \|\\tilde\{s\}\_\{i\}\|\\\!\\geq\\\!8\\\}\|\}\.\(2\)If fewer than two eligible pairs exist, we setτttr:=1\\tau^\{\\mathrm\{tr\}\}\_\{t\}\\\!:=\\\!1\. The proxy rewards literal repetition across recent chunks, which often correlates with the model having settled on a template or a final answer line\.
#### Hidden\-state stability\.
When the backend exposes the last\-layer hidden vectorht∈ℝdh\_\{t\}\\in\\mathbb\{R\}^\{d\}at the final position of steptt, we accumulate normalized vectorsh^i=hi/\(∥hi∥2\+ε\)\\hat\{h\}\_\{i\}=h\_\{i\}/\(\\lVert h\_\{i\}\\rVert\_\{2\}\+\\varepsilon\)and define
τthid=1t−1∑i=2th^i−1⊤h^i,\\tau^\{\\mathrm\{hid\}\}\_\{t\}\\;=\\;\\frac\{1\}\{t\-1\}\\sum\_\{i=2\}^\{t\}\\hat\{h\}\_\{i\-1\}^\{\\top\}\\hat\{h\}\_\{i\},\(3\)using only steps where the hidden state is available\. If fewer than two hidden vectors exist, we setτthid:=1\\tau^\{\\mathrm\{hid\}\}\_\{t\}\\\!:=\\\!1\.
### 4\.4Bit\-Conditioned Confidence
We map entropy to a normalized uncertaintyut=clip\(Ht/Hmax,0,1\)u\_\{t\}=\\mathrm\{clip\}\(H\_\{t\}/H\_\{\\max\},\\,0,\\,1\)with defaultHmax=10H\_\{\\max\}=10nats\. A raw confidence score combines entropy with the two stability proxies,
ctraw=we\(1−ut\)\+wtrτttr\+whidτthid,we,wtr,whid≥0,we\+wtr\+whid=1\.c^\{\\mathrm\{raw\}\}\_\{t\}\\;=\\;w\_\{e\}\(1\-u\_\{t\}\)\\;\+\\;w\_\{\\mathrm\{tr\}\}\\,\\tau^\{\\mathrm\{tr\}\}\_\{t\}\\;\+\\;w\_\{\\mathrm\{hid\}\}\\,\\tau^\{\\mathrm\{hid\}\}\_\{t\},\\quad w\_\{e\},w\_\{\\mathrm\{tr\}\},w\_\{\\mathrm\{hid\}\}\\geq 0,\\ \\ w\_\{e\}\+w\_\{\\mathrm\{tr\}\}\+w\_\{\\mathrm\{hid\}\}=1\.\(4\)Weights are renormalized to sum to one for numerical robustness\. A bit\-width scaling factors\(b\)s\(b\)then yields the bit\-calibrated confidence
ct=clip\(ctraw⋅s\(b\),0,1\),s\(b\)=\{0\.85,b≤4,1\.00,4<b≤8,1\.05,b\>8\.c\_\{t\}\\;=\\;\\mathrm\{clip\}\\bigl\(c^\{\\mathrm\{raw\}\}\_\{t\}\\cdot s\(b\),\\ 0,\\ 1\\bigr\),\\qquad s\(b\)\\;=\\;\\begin\{cases\}0\.85,&b\\leq 4,\\\\ 1\.00,&4<b\\leq 8,\\\\ 1\.05,&b\>8\.\\end\{cases\}\(5\)An optional temperatureγ\>0\\gamma\>0further sharpens or flattens confidence viact←clip\(ct1/γ,0,1\)c\_\{t\}\\leftarrow\\mathrm\{clip\}\(c\_\{t\}^\{1/\\gamma\},0,1\)\. The adaptive ablation sets the calibrator’s effective bit width to 16 \(sos=1\.05s=1\.05\), whereasBitCal\-TTSuses the true servedb=4b\{=\}4\(sos=0\.85s=0\.85\), making it more conservative about declaring high confidence\.
### 4\.5Halting Policy and Marker\-Aware Tails
#### Pre\-marker phase\.
Before the first occurrence of the GSM8K delimiter\#\#\#\#iny≤ty\_\{\\leq t\}, the controller applies a threshold policy with tunable constants\. Letmmbe the minimum number of generated tokens before any halting is allowed \(defaultm=128m\{=\}128\); letθH\\theta\_\{H\},θc\\theta\_\{c\}, andθE\\theta\_\{E\}denote entropy\-stop, confidence\-stop, and entropy\-escalate thresholds \(defaultsθH=2\.0\\theta\_\{H\}\{=\}2\.0,θc=0\.75\\theta\_\{c\}\{=\}0\.75,θE=4\.0\\theta\_\{E\}\{=\}4\.0\); letrtr\_\{t\}be the remaining budget; and letmbufm\_\{\\mathrm\{buf\}\}be a minimum remaining\-budget buffer below which continuing is no longer worthwhile \(defaultmbuf=32m\_\{\\mathrm\{buf\}\}\{=\}32\)\. The action is selected by evaluating the cases below*in order*and returning at the first match \(i\.e\., a buffer\-drivenstopfires before an entropy\-drivenescalate\):
at=\{continue,Tt<m,stop,Tt≥mandrt<mbuf,escalate,Tt≥mandHt≥θE,stop,Tt≥mandHt≤θHandct≥θc,continue,otherwise\.a\_\{t\}\\;=\\;\\begin\{cases\}\\texttt\{continue\},&T\_\{t\}<m,\\\\\[2\.0pt\] \\texttt\{stop\},&T\_\{t\}\\geq m\\ \\text\{and\}\\ r\_\{t\}<m\_\{\\mathrm\{buf\}\},\\\\\[2\.0pt\] \\texttt\{escalate\},&T\_\{t\}\\geq m\\ \\text\{and\}\\ H\_\{t\}\\geq\\theta\_\{E\},\\\\\[2\.0pt\] \\texttt\{stop\},&T\_\{t\}\\geq m\\ \\text\{and\}\\ H\_\{t\}\\leq\\theta\_\{H\}\\ \\text\{and\}\\ c\_\{t\}\\geq\\theta\_\{c\},\\\\\[2\.0pt\] \\texttt\{continue\},&\\text\{otherwise\}\.\\end\{cases\}\(6\)Theescalateaction is reserved as a deployment hook \(e\.g\., switching to full precision\); in the released harness it terminates the step loop equivalently tostop\.
#### Post\-marker phase\.
LetT⋆T^\{\\star\}denote the cumulative number of generated tokens at the first step where\#\#\#\#appears anywhere iny≤ty\_\{\\leq t\}\(unset until the delimiter is observed\)\. We define a bit\-conditioned confirmation horizon:
Δ\(b\)=\{32,b≤4,16,4<b≤8,0,b\>8\.\\Delta\(b\)\\;=\\;\\begin\{cases\}32,&b\\leq 4,\\\\ 16,&4<b\\leq 8,\\\\ 0,&b\>8\.\\end\{cases\}\(7\)OnceT⋆T^\{\\star\}is defined and the floorTt≥mT\_\{t\}\\geq mis satisfied, the entropy policy is*bypassed*until the tail constraint is met: ifTt−T⋆<Δ\(beff\)T\_\{t\}\-T^\{\\star\}<\\Delta\(b\_\{\\mathrm\{eff\}\}\)the controller forcescontinue; otherwise it forcesstop\. The effective precisionbeffb\_\{\\mathrm\{eff\}\}is1616for the adaptive ablation \(soΔ=0\\Delta=0and the delimiter can trigger immediate termination oncemmis met\) and equals the truebbforBitCal\-TTS\(soΔ=32\\Delta=32at 4\-bit, requiring additional confirmation tokens after the first delimiter sighting\)\. This implements a*precision\-dependent*tail that reduces the number of false stops driven by brittle low\-bit formatting\.
Algorithm 1Budgeted inference withBitCal\-TTS\(implementation\-aligned sketch\)\. For the adaptive ablation, setbeff=16b\_\{\\mathrm\{eff\}\}\{=\}16when evaluatingΔ\(⋅\)\\Delta\(\\cdot\)so thatΔ=0\\Delta\{=\}0; forBitCal\-TTSuse the servedbb\(e\.g\.,b=4b\{=\}4\)\.0:prompt
xx; cap
BB; chunk size
kk; thresholds
θH,θc,θE\\theta\_\{H\},\\theta\_\{c\},\\theta\_\{E\}; floor
mm; buffer
mbufm\_\{\\mathrm\{buf\}\}; served bit width
bb\.
1:
y←y\\leftarrowempty string;
T⋆←T^\{\\star\}\\leftarrowunset;
T←0T\\leftarrow 0
2:while
T<BT<Band notfinisheddo
3:generate next chunk of up to
kktokens; append to
yy; update
TT
4:if\#\#\#\#appears in
yyand
T⋆T^\{\\star\}is unsetthen
5:
T⋆←TT^\{\\star\}\\leftarrow T\{first delimiter sighting\}
6:endif
7:compute
Ht,τttr,τthidH\_\{t\},\\,\\tau^\{\\mathrm\{tr\}\}\_\{t\},\\,\\tau^\{\\mathrm\{hid\}\}\_\{t\}; set
ctc\_\{t\}via Eqs\. \([4](https://arxiv.org/html/2605.05561#S4.E4)\)–\([5](https://arxiv.org/html/2605.05561#S4.E5)\)
8:if
T<mT<mthen
9:continue\{below floor\}
10:elseif
T⋆T^\{\\star\}is setand
T−T⋆<Δ\(beff\)T\-T^\{\\star\}<\\Delta\(b\_\{\\mathrm\{eff\}\}\)then
11:continue\{in confirmation tail\}
12:elseif
T⋆T^\{\\star\}is setand
T−T⋆≥Δ\(beff\)T\-T^\{\\star\}\\geq\\Delta\(b\_\{\\mathrm\{eff\}\}\)then
13:stop\{tail satisfied\}
14:else
15:apply entropy policy of Eq\. \([6](https://arxiv.org/html/2605.05561#S4.E6)\)
16:endif
17:endwhile
18:returnparsed answer extracted from
yy
### 4\.6Complexity and Overhead
Each controller step pays the cost of one forward pass forkktokens plus𝒪\(\|𝒱\|\)\\mathcal\{O\}\(\|\\mathcal\{V\}\|\)work for entropy and𝒪\(d\)\\mathcal\{O\}\(d\)work for cosine\-similarity tracking\. The controller does*not*issue additional model calls; it consumes tensors that are already materialized during ordinary decoding \(logits, and optionally last\-layer hiddens exposed via Hugging Face forward hooks\)\.
## 5Experimental Setup
#### Models and quantization\.
We evaluate three Qwen2\.5 Instruct checkpoints\[[4](https://arxiv.org/html/2605.05561#bib.bib4)\]— the 3B, 7B, and 14B sizes \(model identifiersQwen2\.5\-3B\-Instruct,Qwen2\.5\-7B\-Instruct, andQwen2\.5\-14B\-Instruct\)\. All checkpoints are loaded in 4\-bit precision viabitsandbytes, as exposed through Hugging Face Transformers\[[1](https://arxiv.org/html/2605.05561#bib.bib1),[2](https://arxiv.org/html/2605.05561#bib.bib2),[9](https://arxiv.org/html/2605.05561#bib.bib9)\]\. Decoding is greedy \(do\_sample=False\) for reproducibility\.
#### Task, prompts, and extraction\.
We measure exact\-match accuracy on GSM8K\[[3](https://arxiv.org/html/2605.05561#bib.bib3)\]loaded through the Hugging Facedatasetslibrary \(gsm8k/main, splittest\)\. Per\-model evaluation shards are the*first*NNitems of the test split \(no random sub\-sampling\), so the shards are deterministic and can be re\-extracted exactly by any user\. The standard final\-answer extraction protocol after the\#\#\#\#delimiter is used: the literal substring after the final\#\#\#\#, parsed as a float and compared exact\-match against the gold answer\. Prompts use the instruction\-tuned chat template for Qwen2\.5 viatokenizer\.apply\_chat\_template\(\.\.\., add\_generation\_prompt=True\)\. The exact prompt template is recorded inconfigs/experiment\_gsm8k\_minimal\.yamlin the released repository\.
#### Methods compared\.
We compare three controllers at matched\(model,B\)\(\\text\{model\},B\): \(i\)*fixed*, \(ii\)*adaptive*\(precision\-agnostic calibrator\), and \(iii\)*BitCal\-TTS*\(precision\-aware calibrator and tail\)\. All share the same model checkpoint, prompts, decoding strategy, and parsing\.
#### Metrics and uncertainty\.
We report:
- •*Exact\-match accuracy*on the parsed final answer\. Whenever a single accuracy figure is shown, we accompany it with the Wilson 95% confidence interval computed from the relevant shard sizeNN\.
- •*Average tokens consumed*per example \(proxy for latency, cost, and energy in batched serving where the KV\-cache dominates\)\.
- •*Token savings*relative to the fixed baseline at the same\(model,B\)\(\\text\{model\},B\)\.
- •*Premature\-stop rate*: fraction of examples on which generation halted before consuming the full budget*and*the predicted answer is incorrect\. This isolates the adaptive failure mode of greedy early termination\.
#### Hardware\.
Primary headline comparisons mirror the tables collected on a single NVIDIA T4 \(16 GB\) GPU in Google Colab, reflecting a common low\-cost deployment setting for 4\-bit reasoning models\. Software versions are listed in Appendix[B](https://arxiv.org/html/2605.05561#A2)\.
#### Sample sizes and statistical caveat\.
Owing to Colab session limits, the cross\-model summary atB=512B\{=\}512combines*separate*GSM8K shards per model rather than the full 1,319\-item test set:N=50N\{=\}50for the 3B model;N=54N\{=\}54for the 7B model \(also used for the budget sweep atB∈\{256,512\}B\\in\\\{256,512\\\}, withN=53N\{=\}53atB=1024B\{=\}1024\); andN=35N\{=\}35for the 14B model \(withN=34N\{=\}34atB=1024B\{=\}1024in the appendix table\)\. We report sample sizes explicitly and flag below that the headline between\-method differences \(a 2–4 percentage point edge forBitCal\-TTSover adaptive on 7B and 14B\) sit within the 95% confidence intervals of the individual accuracies on these shards\. The*direction*of the effect is consistent across models, budgets, and across the premature\-stop metric, but a larger evaluation shard would be required to claim significance under a paired test such as McNemar’s\. We treat the present results as suggestive of a real effect rather than as a final estimate of its magnitude\.
## 6Results
Table[1](https://arxiv.org/html/2605.05561#S6.T1)summarizes the primary comparison atB=512B\{=\}512, including Wilson 95% confidence intervals on accuracy\. Figure[2](https://arxiv.org/html/2605.05561#S6.F2)visualizes the same aggregates \(with CI bars on the accuracy panel\)\. Figure[3](https://arxiv.org/html/2605.05561#S6.F3)isolates the premature\-stop failure mode\. Figure[4](https://arxiv.org/html/2605.05561#S6.F4)shows the quality–efficiency Pareto frontier for the 7B model across budgets, and Figure[5](https://arxiv.org/html/2605.05561#S6.F5)traces accuracy and token use asBBvaries\.
Table 1:GSM8K results at token budgetB=512B\{=\}512under 4\-bit inference\. Sample sizes areN=50/54/35N\{=\}50/54/35for 3B / 7B / 14B respectively \(see Section[5](https://arxiv.org/html/2605.05561#S5)\)\. Acc\. values report the point estimate with Wilson 95% confidence intervals in brackets\. “Savings” is computed against fixed decoding on the same model\. “Prem\. stop” is the rate of early halts that yield incorrect answers\. Best adaptive accuracy and lowest premature\-stop rate within each model block are in bold\. Note that theBitCal\-TTS–vs–adaptive accuracy differences sit within the overlapping CIs of the individual shard estimates; we discuss this in Sections[5](https://arxiv.org/html/2605.05561#S5)and[8](https://arxiv.org/html/2605.05561#S8)\.Figure 2:Headline GSM8K comparison atB=512B\{=\}512under 4\-bit inference\.*Left:*exact\-match accuracy with Wilson 95% confidence intervals\.*Right:*average tokens consumed per example\.BitCal\-TTSimproves point\-estimate accuracy over the adaptive baseline on 7B and 14B at modest additional token cost relative to fixed decoding; the 3B model remains in a regime where halting signals are unreliable relative to task difficulty \(Section[7](https://arxiv.org/html/2605.05561#S7)\)\.Figure 3:Premature\-stop rate \(early halt*and*incorrect answer\) atB=512B\{=\}512\.BitCal\-TTSreduces this failure mode on 7B and 14B; on 3B both adaptive variants halt prematurely on the majority of examples \(Section[7](https://arxiv.org/html/2605.05561#S7)\)\.Figure 4:Quality–efficiency trade\-off for Qwen2\.5\-7B under 4\-bit inference\. Each point is a method×\\timesbudget aggregate; budget labels annotate token capsBB\. Up\-and\-left is preferable\.Figure 5:Qwen2\.5\-7B budget sweep under 4\-bit inference\. Accuracy rises withBBfor fixed decoding, while adaptive policies plateau earlier\.BitCal\-TTStracks closer to fixed accuracy than the adaptive baseline atB∈\{512,1024\}B\\in\\\{512,1024\\\}while consuming substantially fewer tokens than fixed decoding\.#### Headline findings\.
AtB=512B\{=\}512and 4\-bit precision,BitCal\-TTSimproves point\-estimate accuracy over the adaptive baseline by\+3\.7\+3\.7points on Qwen2\.5\-7B \(83\.3% vs\. 79\.6%\) and by\+2\.8\+2\.8points on Qwen2\.5\-14B \(85\.7% vs\. 82\.9%\), with the premature\-stop rate falling from14\.8%14\.8\\%to11\.1%11\.1\\%on 7B and from17\.1%17\.1\\%to11\.4%11\.4\\%on 14B\. Token savings versus fixed decoding remain sizeable \(32\.1%32\.1\\%on 7B and40\.8%40\.8\\%on 14B\), demonstrating that the gain over adaptive does not require giving up the bulk of the efficiency benefit\.
We caution that the absolute accuracy differences reported here are not statistically resolved at our shard sizes: a\+2\.8\+2\.8\-point gain on the 14B shard \(N=35N\{=\}35\) corresponds to a single additional correct example, and the Wilson 95% intervals on the two methods overlap substantially \(Table[1](https://arxiv.org/html/2605.05561#S6.T1)\)\. The premature\-stop comparisons are similarly based on small absolute counts\. The*direction*of the effect, however, is consistent across model sizes \(7B and 14B\), across token budgets \(B∈\{512,1024\}B\\in\\\{512,1024\\\}; Table[3](https://arxiv.org/html/2605.05561#A2.T3)\), and across metrics \(accuracy and premature\-stop rate move together in the expected direction\); we discuss the appropriate level of confidence in these findings in Section[8](https://arxiv.org/html/2605.05561#S8)\.
#### Budget sweep on 7B\.
Table[3](https://arxiv.org/html/2605.05561#A2.T3)\(Appendix[B](https://arxiv.org/html/2605.05561#A2)\) and Figure[5](https://arxiv.org/html/2605.05561#S6.F5)trace each method acrossB∈\{256,512,1024\}B\\in\\\{256,512,1024\\\}\. AtB=256B\{=\}256, all three methods are accuracy\-bound by the small budget relative to GSM8K solution lengths\. AtB=512B\{=\}512andB=1024B\{=\}1024,BitCal\-TTSsits between adaptive and fixed in both accuracy and token use, recovering roughly1/31/3of the adaptive–vs–fixed accuracy gap atB=1024B\{=\}1024on this shard while still using fewer than half as many tokens as fixed decoding\.
## 7Discussion
#### Why bit\-aware confirmation helps\.
Our results are consistent with two complementary mechanisms: \(i\) the conservative confidence scales\(b=4\)=0\.85s\(b\{=\}4\)=0\.85in Eq\. \([5](https://arxiv.org/html/2605.05561#S4.E5)\) reduces entropy\-threshold stops that would otherwise occur while the trace is still unstable, and \(ii\) the 32\-token post\-marker tail in Eq\. \([7](https://arxiv.org/html/2605.05561#S4.E7)\) prevents the first appearance of\#\#\#\#from being treated as irrefutable evidence of completion\. Quantization noise can locally produce a formatted\-looking answer line whose surrounding reasoning has not yet converged; requiring additional confirmation tokens substantially reduces this failure mode, as visible in the drop in premature\-stop rate on 7B and 14B\. Disentangling the contribution of \(i\) versus \(ii\) requires a controlled component\-wise ablation; we sketch the planned design in Section[8](https://arxiv.org/html/2605.05561#S8)and treat the question as open for the present version\.
#### The 3B failure mode: what is actually happening\.
At Qwen2\.5\-3B\-Instruct under 4\-bit weights, both adaptive variants collapse to roughly2020–22%22\\%accuracy versus60%60\\%for fixed decoding, with a63%63\\%premature\-stop rate \(Table[1](https://arxiv.org/html/2605.05561#S6.T1)\)\. This is the kind of result that ought not to be glossed over\. Inspection of the per\-example logs fromresults/processed/3b/attributes the collapse to two interacting causes:
- •Solution length distribution\.GSM8K solutions for the 3B model frequently exceed 200 generated tokens, while our floorm=128m\{=\}128enables halting decisions well before a typical solution has finished\. On the 7B and 14B models the average correct trace is shorter and more stable, so the floor is rarely the binding constraint\.
- •Stability proxies fire on partial reasoning\.The trace\-stability scoreτtr\\tau^\{\\mathrm\{tr\}\}rewards literal repetition across recent chunks\. At 3B–4bit, repeating a partial intermediate computation \(e\.g\., re\-stating a quantity as the model edges towards an answer\) is sufficient to triggerτtr=1\\tau^\{\\mathrm\{tr\}\}\{=\}1on a chunk\-pair where neither chunk contains a final answer, pushingctc\_\{t\}overθc=0\.75\\theta\_\{c\}\{=\}0\.75\. The bit\-aware scales\(4\)=0\.85s\(4\)\{=\}0\.85does not save us, because0\.75⋅0\.85−1≈0\.880\.75\\cdot 0\.85^\{\-1\}\\approx 0\.88is itself routinely exceeded once entropy is also low on a fluent partial trace\.
We therefore view the 3B–4bit regime as outside the operating envelope of the present controller rather than as evidence that the controller itself has broken: the floor and confidence threshold are tuned for models whose typical correct trace is shorter than the budget, which is not the case here\. Two practical mitigations are immediate—raisingmmto a model\-conditioned value \(e\.g\.,m≈𝔼\[L\]m\\approx\\mathbb\{E\}\[L\]for the typical solution lengthLL\) and increasingθc\\theta\_\{c\}at very low nominal capacity—and we recommend them as a deployment heuristic\. A more principled fix would replace the hand\-tuned thresholds with a small learned calibrator, which we list as future work\.
#### Trade\-off surface\.
BitCal\-TTSis intentionally less aggressive than adaptive decoding; it spends modestly more tokens per example\. Across our experiments, this extra cost is small relative to fixed decoding and consistently buys quality on the 7B and 14B shards\. The Pareto picture in Figure[4](https://arxiv.org/html/2605.05561#S6.F4)highlights the central trade\-off of bit\-aware halting: a controller that is too permissive at low precision pays an outsized accuracy cost, whereas a controller that is too conservative loses most of the savings that motivate adaptive decoding in the first place\.
#### Scaling trend\.
Failure modes diminish from 3B to 14B\. Larger quantized models align better with the answer\-marker\-plus\-confirmation heuristic, suggesting stronger compatibility between online halting signals and the underlying reasoning policy at higher capacity\.BitCal\-TTSis a*controller*, not a capacity increase; practitioners should expect diminishing returns when the base model lacks minimal reasoning competence at the chosen precision\.
## 8Limitations and Broader Impact
#### Statistical power and shard size\.
The most important caveat in this paper is statistical\. Our headline comparisons rest on shards ofN∈\{35,50,54\}N\\in\\\{35,50,54\\\}examples rather than the full1,3191\{,\}319\-item GSM8K test split\. At these shard sizes, the Wilson 95% confidence interval on a single accuracy estimate is roughly±10\\pm 10percentage points, which is wider than any between\-method effect we report\. The point\-estimate gains forBitCal\-TTSover adaptive should therefore be read as*suggestive but not significant*: although the direction is consistent across two model sizes, two budgets, and two metrics \(accuracy and premature\-stop rate\), a paired McNemar’s test would not reject the null hypothesis at our shard sizes\. We are committed to a follow\-up evaluation on the full test split before any production claim is staked on these numbers\.
#### Component\-wise ablation is pending\.
BitCal\-TTSbundles three changes relative to the adaptive baseline: the bit\-conditioned confidence scales\(b\)s\(b\), the bit\-conditioned post\-marker tailΔ\(b\)\\Delta\(b\), and the inclusion of hidden\-state stabilityτhid\\tau^\{\\mathrm\{hid\}\}\. The present experiments compare the bundle against the precision\-agnostic baseline; they do not isolate the contribution of each component\. The natural ablation grid is the four combinations of\{s\(b\)on/off\}×\{Δ\(b\)on/off\}\\\{s\(b\)\\text\{ on/off\}\\\}\\times\\\{\\Delta\(b\)\\text\{ on/off\}\\\}\(holdingτhid\\tau^\{\\mathrm\{hid\}\}fixed\), followed by aτhid\\tau^\{\\mathrm\{hid\}\}on/off sweep\. Reporting these on the full test split is on the immediate roadmap; we list it explicitly here so reviewers can scope the contribution accurately\.
#### Scope\.
Our primary evidence is GSM8K\-centric\. Math reasoning with explicit final markers is a best\-case scenario for delimiter\-based tails; tasks without a structured terminator will require an alternative confirmation criterion\. We do not yet report wall\-clock latency or energy, although average tokens are a useful proxy when batching and KV\-cache residency dominate inference cost\. We also note that GSM8K contamination in modern instruction tunes is an active research concern\[[23](https://arxiv.org/html/2605.05561#bib.bib23)\]; our controller is independent of the specific accuracy floor and operates on the policy layer\.
#### Method limitations\.
The current calibrator is intentionally simple\. A learned calibrator, a small process verifier in the spirit ofLightman et al\. \[[20](https://arxiv.org/html/2605.05561#bib.bib20)\], or a halting head trained jointly with the base model could each plausibly improve robustness\. Our bit\-conditioned scales and tail lengths are hand\-tuned constants; learning them per model and per task is left to future work\. The 3B–4bit failure mode discussed in Section[7](https://arxiv.org/html/2605.05561#S7)is an additional limitation: the controller’s hand\-tuned thresholds and floormmare mistuned for models whose typical correct trace exceeds the floor\.
#### Broader impact\.
BitCal\-TTSonly affects*how long*a frozen model runs at inference time; it does not alter training data, model weights, or safety\-tuning behavior\. Misuse risks are therefore those of the underlying LLM \(e\.g\., harmful or incorrect outputs\)\. Adaptive halting can reduce average compute per query, lowering cost and energy for benign workloads, but in principle it also lowers the cost of high\-throughput misuse; deployment policies should continue to combine capacity limits with content safeguards as usual\.
## 9Conclusion
We presentedBitCal\-TTS, a bit\-conditioned, training\-free inference controller for quantized reasoning models served under a hard token budget\. The controller combines online entropy and stability proxies with a precision\-aware confidence scale and a marker\-aware confirmation tail, all implemented as a thin sidecar around standard Hugging Face quantized decoding\. On Qwen2\.5\-7B and Qwen2\.5\-14B Instruct in 4\-bit, on the small evaluation shards reported here,BitCal\-TTSreduces premature stops and improves GSM8K exact\-match accuracy point estimates over a non–bit\-aware adaptive baseline while preserving substantial savings versus fixed\-budget decoding\. The effect direction is consistent across sizes, budgets, and metrics; the absolute magnitudes await confirmation on the full test split, which we list as the most pressing follow\-up\.
#### Future work\.
Three directions are immediate\. First, the full\-test\-split re\-evaluation plus the component\-wise ablation described in Section[8](https://arxiv.org/html/2605.05561#S8)\. Second, evaluatingBitCal\-TTSon additional reasoning benchmarks such as MATH\[[24](https://arxiv.org/html/2605.05561#bib.bib24)\]and on code reasoning tasks would test the generality of the marker\-aware tail\. Third, learned bit\-conditioned calibrators \(e\.g\., a small linear head fitted on held\-out logits and hidden states\) may outperform our hand\-tuned scales, and integratingBitCal\-TTSwith reasoning\-tuned backbones such as DeepSeek\-R1\[[21](https://arxiv.org/html/2605.05561#bib.bib21)\]or the o1 family\[[22](https://arxiv.org/html/2605.05561#bib.bib22)\]is likely to further sharpen the trade\-off curve\.
## Reproducibility Statement
Code, default hyperparameters, and analysis scripts are available in the accompanying repository \([https://github\.com/Saibabu7770/bitcal\-tts](https://github.com/Saibabu7770/bitcal-tts)\)\. All figures in this paper can be regenerated viapython scripts/paper\_figures\.py, and processed result shards are summarized underresults/processed/\. The full experimental configuration, including model identifiers, quantization options, prompt templates, and seed handling, is recorded inconfigs/and is reproducible end\-to\-end on a single NVIDIA T4\. Software versions and additional implementation details are listed in Appendix[B](https://arxiv.org/html/2605.05561#A2)\.
## Acknowledgments
Experiments were conducted on Google Colab using NVIDIA T4 GPUs\. We thank the maintainers oftransformersandbitsandbytesfor the underlying open\-source infrastructure, and the authors of the Qwen2\.5 Instruct family for releasing the model checkpoints used in this study\.
## References
- Wolf et al\. \[2020\]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M\. Rush\.Transformers: State\-of\-the\-art natural language processing\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations \(EMNLP\)*, pages 38–45, 2020\.
- Dettmers et al\. \[2023a\]Tim Dettmers, Younes Belkada, Sourab Demir, and contributors\.bitsandbytes: 8\-bit and 4\-bit quantization library for PyTorch\.[https://github\.com/TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes), 2023a\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Yang et al\. \[2024\]An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jianhong Tu, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yi\-Chao Zhang, Yunyang Wan, Yuqi Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu\.Qwen2\.5 technical report\.*arXiv preprint arXiv:2412\.15115*, 2024\.
- Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H\. Chi, Quoc V\. Le, and Denny Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 35, pages 24824–24837, 2022\.
- Wang et al\. \[2023\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H\. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Snell et al\. \[2024\]Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.*arXiv preprint arXiv:2408\.03314*, 2024\.
- Muennighoff et al\. \[2025\]Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei\-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto\.s1: Simple test\-time scaling\.*arXiv preprint arXiv:2501\.19393*, 2025\.
- Dettmers et al\. \[2022\]Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer\.LLM\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.*arXiv preprint arXiv:2208\.07339*, 2022\.
- Dettmers et al\. \[2023b\]Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer\.QLoRA: Efficient finetuning of quantized LLMs\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 36, 2023b\.
- Kadavath et al\. \[2022\]Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield\-Dodds, Nova DasSarma, Eli Tran\-Johnson, et al\.Language models \(mostly\) know what they know\.*arXiv preprint arXiv:2207\.05221*, 2022\.
- Manakul et al\. \[2023\]Potsawee Manakul, Adian Liusie, and Mark J\. F\. Gales\.SelfCheckGPT: Zero\-resource black\-box hallucination detection for generative large language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2023\.
- Frantar et al\. \[2023\]Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh\.GPTQ: Accurate post\-training quantization for generative pre\-trained transformers\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Lin et al\. \[2024\]Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei\-Ming Chen, Wei\-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han\.AWQ: Activation\-aware weight quantization for LLM compression and acceleration\.In*Proceedings of Machine Learning and Systems \(MLSys\)*, 2024\.
- Yao et al\. \[2023a\]Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L\. Griffiths, Yuan Cao, and Karthik Narasimhan\.Tree of thoughts: Deliberate problem solving with large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 36, 2023a\.
- Yao et al\. \[2023b\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.ReAct: Synergizing reasoning and acting in language models\.*arXiv preprint arXiv:2210\.03629*, 2023b\.
- Liu et al\. \[2025\]Junpeng Liu, Bowen Liu, Ming Yan, et al\.A survey on test\-time scaling for large language models\.*arXiv preprint arXiv:2503\.24235*, 2025\.
- Schuster et al\. \[2022\]Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q\. Tran, Yi Tay, and Donald Metzler\.Confident adaptive language modeling\.*Advances in Neural Information Processing Systems \(NeurIPS\)*, 35, 2022\.
- Geifman and El\-Yaniv \[2017\]Yonatan Geifman and Ran El\-Yaniv\.Selective classification for deep neural networks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 30, 2017\.
- Lightman et al\. \[2024\]Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Guo et al\. \[2025\]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al\.DeepSeek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- OpenAI \[2024\]OpenAI\.Learning to reason with LLMs\.OpenAI technical report, 2024\.[https://openai\.com/index/learning\-to\-reason\-with\-llms/](https://openai.com/index/learning-to-reason-with-llms/)\.
- Zhang et al\. \[2024\]Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, et al\.A careful examination of large language model performance on grade school arithmetic\.*arXiv preprint arXiv:2405\.00332*, 2024\.
- Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the MATH dataset\.*Advances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track*, 2021\.
## Appendix ANotation
Table[2](https://arxiv.org/html/2605.05561#A1.T2)summarizes the symbols used in the paper\.
Table 2:Notation used throughout the paper\.
## Appendix BImplementation Details
#### Code and defaults\.
The implementation lives in thebitcal\_ttsPython package, with entry pointsscripts/run\_experiment\.py\(end\-to\-end GSM8K runner\),scripts/analyze\_results\.py\(re\-aggregates raw JSONLs into processed CSVs and per\-model figures\), andscripts/paper\_figures\.py\(regenerates the four publication figures used in this paper\)\. The paper\-run configuration isconfigs/experiment\_gsm8k\_minimal\.yaml\(*not*configs/default\.yaml, which is a CPU\-only mock\-demo template\)\. The paper\-run defaults are: chunk sizek=16k=16; entropy normalizationHmax=10H\_\{\\max\}=10; halting thresholdsθH=2\.0\\theta\_\{H\}=2\.0,θc=0\.75\\theta\_\{c\}=0\.75,θE=4\.0\\theta\_\{E\}=4\.0; floorm=128m=128; remaining\-budget buffermbuf=32m\_\{\\mathrm\{buf\}\}=32; signal weights\(we,wtr,whid\)=\(0\.40,0\.35,0\.25\)\(w\_\{e\},w\_\{\\mathrm\{tr\}\},w\_\{\\mathrm\{hid\}\}\)=\(0\.40,\\ 0\.35,\\ 0\.25\); calibrator temperatureγ=1\\gamma=1\. Quantization uses NF4 weights with BF16 compute viaBitsAndBytesConfigin Hugging Face Transformers \(load\_in\_4bit=True\)\. The complete cell\-by\-cell mapping of Table[1](https://arxiv.org/html/2605.05561#S6.T1)entries to JSONL filenames and exact CLIs is given in the repository’sREPRODUCIBILITY\.md\.
#### Software versions\.
Experiments were run with the following pinned versions, matching the companion repository’sREPRODUCIBILITY\.md:transformers 4\.44\.x,torch 2\.4\.xwith CUDA 12\.1,bitsandbytes 0\.43\.x,accelerate 0\.33\.x,datasets 2\.20\.x, plusnumpy/scipy/pandas/matplotlibfor analysis and plotting\. The published JSONL artifacts inresults/raw/were generated under Python 3\.11 on Colab; any recent versions that support 4\-bitbitsandbytesloading should reproduce the headline numbers within roughly0\.50\.5accuracy points\. We recommend re\-pinning before reproduction because thebitsandbytes4\-bit NF4 path is sensitive to minor\-version changes \(e\.g\., the 0\.43\.x→\\rightarrow0\.44\.x dequant\-ordering update flips greedy outputs on a small minority of items\)\. Determinism settings:torch\.use\_deterministic\_algorithms\(False\)\(thebitsandbyteskernels do not currently expose deterministic variants\),do\_sample=Falsefor all decoding, fixed seed4242\. The remaining sources of variability \(CUDA non\-determinism, Hugging Face cache SHA, shard composition\) are documented in the repository\.
#### Qwen2\.5\-7B budget sweep\.
Table[3](https://arxiv.org/html/2605.05561#A2.T3)reproduces the Qwen2\.5\-7B budget sweep exported fromresults/processed/7b/summary\.csv\. Sample sizes reflect the logged shard \(N=54N\{=\}54forB∈\{256,512\}B\\in\\\{256,512\\\}andN=53N\{=\}53forB=1024B\{=\}1024\)\.
Table 3:Qwen2\.5\-7B across budgets under 4\-bit inference \(processed CSV\)\.
#### Qwen2\.5\-14B partial sweep\.
Table[4](https://arxiv.org/html/2605.05561#A2.T4)reports the 14B sweep withN∈\{34,35\}N\\in\\\{34,35\\\}items per setting due to Colab session limits\. We list these explicitly to avoid overstating the sample size\.
Table 4:Qwen2\.5\-14B across budgets under 4\-bit inference \(partialNN\)\.
#### Reproducibility checklist\.
- •Greedy decoding \(do\_sample=False\) with fixed prompt templates removes sampling stochasticity; the only remaining sources of variability are CUDA non\-determinism and shard composition, both of which are documented in the repository\.
- •All controller hyperparameters are stored in versioned configuration files; tables in this appendix were generated directly from the released processed\-results CSVs\.
- •Figures and tables can be regenerated from raw and processed logs without re\-running the model, as long as the released archives inresults/are present\.
- •A typical T4 wall\-clock budget for the full per\-model sweep \(items×\\timesmethods×\\timesbudgets\) is approximately2525minutes for 3B \(50×3×3=45050\\times 3\\times 3=450calls\),8080minutes for 7B \(54×3×3=48654\\times 3\\times 3=486calls\), and110110minutes for 14B \(35×3×2=21035\\times 3\\times 2=210calls\), dominated by autoregressive decoding rather than by controller overhead\. Stronger GPUs \(A100 / RTX 4090\) typically finish33–6×6\\timesfaster\.Similar Articles
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
The paper introduces CALIBER, a method for calibrating confidence in reasoning language models by eliciting confidence estimates both before and after reasoning, with supervision targets matched to the information state. It achieves significant reductions in Expected Calibration Error (up to 52.5%) and strong Brier scores and AUROC across multiple benchmarks.
Agentic Test-Time Scaling (GitHub Repo)
AutoTTS is an open-source tool that uses agentic discovery to automatically find optimal test-time scaling strategies for LLMs, significantly reducing token usage and cost through replay-based evaluation.
CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks
CalVerT introduces calibrated verifier telemetry for LLM agents, providing confidence and grounding scores to reduce over-retrieval and unsupported answers, improving performance on knowledge-intensive QA tasks.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.
FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
This paper proposes FAIR-Calib, a two-stage post-training quantization framework for diffusion large language models that addresses the instability of token commitments during iterative refinement. It achieves state-of-the-art results on LLaDA and Dream models under low-bit quantization.