Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment

arXiv cs.CL 06/29/26, 04:00 AM Papers
numerical-prediction llm maximum-mean-discrepancy loss-function smoothness machine-learning fine-tuning
Summary
Introduces Smooth Maximum Mean Discrepancy (SMMD), a loss function that aligns predicted numeric distributions with targets using kernel matching and graph-based smoothness, improving numerical prediction accuracy in LLMs across multiple tasks.
arXiv:2606.27731v1 Announce Type: new Abstract: Despite their strong general capabilities, large language models (LLMs) often remain unreliable when outputs must be numerically precise. A key reason is the training objective: standard cross-entropy treats numeric tokens as unstructured categories and ignores the metric structure of their values. We address this mismatch with Smooth Maximum Mean Discrepancy (SMMD), which builds on the classic MMD by incorporating value-distance kernels over numeric tokens and graph-based smoothness. With this kernel defined over a numeric sub-vocabulary, SMMD aligns the predicted numeric distribution to the target via kernel matching and smooths the prediction-target residual over the induced kernel graph to encourage local consistency. We evaluate SMMD on four numeric-target tasks: mathematical reasoning, arithmetic calculation, clock-time recognition, and chart question answering, across multiple open-weight LLM and VLM backbones. SMMD consistently improves accuracy over both cross-entropy and recent numeric-target losses; analyses show complementary effects between MMD and smoothness and underscore the importance of distance-based kernel design. Code is available at https://github.com/Zuozhuo/smmd-loss.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:24 AM
# Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment
Source: [https://arxiv.org/html/2606.27731](https://arxiv.org/html/2606.27731)
###### Abstract

Despite their strong general capabilities, large language models \(LLMs\) often remain unreliable when outputs must be numerically precise\. A key reason is the training objective: standard cross\-entropy treats numeric tokens as unstructured categories and ignores the metric structure of their values\. We address this mismatch withSmoothMaximumMeanDiscrepancy \(SMMD\), which builds on the classic MMD by incorporating value\-distance kernels over numeric tokens and graph\-based smoothness\. With this kernel defined over a numeric sub\-vocabulary, SMMD aligns the predicted numeric distribution to the target via kernel matching and smooths the prediction–target residual over the induced kernel graph to encourage local consistency\. We evaluate SMMD on four numeric\-target tasks—mathematical reasoning, arithmetic calculation, clock\-time recognition, and chart question answering—across multiple open\-weight LLM and VLM backbones\. SMMD consistently improves accuracy over both cross\-entropy and recent numeric\-target losses; analyses show complementary effects between MMD and smoothness and underscore the importance of distance\-based kernel design\. Code is available at[https://github\.com/Zuozhuo/smmd\-loss](https://github.com/Zuozhuo/smmd-loss)\.

Machine Learning, Large Language Models, Numerical Prediction, Maximum Mean Discrepancy

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.27731v1/x1.png)Figure 1:Overview of SMMD\.\(a\)From logits, we restrict to the numeric sub\-vocabularyVnumV\_\{\\mathrm\{num\}\}and apply softmax to obtain the numeric distributionpp, which is compared to the one\-hot targetqq\.\(b\)A value distance induced kernelKKis precomputed by applying a RBF kernel to pairwise gaps\|vi−vj\|\|v\_\{i\}\-v\_\{j\}\|, so numerically closer tokens have higher similarity\.\(c\)Training combines kernel MMD alignment with a smoothness regularizer on the residualr=p−qr=p\-qto encourage locally coherent errors along the numeric axis\. The final objective isℒSMMD=ℒMMD\+αℒSmooth\\mathcal\{L\}\_\{\\mathrm\{SMMD\}\}=\\mathcal\{L\}\_\{\\mathrm\{MMD\}\}\+\\alpha\\,\\mathcal\{L\}\_\{\\mathrm\{Smooth\}\}, whereα\\alphais set automatically via degree\-based normalization\.Large language models \(LLMs\) have achieved remarkable progress in natural language generation and reasoning\(OpenAI,[2023](https://arxiv.org/html/2606.27731#bib.bib38); Minaeeet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib37); Guoet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib36)\), yet they remain unreliable when a task requires*precise numerical outputs*\(Spithourakis and Riedel,[2018](https://arxiv.org/html/2606.27731#bib.bib1); Zausingeret al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib26)\)\. This weakness extends far beyond basic arithmetic word problems\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.27731#bib.bib3)\), manifesting in more complex*numerically grounded*contexts, from visual numerical reasoning\(Masryet al\.,[2022](https://arxiv.org/html/2606.27731#bib.bib4); Methaniet al\.,[2020](https://arxiv.org/html/2606.27731#bib.bib5); Kafleet al\.,[2018](https://arxiv.org/html/2606.27731#bib.bib6); Saxenaet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib7)\)to specialized scientific and engineering workflows where precise numerical parameters directly determine the output\(Zuoet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib41); Guoet al\.,[2026](https://arxiv.org/html/2606.27731#bib.bib42)\)\. In these settings, models frequently produce incorrect numerical outputs even when the surrounding reasoning appears plausible\. Such failures are particularly undesirable in scientific, financial, and decision\-making pipelines, where numerical errors can propagate and lead to qualitatively different outcomes\.

A fundamental reason is a mismatch between the*metric structure*of numeric values and the*training signal*used to model them\. In next\-token prediction, numerical tokens are treated as categorical labels and optimized with cross\-entropy \(CE\), which ignores ordinal and distance information: confusing “3” with “4” is penalized in the same way as confusing “3” with “7”\. Consequently, the objective provides no incentive for the model to express value proximity in its predictive distribution, even though such proximity is often crucial for numerically grounded reasoning and downstream decision making\(Spithourakis and Riedel,[2018](https://arxiv.org/html/2606.27731#bib.bib1)\)\.

Recently, a growing line of work has begun to incorporate metric structure into supervision, notably through Earth Mover’s Distance \(EMD\)\(Zausingeret al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib26); Feiet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib2)\)\. Concretely, EMD\-based supervision penalizes the model by weighting predicted probability mass according to its distance from the ground\-truth numeric token\. While these principled, transport\-based losses are effective, they do not explicitly encourage local smoothness in the resulting training signal\. In particular, even when most probability mass is near the target, the per\-token error signal can vary unevenly across neighboring numeric tokens, leaving the model’s behavior less stable in the immediate vicinity of the correct value\.

Motivated by these gaps, we take a kernel\-distribution perspective and introduceSmoothMaximumMeanDiscrepancy \(SMMD\)\. SMMD adapts the classic kernel MMD framework\(Grettonet al\.,[2012](https://arxiv.org/html/2606.27731#bib.bib8)\)to the discrete token distributions of LLMs and, to our best knowledge, is the first to use kernel distribution matching to supervise numeric token prediction\. Distinct from objectives that directly penalize a transport cost, SMMD takes a holistic*kernel\-based*approach: it transforms value distances into a similarity kernel and aligns the predicted and target distributions by matching their moments in a Reproducing Kernel Hilbert Space \(RKHS\)\. Beyond this global alignment, SMMD further promotes*local consistency*by imposing a smoothness constraint on the prediction–target residual across nearby values via the kernel\-induced Dirichlet energy\. The resulting objective is lightweight, requires no architectural changes, and can be seamlessly combined with cross\-entropy during training\.

We evaluate SMMD on a diverse suite of numerical\-output tasks spanning mathematical reasoning, arithmetic calculation, clock\-time recognition, and chart question answering\. Our experiments show that SMMD delivers consistent gains in numerical accuracy across a range of language\-only and vision\-language backbones and datasets\. Analysis further indicates that the kernel matching and smoothness regularization contribute in complementary regimes, and that improvements depend on kernels that respect value\-aligned distance structure\. Finally, sensitivity results suggest SMMD is stable under a broad range of hyperparameter choices\.

## 2Related Work

##### Numeracy in language models\.

While modern LLMs excel at general\-purpose tasks, their grasp of numerical values remains surprisingly brittle\. Early critiques highlighted that treating numbers as standard text tokens ignores their underlying magnitude, sparking a move toward numeral\-aware modeling\(Spithourakis and Riedel,[2018](https://arxiv.org/html/2606.27731#bib.bib1)\)\. Much of this effort has focused on representation, ranging from digit\-level tokenization to numeracy\-specific training signals\(Gevaet al\.,[2020](https://arxiv.org/html/2606.27731#bib.bib9)\), with evidence suggesting that even subtle tokenization choices can fundamentally shift arithmetic performance\(Singh and Strouse,[2024](https://arxiv.org/html/2606.27731#bib.bib10)\)\. Another direction injects more suitable inductive biases through continuous or structured encodings, especially for scientific and property\-prediction settings\(Golkaret al\.,[2024](https://arxiv.org/html/2606.27731#bib.bib11)\), and connects generation to continuous targets via conditional sequence regression formulations\(Born and Manica,[2023](https://arxiv.org/html/2606.27731#bib.bib12)\)\. Recent representation\-level work further improves single\-token number embeddings through Fourier features, providing a complementary way to encode numerical structure in the model parameters\(Zhouet al\.,[2026](https://arxiv.org/html/2606.27731#bib.bib40)\)\. Complementary work targets the sequential nature of number generation, e\.g\., changing digit decoding order to better align with arithmetic structure\(Zhang\-Liet al\.,[2024](https://arxiv.org/html/2606.27731#bib.bib13)\)\. Closest to our focus, several methods revise the training signal for numeric outputs without architectural changes: NTL introduces value\-aware objectives over number tokens, including Wasserstein\-style loss\(Zausingeret al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib26)\), and NTIL further extends EMD\-based supervision to encourage numerical integrity at both token and sequence levels\(Feiet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib2)\)\. DIST2also injects metric distance into token\-level supervision by shaping targets according to numerical proximity\(Chunget al\.,[2026](https://arxiv.org/html/2606.27731#bib.bib39)\); in contrast, our SMMD keeps the one\-hot target, performs kernel distribution matching in an RKHS, and further regularizes the prediction–target residual with graph smoothness\. Orthogonally, inference\-time strategies such as verifiers\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.27731#bib.bib3)\), chain\-of\-thought prompting\(Weiet al\.,[2023](https://arxiv.org/html/2606.27731#bib.bib14)\), or program\-aided execution\(Gaoet al\.,[2023](https://arxiv.org/html/2606.27731#bib.bib15)\)can improve accuracy, while arithmetic\-oriented extended pretraining can also strengthen numeracy\(Petraket al\.,[2023](https://arxiv.org/html/2606.27731#bib.bib16)\)\. Overall, these threads point to a persistent objective mismatch: numeric mistakes have an inherent metric meaning \(how far off the value is\), but standard cross\-entropy rewards only exact token matches and does not expose that structure to learning\.

##### Maximum Mean Discrepancy \(MMD\) and distribution matching\.

MMD is a kernel\-based distance between distributions, originally developed for two\-sample testing through kernel mean embeddings\(Grettonet al\.,[2012](https://arxiv.org/html/2606.27731#bib.bib8)\)\. In deep learning it is often used as a practical distribution\-matching penalty\. For domain adaptation, MMD\-based losses align source and target feature distributions, with multi\-kernel variants improving robustness across scales\(Longet al\.,[2015](https://arxiv.org/html/2606.27731#bib.bib22)\)\. For generative modeling, MMD has served as a likelihood\-free training signal, including adversarial variants that learn the feature space in which matching is performed\(Liet al\.,[2015](https://arxiv.org/html/2606.27731#bib.bib24),[2017](https://arxiv.org/html/2606.27731#bib.bib25)\)\. Related kernel discrepancies also appear as regularizers for representation learning and latent\-variable models when explicit likelihoods are unavailable\(Tolstikhinet al\.,[2017](https://arxiv.org/html/2606.27731#bib.bib19); Zhaoet al\.,[2018](https://arxiv.org/html/2606.27731#bib.bib20)\)\. Departing from the usual feature\-level matching setups, we instantiate MMD as a supervised, per\-token objective over a numeric sub\-vocabulary, using a distance\-induced kernel that reflects value proximity and pairing it with a smoothness bias to encourage locally coherent behavior along the number line\.

## 3Method

We study numeric prediction in an autoregressive language model\. Given a context, the model outputs logitsℓ∈ℝ\|𝒱\|\\bm\{\\ell\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}over the vocabulary𝒱\\mathcal\{V\}, inducing the full next\-token distribution𝐩~=softmax\(ℓ\)\\tilde\{\\mathbf\{p\}\}=\\mathrm\{softmax\}\(\\bm\{\\ell\}\)\.

The focus here is on*numeric tokens*— tokens whose string form can be deterministically parsed into a real value \(via standard float casting\)\. We precompute a numeric sub\-vocabulary𝒱num⊆𝒱\\mathcal\{V\}\_\{\\mathrm\{num\}\}\\subseteq\\mathcal\{V\}with sizeN=\|𝒱num\|N=\|\\mathcal\{V\}\_\{\\mathrm\{num\}\}\|, and index it by\{1,…,N\}\\\{1,\\dots,N\\\}via a bijectionπ:𝒱num→\{1,…,N\}\\pi:\\mathcal\{V\}\_\{\\mathrm\{num\}\}\\to\\\{1,\\dots,N\\\}, where indexiicorresponds to the parsed numeric valuevi∈ℝv\_\{i\}\\in\\mathbb\{R\}\. The construction procedure is summarized in Appendix[A](https://arxiv.org/html/2606.27731#A1)\.

At any training position whose ground\-truth tokenyylies in𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}, logits are restricted to𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}and renormalized to form the numeric distribution:

𝐩\\displaystyle\\mathbf\{p\}=softmax\(ℓ\[𝒱num\]\)∈ΔN,\\displaystyle=\\mathrm\{softmax\}\\\!\\big\(\\bm\{\\ell\}\[\\mathcal\{V\}\_\{\\mathrm\{num\}\}\]\\big\)\\in\\Delta^\{N\},\(1a\)𝐪\\displaystyle\\mathbf\{q\}=𝐞π\(y\)∈ΔN\.\\displaystyle=\\mathbf\{e\}\_\{\\pi\(y\)\}\\in\\Delta^\{N\}\.\(1b\)Equivalently,𝐩\\mathbf\{p\}is the conditional next\-token distribution restricted to𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}\(renormalized on𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}\), while the standard cross\-entropyℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}is computed over the full vocabulary𝒱\\mathcal\{V\}using𝐩~\\tilde\{\\mathbf\{p\}\}\. For positions withy∉𝒱numy\\notin\\mathcal\{V\}\_\{\\mathrm\{num\}\}, our numeric\-aware term is set to zero\.

Standard cross\-entropy only rewards exact matches and treats all incorrect numeric tokens equally\. Our goal is to introduce a numeric\-aware training loss that respects the metric structure of\{vi\}i=1N\\\{v\_\{i\}\\\}\_\{i=1\}^\{N\}: the loss should decrease as the predicted numeric distribution approaches the target distribution, while remaining compatible with token\-level autoregressive training\.

### 3\.1Distance\-induced Kernel over Numeric Tokens

Numeric tokens come with an inherent geometry through their underlying values: for example,33is closer to44than to99in value space\. We encode this geometry as a similarity kernel over𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}by mapping pairwise value distances to kernel weights\.

For indicesi,j∈\{1,…,N\}i,j\\in\\\{1,\\dots,N\\\}, define the value distance

d\(i,j\)=\|vi−vj\|\.d\(i,j\)=\|v\_\{i\}\-v\_\{j\}\|\.\(2\)A PSD kernel matrix is then obtained by converting distances into similarities with a \(possibly multi\-scale\) radial kernel:

Kij=1\|Σ\|∑σ∈Σκσ\(d\(i,j\)\),K\_\{ij\}=\\frac\{1\}\{\|\\Sigma\|\}\\sum\_\{\\sigma\\in\\Sigma\}\\kappa\_\{\\sigma\}\\\!\\big\(d\(i,j\)\\big\),\(3\)whereΣ\\Sigmais a finite set of bandwidths andκσ\(⋅\)\\kappa\_\{\\sigma\}\(\\cdot\)assigns higher similarity to numerically closer tokens\. Throughout this paper,κσ\\kappa\_\{\\sigma\}is instantiated as the Radial Basis Function \(RBF\),

κσ\(d\)=exp⁡\(−d22σ2\)\.\\kappa\_\{\\sigma\}\(d\)=\\exp\\\!\\left\(\-\\frac\{d^\{2\}\}\{2\\sigma^\{2\}\}\\right\)\.\(4\)Intuitively,σ\\sigmacontrols the locality: smallerσ\\sigmamakes similarity decay faster with\|vi−vj\|\|v\_\{i\}\-v\_\{j\}\|, while largerσ\\sigmacouples farther\-apart values\. Since the RBF kernel is PSD onℝ\\mathbb\{R\}and nonnegative averages preserve positive semidefiniteness, the resulting Gram matrix𝐊∈ℝN×N\\mathbf\{K\}\\in\\mathbb\{R\}^\{N\\times N\}is symmetric PSD withKii=1K\_\{ii\}=1\.

##### Practical note on kernel𝐊\\mathbf\{K\}\.

In modern LLMs,𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}typically consists of digit tokens \(soN=10N=10\)\. More generally, even when multi\-digit integers are included as single tokens \(e\.g\.,\{0,…,999\}\\\{0,\\dots,999\\\}\),NNis at most10310^\{3\}\. Thus the kernel can be precomputed once per tokenizer and reused throughout training with negligible overhead\.

### 3\.2Kernel MMD Alignment

With the kernel\-induced similarity structure over numeric tokens, the goal is to align the predicted numeric distribution𝐩\\mathbf\{p\}to the one\-hot target𝐪\\mathbf\{q\}in a*value\-aware*manner\. Maximum Mean Discrepancy \(MMD\)\(Grettonet al\.,[2012](https://arxiv.org/html/2606.27731#bib.bib8)\)provides a principled way to compare distributions through their kernel mean embeddings\. Given a PSD kernelk\(⋅,⋅\)k\(\\cdot,\\cdot\)with RKHSℋ\\mathcal\{H\}, the squared MMD betweenPPandQQis

MMD2\(P,Q\)=‖μP−μQ‖ℋ2,\\mathrm\{MMD\}^\{2\}\(P,Q\)=\\\|\\mu\_\{P\}\-\\mu\_\{Q\}\\\|\_\{\\mathcal\{H\}\}^\{2\},\(5\)whereμP=𝔼x∼P\[ϕ\(x\)\]\\mu\_\{P\}=\\mathbb\{E\}\_\{x\\sim P\}\[\\phi\(x\)\]andϕ\(⋅\)\\phi\(\\cdot\)is the \(implicit\) feature map ofkk\.

In our setting, the domain is the finite index set𝒳=\{1,…,N\}\\mathcal\{X\}=\\\{1,\\dots,N\\\}, and both distributions are discrete vectors𝐩,𝐪∈ΔN\\mathbf\{p\},\\mathbf\{q\}\\in\\Delta^\{N\}defined in \([1](https://arxiv.org/html/2606.27731#S3.E1)\)\. Let𝐊∈ℝN×N\\mathbf\{K\}\\in\\mathbb\{R\}^\{N\\times N\}denote the Gram matrix,Kij=k\(i,j\)K\_\{ij\}=k\(i,j\)\. Introduce the prediction–target residual

𝐫=𝐩−𝐪∈ℝN\.\\mathbf\{r\}=\\mathbf\{p\}\-\\mathbf\{q\}\\in\\mathbb\{R\}^\{N\}\.\(6\)Instantiating Eq\. \([5](https://arxiv.org/html/2606.27731#S3.E5)\) on the finite domain yields a quadratic alignment loss in this residual:

ℒMMD:=∑i=1N∑j=1NKijrirj=𝐫⊤𝐊𝐫\.\\mathcal\{L\}\_\{\\mathrm\{MMD\}\}:=\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}K\_\{ij\}\\,r\_\{i\}r\_\{j\}=\\mathbf\{r\}^\{\\top\}\\mathbf\{K\}\\mathbf\{r\}\.\(7\)
A detailed derivation from the expectation form Eq\. \([5](https://arxiv.org/html/2606.27731#S3.E5)\) to the discrete residual form Eq\. \([7](https://arxiv.org/html/2606.27731#S3.E7)\) is deferred to Appendix[B\.1](https://arxiv.org/html/2606.27731#A2.SS1)\. Moreover, when𝐊\\mathbf\{K\}is positive definite,ℒMMD=0\\mathcal\{L\}\_\{\\mathrm\{MMD\}\}=0holds if and only if𝐩=𝐪\\mathbf\{p\}=\\mathbf\{q\}, so the auxiliary term preserves the supervised optimum\.

### 3\.3Smooth Regularization via Dirichlet Energy

WhileℒMMD\\mathcal\{L\}\_\{\\mathrm\{MMD\}\}aligns𝐩\\mathbf\{p\}to the one\-hot target under the kernel\-induced similarity structure, it does not explicitly enforce*local consistency*of the prediction error along nearby numeric values\. In particular, the residual can still exhibit sharp, locally oscillatory patterns over the numeric vocabulary\. To address this, we introduce an additional smoothness regularizer by viewing the numeric sub\-vocabulary as a weighted graph whose edge weights are given by kernel𝐊\\mathbf\{K\}\.

Specifically, we penalize variations of the residual𝐫\\mathbf\{r\}across strongly connected nodes using the Dirichlet energy:

ℒsmooth:=12∑i=1N∑j=1NKij\(ri−rj\)2\.\\mathcal\{L\}\_\{\\mathrm\{smooth\}\}:=\\frac\{1\}\{2\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}K\_\{ij\}\\big\(r\_\{i\}\-r\_\{j\}\\big\)^\{2\}\.\(8\)This objective has an intuitive interpretation\. When two numeric values are close \(largeKijK\_\{ij\}\), the penalty strongly discouragesrir\_\{i\}andrjr\_\{j\}from disagreeing, thereby promoting a locally coherent error profile over the number line; when they are far apart \(smallKijK\_\{ij\}\), the coupling is weak and the regularizer imposes little constraint\.

Letdegi=∑j=1NKij\\deg\_\{i\}=\\sum\_\{j=1\}^\{N\}K\_\{ij\}, define𝐃=diag\(deg1,…,degN\)\\mathbf\{D\}=\\mathrm\{diag\}\(\\deg\_\{1\},\\dots,\\deg\_\{N\}\), and the graph Laplacian𝐋=𝐃−𝐊\\mathbf\{L\}=\\mathbf\{D\}\-\\mathbf\{K\}\. Then the Dirichlet energy then reduces to the Laplacian quadratic form

ℒsmooth=𝐫⊤𝐋𝐫,\\mathcal\{L\}\_\{\\mathrm\{smooth\}\}=\\mathbf\{r\}^\{\\top\}\\mathbf\{L\}\\mathbf\{r\},\(9\)highlighting that the regularizer suppresses high\-frequency components of𝐫\\mathbf\{r\}on the kernel graph\. The equivalence between the Dirichlet energy and the Laplacian quadratic form is a well\-established identity, detailed in Appendix[B\.2](https://arxiv.org/html/2606.27731#A2.SS2)for completeness\.

##### Rationale for smoothing the residual\.

An important design choice is to apply smoothness to the residual𝐫=𝐩−𝐪\\mathbf\{r\}=\\mathbf\{p\}\-\\mathbf\{q\}rather than to𝐩\\mathbf\{p\}itself\. Doing so preserves consistency with supervision: at perfect prediction𝐩=𝐪\\mathbf\{p\}=\\mathbf\{q\}, the residual vanishes andℒsmooth=0\\mathcal\{L\}\_\{\\mathrm\{smooth\}\}=0automatically\. By contrast, smoothing𝐩\\mathbf\{p\}directly would generally impose a nonzero penalty even when the model matches the target exactly, preventing the auxiliary objective from decaying to zero as training converges\.

### 3\.4Unified Training Objective

We combine kernel MMD alignment and smoothness regularization into a single numeric\-aware objective:

ℒSMMD\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{SMMD\}\}=𝐫⊤𝐊𝐫\+α𝐫⊤𝐋𝐫\\displaystyle=\\mathbf\{r\}^\{\\top\}\\mathbf\{K\}\\mathbf\{r\}\+\\alpha\\,\\mathbf\{r\}^\{\\top\}\\mathbf\{L\}\\mathbf\{r\}\(10\)=𝐫⊤\(𝐊\+α𝐋\)𝐫,\\displaystyle=\\mathbf\{r\}^\{\\top\}\\big\(\\mathbf\{K\}\+\\alpha\\mathbf\{L\}\\big\)\\mathbf\{r\},whereα\\alphacontrols the smoothness regularization strength\. In practice,α\\alphais*automatically*set by a degree\-based normalization,α=1/\(2d¯\)\\alpha=1/\(2\\bar\{d\}\)withd¯=1N∑i=1Ndegi\\bar\{d\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\deg\_\{i\}\.

At non\-numeric target positions, i\.e\.y∉𝒱numy\\notin\\mathcal\{V\}\_\{\\mathrm\{num\}\},ℒSMMD\\mathcal\{L\}\_\{\\mathrm\{SMMD\}\}is always set to0\. The overall training objective augments the standard cross\-entropy with our SMMD term:

ℒ=ℒCE\+λℒSMMD,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{SMMD\}\},\(11\)whereλ≥0\\lambda\\geq 0controls the weight of the numeric\-aware term\. Pseudo\-code for training with SMMD is provided in Appendix[A](https://arxiv.org/html/2606.27731#A1)\. Both components are differentiable with respect to logits through𝐩=softmax\(⋅\)\\mathbf\{p\}=\\mathrm\{softmax\}\(\\cdot\)on numeric\-target positions\.

## 4Experiments

This section presents a comprehensive evaluation of SMMD\. Our findings show that:

1. 1\.SMMD consistently improves numerical prediction across language\-only and vision\-language tasks, outperforming both cross\-entropy and recent numeric\-target objectives\.
2. 2\.Ablations verify that gains come from the value\-distance construction of the kernel and the smooth residual regularizer, rather than from adding an MMD\-style penalty alone\.
3. 3\.Sensitivity studies show SMMD is robust under simple default settings, while the bestλ\\lambdaandσ\\sigmacan vary across datasets\.

### 4\.1Setup

##### Tasks and datasets\.

We evaluate SMMD on four numerical\-output task categories spanning both language\-only and vision\-language settings\. For*Mathematical Reasoning*, we train and evaluate on GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.27731#bib.bib3)\), and further assess cross\-dataset generalization on SVAMP\(Patelet al\.,[2021](https://arxiv.org/html/2606.27731#bib.bib17)\)\. For*Arithmetic Calculation*, we use the DeepMind\-Math suite\(Saxtonet al\.,[2019](https://arxiv.org/html/2606.27731#bib.bib18)\)\(the arithmetic\_mixed subset\), which consists of short mixed\-operator expressions \(addition, subtraction, multiplication, and division\) with numeric answers\. To test grounded numerical prediction with visual inputs, we consider*Clock\-Time Recognition*on the Clock\-Time dataset\(gpiosenka,[2022](https://arxiv.org/html/2606.27731#bib.bib32)\), where models predict the corresponding time given a clock image\. We also evaluate*Chart Question Answering*on ChartQA\(Masryet al\.,[2022](https://arxiv.org/html/2606.27731#bib.bib4)\), which requires extracting numeric values from plots and tables to answer questions\. Across all datasets, we report exact\-match accuracy for numeric outputs \(with additional task\-specific metrics reported where applicable\)\. Additional dataset statistics and details are deferred to Appendix[C\.1](https://arxiv.org/html/2606.27731#A3.SS1)\.

##### Models\.

To assess the robustness and model\-agnostic nature of SMMD across scales and architectures, we experiment with open\-weight backbones ranging from 0\.5B to 11B parameters\. For LLMs, we consider Qwen2\.5\-0\.5B and Qwen2\.5\-1\.5B\(Team,[2025](https://arxiv.org/html/2606.27731#bib.bib27)\), SmolLM3\-3B\(Bakouchet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib30)\), and Llama3\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.27731#bib.bib29)\)\. For VLMs, we choose Qwen2\.5\-VL\-3B and Qwen2\.5\-VL\-7B\(Baiet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib28)\), Ministral\-3\-3B\(Liuet al\.,[2026](https://arxiv.org/html/2606.27731#bib.bib35)\), and Llama\-3\.2\-11B\-Vision\(Meta,[2024](https://arxiv.org/html/2606.27731#bib.bib31)\)\.111Tokenizer note\. Qwen2\.5 series and Ministral\-3 use digit\-level tokenization for integers, so𝒱num=\{0,…,9\}\\mathcal\{V\}\_\{\\mathrm\{num\}\}=\\\{0,\\dots,9\\\}andN=10N=10\. SmolLM3 and Llama3 series include multi\-digit integer tokens \(e\.g\.,\{0,…,999\}\\\{0,\\dots,999\\\}\) as single tokens, yieldingN=1000N=1000under our numeric\-token construction\.For ablation and analysis, we use representative backbones to keep the study focused and comparable: Qwen2\.5\-1\.5B for LLM\-based tasks and Qwen2\.5\-VL\-3B for VLM\-based tasks\.

### 4\.2Baselines

Beyond standard cross\-entropy \(CE\), we compare our method against three recent numeric\-target methods\. The implementation details and hyperparameter settings are provided in Appendix[C\.2](https://arxiv.org/html/2606.27731#A3.SS2)\.

Gaussian Cross Entropy \(GCE\)\(Wanget al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib23)\)replaces the one\-hot target for numeric tokens with a Gaussian\-shaped soft target centered at the ground\-truth numeric value, so that nearby numbers receive partial credit\. Using the same notation as our method, lety∈𝒱numy\\in\\mathcal\{V\}\_\{\\mathrm\{num\}\}be the ground\-truth numeric token with indexπ\(y\)\\pi\(y\)and valuevπ\(y\)v\_\{\\pi\(y\)\}\. GCE defines a soft target𝐪∈ΔN\\mathbf\{q\}\\in\\Delta^\{N\}over𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}by

qi∝exp⁡\(−\(vi−vπ\(y\)\)22σgce2\),∑i=1Nqi=1,q\_\{i\}\\propto\\exp\\\!\\Big\(\-\\frac\{\(v\_\{i\}\-v\_\{\\pi\(y\)\}\)^\{2\}\}\{2\\sigma\_\{\\mathrm\{gce\}\}^\{2\}\}\\Big\),\\quad\\sum\_\{i=1\}^\{N\}q\_\{i\}=1,\(12\)and applies cross\-entropy between𝐪\\mathbf\{q\}and the model distribution restricted to𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}\. FollowingWanget al\.\([2025](https://arxiv.org/html/2606.27731#bib.bib23)\), we setσgce=0\.5\\sigma\_\{\\mathrm\{gce\}\}=0\.5for GCE in all experiments\.

Number Token Loss \(NTL\)\(Zausingeret al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib26)\)introduces a regression\-like loss on numeric tokens by explicitly penalizing numeric distance\. Following the paper, we use the Wasserstein\-1 variant as our NTL baseline\. In discrete numeric vocabulary with a one\-hot target, the Wasserstein\-1 objective reduces to a distance\-weighted penalty that sums the predicted probability mass at each numeric token weighted by its absolute distance to the ground\-truth value\.

Numeric Token Integrity Loss \(NTIL\)\(Feiet al\.,[2025](https://arxiv.org/html/2606.27731#bib.bib2)\)extends distance\-aware supervision to*sequential*numeric prediction\. Beyond token\-level objectives, it adds \(i\) position\-dependent weights reflecting place\-value importance, and \(ii\) sequence\-level consistency terms that use Gumbel\-Softmax to reconstruct a differentiable scalar number and penalize magnitude/absolute errors of the reconstructed value\. However, due to the requirement for dynamic sequence parsing and the non\-vectorized nature of its number reconstruction process, NTIL incurs a computational latency several times higher than that of standard Cross\-Entropy or plain EMD\.

### 4\.3Main Results

Unless otherwise specified, we use a single RBF kernel with bandwidthσ=2\.0\\sigma\{=\}2\.0and set our loss weight toλ=3\.0\\lambda\{=\}3\.0across all experiments in this section\. Additional implementation details are provided in Appendix[C\.3](https://arxiv.org/html/2606.27731#A3.SS3)\.

Table 1:Mathematical reasoning results\.Metric: exact\-match accuracy \(Acc, %\)↑\\uparrow\. We fine\-tune on GSM8k training set, then evaluate on GSM8K test set and SVAMP full set\.ModelParamsGSM8KSVAMPCEGCENTLNTILOursCEGCENTLNTILOursQwen2\.50\.5B29\.3428\.2030\.2530\.7831\.3139\.8037\.9043\.2040\.6043\.90Qwen2\.51\.5B54\.9952\.6956\.1755\.1957\.7768\.6071\.1059\.8067\.1071\.10SmolLM33B67\.7866\.5664\.1370\.1370\.7465\.8063\.7060\.5065\.8067\.10Llama38B55\.7258\.1548\.1256\.6357\.7062\.2058\.4057\.5059\.5065\.30##### Mathematical Reasoning\.

Table[1](https://arxiv.org/html/2606.27731#S4.T1)reports exact\-match accuracy on GSM8K, and also evaluates the same models on SVAMP \(full set\) as a cross\-dataset check\. Across backbones, SMMD improves over CE on both datasets, and often outperforms prior numeric\-target objectives\. On GSM8K, the gains are consistent: for example, Qwen2\.5\-1\.5B increases from 54\.99% to 57\.77% Acc, surpassing both NTL \(56\.17%\) and NTIL \(55\.19%\)\. Importantly, the gains transfer to SVAMP without additional training, indicating improved cross\-dataset numerical generalization\. For instance, Llama3\-8B rises from 62\.20% to 65\.30% Acc\. Overall, these results suggest that numeric\-aware supervision translates into end\-to\-end reasoning gains and better robustness across datasets, rather than only reshaping local token probabilities\. Appendix[C\.7](https://arxiv.org/html/2606.27731#A3.SS7)provides a statistical error analysis on GSM8K, which further quantifies how SMMD reduces common numerical error patterns\.

##### Arithmetic Calculation\.

We evaluate direct arithmetic computation on the DeepMind\-Math dataset\. Exact\-match accuracy \(Acc, %\) is the primary metric, and we additionally report mean absolute error \(MAE\) and the coefficient of determination \(R2R^\{2\}\) to capture both the magnitude and overall consistency of numerical deviations\. As shown in Table[2](https://arxiv.org/html/2606.27731#S4.T2), SMMD achieves the highest Acc among all compared methods across all four backbones\. Moreover, for most models it also yields lower MAE and higherR2R^\{2\}, indicating that SMMD not only increases the likelihood of producing the exact answer, but also reduces large\-magnitude mistakes when predictions are incorrect\.

Table 2:Arithmetic calculation results\.Metrics: exact\-match accuracy \(Acc, %\)↑\\uparrow, mean absolute error \(MAE\)↓\\downarrow, andR2R^\{2\}↑\\uparrow\.ModelParamsLossAccMAE𝐑𝟐\\mathbf\{R^\{2\}\}Qwen2\.50\.5BCE43\.393\.380\.67GCE42\.043\.320\.73NTL44\.793\.560\.67NTIL43\.033\.760\.64Ours45\.933\.250\.69Qwen2\.51\.5BCE51\.832\.620\.76GCE52\.362\.540\.78NTL52\.052\.920\.71NTIL52\.992\.600\.75Ours53\.832\.660\.77SmolLM33BCE61\.942\.070\.80GCE60\.452\.210\.80NTL60\.022\.360\.79NTIL61\.282\.140\.81Ours64\.061\.880\.82Llama38BCE70\.041\.280\.91GCE69\.591\.190\.91NTL67\.011\.660\.86NTIL67\.911\.660\.85Ours71\.961\.160\.92Table 3:Clock\-Time results\.Metrics: exact\-match accuracy \(Acc, %\)↑\\uparrowand Time Gap↓\\downarrow\(absolute deviation in minutes\)\.ModelParamsLossAccTime GapQwen2\.5\-VL3BCE46\.3986\.73GCE38\.4055\.28NTL68\.6857\.73NTIL69\.4655\.43Ours74\.3050\.59Ministral\-33BCE82\.3637\.36GCE93\.8210\.39NTL89\.1725\.48NTIL90\.2123\.88Ours97\.568\.07Qwen2\.5\-VL7BCE65\.6959\.11GCE69\.2342\.95NTL75\.9741\.45NTIL75\.2845\.74Ours78\.2641\.07Llama\-3\.2\-Vision11BCE71\.8046\.64GCE82\.1528\.63NTL79\.7933\.29NTIL69\.5155\.09Ours80\.7632\.45![Refer to caption](https://arxiv.org/html/2606.27731v1/x2.png)Figure 2:Clock\-Time digit distribution\.We evaluate model confidence by grouping test examples where the ground\-truth first digit is55\. Using Qwen2\.5\-VL\-3B as a representative VLM, we plot the averaged per\-digit probability at the first\-digit position\. Compared to baselines, SMMD yields the most concentrated and target\-aligned distribution, successfully assigning dominant probability mass to the correct digit while suppressing competing ones\.
##### Clock Time Recognition\.

On the Clock\-Time dataset, we evaluate grounded time prediction from clock images\. We report exact\-match accuracy \(Acc, %\) and*Time Gap*, defined as the absolute deviation in minutes\. Table[3](https://arxiv.org/html/2606.27731#S4.T3)shows that SMMD delivers strong and consistent improvements across VLM backbones: it attains the best Acc on three out of four models and remains competitive on the remaining one, suggesting the gains are not architecture\-specific\. Notably, Acc increases together with a clear reduction in Time Gap, indicating that SMMD reduces large time mistakes rather than merely shifting borderline cases\. For example, on Ministral\-3\-3B it improves Acc from 82\.36% to 97\.56% and cuts Time Gap from 37\.36 to 8\.07 minutes\. On Llama\-3\.2\-Vision, SMMD remains highly competitive, trailing the strongest baseline by only 1\.39 percentage points\. A plausible reason is that this backbone is already well\-calibrated on numeric tokens for this task, so GCE’s confidence shaping fits its error profile and leaves less headroom for SMMD\. These improvements are further corroborated by a digit\-distribution probe: Figure[2](https://arxiv.org/html/2606.27731#S4.F2)visualizes predicted digit probabilities under a fixed ground\-truth digit condition\. Compared to CE and other baselines, SMMD yields the sharpest and most target\-aligned distribution, placing dominant probability mass on the correct digit while aggressively down\-weighting nearby alternatives\. This indicates a more decisive and stable numerical belief state—the model commits to the target with higher confidence and exhibits reduced ambiguity among competing digits\. The complete per\-bucket distributions for all ground\-truth digits are deferred to Appendix[C\.6](https://arxiv.org/html/2606.27731#A3.SS6)\.

##### Chart Question Answering\.

We evaluate grounded numerical prediction from plots and tables on ChartQA dataset\. Table[4](https://arxiv.org/html/2606.27731#S4.T4)shows that all objectives are broadly comparable on this benchmark, suggesting that performance is often bottlenecked by visual grounding and value extraction rather than the numeric training signal alone\. Nevertheless, SMMD remains consistently competitive and achieves the best or tied\-best accuracy on three of the four backbones, including a clear gain on Qwen2\.5\-VL\-7B from 77\.02% to 78\.38%\. While the absolute improvements are modest, the trend is stable: SMMD does not degrade ChartQA performance and can provide small, reliable gains in grounded settings\.

Table 4:Chart question answering results\.Metric: exact\-match accuracy \(Acc, %\)↑\\uparrow\.ModelCEGCENTLNTILOursQwen2\.5\-VL\(3B\)71\.4470\.4472\.0172\.9572\.21Ministral\-3\(3B\)71\.6472\.3269\.6671\.9573\.21Qwen2\.5\-VL\(7B\)77\.0275\.8776\.9277\.0878\.38Llama\-3\.2\-Vision\(11B\)64\.8664\.9163\.5064\.7064\.91

### 4\.4Ablation Study and Analysis

This section presents ablations and analyses to understand*how*SMMD improves numerical prediction\. We focus on targeted controlled studies that isolate key design choices: we \(i\) ablate the MMD term and the smoothness regularizer to quantify their respective contributions, \(ii\) vary the kernel construction to test whether value\-induced distance structure is essential beyond the MMD form itself, and \(iii\) examine robustness to the main hyperparameters \(σ\\sigmaandλ\\lambda\)\. We summarize the main takeaways below\.

##### MMD and smoothness contribute in complementary regimes\.

Table[5](https://arxiv.org/html/2606.27731#S4.T5)ablates SMMD by enabling the MMD term and the smoothness regularizer individually or jointly \(CE disables both\)\. Overall, the full objective performs best: it achieves the top accuracy on GSM8K, SVAMP, and ChartQA, and is effectively tied on Arithmetic\. The MMD term contributes most on language\-only reasoning \(GSM8K/SVAMP\), while smoothness is especially helpful on Arithmetic and Clock\-Time, where local numeric consistency matters\. Clock\-Time is the only notable exception where adding MMD on top of smoothness slightly lowers Acc\. This likely reflects that the task favors very sharp bucket decisions, and the additional MMD coupling can mildly blur near\-target alternatives once smoothness has already stabilized the residual\. Taken together, these results suggest that MMD and smoothness play complementary roles, with their combination offering the most reliable gains across diverse numerical prediction settings\.

![Refer to caption](https://arxiv.org/html/2606.27731v1/x3.png)\(a\)Sensitivity toσ\\sigma
![Refer to caption](https://arxiv.org/html/2606.27731v1/x4.png)\(b\)Sensitivity toλ\\lambda

Figure 3:Sensitivity analysis\.\(a\) Sensitivity toσ\\sigma\(single\-kernel\)\. Accuracy versus the Gaussian bandwidthσ\\sigmain Eq\. \([3](https://arxiv.org/html/2606.27731#S3.E3)\) withΣ=\{σ\}\\Sigma=\\\{\\sigma\\\}\. Performance is typically highest nearσ=2\.0\\sigma\{=\}2\.0; overly small or large bandwidths reduce effectiveness by making the kernel too local or too diffuse\. \(b\) Sensitivity toλ\\lambda\. Accuracy as we vary the SMMD weightλ\\lambdain Eq\. \([11](https://arxiv.org/html/2606.27731#S3.E11)\)\. SMMD improves overλ=0\\lambda\{=\}0across a broad range and typically peaks nearλ=3\\lambda\{=\}3\. GSM8K benefits from a largerλ\\lambdaunder its text\-heavy reasoning trajectories\.Table 5:Effect of MMD and smoothness terms\.We report performance when enabling the MMD term and the smoothness regularizer individually or jointly \(SMMD\), keeping all other settings fixed\. Metric: exact\-match accuracy \(Acc, %\)↑\\uparrow\.MMDSmooth RegGSM8KSVAMPArithmeticClock\-TimeChartQA✗✗54\.9768\.6051\.8346\.3971\.44✓✗57\.1670\.1052\.3768\.6171\.95✗✓56\.4868\.7053\.8475\.1471\.95✓✓57\.7771\.1053\.8374\.3072\.21Table 6:Kernel\-structure ablation \(MMD\-only\)\.We vary the kernel𝐊\\mathbf\{K\}while optimizing only the MMD term in Eq\. \([7](https://arxiv.org/html/2606.27731#S3.E7)\) \(smoothness removed\), so differences isolate the effect of distance\-induced kernel construction\. Metrics are exact\-match accuracy \(Acc, %\)↑\\uparrow\.Kernel settingGSM8KSVAMPArithmeticClock\-TimeChartQARandom PSD kernel55\.6569\.9051\.7662\.0171\.64Shuffled mapping56\.1769\.9051\.9867\.9871\.74Distance\-induced \(Ours\)57\.1670\.1052\.3768\.6171\.95
##### Many kernels help somewhat, but distance\-induced kernel helps the most\.

We ask whether the gains come from the*distance\-induced structure*encoded in the kernel𝐊\\mathbf\{K\}, rather than from introducing an MMD\-style loss alone\. To isolate kernel design, we run all variants with the*MMD\-only*objective \(Eq\. \([7](https://arxiv.org/html/2606.27731#S3.E7)\)\), i\.e\., we drop the smoothness term so performance differences can be attributed to how𝐊\\mathbf\{K\}is constructed\. We compare three settings: \(i\)Distance\-induced \(Ours\), whereKij=k\(\|vi−vj\|\)K\_\{ij\}=k\(\|v\_\{i\}\-v\_\{j\}\|\)is built from true numerical distances; \(ii\)Random PSD kernel, which preserves the symmetric PSD structure but removes any link to numeric values \(detailed in Appendix[C\.4](https://arxiv.org/html/2606.27731#A3.SS4)\); and \(iii\)Shuffled mapping, which keeps the same distance\-based functional form but permutes the value→\\rightarrowtoken correspondence, breaking semantic alignment while preserving the overall kernel shape\. As shown in Table[6](https://arxiv.org/html/2606.27731#S4.T6), the distance\-induced kernel is the only variant that is consistently best across all datasets, with particularly clear margins on Clock\-Time\. This pattern indicates that SMMD benefits from aligning distribution matching to*true numeric proximity*, rather than from a generic MMD regularization effect\.

##### A single well\-chosen bandwidth is more important than multi\-kernel mixtures\.

We analyze the kernel bandwidthσ\\sigmain Eq\. \([3](https://arxiv.org/html/2606.27731#S3.E3)\), which controls how rapidly similarity decays with value distance and thus how local the induced structure is\. Figure[3\(a\)](https://arxiv.org/html/2606.27731#S4.F3.sf1)reports results for the*single\-kernel*setting \(Σ=\{σ\}\\Sigma=\\\{\\sigma\\\}\)\. Across datasets, performance is typically best aroundσ=2\.0\\sigma\{=\}2\.0: smaller values make the kernel overly local \(rewarding only near\-identical numbers\), while larger values make it overly diffuse \(blurring meaningful proximity\)\. Although the exact optimum can shift mildly by task,σ=2\.0\\sigma\{=\}2\.0is a strong and stable default that consistently improves over CE, so we use it throughout our main experiments\. We additionally evaluated multi\-kernel mixtures with bandwidths centered around2\.02\.0\(Appendix[C\.5](https://arxiv.org/html/2606.27731#A3.SS5)\)\. However, these bring little benefit and can slightly underperform the best single\-σ\\sigmachoice, indicating that selecting a well\-calibrated bandwidth matters more than increasing kernel complexity\.

##### SMMD is robust toλ\\lambdawith a broad sweet spot\.

We examine sensitivity to the SMMD weightλ\\lambdain the training objective in Eq\. \([11](https://arxiv.org/html/2606.27731#S3.E11)\)\. Figure[3\(b\)](https://arxiv.org/html/2606.27731#S4.F3.sf2)shows that SMMD improves over theλ=0\\lambda\{=\}0baseline across tasks for a wide range of values, indicating that the method is not brittle to precise tuning\. Performance typically peaks at a moderateλ\\lambda: increasingλ\\lambdastrengthens the numeric\-target signal, while excessively largeλ\\lambdacan over\-emphasize it and slightly hurt generalization\. Across SVAMP and Clock\-Time, the best performance is usually achieved aroundλ=3\\lambda\{=\}3, and we use this value throughout our main experiments on all datasets\. GSM8K is a slight exception, where gains continue to accrue at largerλ\\lambda, consistent with its longer, text\-heavy reasoning trajectories in which numeric targets form a smaller fraction of tokens\.

##### SMMD incurs only modest training overhead\.

Finally, we measure the end\-to\-end training overhead of SMMD under a controlled GSM8K setup\. Although SMMD involves quadratic formsr⊤Krr^\{\\top\}Krandr⊤Lrr^\{\\top\}Lrover the numeric sub\-vocabulary, bothKKandLLare precomputed once, and the numeric vocabulary is small in practice\. As shown in Table[7](https://arxiv.org/html/2606.27731#S4.T7), SMMD adds only a small wall\-clock overhead over CE:3\.39%3\.39\\%on Qwen2\.5\-1\.5B with a digit\-level tokenizer and2\.18%2\.18\\%on SmolLM3\-3B with a multi\-digit tokenizer\. This overhead is substantially lower than NTIL, whose sequence\-level reconstruction incurs much higher latency\. The memory increase is comparable to other numeric\-aware losses, suggesting that SMMD remains practical for downstream fine\-tuning\.

Table 7:End\-to\-end training overhead on GSM8K\. We report average step time and peak allocated GPU memory on a single NVIDIA L20 GPU\. For each loss, we run 120 training steps, discard the first 20 warm\-up steps, and report mean step time over 3 repeated runs\.ModelLossStep time \(ms\)↓\\downarrowMem\. \(GB\)↓\\downarrowQwen2\.5\-1\.5BCE322\.29±2\.97322\.29\\pm 2\.978\.02NTL357\.77±0\.77357\.77\\pm 0\.7710\.27NTIL929\.07±3\.19929\.07\\pm 3\.1910\.27SMMD333\.21±1\.93333\.21\\pm 1\.9310\.28SmolLM3\-3BCE439\.82±0\.74439\.82\\pm 0\.747\.05NTL468\.98±2\.92468\.98\\pm 2\.928\.70NTIL924\.49±0\.84924\.49\\pm 0\.848\.71SMMD449\.40±0\.48449\.40\\pm 0\.488\.71

## 5Conclusion

We studied numerical prediction in LLMs through the lens of objective mismatch: although numbers exhibit an inherent metric structure, standard cross\-entropy treats numeric tokens as unstructured categories\. To bridge this gap, we introducedSMMD, a plug\-and\-play training loss that constructs a distance\-induced kernel over a numeric sub\-vocabulary, aligns the predicted numeric distribution with the target via kernel MMD, and regularizes the prediction–target residual on the induced kernel graph\. Across diverse language\-only and vision\-language tasks with numeric targets, SMMD consistently improves numerical accuracy over cross\-entropy and strong numeric\-target baselines\. Ablations further confirm our kernel design and smoothness term contribute complementary benefits\.

Despite these gains, SMMD has limitations in two respects\. First, its notion of distance is defined over token\-level numeric units and is therefore constrained by the tokenizer; under digit\-level tokenization, nearby integers can differ across multiple digits, and a per\-token objective may over\-penalize such near\-miss cases\. This distance\-based formulation also assumes that numerical proximity is semantically meaningful for the supervised target, an assumption that may not hold when numbers primarily serve as identifiers or symbolic labels, such as category IDs or codes\. Second, SMMD introduces additional hyperparameters: the loss weightλ\\lambdaand the kernel bandwidthσ\\sigma\. While our defaults work well broadly, optimal settings can vary by task, motivating adaptive hyperparameter selection in future work\.

## Acknowledgements

This work was supported by the National Key R&D Program of China \(2024YFB3312503\), the Science and Technology Major Project of Sichuan Province \(2024ZDZX0003, 2025ZDZX0140\)\. We also acknowledge the support of Sichuan Province Engineering Technology Research Center of Broadband Electronics Intelligent Manufacturing\.

## Impact Statement

This work improves the numerical reliability of large language models by introducing a training objective for numeric tokens, which can benefit applications where accurate quantities matter such as education and data analysis\. While the method reduces common numerical errors, it does not guarantee correctness in all cases and may increase the perceived authority of generated numbers\. Therefore, outputs should be validated in high\-stakes settings and deployed with standard safeguards such as human oversight and monitoring\.

## References

- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, J\. Lin,et al\.\(2025\)Qwen2\.5\-vl technical report\.External Links:2502\.13923,[Link](https://arxiv.org/abs/2502.13923)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px2.p1.1)\.
- E\. Bakouch, L\. Ben Allal, A\. Lozhkov, N\. Tazi, L\. Tunstall, C\. M\. Patiño, E\. Beeching, A\. Roucher, A\. J\. Reedi, Q\. Gallouédec, K\. Rasul, N\. Habib, C\. Fourrier, H\. Kydlicek, G\. Penedo, H\. Larcher, M\. Morlon, V\. Srivastav, J\. Lochner, X\. Nguyen, C\. Raffel, L\. von Werra, T\. Wolf,et al\.\(2025\)SmolLM3: smol, multilingual, long\-context reasoner\.Note:[https://huggingface\.co/blog/smollm3](https://huggingface.co/blog/smollm3)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Born and M\. Manica \(2023\)Regression transformer enables concurrent sequence regression and generation for molecular language modelling\.Nature Machine Intelligence5\(4\),pp\. 432–444\.Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Chung, S\. Kim, Y\. Jo, J\. Park, D\. Min, and Y\. Yu \(2026\)Teaching metric distance to discrete autoregressive language models\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=s0zLtkY7iu)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1),[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px1.p1.1)\.
- M\. H\. Daniel Han and U\. team \(2023\)Unsloth\.External Links:[Link](http://github.com/unslothai/unsloth)Cited by:[§C\.3](https://arxiv.org/html/2606.27731#A3.SS3.p1.4)\.
- X\. Fei, J\. Lu, Q\. Sun, H\. Feng, Y\. Wang, W\. Shi, A\. Wang, J\. Tang, and C\. Huang \(2025\)Advancing sequential numerical prediction in autoregressive models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),Vienna, Austria,pp\. 562–574\.External Links:[Link](https://aclanthology.org/2025.acl-short.44/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-short.44)Cited by:[§C\.2\.2](https://arxiv.org/html/2606.27731#A3.SS2.SSS2.p1.7),[§1](https://arxiv.org/html/2606.27731#S1.p3.1),[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.27731#S4.SS2.p4.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§C\.8](https://arxiv.org/html/2606.27731#A3.SS8.p2.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.External Links:2211\.10435,[Link](https://arxiv.org/abs/2211.10435)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Geva, A\. Gupta, and J\. Berant \(2020\)Injecting numerical reasoning skills into language models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 946–958\.External Links:[Link](https://aclanthology.org/2020.acl-main.89/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.89)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Gliwa, I\. Mochol, M\. Biesek, and A\. Wawer \(2019\)SAMSum corpus: a human\-annotated dialogue dataset for abstractive summarization\.InProceedings of the 2nd Workshop on New Frontiers in Summarization,L\. Wang, J\. C\. K\. Cheung, G\. Carenini, and F\. Liu \(Eds\.\),Hong Kong, China,pp\. 70–79\.External Links:[Link](https://aclanthology.org/D19-5409/),[Document](https://dx.doi.org/10.18653/v1/D19-5409)Cited by:[§C\.8](https://arxiv.org/html/2606.27731#A3.SS8.p1.2)\.
- S\. Golkar, M\. Pettee, M\. Eickenberg, A\. Bietti, M\. Cranmer, G\. Krawezik, F\. Lanusse, M\. McCabe, R\. Ohana, L\. Parker, B\. R\. Blancard, T\. Tesileanu, K\. Cho, and S\. Ho \(2024\)XVal: a continuous numerical tokenization for scientific language models\.External Links:2310\.02989,[Link](https://arxiv.org/abs/2310.02989)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- gpiosenka \(2022\)TIME\.Note:Kaggle DatasetExternal Links:[Link](https://www.kaggle.com/datasets/gpiosenka/time-image-datasetclassification)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian,et al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Gretton, K\. M\. Borgwardt, M\. J\. Rasch, B\. Schölkopf, and A\. Smola \(2012\)A kernel two\-sample test\.The journal of machine learning research13\(1\),pp\. 723–773\.Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p4.1),[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.27731#S3.SS2.p1.6)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z),[Link](https://www.nature.com/articles/s41586-025-09422-z)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1)\.
- Y\. Guo, M\. Luo, W\. Zhang, P\. Liu, J\. Liu, S\. Huang, J\. Lv, B\. Ke, and X\. Liu \(2026\)Few\-shot molecular property optimization via a domain\-specialized large language model\.Chemical Science17,pp\. 4928–4941\.External Links:[Document](https://dx.doi.org/10.1039/D5SC08859C),[Link](https://doi.org/10.1039/D5SC08859C)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1)\.
- K\. Kafle, B\. Price, S\. Cohen, and C\. Kanan \(2018\)DVQA: understanding data visualizations via question answering\.In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Vol\.,pp\. 5648–5656\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2018.00592)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1)\.
- C\. Li, W\. Chang, Y\. Cheng, Y\. Yang, and B\. Póczos \(2017\)MMD GAN: towards deeper understanding of moment matching network\.Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Li, K\. Swersky, and R\. Zemel \(2015\)Generative moment matching networks\.InProceedings of the 32nd International Conference on International Conference on Machine Learning \- Volume 37,ICML’15,pp\. 1718–1727\.Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px2.p1.1)\.
- A\. H\. Liu, K\. Khandelwal, S\. Subramanian, V\. Jouault, A\. Rastogi, A\. Sadé,et al\.\(2026\)Ministral 3\.External Links:2601\.08584,[Link](https://arxiv.org/abs/2601.08584)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Long, Y\. Cao, J\. Wang, and M\. Jordan \(2015\)Learning transferable features with deep adaptation networks\.InProceedings of the 32nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.37,Lille, France,pp\. 97–105\.External Links:[Link](https://proceedings.mlr.press/v37/long15.html)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Masry, D\. Long, J\. Q\. Tan, S\. Joty, and E\. Hoque \(2022\)ChartQA: a benchmark for question answering about charts with visual and logical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2022,Dublin, Ireland,pp\. 2263–2279\.External Links:[Link](https://aclanthology.org/2022.findings-acl.177),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.177)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px1.p1.1)\.
- Meta \(2024\)Meta\-llama/llama\-3\.2\-11b\-vision: model card\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.2\-11B\-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px2.p1.1)\.
- N\. Methani, P\. Ganguly, M\. M\. Khapra, and P\. Kumar \(2020\)PlotQA: reasoning over scientific plots\.In2020 IEEE Winter Conference on Applications of Computer Vision \(WACV\),Vol\.,pp\. 1516–1525\.External Links:[Document](https://dx.doi.org/10.1109/WACV45572.2020.9093523)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1)\.
- S\. Minaee, T\. Mikolov, N\. Nikzad, M\. Chenaghlu, R\. Socher, X\. Amatriain, and J\. Gao \(2025\)Large language models: a survey\.External Links:2402\.06196,[Link](https://arxiv.org/abs/2402.06196)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1)\.
- OpenAI \(2023\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1)\.
- A\. Patel, S\. Bhattamishra, and N\. Goyal \(2021\)Are NLP models really able to solve simple math word problems?\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Online,pp\. 2080–2094\.External Links:[Link](https://aclanthology.org/2021.naacl-main.168/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.168)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Petrak, N\. S\. Moosavi, and I\. Gurevych \(2023\)Arithmetic\-based pretraining improving numeracy of pretrained language models\.InProceedings of the 12th Joint Conference on Lexical and Computational Semantics \(\*SEM 2023\),A\. Palmer and J\. Camacho\-collados \(Eds\.\),Toronto, Canada,pp\. 477–493\.External Links:[Link](https://aclanthology.org/2023.starsem-1.42/),[Document](https://dx.doi.org/10.18653/v1/2023.starsem-1.42)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Saxena, A\. P\. Gema, and P\. Minervini \(2025\)Lost in time: clock and calendar understanding challenges in multimodal llms\.Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1)\.
- D\. Saxton, E\. Grefenstette, F\. Hill, and P\. Kohli \(2019\)Analysing mathematical reasoning abilities of neural models\.External Links:1904\.01557,[Link](https://arxiv.org/abs/1904.01557)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px1.p1.1)\.
- A\. K\. Singh and D\. Strouse \(2024\)Tokenization counts: the impact of tokenization on arithmetic in frontier llms\.External Links:2402\.14903,[Link](https://arxiv.org/abs/2402.14903)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Spithourakis and S\. Riedel \(2018\)Numeracy for language models: evaluating and improving their ability to predict numbers\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 2104–2115\.External Links:[Link](https://aclanthology.org/P18-1196/),[Document](https://dx.doi.org/10.18653/v1/P18-1196)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1),[§1](https://arxiv.org/html/2606.27731#S1.p2.1),[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Team \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.1](https://arxiv.org/html/2606.27731#S4.SS1.SSS0.Px2.p1.1)\.
- I\. Tolstikhin, O\. Bousquet, S\. Gelly, and B\. Schoelkopf \(2017\)Wasserstein auto\-encoders\.arXiv preprint arXiv:1711\.01558\.Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wang, R\. Qin, M\. Wang, M\. Fang, Y\. Zhang, Y\. Zhu, Q\. Su, Q\. Gou, C\. Shen, O\. Zhang,et al\.\(2025\)Token\-mol 1\.0: tokenized drug design with large language models\.Nature Communications16\(1\),pp\. 4416\.Cited by:[§4\.2](https://arxiv.org/html/2606.27731#S4.SS2.p2.5),[§4\.2](https://arxiv.org/html/2606.27731#S4.SS2.p2.8)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2023\)Chain\-of\-thought prompting elicits reasoning in large language models\.External Links:2201\.11903,[Link](https://arxiv.org/abs/2201.11903)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. Le Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. Rush \(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Q\. Liu and D\. Schlangen \(Eds\.\),Online,pp\. 38–45\.External Links:[Link](https://aclanthology.org/2020.emnlp-demos.6/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-demos.6)Cited by:[§C\.3](https://arxiv.org/html/2606.27731#A3.SS3.p1.4)\.
- J\. Zausinger, L\. Pennig, A\. Kozina, S\. Sdahl, J\. Sikora, A\. Dendorfer, T\. Kuznetsov, M\. Hagog, N\. Wiedemann, K\. Chlodny, V\. Limbach, A\. Ketteler, T\. Prein, V\. M\. Singh, M\. Danziger, and J\. Born \(2025\)Regress, don’t guess – a regression\-like loss on number tokens for language models\.InProc\. of the 42nd International Conference on Machine Learning \(ICML\),External Links:[Link](https://ibm.biz/ntl-main)Cited by:[§C\.2\.1](https://arxiv.org/html/2606.27731#A3.SS2.SSS1.p1.7),[§1](https://arxiv.org/html/2606.27731#S1.p1.1),[§1](https://arxiv.org/html/2606.27731#S1.p3.1),[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.27731#S4.SS2.p3.1)\.
- D\. Zhang\-Li, N\. Lin, J\. Yu, Z\. Zhang, Z\. Yao, X\. Zhang, L\. Hou, J\. Zhang, and J\. Li \(2024\)Reverse that number\! decoding order matters in arithmetic learning\.External Links:2403\.05845,[Link](https://arxiv.org/abs/2403.05845)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhao, J\. Song, and S\. Ermon \(2018\)InfoVAE: information maximizing variational autoencoders\.External Links:1706\.02262,[Link](https://arxiv.org/abs/1706.02262)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Zhou, D\. Fu, M\. Soltanolkotabi, R\. Jia, and V\. Sharan \(2026\)FoNE: precise single\-token number embeddings via fourier features\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=g0vtWmwDDh)Cited by:[§2](https://arxiv.org/html/2606.27731#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zuo, Y\. Gan, J\. Long, and X\. Liu \(2025\)CAD\-HLLM: generating executable CAD from text with hierarchical LLM planning\.InProceedings of the 17th Asian Conference on Machine Learning,H\. Lee and T\. Liu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.304,pp\. 958–973\.External Links:[Link](https://proceedings.mlr.press/v304/zuo25a.html)Cited by:[§1](https://arxiv.org/html/2606.27731#S1.p1.1)\.

## Appendix AAlgorithms

Algorithm 1Constructing the numeric sub\-vocabulary𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}Require:token vocabulary

𝒱\\mathcal\{V\}; deterministic numeric parser

parse\(⋅\)\\mathrm\{parse\}\(\\cdot\)\(float casting\)

Ensure:numeric sub\-vocabulary

𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}; value map

\{vi\}i=1N\\\{v\_\{i\}\\\}\_\{i=1\}^\{N\}; index map

π\\pi
𝒱num←∅\\mathcal\{V\}\_\{\\mathrm\{num\}\}\\leftarrow\\emptyset

Initialize an empty map

val\[⋅\]\\mathrm\{val\}\[\\cdot\]
for all

t∈𝒱t\\in\\mathcal\{V\}do

if

parse\(t\)\\mathrm\{parse\}\(t\)succeedsthen

val\[t\]←parse\(t\)\\mathrm\{val\}\[t\]\\leftarrow\\mathrm\{parse\}\(t\)

𝒱num←𝒱num∪\{t\}\\mathcal\{V\}\_\{\\mathrm\{num\}\}\\leftarrow\\mathcal\{V\}\_\{\\mathrm\{num\}\}\\cup\\\{t\\\}

endif

endfor

Choose any bijection

π:𝒱num→\{1,…,N\}\\pi:\\mathcal\{V\}\_\{\\mathrm\{num\}\}\\to\\\{1,\\dots,N\\\}with

N=\|𝒱num\|N=\|\\mathcal\{V\}\_\{\\mathrm\{num\}\}\|
for

i=1i=1to

NNdo

vi←val\(π−1\(i\)\)v\_\{i\}\\leftarrow\\mathrm\{val\}\\\!\\big\(\\pi^\{\-1\}\(i\)\\big\)

endfor

Algorithm 2Training with SMMDRequire:training data

𝒟\\mathcal\{D\}; LM

pθp\_\{\\theta\}; numeric sub\-vocabulary

𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}with index map

π\\pi; precomputed

𝐊\\mathbf\{K\},

𝐋\\mathbf\{L\}, and

α\\alpha; weight

λ\\lambda
Ensure:trained parameters

θ\\theta
whilenot convergeddo

Sample a minibatch and run a forward pass to obtain logits

\{ℓt\}\\\{\\bm\{\\ell\}\_\{t\}\\\}
Compute

ℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}over the full vocabulary

𝒱\\mathcal\{V\}
Initialize

S←0S\\leftarrow 0,

M←0M\\leftarrow 0\(SS: loss sum,MM: \#numeric positions\)

for alltoken positions

ttin the minibatchdo

if

yt∈𝒱numy\_\{t\}\\in\\mathcal\{V\}\_\{\\mathrm\{num\}\}then

𝐩t←softmax\(ℓt\[𝒱num\]\)\\mathbf\{p\}\_\{t\}\\leftarrow\\mathrm\{softmax\}\\\!\\big\(\\bm\{\\ell\}\_\{t\}\[\\mathcal\{V\}\_\{\\mathrm\{num\}\}\]\\big\),

𝐫t←𝐩t−𝐞π\(yt\)\\mathbf\{r\}\_\{t\}\\leftarrow\\mathbf\{p\}\_\{t\}\-\\mathbf\{e\}\_\{\\pi\(y\_\{t\}\)\}
S←S\+𝐫t⊤𝐊𝐫t\+α𝐫t⊤𝐋𝐫tS\\leftarrow S\+\\mathbf\{r\}\_\{t\}^\{\\top\}\\mathbf\{K\}\\mathbf\{r\}\_\{t\}\+\\alpha\\,\\mathbf\{r\}\_\{t\}^\{\\top\}\\mathbf\{L\}\\mathbf\{r\}\_\{t\}

M←M\+1M\\leftarrow M\+1

endif

endfor

ℒSMMD←Smax⁡\(1,M\)\\mathcal\{L\}\_\{\\mathrm\{SMMD\}\}\\leftarrow\\frac\{S\}\{\\max\(1,M\)\}

Update

θ\\thetaby minimizing

ℒ←ℒCE\+λℒSMMD\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{SMMD\}\}
endwhile

## Appendix BMathematical Derivations of SMMD Objectives

This appendix provides detailed step\-by\-step derivations for the two components of our numeric\-aware loss: the discrete MMD alignment \(Eq\. 14\) and the smooth regularization via Dirichlet energy \(Eq\. 10\)\.

### B\.1Derivation of Discrete MMD for Numeric Tokens

We show how the general MMD definition reduces to the quadratic form𝐫⊤𝐊𝐫\\mathbf\{r\}^\{\\top\}\\mathbf\{K\}\\mathbf\{r\}on a finite numeric vocabulary\.

##### 1\. General Expectation Form\.

For two distributionsPPandQQin a RKHSℋ\\mathcal\{H\}with kernelk\(⋅,⋅\)k\(\\cdot,\\cdot\), the squared MMD is:

MMD2\(P,Q\)=𝔼x,x′∼P\[k\(x,x′\)\]−2𝔼x∼P,y∼Q\[k\(x,y\)\]\+𝔼y,y′∼Q\[k\(y,y′\)\]\.\\mathrm\{MMD\}^\{2\}\(P,Q\)=\\mathbb\{E\}\_\{x,x^\{\\prime\}\\sim P\}\[k\(x,x^\{\\prime\}\)\]\-2\\mathbb\{E\}\_\{x\\sim P,y\\sim Q\}\[k\(x,y\)\]\+\\mathbb\{E\}\_\{y,y^\{\\prime\}\\sim Q\}\[k\(y,y^\{\\prime\}\)\]\.\(13\)

##### 2\. Discrete Summation\.

In our setting, the domain is the finite sub\-vocabulary𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}of sizeNN\. The distributions are discrete vectors𝐩,𝐪∈ΔN\\mathbf\{p\},\\mathbf\{q\}\\in\\Delta^\{N\}\. The expectations become double summations over indicesiiandjj:

ℒMMD=∑i=1N∑j=1NpipjKij−2∑i=1N∑j=1NpiqjKij\+∑i=1N∑j=1NqiqjKij\.\\mathcal\{L\}\_\{\\text\{MMD\}\}=\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}p\_\{i\}p\_\{j\}K\_\{ij\}\-2\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}p\_\{i\}q\_\{j\}K\_\{ij\}\+\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}q\_\{i\}q\_\{j\}K\_\{ij\}\.\(14\)

##### 3\. Residual Vectorization\.

Exploiting the linearity of summation and the symmetry of𝐊\\mathbf\{K\}, we factorize the expression using the prediction–target residualri=pi−qir\_\{i\}=p\_\{i\}\-q\_\{i\}:

ℒMMD\\displaystyle\\mathcal\{L\}\_\{\\text\{MMD\}\}=∑i=1N∑j=1NKij\(pipj−piqj−qipj\+qiqj\)\\displaystyle=\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}K\_\{ij\}\(p\_\{i\}p\_\{j\}\-p\_\{i\}q\_\{j\}\-q\_\{i\}p\_\{j\}\+q\_\{i\}q\_\{j\}\)\(15\)=∑i=1N∑j=1NKij\(pi−qi\)\(pj−qj\)=∑i=1N∑j=1NKijrirj\.\\displaystyle=\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}K\_\{ij\}\(p\_\{i\}\-q\_\{i\}\)\(p\_\{j\}\-q\_\{j\}\)=\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}K\_\{ij\}r\_\{i\}r\_\{j\}\.\(16\)This yields the compact quadratic form:

ℒMMD=𝐫⊤𝐊𝐫\.\\mathcal\{L\}\_\{\\text\{MMD\}\}=\\mathbf\{r\}^\{\\top\}\\mathbf\{K\}\\mathbf\{r\}\.\(17\)

### B\.2Derivation of Graph Laplacian Form for Smooth Regularization

We now derive the matrix representation for the Dirichlet energyℒsmooth\\mathcal\{L\}\_\{\\text\{smooth\}\}, which penalizes local variations of the residual𝐫\\mathbf\{r\}over the similarity graph𝐊\\mathbf\{K\}\.

##### 1\. Expansion of Dirichlet Energy\.

Starting from the pairwise difference penalty in Eq\. \(9\):

ℒsmooth=12∑i=1N∑j=1NKij\(ri−rj\)2\.\\mathcal\{L\}\_\{\\text\{smooth\}\}=\\frac\{1\}\{2\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}K\_\{ij\}\(r\_\{i\}\-r\_\{j\}\)^\{2\}\.\(18\)Expanding the quadratic term\(ri−rj\)2=ri2−2rirj\+rj2\(r\_\{i\}\-r\_\{j\}\)^\{2\}=r\_\{i\}^\{2\}\-2r\_\{i\}r\_\{j\}\+r\_\{j\}^\{2\}:

ℒsmooth=12\(∑i,jKijri2−2∑i,jKijrirj\+∑i,jKijrj2\)\.\\mathcal\{L\}\_\{\\text\{smooth\}\}=\\frac\{1\}\{2\}\\left\(\\sum\_\{i,j\}K\_\{ij\}r\_\{i\}^\{2\}\-2\\sum\_\{i,j\}K\_\{ij\}r\_\{i\}r\_\{j\}\+\\sum\_\{i,j\}K\_\{ij\}r\_\{j\}^\{2\}\\right\)\.\(19\)

##### 2\. Relation to the Degree Matrix\.

For the first term, summing overjjyields the degree of nodeii:degi=∑j=1NKijdeg\_\{i\}=\\sum\_\{j=1\}^\{N\}K\_\{ij\}\. Thus:

∑i=1N∑j=1NKijri2=∑i=1Nri2\(∑j=1NKij\)=∑i=1Ndegiri2=𝐫⊤𝐃𝐫,\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}K\_\{ij\}r\_\{i\}^\{2\}=\\sum\_\{i=1\}^\{N\}r\_\{i\}^\{2\}\\left\(\\sum\_\{j=1\}^\{N\}K\_\{ij\}\\right\)=\\sum\_\{i=1\}^\{N\}deg\_\{i\}r\_\{i\}^\{2\}=\\mathbf\{r\}^\{\\top\}\\mathbf\{D\}\\mathbf\{r\},\(20\)where𝐃=diag\(deg1,…,degN\)\\mathbf\{D\}=\\text\{diag\}\(deg\_\{1\},\\dots,deg\_\{N\}\)\. Due to the symmetry of𝐊\\mathbf\{K\}, the third term∑i,jKijrj2\\sum\_\{i,j\}K\_\{ij\}r\_\{j\}^\{2\}also equals𝐫⊤𝐃𝐫\\mathbf\{r\}^\{\\top\}\\mathbf\{D\}\\mathbf\{r\}\.

##### 3\. Final Laplacian Form\.

Substituting these into the expression:

ℒsmooth\\displaystyle\\mathcal\{L\}\_\{\\text\{smooth\}\}=12\(𝐫⊤𝐃𝐫−2𝐫⊤𝐊𝐫\+𝐫⊤𝐃𝐫\)\\displaystyle=\\frac\{1\}\{2\}\\left\(\\mathbf\{r\}^\{\\top\}\\mathbf\{D\}\\mathbf\{r\}\-2\\mathbf\{r\}^\{\\top\}\\mathbf\{K\}\\mathbf\{r\}\+\\mathbf\{r\}^\{\\top\}\\mathbf\{D\}\\mathbf\{r\}\\right\)\(21\)=𝐫⊤𝐃𝐫−𝐫⊤𝐊𝐫=𝐫⊤\(𝐃−𝐊\)𝐫\.\\displaystyle=\\mathbf\{r\}^\{\\top\}\\mathbf\{D\}\\mathbf\{r\}\-\\mathbf\{r\}^\{\\top\}\\mathbf\{K\}\\mathbf\{r\}=\\mathbf\{r\}^\{\\top\}\(\\mathbf\{D\}\-\\mathbf\{K\}\)\\mathbf\{r\}\.\(22\)Defining the graph Laplacian𝐋=𝐃−𝐊\\mathbf\{L\}=\\mathbf\{D\}\-\\mathbf\{K\}, we obtain the final form:

ℒsmooth=𝐫⊤𝐋𝐫\.\\mathcal\{L\}\_\{\\text\{smooth\}\}=\\mathbf\{r\}^\{\\top\}\\mathbf\{L\}\\mathbf\{r\}\.\(23\)Comparing the two objectives, whileℒMMD=𝐫⊤𝐊𝐫\\mathcal\{L\}\_\{\\text\{MMD\}\}=\\mathbf\{r\}^\{\\top\}\\mathbf\{K\}\\mathbf\{r\}captures global distribution alignment,ℒsmooth=𝐫⊤𝐋𝐫\\mathcal\{L\}\_\{\\text\{smooth\}\}=\\mathbf\{r\}^\{\\top\}\\mathbf\{L\}\\mathbf\{r\}ensures that the residual is distributed smoothly across numerically similar tokens\.

## Appendix CExperiment

### C\.1Datasets

Figure[4](https://arxiv.org/html/2606.27731#A3.F4)shows example question–answer pairs across datasets, and Table[8](https://arxiv.org/html/2606.27731#A3.T8)summarizes the train/test sizes\. While GSM8K and SVAMP provide intermediate solution rationales, we evaluate only the*final numeric answer*and compute exact\-match accuracy based on this final output\. Since our focus is numeric prediction, we preprocess ChartQA by filtering out examples whose ground\-truth answers are non\-numeric, such as yes/no judgments or free\-form text\. ChartQA contains 28\.3k/2\.5k train/test examples before filtering, and 22\.8k/1\.9k after filtering\.

![Refer to caption](https://arxiv.org/html/2606.27731v1/x5.png)Figure 4:Dataset Samples Illustration:Question\-Answer Pairs with Numeric OutputsTable 8:Dataset sizes used in our experiments\.GSM8kSVAMPArithmeticClock\-TimeChartQATraining set7\.47k–2M11\.52k22\.85kTest set1\.32k1k10k1\.44k1\.92k
### C\.2Baselines

#### C\.2\.1Number Token Loss \(NTL\)

We implement NTL using the Wasserstein\-1 formulation inZausingeret al\.\([2025](https://arxiv.org/html/2606.27731#bib.bib26)\)and add it to standard cross\-entropy\. Let𝒱num=\{ui\}i=1N\\mathcal\{V\}\_\{\\mathrm\{num\}\}=\\\{u\_\{i\}\\\}\_\{i=1\}^\{N\}be the numeric\-token subset, where each numeric tokenuiu\_\{i\}is mapped to a real valuevi∈ℝv\_\{i\}\\in\\mathbb\{R\}\(via deterministic parsing of the token string\)\. At a decoding stepttwith ground\-truth tokenyt∈𝒱numy\_\{t\}\\in\\mathcal\{V\}\_\{\\mathrm\{num\}\}, we restrict the model distribution to𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}and denote it by𝐩∈ΔN\\mathbf\{p\}\\in\\Delta^\{N\}\.

NTL is defined as a Wasserstein\-1 distance between the one\-hot target and the predicted numeric\-token distribution:

ℒNTL=minγ∈Γ\(𝐞π\(y\),𝐩\)∑i=1N∑j=1Nγij\|vi−vj\|,\\mathcal\{L\}\_\{\\text\{NTL\}\}=\\min\_\{\\gamma\\in\\Gamma\(\\mathbf\{e\}\_\{\\pi\(y\)\},\\mathbf\{p\}\)\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}\\gamma\_\{ij\}\\,\|v\_\{i\}\-v\_\{j\}\|,\(24\)wherey∈𝒱numy\\in\\mathcal\{V\}\_\{\\mathrm\{num\}\}is the ground\-truth numeric token,π\(y\)\\pi\(y\)is its index in𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\},𝐞π\(y\)\\mathbf\{e\}\_\{\\pi\(y\)\}is the one\-hot target, andΓ\(⋅,⋅\)\\Gamma\(\\cdot,\\cdot\)is the set of couplings with the specified marginals\. In our one\-dimensional discrete setting with a one\-hot target, Eq\. \([24](https://arxiv.org/html/2606.27731#A3.E24)\) reduces to a directly computable form:

ℒNTL=∑i=1Npi\|vi−vπ\(y\)\|,\\mathcal\{L\}\_\{\\text\{NTL\}\}=\\sum\_\{i=1\}^\{N\}p\_\{i\}\\,\|v\_\{i\}\-v\_\{\\pi\(y\)\}\|,\(25\)i\.e\., the expected absolute numeric deviation under𝐩\\mathbf\{p\}, so probability mass assigned to numerically closer tokens incurs smaller penalty\. We applyℒNTL\\mathcal\{L\}\_\{\\text\{NTL\}\}only at positions where the ground\-truth token is numeric \(and set it to0otherwise\), and combine it with cross\-entropy:

ℒ=ℒCE\+λℒNTL\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+\\lambda\\,\\mathcal\{L\}\_\{\\text\{NTL\}\}\.\(26\)
Hyperparameters\.We perform a small, representative scan of the NTL loss weightλ\\lambdaon two settings: Qwen2\.5\-1\.5B on GSM8K and Qwen2\.5\-VL\-3B on Clock\-Time\. We sweep a modest grid ofλ\\lambdavalues \(Figure[5](https://arxiv.org/html/2606.27731#A3.F5)\) and observe that NTL reaches its best performance atλ=2\.0\\lambda=2\.0on both datasets\. Accordingly, we useλ=2\.0\\lambda=2\.0as the default NTL setting in all experiments\.

![Refer to caption](https://arxiv.org/html/2606.27731v1/x6.png)\(a\)λ\\lambdascan on GSM8k
![Refer to caption](https://arxiv.org/html/2606.27731v1/x7.png)\(b\)λ\\lambdascan on Clock\-Time

Figure 5:λ\\lambdascan for the NTL baseline \(with SMMD \(Ours\) as reference\)\.We vary the NTL loss weightλ\\lambdaand report exact\-match accuracy \(Acc, %\) for \(a\) Qwen2\.5\-1\.5B on GSM8K and \(b\) Qwen2\.5\-VL\-3B on Clock\-Time\. NTL peaks atλ=2\.0\\lambda=2\.0in both settings, which we adopt as the default for NTL throughout\. The SMMD \(Ours\) curve is shown for reference under the sameλ\\lambdavalues\.
#### C\.2\.2Numeric Integrity Token Loss \(NTIL\)

We implement NTIL followingFeiet al\.\([2025](https://arxiv.org/html/2606.27731#bib.bib2)\)\. NTIL extends the token\-level Wasserstein/EMD objective to*sequential*digit prediction by \(i\) emphasizing more significant digit positions and \(ii\) adding sequence\-level penalties based on the*constructed*numeric value\. Concretely, consider a ground\-truth number span consisting ofnnconsecutive digit tokens with digit labels\{dk\}k=0n−1\\\{d\_\{k\}\\\}\_\{k=0\}^\{n\-1\}, wheredk∈\{0,…,9\}d\_\{k\}\\in\\\{0,\\dots,9\\\}andk=0k=0is the most significant digit\. Let𝐩\(k\)∈Δ10\\mathbf\{p\}^\{\(k\)\}\\in\\Delta^\{10\}be the model distribution over digit tokens at positionkk\. The token\-level term is a 1D Wasserstein\-1 / EMD with ground costc\(i,j\)=\|i−j\|c\(i,j\)=\|i\-j\|:

ℒEMD=∑k=0n−1wk⋅EMD\(𝐞dk,𝐩\(k\)\)=∑k=0n−1wk∑i=09pi\(k\)\|i−dk\|,wk=\(1\+τ\)n−k−1,\\mathcal\{L\}\_\{\\mathrm\{EMD\}\}=\\sum\_\{k=0\}^\{n\-1\}w\_\{k\}\\cdot\\mathrm\{EMD\}\\\!\\big\(\\mathbf\{e\}\_\{d\_\{k\}\},\\mathbf\{p\}^\{\(k\)\}\\big\)\\;=\\;\\sum\_\{k=0\}^\{n\-1\}w\_\{k\}\\sum\_\{i=0\}^\{9\}p^\{\(k\)\}\_\{i\}\\,\|i\-d\_\{k\}\|,\\qquad w\_\{k\}=\(1\+\\tau\)^\{\\,n\-k\-1\},\(27\)where the equality uses the one\-hot target simplification in the discrete 1D setting, and the exponential weightswkw\_\{k\}reflect place\-value importance\.

To enforce sequence\-level numeric consistency, NTIL constructs a differentiable estimate of the*entire*predicted number via Gumbel\-softmax over digits\. Denote the resulting soft digit at positionkkbyd^k=∑i=09iy~i\(k\)\\hat\{d\}\_\{k\}=\\sum\_\{i=0\}^\{9\}i\\,\\tilde\{y\}^\{\(k\)\}\_\{i\}, and aggregate across positions \(using the corresponding powers of1010\) to obtain a scalar predictionXX; the ground\-truth numeric value isYY\. Two additional sequence\-level losses are then applied:

ℒrel=\|X−Y\|max⁡\(X,Y\)\+ϵ,ℒmag=log⁡\(max⁡\(X,Y\)min⁡\(X,Y\)\),\\mathcal\{L\}\_\{\\mathrm\{rel\}\}=\\frac\{\|X\-Y\|\}\{\\max\(X,Y\)\+\\epsilon\},\\qquad\\mathcal\{L\}\_\{\\mathrm\{mag\}\}=\\log\\\!\\Big\(\\frac\{\\max\(X,Y\)\}\{\\min\(X,Y\)\}\\Big\),\(28\)with a smallϵ\\epsilonto avoid division by zero\. The NTIL auxiliary objective is the sum of three terms:

ℒNTIL=ℒEMD\+αℒrel\+βℒmag,\\mathcal\{L\}\_\{\\mathrm\{NTIL\}\}=\\mathcal\{L\}\_\{\\mathrm\{EMD\}\}\+\\alpha\\,\\mathcal\{L\}\_\{\\mathrm\{rel\}\}\+\\beta\\,\\mathcal\{L\}\_\{\\mathrm\{mag\}\},\(29\)and the final training loss adds it to cross\-entropy:

ℒ=ℒCE\+λℒNTIL\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{NTIL\}\}\.\(30\)
Hyperparameters\.For the overall NTIL weight, we setλ=2\.0\\lambda=2\.0in all experiments for consistency with our representative scan of EMD\-based objectives \(Figure[5](https://arxiv.org/html/2606.27731#A3.F5)\); since NTIL’s primary term is a weighted EMD loss \(Eq\. \([27](https://arxiv.org/html/2606.27731#A3.E27)\)\), we adopt the same EMD weight under the same computational budget constraints\. For the sequence\-level terms, we follow the default settings in the original implementation and useα=β=τ=0\.2\\alpha=\\beta=\\tau=0\.2\.

### C\.3Implementation Details

We fine\-tune all backbones on NVIDIA A800 and L20 GPUs using the Unsloth framework\(Daniel Han and team,[2023](https://arxiv.org/html/2606.27731#bib.bib33)\)and Transformers\(Wolfet al\.,[2020](https://arxiv.org/html/2606.27731#bib.bib21)\)\(v4\.57\.3; except Ministral\-3, which requires Transformers v5\.0\.0\+\)\. All models are fine\-tuned with LoRA \(rankr=16r\{=\}16\) and optimized using AdamW with a learning rate of2×10−42\\times 10^\{\-4\}and weight decay0\.010\.01; we use a linear warmup over the first 3% of training steps \(warmup ratio0\.030\.03\)\. For the Arithmetic and Clock\-Time datasets, we train for 15 epochs, as the training loss continues to decrease throughout training; for all other datasets, we train for 5 epochs\. In every setting, we select the checkpoint with the best validation performance and report the corresponding test results\. At inference time, we use greedy decoding for reproducibility\.

### C\.4Construction of the random PSD kernel

LetDDbe the size of the numeric sub\-vocabulary and letddbe a small embedding dimension\. We construct a random kernel matrix𝐊∈ℝD×D\\mathbf\{K\}\\in\\mathbb\{R\}^\{D\\times D\}that is symmetric, positive semidefinite \(PSD\), and independent of the underlying numeric values\. We sample i\.i\.d\. random embeddings\{zi\}i=1D\\\{z\_\{i\}\\\}\_\{i=1\}^\{D\}withzi∈\{−1,\+1\}dz\_\{i\}\\in\\\{\-1,\+1\\\}^\{d\}\(Rademacher vectors\), normalize each to unitℓ2\\ell\_\{2\}norm, and stack them into a matrixZ∈ℝD×dZ\\in\\mathbb\{R\}^\{D\\times d\}\. In our experiments we used=4d\{=\}4, which yields a compact random feature space while ensuring‖zi‖2=1\\\|\\\!z\_\{i\}\\\!\\\|\_\{2\}=1\.

We first form the Gram matrix

𝐊\(0\)=ZZ⊤,\\mathbf\{K\}^\{\(0\)\}=ZZ^\{\\top\},\(31\)which is PSD by construction and satisfiesKii\(0\)=1K^\{\(0\)\}\_\{ii\}=1\. To increase the contrast of these random correlations while preserving PSD, we apply an elementwise \(Hadamard\) integer power:

𝐊=\(𝐊\(0\)\)∘p\.\\mathbf\{K\}=\\bigl\(\\mathbf\{K\}^\{\(0\)\}\\bigr\)^\{\\circ p\}\.\(32\)Equivalently, each entry is given byKij=\(zi⊤zj\)pK\_\{ij\}=\(z\_\{i\}^\{\\top\}z\_\{j\}\)^\{p\}\. PSD is preserved because\(𝐊\(0\)\)∘p\\bigl\(\\mathbf\{K\}^\{\(0\)\}\\bigr\)^\{\\circ p\}is the Hadamard product of𝐊\(0\)\\mathbf\{K\}^\{\(0\)\}with itselfpptimes, and the Schur product theorem guarantees that Hadamard products of PSD matrices remain PSD\. Throughout our ablations, we set the polynomial degree to the odd integerp=5p\{=\}5, which strengthens these value\-agnostic correlations while keeping the kernel PSD, providing a stringent control that removes number\-line structure without changing the MMD form\.

### C\.5Results of Multi\-Kernel

Table[9](https://arxiv.org/html/2606.27731#A3.T9)further examines whether multi\-kernel mixtures provide additional benefits beyond a well\-chosen single bandwidth\. Starting from the best\-performing single\-kernel setting in the main paper \(\{σ\}=\{2\.0\}\\\{\\sigma\\\}=\\\{2\.0\\\}\), we evaluate several representative bandwidth sets centered around2\.02\.0as well as wider and more diverse mixtures\. Overall, the improvements from multi\-kernel averaging are limited and inconsistent across tasks: the single bandwidthσ=2\.0\\sigma=2\.0remains the best or second\-best choice on most datasets, while mixtures sometimes yield marginal gains on a specific dataset \(e\.g\., SVAMP\) but often degrade performance elsewhere\. These results suggest that, in our setting, accurately calibrating the kernel locality \(via a singleσ\\sigma\) is more important than increasing kernel complexity\. Therefore, for simplicity and robustness, we adopt a single\-kernel configuration withσ=2\.0\\sigma=2\.0as the default throughout the paper\.

Table 9:Multi\-kernel bandwidth study\.SMMD performance under different Gaussian bandwidth setsΣ\\Sigma\(single\-kernel and multi\-kernel mixtures\) on the ablation backbones\. Best is inboldand second\-best is underlined\.\{σ\}\\\{\\sigma\\\}Qwen2\.5\-1\.5BQwen2\.5\-VL\-3BGSM8KSVAMPArithmeticClock\-TimeChartQA\{2\.0\}\\\{2\.0\\\}57\.7771\.1053\.8374\.3072\.22\{1\.7,2\.0,2\.3\}\\\{1\.7,\\,2\.0,\\,2\.3\\\}57\.3171\.2052\.7473\.2671\.90\{1\.0,2\.0,3\.0\}\\\{1\.0,\\,2\.0,\\,3\.0\\\}57\.1670\.7053\.8372\.2972\.06\{1\.0,1\.5,2\.0,2\.5,3\.0\}\\\{1\.0,\\,1\.5,\\,2\.0,\\,2\.5,\\,3\.0\\\}56\.7870\.9052\.2573\.4771\.59\{0\.5,3\.0,5\.5\}\\\{0\.5,\\,3\.0,\\,5\.5\\\}56\.3370\.6051\.9874\.7971\.33
### C\.6Numeric distribution analysis

Take Qwen2\.5\-VL\-3B as example, we examine how different objectives shape the model’s digit distributions on Clock\-Time Dataset\. We group examples by the first digit of the ground\-truth hour and summarize the average digit distribution \(mean±\\pmstd over examples\) for each group\. As shown in Figure[6](https://arxiv.org/html/2606.27731#A3.F6), our method consistently places more probability on the correct target digit while suppressing competing digits, producing a sharper and more target\-aligned distribution across most buckets\.

![Refer to caption](https://arxiv.org/html/2606.27731v1/x8.png)Figure 6:Clock\-Time digit distributions\.Rows bucket examples by the first digit of the ground\-truth hour \(1–9\), and columns compare training objectives\. Overall, our method yields the most concentrated distributions with the highest mass on the correct target digit across buckets\.
### C\.7Error Analysis

We conduct a targeted error analysis on GSM8K to understand how our numeric\-aware objective changes failure modes beyond exact\-match accuracy\. For each example, we parse the final numeric answer from the model output \(and the ground\-truth answer\) and compute per\-instance errors\. All statistics below are reported on the*common\-valid*subset where both methods yield a valid parsed number \(1319 examples\)\.

##### Overall error distribution\.

Figure[7](https://arxiv.org/html/2606.27731#A3.F7)shows the histogram oflog10⁡\(\|y^−y\|\+1\)\\log\_\{10\}\(\|\\hat\{y\}\-y\|\+1\)\. Our method shifts the error mass toward smaller values and reduces the tail: the 90th percentile absolute error decreases from 168\.40 \(CE\) to 150\.00 \(Ours\), indicating fewer large\-magnitude failures\.

![Refer to caption](https://arxiv.org/html/2606.27731v1/x9.png)Figure 7:GSM8K overall error histogram on the common\-valid subset, usinglog10⁡\(\|y^−y\|\+1\)\\log\_\{10\}\(\|\\hat\{y\}\-y\|\+1\)\. Our distribution concentrates more on the left and exhibits a smaller tail \(e\.g\., lowerp90\(\|e\|\)p90\(\|e\|\)\), suggesting reduced large\-magnitude errors\.
##### Digit\-sliced signed error histograms\.

To probe digit\-boundary behaviors, we slice GSM8K examples by the last digit of the ground\-truth answer \(0–9\) and plot signed errors\(y^−y\)\(\\hat\{y\}\-y\)\. Figure[8](https://arxiv.org/html/2606.27731#A3.F8)shows that for most ending digits, our error mass is more concentrated near zero, aligning with higher exact/near\-exact outcomes\. We also observe a mild limitation on boundary\-like endings \(notably 0 and 9\), where the two methods are comparable and occasionally CE is slightly better, suggesting that our gains mainly come from reducing off\-scale errors rather than uniformly improving digit\-boundary exactness\.

![Refer to caption](https://arxiv.org/html/2606.27731v1/x10.png)Figure 8:GSM8K signed error histograms\(y^−y\)\(\\hat\{y\}\-y\)sliced by the last digit of the ground\-truth answer\. Each subplot overlays CE vs\. Ours on the same subset\. Overall, our errors concentrate more around zero for most digits, with mixed trends for boundary endings \(0 and 9\)\.
##### Scale\-related failure modes\.

A frequent failure in GSM8K is*off\-scale*prediction \(e\.g\.,×10\\times 10due to unit/decimal mistakes\) or*catastrophic*relative error\. Table[10](https://arxiv.org/html/2606.27731#A3.T10)categorizes each prediction on the GSM8K test set into mutually exclusive types: \(i\)exact:y^=y\\hat\{y\}=y; \(ii\)sign\_flip:y^≈−y\\hat\{y\}\\approx\-y; \(iii\)scale\_x10^k:\|y^\|≈\|y\|⋅10k\|\\hat\{y\}\|\\approx\|y\|\\cdot 10^\{k\}fork∈\{−3,−2,−1,1,2,3\}k\\in\\\{\-3,\-2,\-1,1,2,3\\\}\(within 5% relative tolerance\); \(iv\)near\_miss:\|y^−y\|≤1\|\\hat\{y\}\-y\|\\leq 1or relative error≤1%\\leq 1\\%; \(v\)catastrophic: relative error≥50%\\geq 50\\%; \(vi\)other: remaining cases\. Compared to CE, our method increases exact matches \(726→\\rightarrow762\) and near\-misses \(32→\\rightarrow41\), while reducing catastrophic errors \(331→\\rightarrow283, a∼\\sim14\.5% relative reduction\) and scale errors \(16→\\rightarrow10\), especially×10\\times 10mistakes \(9→\\rightarrow4\)\. This supports our main claim: metric\-aligned supervision primarily suppresses off\-scale guesses and shifts errors toward numerically plausible neighborhoods\.

Table 10:Scale\-error category breakdown on GSM8K \(n=1319n=1319\)\.Our method reduces catastrophic relative errors and×10k\\times 10^\{k\}scale mistakes \(notably×10\\times 10\), while increasing exact and near\-miss outcomes\.CategoryCEOurs\#%\#%Exact72655\.0476257\.77Sign flip00\.0020\.15Scale×10k\\times 10^\{k\}\(anyk≠0k\\neq 0\)161\.21100\.76k=−2k=\-220\.1520\.15k=−1k=\-130\.2330\.23k=1k=190\.6840\.30k=2k=210\.0810\.08k=3k=310\.0800\.00Near\-miss322\.43413\.11Catastrophic33125\.0928321\.46Other21416\.2222116\.76Total1319

### C\.8SMMD is compatible with standard text modeling\.

To sanity\-check that SMMD does not interfere with general language generation, we first evaluate on SAMSum\(Gliwaet al\.,[2019](https://arxiv.org/html/2606.27731#bib.bib34)\), a dialogue summarization benchmark scored by ROUGE\. Since SMMD is only activated at positions whose targets fall in𝒱num\\mathcal\{V\}\_\{\\mathrm\{num\}\}, it introduces no additional supervision on the vast majority of purely textual tokens in this task\. As shown in Table[11](https://arxiv.org/html/2606.27731#A3.T11), SMMD yields ROUGE scores that are very close to the CE baseline, with differences within0\.60\.6absolute across ROUGE variants\.

Table 11:SAMSum dialogue summarization\.Backbone: Qwen2\.5\-1\.5B\. Metrics: ROUGE↑\\uparrow\.MethodROUGE\-1ROUGE\-2ROUGE\-LCE51\.9027\.3343\.49Ours51\.3626\.7943\.01Table 12:MMLU evaluation\.Backbone: Qwen2\.5\-1\.5B fine\-tuned on GSM8K\. We report 5\-shot accuracy \(%\) usinglm\-eval\.MethodSTEMHumanitiesSocial Sci\.OtherAvg\.CE55\.47±0\.8555\.47\\pm 0\.8554\.94±0\.6854\.94\\pm 0\.6871\.60±0\.8071\.60\\pm 0\.8066\.72±0\.8266\.72\\pm 0\.8261\.32±0\.3961\.32\\pm 0\.39Ours55\.44±0\.8655\.44\\pm 0\.8654\.86±0\.6854\.86\\pm 0\.6871\.79±0\.8071\.79\\pm 0\.8066\.69±0\.8266\.69\\pm 0\.8261\.32±0\.3961\.32\\pm 0\.39We further evaluate whether the numeric\-aware fine\-tuning affects broader knowledge and reasoning ability\. Using the same Qwen2\.5\-1\.5B checkpoints fine\-tuned on GSM8K, we evaluate MMLU under the standard 5\-shot protocol withlm\-eval\(Gaoet al\.,[2024](https://arxiv.org/html/2606.27731#bib.bib43)\)\. Table[12](https://arxiv.org/html/2606.27731#A3.T12)shows that SMMD matches CE in average accuracy \(61\.32%61\.32\\%for both methods\), with nearly identical performance across all four category groups\. Together, these results suggest that SMMD is broadly compatible with standard text modeling and does not materially degrade non\-numeric capabilities in these sanity checks\.
Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment

Similar Articles

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

On Predicting the Post-training Potential of Pre-trained LLMs

Submit Feedback

Similar Articles

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
On Predicting the Post-training Potential of Pre-trained LLMs