CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

arXiv cs.LG Papers

Summary

This paper proposes CRUMB, a three-stage inference wrapper that clusters test queries and selects a distributionally matched training subset via MMD minimization to enable efficient Prior-Fitted Network inference on large datasets, achieving state-of-the-art context selection on 51 TabArena datasets.

arXiv:2606.11473v1 Announce Type: new Abstract: Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:47 PM

# Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching
Source: [https://arxiv.org/html/2606.11473](https://arxiv.org/html/2606.11473)
Jamie Heredge Mattia J\. Villani∗Pranav Deshpande Akshay Seshadri Niraj Kumar

Global Technology Applied Research, JPMorganChase, New York, NY 10001, USA

Equal contribution\. Email:\{jamie\.heredge, mattia\.villani\}@jpmchase\.com\.Principal Investigator\. Email:niraj\.x7\.kumar@jpmchase\.com###### Abstract

Prior\-fitted networks \(PFNs\) are a promising class of tabular foundation models that perform in\-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass\. However, the quadratically scaling self\-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets\. We proposeCRUMB\(ClusteredRetrievalUsingMinimised\-MMDBatching\), a three\-stage inference wrapper that \(i\) clusters the test queries, \(ii\) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy \(MMD\), and \(iii\) runs exact PFN inference on each reduced\-context batch\. CRUMB is architecture\-agnostic and requires no retraining\. On the 51\-dataset TabArena benchmark, evaluated across three PFN architectures \(TabPFNv2, TabICLv1, TabICLv2\), we show that CRUMB outperforms similar state\-of\-the\-art context selection strategies\. We also show that CRUMB is resilient to covariate drift, as the MMD\-minimisation step naturally helps align the training context distribution to match the current test batch distributions\.

## 1Introduction

Prior\-fitted networks \(PFNs\)\[[12](https://arxiv.org/html/2606.11473#bib.bib47),[13](https://arxiv.org/html/2606.11473#bib.bib46)\]are an increasingly popular\[[28](https://arxiv.org/html/2606.11473#bib.bib19)\]class of tabular foundation models that solve supervised learning tasks via in\-context learning: the entire labelled training set is provided as context, and the model produces predictions for test points in a single forward pass\. These PFN models have achieved success on various datasets, even outperforming gradient boosting decision tree methods such as CatBoost\[[21](https://arxiv.org/html/2606.11473#bib.bib39)\]and XGBoost\[[3](https://arxiv.org/html/2606.11473#bib.bib40)\], which had previously been considered some of the most competitive models for tabular dataset classification tasks\. One key problem is that the reported PFN victories over current state of the art methods typically only involve small datasets\[[13](https://arxiv.org/html/2606.11473#bib.bib46),[32](https://arxiv.org/html/2606.11473#bib.bib2),[5](https://arxiv.org/html/2606.11473#bib.bib14)\]\. This is partly due to the attention layers scaling quadratically with the size of the training dataset which makes the time and memory costs for PFN models prohibitive for large dataset sizes\.

Multiple different methods have been suggested to overcome this issue\. Examples of possible solutions include architectural changes to the PFN itself, such as switching to linear attention, as found in TabFlex\[[33](https://arxiv.org/html/2606.11473#bib.bib21)\]\. There are also techniques focused on fine\-tuning model weights for a specific dataset\[[27](https://arxiv.org/html/2606.11473#bib.bib56)\]as well as context\-tuning\[[8](https://arxiv.org/html/2606.11473#bib.bib26)\]where the training points themselves are variationally adapted\. In this work we will not focus on architectural adjustments, instead restricting our investigation to TabPFNv2, TabICLv1 and TabICLv2\.

A natural remedy is*context selection*: replace the full training set𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}with a much smaller subset𝒮⊂𝒟train\\mathcal\{S\}\\subset\\mathcal\{D\}\_\{\\text\{train\}\}, chosen so that predictions are minimally affected\. Uniform subsampling is the simplest option, although it effectively randomly discards information in exchange for improving speed\. Query\-dependent methods such askk\-nearest\-neighbour \(kkNN\) retrieval can improve accuracy by tailoring the context to each test point, but they sacrifice the ability to batch test queries: because every test point receives a different context, each requires a separate forward pass, resulting inTTindependent PFN evaluations rather than a small number of batched calls\.

We proposeCRUMB\(ClusteredRetrievalUsingMinimised\-MMDBatching\), a method that resolves this tension between context quality and batching efficiency\. The key idea is to cluster the*test*queries, and to select a training context for each test cluster by minimising the maximum mean discrepancy \(MMD\) between the test cluster and the selected train subset\. This aligns the training context to the test query points that we wish to evaluate, while simultaneously allowing efficient PFN inference on batches of test queries at a time\. An overview of CRUMB is given in Figure[1](https://arxiv.org/html/2606.11473#S1.F1)\.

Our contributions are as follows:

- •MMD\-based context retrieval\.We propose greedy MMD minimisation \(kernel herding\) as the mechanism for selecting training subsets, and demonstrate that MMD\-based selection outperforms centroid\-nearest\-neighbour and Voronoi\-uniform retrieval within the same clustering framework\. We also show that this MMD\-minimised context retrieval provides additional resilience to covariate drift in the test data compared to other methods\.
- •Batched context selection via test\-side clustering\.We show that clustering test queries and sharing a single training context per cluster enables efficient batching while preserving query\-relevant context\.
- •Strong results on TabArena\.On the 51\-dataset TabArena benchmark\[[6](https://arxiv.org/html/2606.11473#bib.bib30)\], evaluated across three PFN architectures, we find that CRUMB is close in performance to per\-querykkNN while requiring a fixedKKnumber of forward passes, instead of a forward pass for every testing point\. CRUMB significantly outperforms both the Mixture of In\-context Prompters \(MICP\) technique and uniform subsampling at identical context budgets\. We also show that under covariate shift, CRUMB’s advantage over MICP grows from\+4\.9%\+4\.9\\%, to\+17\.1%\+17\.1\\%as drift intensifies from no drift, to completely out of sample covariate drift\.

Train𝒟train\\mathcal\{D\}\_\{\\text\{train\}\},NNptsTest𝒟test\\mathcal\{D\}\_\{\\text\{test\}\},TTptsStage 1kk\-means on𝒟test\\mathcal\{D\}\_\{\\text\{test\}\}C1C\_\{1\}C2C\_\{2\}CKC\_\{K\}⋮\\vdotsMMD HerdingMMD HerdingMMD Herding⋮\\vdotsStage 2𝒮1\\mathcal\{S\}\_\{1\}𝒮2\\mathcal\{S\}\_\{2\}𝒮K\\mathcal\{S\}\_\{K\}⋮\\vdotsn≪Nn\\\!\\ll\\\!NPFN\(𝒮1,C1\)\(\{\\mathcal\{S\}\_\{1\},C\_\{1\}\}\)PFN\(𝒮2,C2\)\(\{\\mathcal\{S\}\_\{2\},C\_\{2\}\}\)PFN\(𝒮K,CK\)\(\{\\mathcal\{S\}\_\{K\},C\_\{K\}\}\)⋮\\vdotsStage 3𝒚^C1\\hat\{\\bm\{y\}\}\_\{C\_\{1\}\}𝒚^C2\\hat\{\\bm\{y\}\}\_\{C\_\{2\}\}𝒚^CK\\hat\{\\bm\{y\}\}\_\{C\_\{K\}\}⋮\\vdots𝒮k∗=arg⁡min\|𝒮\|=n⁡MMD2​\(P^Ck,P^𝒮\)\\displaystyle\\mathcal\{S\}\_\{k\}^\{\*\}=\\arg\\min\_\{\|\\mathcal\{S\}\|=n\}\\mathrm\{MMD\}^\{2\}\\\!\\big\(\\hat\{P\}\_\{C\_\{k\}\},\\,\\hat\{P\}\_\{\\mathcal\{S\}\}\\big\)

Figure 1:Overview of CRUMB\.Stage 1:Test queries are partitioned intoKKclusters viakk\-means\.Stage 2:For each clusterCkC\_\{k\}, a training subset𝒮k\\mathcal\{S\}\_\{k\}of sizen≪Nn\\ll Nis selected by greedy MMD minimisation \(kernel herding\), drawing from the full training pool \(dashed blue arrows\)\.Stage 3:The PFN runsKKindependent forward passes, each with a small, geometrically relevant context\. The total attention cost reduces fromT⋅NT\\\!\\cdot\\\!NtoT⋅nT\\\!\\cdot\\\!n, enabling inference on datasets withN\>50,000N\>50\{,\}000training points\.
## 2Related Work

#### Efficient Prior\-Fitted Networks\.

The PFN paradigm was introduced by\[[13](https://arxiv.org/html/2606.11473#bib.bib46)\], who showed that a transformer trained on synthetic datasets sampled from a prior can perform in\-context learning on real tabular data\. TabPFNv2\[[25](https://arxiv.org/html/2606.11473#bib.bib55)\]and TabPFNv2\.5\[[10](https://arxiv.org/html/2606.11473#bib.bib25)\]scale this idea to larger and more diverse priors, achieving strong results on small\-to\-medium datasets\[[19](https://arxiv.org/html/2606.11473#bib.bib15)\]\. The TabICL line of work\[[22](https://arxiv.org/html/2606.11473#bib.bib41),[23](https://arxiv.org/html/2606.11473#bib.bib42)\]explores alternative architectures and training procedures within the same paradigm\. All of these models share the fundamental scaling limitation that motivates our work: their inference\-time cost scales quadratically with the number of training samples included in the context\. Several works introduce architectural changes to improve speed, performance, and token efficiency, including TabFlex\[[33](https://arxiv.org/html/2606.11473#bib.bib21)\], MITRA\[[34](https://arxiv.org/html/2606.11473#bib.bib7)\], and related TabPFN variants\[[15](https://arxiv.org/html/2606.11473#bib.bib6)\]\. Chunked TabPFN\[[25](https://arxiv.org/html/2606.11473#bib.bib55)\]introduces a tiled block\-attention strategy that partitions the attention tensor into chunks to compute full attention incrementally, yielding attention\-computation speedups\. Moreover, some methods attempt to reduce the effective dimensionality of the dataset\[[7](https://arxiv.org/html/2606.11473#bib.bib5)\], providing directions orthogonal to inference\-time context subselection\. Drift\-resilient TabPFN variants have been proposed to improve robustness to temporal distribution shifts\[[11](https://arxiv.org/html/2606.11473#bib.bib9)\]\. These models can handle covariate and concept shift, but primarily by modifying the PFN pretraining prior to incorporate time\-varying data\-generating mechanisms, rather than via inference\-time context selection\.

#### Context selection for in\-context learning\.

Prior work on PFNs shows thatkkNN\-based context selection is an effective choice\[[13](https://arxiv.org/html/2606.11473#bib.bib46),[27](https://arxiv.org/html/2606.11473#bib.bib56)\]\. However, a key drawback ofkkNN\-based approaches is that test queries cannot be efficiently batched, since each query may induce a distinct retrieved context\. MixturePFN\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]proposes Mixture of In\-Context Prompters \(MICP\), which routes nearby test points to a shared local training context, enabling efficient batching while retaining locality\. Other closely related work on context subselection includes\[[18](https://arxiv.org/html/2606.11473#bib.bib3),[17](https://arxiv.org/html/2606.11473#bib.bib4)\]; however, these methods do not use MMD\-based heuristics\.

## 3Background

We consider a standard tabular supervised learning setting\. Let𝒟train=\{\(𝒙i,yi\)\}i=1N\\mathcal\{D\}\_\{\\text\{train\}\}=\\\{\(\\bm\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote the labelled training set with𝒙i∈ℝd\\bm\{x\}\_\{i\}\\in\\mathbb\{R\}^\{d\}andyi∈\{1,…,C\}y\_\{i\}\\in\\\{1,\\dots,C\\\}, for classification andyi∈ℝy\_\{i\}\\in\\mathbb\{R\}for regression; let𝒟test=\{𝒙j∗\}j=1T\\mathcal\{D\}\_\{\\text\{test\}\}=\\\{\\bm\{x\}\_\{j\}^\{\*\}\\\}\_\{j=1\}^\{T\}denote the test queries\. A PFN takes the full training set as context and produces predictions for all test queries in a single forward pass𝐲^=PFN​\(𝒟train,𝒟test\)\\mathbf\{\\hat\{y\}\}=\\text\{PFN\}\(\\mathcal\{D\}\_\{\\text\{train\}\},\\mathcal\{D\}\_\{\\text\{test\}\}\), without requiring any dataset specific training on𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}\. Instead PFN models have been previously pre\-trained over a large range of synthetic datasets to approximate the posterior predictive distribution, which allows them to perform in\-context learning in a single forward pass\[[12](https://arxiv.org/html/2606.11473#bib.bib47)\]\. However, the cost of this forward pass is often quadratic in the number of training samples, due to self\-attention between training samples\[[12](https://arxiv.org/html/2606.11473#bib.bib47),[22](https://arxiv.org/html/2606.11473#bib.bib41)\], which becomes prohibitive whenNNis large\.

#### Maximum mean discrepancy\.

The maximum mean discrepancy \(MMD\) is a kernel\-based distance between probability distributions\[[9](https://arxiv.org/html/2606.11473#bib.bib53)\]\. Given two distributionsPPandQQand a reproducing kernel Hilbert space \(RKHS\) with kernelκ\\kappa, the squared MMD is defined as

MMD2​\(P,Q\)=𝔼𝒙,𝒙′∼P​\[κ​\(𝒙,𝒙′\)\]−2​𝔼𝒙∼P,𝒛∼Q​\[κ​\(𝒙,𝒛\)\]\+𝔼𝒛,𝒛′∼Q​\[κ​\(𝒛,𝒛′\)\]\.\\text\{MMD\}^\{2\}\(P,Q\)=\\mathbb\{E\}\_\{\\bm\{x\},\\bm\{x\}^\{\\prime\}\\sim P\}\[\\kappa\(\\bm\{x\},\\bm\{x\}^\{\\prime\}\)\]\-2\\mathbb\{E\}\_\{\\bm\{x\}\\sim P,\\bm\{z\}\\sim Q\}\[\\kappa\(\\bm\{x\},\\bm\{z\}\)\]\+\\mathbb\{E\}\_\{\\bm\{z\},\\bm\{z\}^\{\\prime\}\\sim Q\}\[\\kappa\(\\bm\{z\},\\bm\{z\}^\{\\prime\}\)\]\.\(1\)Intuitively,MMD2​\(P,Q\)=0\\text\{MMD\}^\{2\}\(P,Q\)=0if and only ifP=QP=Q\(for characteristic kernels such as the Gaussian RBF\), making it a natural criterion for measuring how well a selected training subset represents a target distribution\. In our setting,PPis the empirical distribution over a test clusterCkC\_\{k\}andQQis the empirical distribution over a candidate training subset𝒮k⊂𝒟train\\mathcal\{S\}\_\{k\}\\subset\\mathcal\{D\}\_\{\\text\{train\}\}\.

#### Covariate Drift\.

The standard assumption is that training and test inputs are drawn from the same marginal distributionP​\(𝒙\)P\(\\bm\{x\}\)\. Covariate drift refers to the setting where this assumption is violated: the test inputs are drawn from a shifted distributionPtest​\(𝒙\)≠Ptrain​\(𝒙\)P\_\{\\text\{test\}\}\(\\bm\{x\}\)\\neq P\_\{\\text\{train\}\}\(\\bm\{x\}\), while the labelling mechanismP​\(y\|𝒙\)P\(y\|\\bm\{x\}\)remains unchanged\[[26](https://arxiv.org/html/2606.11473#bib.bib24),[31](https://arxiv.org/html/2606.11473#bib.bib23)\]\. Under such a shift, a training context sampled uniformly from𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}will reflectPtrainP\_\{\\text\{train\}\}rather thanPtestP\_\{\\text\{test\}\}, potentially degrading predictions in regions of the feature space where the test queries concentrate\. This motivates context selection methods that explicitly align the training context to the test distribution\.

#### Mixture of In\-Context Prompters \(MICP\)\.

The most closely related baseline to CRUMB is the Mixture of In\-Context Prompters \(MICP\) component of MixturePFN\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]\. MICP\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]clusters the training data, constructs a fixed support set per cluster, and routes test points to the nearest training centroid\. The full MixturePFN system additionally includes a Context\-Aware Fine\-Tuning \(CAPFN\) stage that trains per\-dataset adapter layers; we omit CAPFN throughout to isolate the effect of context selection and ensure a fair comparison with CRUMB, which likewise does not modify model weights\.

## 4Methodology

The method we propose has three stages: \(i\) cluster the test queries, \(ii\) retrieve a training subset for each cluster by minimising the maximal mean discrepancy \(MMD\) between train and test points, and \(iii\) run the PFN forward pass on each \(cluster, subset\) pair, where all points in a given pair can be batched together\. We term this methodClusteredRetrievalUsingMinimised\-MMDBatching \(CRUMB\)\. We also describe optional enhancements, such as an adaptive method of training context selection, whereby we select points until we see no significant improvement in MMD minimisation, saving time by avoiding large contexts with limited improvement in performance\.

Stage 1: Clustering the Test Queries\.We partition the test set intoKKclusters\{C1,…,CK\}\\\{C\_\{1\},\\dots,C\_\{K\}\\\}usingkk\-means on the input features\{𝒙j∗\}j=1T\\\{\\bm\{x\}\_\{j\}^\{\*\}\\\}\_\{j=1\}^\{T\}\[[16](https://arxiv.org/html/2606.11473#bib.bib38),[2](https://arxiv.org/html/2606.11473#bib.bib32)\]\. The number of clustersKKis a hyperparameter that trades off between two extremes:K=1K=1means we will consider one shared training context for all test points, whileK=TK=Trecovers per\-query selection, where every test point gets a unique training context \(i\.e\. as is the case in per\-querykkNN\)\. The clustering is performed on standardised features \(zero mean, unit variance\) using Euclidean distance\. We use Lloyd’s algorithm initialised withkk\-means\+\+\.

Stage 2: Training Subset Selection via MMD\.For each clusterCkC\_\{k\}, we want to find a training subset𝒮k⊂𝒟train\\mathcal\{S\}\_\{k\}\\subset\\mathcal\{D\}\_\{\\text\{train\}\}of sizennthat is “close” to the test points inCkC\_\{k\}in a distributional sense\. We use the maximum mean discrepancy \(MMD\) as our notion of closeness, supported by ablations in Appendix[B\.2](https://arxiv.org/html/2606.11473#A2.SS2)\. In our case minimisingMMD2\\text\{MMD\}^\{2\}is equivalent to solving

𝒮k∗=arg​min𝒮k⊂𝒟train,\|𝒮k\|=n⁡\[1n2​∑𝒙i,𝒙i′∈𝒮kκ​\(𝒙i,𝒙i′\)−2n​\|Ck\|​∑𝒙i∈𝒮k∑𝒙j∗∈Ckκ​\(𝒙i,𝒙j∗\)\],\\mathcal\{S\}\_\{k\}^\{\*\}=\\operatorname\*\{arg\\,min\}\_\{\\mathcal\{S\}\_\{k\}\\subset\\mathcal\{D\}\_\{\\text\{train\}\},\\,\|\\mathcal\{S\}\_\{k\}\|=n\}\\left\[\\frac\{1\}\{n^\{2\}\}\\sum\_\{\\bm\{x\}\_\{i\},\\bm\{x\}\_\{i^\{\\prime\}\}\\in\\mathcal\{S\}\_\{k\}\}\\kappa\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{i^\{\\prime\}\}\)\-\\frac\{2\}\{n\|C\_\{k\}\|\}\\sum\_\{\\bm\{x\}\_\{i\}\\in\\mathcal\{S\}\_\{k\}\}\\sum\_\{\\bm\{x\}^\{\*\}\_\{j\}\\in C\_\{k\}\}\\kappa\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{j\}^\{\*\}\)\\right\],\(2\)
where the first term penalises redundancy among the selected points \(encouraging diversity\), while the second rewards proximity to the test cluster \(encouraging relevance\)\.

Minimising the MMD objective in Equation[2](https://arxiv.org/html/2606.11473#S4.E2)is a combinatorial optimisation problem; we solve it greedily via kernel herding\[[14](https://arxiv.org/html/2606.11473#bib.bib54),[4](https://arxiv.org/html/2606.11473#bib.bib28)\]\. Starting from𝒮k=∅\\mathcal\{S\}\_\{k\}=\\emptysetatt=0t=0, then at each stepttwe select the training point that minimises the following scoring function:

𝒙i∗=arg​min𝒙i∈𝒟train∖𝒮k⁡\[−2\|Ck\|​∑𝒙j∗∈Ckκ​\(𝒙i,𝒙j∗\)\+2t\+1​∑𝒙i′∈𝒮kκ​\(𝒙i,𝒙i′\)\],\\bm\{x\}\_\{i^\{\*\}\}=\\operatorname\*\{arg\\,min\}\_\{\\bm\{x\}\_\{i\}\\in\\mathcal\{D\}\_\{\\text\{train\}\}\\setminus\\mathcal\{S\}\_\{k\}\}\\left\[\-\\frac\{2\}\{\|C\_\{k\}\|\}\\sum\_\{\\bm\{x\}^\{\*\}\_\{j\}\\in C\_\{k\}\}\\kappa\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{j\}^\{\*\}\)\+\\frac\{2\}\{t\+1\}\\sum\_\{\\bm\{x\}\_\{i^\{\\prime\}\}\\in\\mathcal\{S\}\_\{k\}\}\\kappa\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{i^\{\\prime\}\}\)\\right\],\(3\)
where the first term rewards proximity to the test cluster \(precomputed once inO​\(N​\|Ck\|\)O\(N\|C\_\{k\}\|\)time\) and the second penalises redundancy with already\-selected points \(updated incrementally as each point is added\)\. In our application we use a Gaussian RBF kernelκ​\(𝒙,𝒛\)=exp⁡\(−‖𝒙−𝒛‖2/2​σ2\)\\kappa\(\\bm\{x\},\\bm\{z\}\)=\\exp\(\-\\\|\\bm\{x\}\-\\bm\{z\}\\\|^\{2\}/2\\sigma^\{2\}\)with bandwidthσ\\sigmaset via the median heuristic, which avoids introducing additional hyperparameters\. Since the Gaussian RBF kernel satisfiesκ​\(𝒙i,𝒙i\)=1\\kappa\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{i\}\)=1for allii, the self\-kernel term is constant across candidates and can be dropped from the selection criterion\.

The end result of MMD\-minimisation is that for each batch of test pointsCkC\_\{k\}we have found a training subset𝒮k\\mathcal\{S\}\_\{k\}to use for all test points in the batch\. Note that the subsets𝒮1,…,𝒮K\\mathcal\{S\}\_\{1\},\\dots,\\mathcal\{S\}\_\{K\}are not required to be disjoint\. A training point that lies in a region of feature space relevant to multiple test clusters will naturally be selected for multiple subsets\. This is desirable as it means that informative training points are reused where needed\.

Accelerated greedy MMD minimisation\.In practice we speed up this greedy procedure by selecting the topBBscoring points at each iteration before recalculating the second term of Eq\. \([3](https://arxiv.org/html/2606.11473#S4.E3)\), instead of selecting strictly one point at a time\. We also speed up kernel calculations via random Fourier features \(RFF\)\[[24](https://arxiv.org/html/2606.11473#bib.bib52)\]\. For large datasets we also implement early stopping in the procedure, in which we monitor the relative MMD improvement every5050selected points and halt when improvement falls below10−410^\{\-4\}for five consecutive checks\. This allows the context size to adapt to the difficulty of each cluster: easy clusters terminate early with fewer points, reducing inference cost without a fixed budget\. We use this variant in the large\-dataset experiments reported in Figure[4](https://arxiv.org/html/2606.11473#S5.F4)\. Further details and ablations on greedy MMD minimisation variants are provided in Appendix[D](https://arxiv.org/html/2606.11473#A4)\.

Stage 3: Running Batched PFN Inference\.Given theKKpairs\(Ck,𝒮k\)\(C\_\{k\},\\mathcal\{S\}\_\{k\}\), we run the PFN forward passKKtimes, each time providing𝒮k\\mathcal\{S\}\_\{k\}as the training context for the test points inCkC\_\{k\}\. For each clusterkkwe build a single PFN input consisting of the shared context𝒮k\\mathcal\{S\}\_\{k\}and a query tensor containing all points inCkC\_\{k\}, so that the PFN produces\|Ck\|\\lvert C\_\{k\}\\rvertpredictions in one vectorized forward pass\. This enables efficient GPU utilisation because all queries inCkC\_\{k\}share the same context, whereas per\-query retrieval \(e\.g\.kkNN\) induces distinct contexts and therefore prevents straightforward batching\.

The full procedure is summarised in Algorithm[1](https://arxiv.org/html/2606.11473#alg1)and a visual representation of the preprocessing and context retrieval steps are shown in Figure[2](https://arxiv.org/html/2606.11473#S4.F2)\.

Algorithm 1CRUMB \- Clustering Retrieval Using Minimised\-MMD Batching0:Training set

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}, test set

𝒟test\\mathcal\{D\}\_\{\\text\{test\}\}, number of clusters

KK, subset size

nn, PFN model

ff
1:Standardise features of

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and

𝒟test\\mathcal\{D\}\_\{\\text\{test\}\}
2:

\{C1,…,CK\}←k​\-means​\(𝒟test,K\)\\\{C\_\{1\},\\dots,C\_\{K\}\\\}\\leftarrow k\\text\{\-means\}\(\\mathcal\{D\}\_\{\\text\{test\}\},K\)
3:for

k=1,…,Kk=1,\\dots,Kdo

4:

𝒮k←GreedyMMD​\(𝒟train,Ck,n\)\\mathcal\{S\}\_\{k\}\\leftarrow\\text\{GreedyMMD\}\(\\mathcal\{D\}\_\{\\text\{train\}\},C\_\{k\},n\)⊳\\trianglerightEq\. \([3](https://arxiv.org/html/2606.11473#S4.E3)\)

5:

𝒚^Ck←f​\(𝒮k,Ck\)\\hat\{\\bm\{y\}\}\_\{C\_\{k\}\}\\leftarrow f\(\\mathcal\{S\}\_\{k\},C\_\{k\}\)⊳\\trianglerightPFN forward pass

6:endfor

7:return

\{𝒚^Ck\}k=1K\\\{\\hat\{\\bm\{y\}\}\_\{C\_\{k\}\}\\\}\_\{k=1\}^\{K\}

![Refer to caption](https://arxiv.org/html/2606.11473v1/paper_figures/CRUMB_V8.png)Figure 2:Visual representation of the CRUMB preprocessing and context retrieval steps\.We also provide visualisation of alternative context retrievals, showing how uniform subsampling and MICP could in certain cases lead to a sub\-optimal context selection for a given batch of test points\.
## 5Experiments

We evaluate CRUMB on the TabArena benchmark\[[6](https://arxiv.org/html/2606.11473#bib.bib30)\], which consists of 51 diverse tabular classification and regression datasets spanning various sizes, feature types, and difficulty levels\. We test across three PFN architectures: TabPFNv2\[[13](https://arxiv.org/html/2606.11473#bib.bib46)\], TabICLv1\[[22](https://arxiv.org/html/2606.11473#bib.bib41)\], and TabICLv2\[[23](https://arxiv.org/html/2606.11473#bib.bib42)\]\. To investigate effects only relevant to very large datasets \(beyond the sizes found in TabArena\) we supplemented certain experiments with the Higgs particle physics dataset\[[29](https://arxiv.org/html/2606.11473#bib.bib50)\]\.

#### Baselines\.

We restrict our investigation to context selection only \(no fine\-tuning\) and report on the following context selection methods:Full context: Is our baseline where the entire training set of sizeNNis used \(no context selection used\)\.Uniform subsampling:nntraining points are selected uniformly at random from the training set to be used as the context\.kkNN: for each test point, thennnearest neighbours in the training set are selected\. Each test point receives a \(potentially\) unique training context and therefore they can not be batched\.MICP: the training data is clustered intoK′=⌈γ​N/n⌉K^\{\\prime\}=\\lceil\\gamma N/n\\rceilclusters viakk\-means, and a fixed support set ofnntraining points is constructed per cluster \(by uniformly subsampling large clusters or expanding small ones viakkNN from the centroid if the cluster does not include enough points\)\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]\.

Each test point is routed to the nearest training\-cluster centroid and batched with other test points assigned to the same cluster\.

#### Metrics\.

We report the ranking of different context selection techniques against each other, where the methods are ranked for a given PFN model and a dataset drawn with a fixed random seed\. For classification tasks the rank is determined by the final accuracy; for regression tasks the rank is determined by RMSE\. Wall\-clock timings are reported only for the large\-dataset experiments \(Section[5\.3](https://arxiv.org/html/2606.11473#S5.SS3)\), where context selection provides a genuine computational saving\. On the smaller TabArena datasets \(N<10,000N<10\{,\}000\), fixed overhead of PFN model initialisations, forward passes, and preprocessing steps can make timing comparisons uninformative\. Wall\-clock inference times are reported using a single NVIDIA A100 GPU\.

### 5\.1Main Results

We evaluate CRUMB against four baselines on all 51 TabArena datasets using TabICL\-v2: full\-context inference, per\-querykk\-NN retrieval, MICP, and uniform subsampling\. Every method except full context operates within a fixed context budgetn=0\.1​Nn=0\.1N; methods are ranked per \(dataset, seed\) pair and compared to CRUMB via Wilcoxon signed\-rank tests with Bonferroni correction\.

Table 1:Average rank of context selection methods on TabICL\-v2 across 51 TabArena datasets \(10 seeds each, 510 paired observations\)\. Methods are ranked per \(dataset, seed\) group using accuracy for classification tasks and RMSE for regression tasks, then averaged across all 510 groups; lower rank is better\. Significance vs\. CRUMB is assessed via Wilcoxon signed\-rank test with Bonferroni correction with adjusted score given bypadjp\_\{\\text\{adj\}\}\. Context budgetn=0\.1​Nn=0\.1Nfor all methods except Full Context\.K=20K\{=\}20test clusters for CRUMB andK′=⌈γ​N/n⌉K^\{\\prime\}=\\lceil\\gamma N/n\\rceiltrain clusters for MICP, whereγ=1\\gamma=1\. In some cases some clusters may not have any test samples routed to them, in this case MICP only runsK∗K^\{\*\}forward passes, whereK∗K^\{\*\}is the number of unique clusters which have test points routed to them\. The scaling of the PFN inference attention with context size is listed to emphasise full context is prohibitive on large datasets due toO​\(N2\)O\(N^\{2\}\)scaling, and to also highlight that per\-querykk\-NN requiresTTforward passes as opposed to CRUMB which requiresKK, which can be much smaller in practice\. A full complexity analysis of methods is provided in Appendix[E](https://arxiv.org/html/2606.11473#A5)\.#### Outcome\.

In Table[1](https://arxiv.org/html/2606.11473#S5.T1)using the full context expectedly performs best, but it uses allNNtraining points and is infeasible for largeNN; it is included only as an upper bound\. Among methods operating within a fixed context budgetn=0\.1​Nn=0\.1N, CRUMB is statistically close to per\-querykk\-NN \(padj=0\.106p\_\{\\text\{adj\}\}=0\.106\) despite requiring onlyK=20K\{=\}20forward passes rather thanT=200T\{=\}200, a10×10\\timesreduction which is also confirmed experimentally \(3535s vs\.361361s average PFN inference time\)\. CRUMB also significantly outperforms both MICP and uniform subsampling \(padj<0\.001p\_\{\\text\{adj\}\}<0\.001\)\.

### 5\.2Consistency Amongst PFN Models and Sampling Sizes

We next vary three axes: \(i\) the PFN backbone \(TabICL\-v1, TabICL\-v2, TabPFN\-v2\) and \(ii\) the context budgetn=p​Nn=pN, sweeping the sampling proportionsp∈\{0\.05,0\.1,0\.2,0\.5,0\.8\}p\\in\\\{0\.05,0\.1,0\.2,0\.5,0\.8\\\}\(iii\) max train sizeN∈\{500,1000,2000\}N\\in\\\{500,1000,2000\\\}\. We restrict this analysis to the 38 classification datasets in TabArena, since TabICL\-v1 does not support regression\.

![Refer to caption](https://arxiv.org/html/2606.11473v1/x1.png)Figure 3:Average Rank Heatmap\.For each dataset, we compute the mean accuracy of each context\-selection method \(CRUMB, Uniform, MICP\) by averaging over all sampling proportions, all train sizes, and five random seeds\. We report the mean rank for the methods for each dataset and PFN model combination\. For each dataset and model colour indicates : Gold = best, Silver = Second, Bronze = Last\.Table 2:Average rank of context selection methods across the 38 TabArena classification datasets, 3 PFN models, 5 sampling proportions \(p∈\{0\.05,0\.1,0\.2,0\.5,0\.8\}p\\in\\\{0\.05,0\.1,0\.2,0\.5,0\.8\\\}\), and 5 seeds\. Methods are ranked per \(dataset, model, proportion, seed\) group using accuracy, then averaged; lower rank is better\. Significance vs\. CRUMB assessed via Wilcoxon signed\-rank test with Bonferroni correction\.#### Outcome\.

Figure[3](https://arxiv.org/html/2606.11473#S5.F3)and Table[2](https://arxiv.org/html/2606.11473#S5.T2)confirm that CRUMB’s advantage is not an artefact of a single model or budget\. Across all three architectures, five sampling proportions and three train set sizes, CRUMB achieves an average rank of1\.809±0\.0271\.809\\pm 0\.027, significantly outperforming both MICP \(2\.022±0\.0232\.022\\pm 0\.023\) and uniform subsampling \(2\.169±0\.0262\.169\\pm 0\.026\)\.

### 5\.3Large Datasets and Early Stopping

We evaluate this method for large datasets where full\-context PFN inference becomes computationally prohibitive due to quadratic attention cost\. We run on the five largest TabArena datasets: Diabetes130US \(57k\), APSFailure \(61k\), SDSS17 \(62k\), customer\_satisfaction\_in\_airline \(104k\), and GiveMeSomeCredit \(120k\); capping the max size to 100k samples\. We set the max contextn=0\.1​Nn=0\.1Nand report results over 5 random seeds\. In this experiment we enable early stopping of the greedy MMD minimisation procedure as detailed in Section[4](https://arxiv.org/html/2606.11473#S4), when the MMD is no longer significantly decreasing\.

![Refer to caption](https://arxiv.org/html/2606.11473v1/x2.png)Figure 4:Large\-dataset experiments \(5 datasets, 5 seeds\)\.\(a\)Number of wins \(dataset×\\timesseed\) per context selection method, grouped by PFN model\.\(b\)Accuracy \(Median\) vs\. Mean context size for CRUMB and MICP across PFN models \(marker shape distinguishes models\); connecting lines link the two methods on the same dataset\. \*\* indicatesp<0\.01p<0\.01\.#### Outcome\.

Figure[4](https://arxiv.org/html/2606.11473#S5.F4)\(a\) shows the number of wins \(across dataset–seed pairs\) for each method: CRUMB with early stopping wins more often than MICP, significantly so for TabPFN\-v2 and TabICL\-v1 \(p<0\.01p<0\.01\)\. Figure[4](https://arxiv.org/html/2606.11473#S5.F4)\(b\) shows that the early stopping criterion terminates context selection before the full budgetnnis reached, yielding smaller contexts that perform comparably to MICP when it is given a larger context size\.

### 5\.4Covariate Drift

We simulate controlled covariate drift on real datasets\. All features are standardised and projected onto the first principal component \(PC1\); each sample receives a quantile rankq∈\[0,1\]q\\in\[0,1\]along PC1\. The training pool is fixed as the bottom half \(q<0\.5q<0\.5;∼25,000\{\\sim\}25\{,\}000points\)\. This amounts to passing the data through a filter function, thus changing their distribution\. The test pool is passed through a similar filter but with an adjustable sliding window such thatq∈\[τ/2,0\.5\+τ/2\]q\\in\[\\tau/2,\\;0\.5\+\\tau/2\]\. This means that atτ=0\\tau\{=\}0the filters are the same for training and testing, hence this corresponds to no drift\. In contrast, atτ=1\\tau\{=\}1the test dataset will be filtered to only includeq\>0\.5q\>0\.5, meaning that the test set will be entirely out\-of\-distribution as the train and test set will be in disjoint spaces \(along the first PCA component axis\)\. We adopted this method as a way to introduce parameterised covariate drift and out of distribution testing onto real datasets\.

For eachτ∈\{0\.0,0\.1,…,1\.0\}\\tau\\in\\\{0\.0,0\.1,\\ldots,1\.0\\\}and random seed,T=200T\{=\}200test points are sampled from the test pool\. Each method receives a context budget ofn=min⁡\(0\.1​N,10,000\)n=\\min\(0\.1N,\\;10\{,\}000\); both CRUMB and MICP useK=20K\{=\}20clusters\. Results are averaged over1010seeds per\(τ,dataset,model\)\(\\tau,\\text\{dataset\},\\text\{model\}\)triple across four datasets withN≥50,000N\\geq 50\{,\}000: two classification tasks from TabArena \(Anneal, Amazon Employee Access\), one regression task \(Airfoil Self Noise\), and the Higgs particle physics dataset\[[29](https://arxiv.org/html/2606.11473#bib.bib50)\]\.

#### Outcome\.

Table[3](https://arxiv.org/html/2606.11473#S5.T3)and Figure[5](https://arxiv.org/html/2606.11473#S5.F5)show that CRUMB’s advantage over MICP grows monotonically with drift intensity\. Atτ=0\\tau\{=\}0, CRUMB leads by\+4\.9\+4\.9%; atτ=1\\tau\{=\}1the gap widens to\+17\.1\+17\.1%, as CRUMB degrades gracefully \(0\.7940\.794to0\.6990\.699;−9\.5\-9\.5%\) while MICP collapses \(0\.7460\.746to0\.5280\.528;−21\.8\-21\.8%\)\. The mechanism is straightforward: MICP clusters the*training*data, so its partitions become misaligned with the shifting test distribution; CRUMB clusters on the test side and adapts automatically, providing a natural robustness mechanism against covariate drift\.

Table 3:Classification accuracy under covariate drift, aggregated across 3 datasets, 3 PFN models, and 10 seeds\. Values are mean±\\pmSEM\.![Refer to caption](https://arxiv.org/html/2606.11473v1/x3.png)
![Refer to caption](https://arxiv.org/html/2606.11473v1/x4.png)
![Refer to caption](https://arxiv.org/html/2606.11473v1/x5.png)
![Refer to caption](https://arxiv.org/html/2606.11473v1/x6.png)

Figure 5:CRUMB advantage under controlled covariate drift\.Each panel shows accuracy \(or RMSE for Airfoil\) versus drift intensityτ\\taufor CRUMB and MICP across PFN models\. CRUMB’s advantage widens asτ\\tauincreases, demonstrating robustness to distribution shift\. TabICLv1 is omitted from the regression panel as it does not support regression\.

## 6Conclusion

We introduced CRUMB, a training\-free inference wrapper that makes prior\-fitted network inference practical on large tabular datasets by clustering test queries, selecting distributionally matched training contexts via greedy MMD minimisation, and running batched PFN forward passes on the resulting reduced\-context pairs\. Across 51 TabArena datasets and three PFN architectures, CRUMB significantly outperforms both uniform subsampling and MICP at identical context budgets\. The source of this advantage is the combination of test\-side clustering with distributional context retrieval: clustering creates focused, localised query groups, and MMD herding ensures each group receives a context that is distributionally aligned rather than merely geometrically proximal\. This same test\-adaptive design provides natural resilience to covariate drift, a regime in which CRUMB’s advantage over MICP grows from\+4\.9\+4\.9% to\+17\.1\+17\.1% as drift intensifies\.

#### Limitations

CRUMB requires the full set of test queries upfront in order to cluster them and does not natively support the online setting where points arrive one at a time\. We discuss a cached\-centroid variant for streaming inference in Appendix[C](https://arxiv.org/html/2606.11473#A3), but this effectively reduces to MICP\-style routing with a different preprocessing step, and the distribution\-matching advantage of CRUMB may be diminished\. The clustering stage relies on Euclideankk\-means, which assumes roughly isotropic cluster geometry and may be poorly suited to high\-dimensional or mixed\-type feature spaces; the MMD objective is subject to the same limitation, using an isotropic Gaussian RBF kernel with a fixed median\-heuristic bandwidth and no adaptation to dataset\-specific structure\. Although CRUMB’s preprocessing scales linearly inNN, the constant inO​\(N​d​\(T\+K​n\)\)O\(Nd\(T\{\+\}Kn\)\)can still be large for million\-scale datasets\. More broadly, as a pure inference wrapper CRUMB inherits the ceiling of the underlying PFN: if the model’s pretraining priors are a poor match for a given dataset, no context selection strategy can fully compensate\. In this work we focused solely on context selection without additional fine\-tuning; methods such as LoCalPFN\[[27](https://arxiv.org/html/2606.11473#bib.bib56)\]and MixturePFN\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]apply fine\-tuning afterkkNN and MICP respectively\. Investigating whether similar fine\-tuning combined with CRUMB yields further gains is a natural direction for future work\.

## Disclaimer

This paper was prepared for informational purposes by the Global Technology Applied Research center of JPMorgan Chase & Co\. This paper is not a merchandisable/sellable product of the Research Department of JPMorgan Chase & Co\. or its affiliates\. Neither JPMorgan Chase & Co\. nor any of its affiliates makes any explicit or implied representation or warranty and none of them accept any liability in connection with this paper, including, without limitation, with respect to the completeness, accuracy, or reliability of the information contained herein and the potential legal, compliance, tax, or accounting effects thereof\. This document is not intended as investment research or investment advice, or as a recommendation, offer, or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction\.

## References

- \[1\]\(2014\)A survey on nearest neighbor search methods\.International Journal of Computer Applications95\(25\)\.Cited by:[Appendix E](https://arxiv.org/html/2606.11473#A5.p15.3),[Appendix E](https://arxiv.org/html/2606.11473#A5.p16.9),[Appendix E](https://arxiv.org/html/2606.11473#A5.p9.15)\.
- \[2\]D\. Arthur and S\. Vassilvitskii\(2007\-01\)K\-means\+\+: the advantages of careful seeding\.Vol\.8,pp\. 1027–1035\.External Links:[Document](https://dx.doi.org/10.1145/1283383.1283494)Cited by:[§4](https://arxiv.org/html/2606.11473#S4.p2.9)\.
- \[3\]T\. Chen and C\. Guestrin\(2016\-08\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’16,pp\. 785–794\.External Links:[Link](http://dx.doi.org/10.1145/2939672.2939785),[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p1.1)\.
- \[4\]Y\. Chen, M\. Welling, and A\. Smola\(2012\)Super\-samples from kernel herding\.External Links:1203\.3472,[Link](https://arxiv.org/abs/1203.3472)Cited by:[§4](https://arxiv.org/html/2606.11473#S4.p5.3)\.
- \[5\]Z\. Cheng, Z\. Jia, Z\. Zhou, Y\. Li, and L\. Guo\(2025\)Realistic evaluation of tabpfn v2 in open environments\.arXiv preprint arXiv:2505\.16226\.Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p1.1)\.
- \[6\]N\. Erickson, L\. Purucker, A\. Tschalzev, D\. Holzmüller, P\. M\. Desai, D\. Salinas, and F\. Hutter\(2025\)TabArena: a living benchmark for machine learning on tabular data\.External Links:2506\.16791,[Link](https://arxiv.org/abs/2506.16791)Cited by:[3rd item](https://arxiv.org/html/2606.11473#S1.I1.i3.p1.4),[§5](https://arxiv.org/html/2606.11473#S5.p1.1)\.
- \[7\]B\. Feuer, C\. Hegde, and N\. Cohen\(2023\)Scaling tabpfn: sketching and feature selection for tabular prior\-data fitted networks\.arXiv preprint arXiv:2311\.10609\.Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]B\. Feuer, R\. T\. Schirrmeister, V\. Cherepanova, C\. Hegde, F\. Hutter, M\. Goldblum, N\. Cohen, and C\. White\(2024\)TuneTables: context optimization for scalable prior\-data fitted networks\.External Links:2402\.11137,[Link](https://arxiv.org/abs/2402.11137)Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p2.1)\.
- \[9\]A\. Gretton, K\. M\. Borgwardt, M\. J\. Rasch, B\. Schölkopf, and A\. Smola\(2012\-03\)A kernel two\-sample test\.J\. Mach\. Learn\. Res\.13\(null\),pp\. 723–773\.External Links:ISSN 1532\-4435Cited by:[§3](https://arxiv.org/html/2606.11473#S3.SS0.SSS0.Px1.p1.3)\.
- \[10\]L\. Grinsztajn, K\. Flöge, O\. Key, F\. Birkel, P\. Jund, B\. Roof, B\. Jäger, D\. Safaric, S\. Alessi, A\. Hayler, M\. Manium, R\. Yu, F\. Jablonski, S\. B\. Hoo, A\. Garg, J\. Robertson, M\. Bühler, V\. Moroshan, L\. Purucker, C\. Cornu, L\. C\. Wehrhahn, A\. Bonetto, B\. Schölkopf, S\. Gambhir, N\. Hollmann, and F\. Hutter\(2026\)TabPFN\-2\.5: advancing the state of the art in tabular foundation models\.External Links:2511\.08667,[Link](https://arxiv.org/abs/2511.08667)Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]K\. Helli, D\. Schnurr, N\. Hollmann, S\. Müller, and F\. Hutter\(2024\)Drift\-resilient tabpfn: in\-context learning temporal distribution shifts on tabular data\.Advances in Neural Information Processing Systems37,pp\. 98742–98781\.Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1)\.
- \[12\]N\. Hollmann, S\. Müller, K\. Eggensperger, and F\. Hutter\(2023\)TabPFN: a transformer that solves small tabular classification problems in a second\.External Links:2207\.01848,[Link](https://arxiv.org/abs/2207.01848)Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p1.1),[§3](https://arxiv.org/html/2606.11473#S3.p1.8)\.
- \[13\]N\. Hollmann, S\. Müller, L\. Purucker, A\. Krishnakumar, M\. Körfer, S\. B\. Hoo, R\. T\. Schirrmeister, and F\. Hutter\(2025\)Accurate predictions on small data with a tabular foundation model\.Nature637\(8045\),pp\. 319–326\.Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p1.1),[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px2.p1.2),[§5](https://arxiv.org/html/2606.11473#S5.p1.1)\.
- \[14\]F\. Huszár and D\. Duvenaud\(2016\)Optimally\-weighted herding is bayesian quadrature\.External Links:1204\.1664,[Link](https://arxiv.org/abs/1204.1664)Cited by:[§4](https://arxiv.org/html/2606.11473#S4.p5.3)\.
- \[15\]C\. Kolberg, J\. Kreuer, J\. Huurdeman, S\. Ouaari, K\. Eggensperger, and N\. Pfeifer\(2025\)Tabpfn\-wide: continued pre\-training for extreme feature counts\.arXiv preprint arXiv:2510\.06162\.Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]S\. P\. Lloyd\(1982\)Least squares quantization in pcm\.IEEE Trans\. Inf\. Theory28,pp\. 129–136\.External Links:[Link](https://api.semanticscholar.org/CorpusID:10833328)Cited by:[§B\.2](https://arxiv.org/html/2606.11473#A2.SS2.p1.2),[§4](https://arxiv.org/html/2606.11473#S4.p2.9)\.
- \[17\]S\. Lu, J\. Liu, S\. Peters, T\. D\. Le, C\. Xie, L\. Liu, and J\. Li\(2026\)ULEAD\-tabpfn: uncertainty\-aware dependency\-based anomaly detection with tabpfn\.arXiv preprint arXiv:2604\.20255\.Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px2.p1.2)\.
- \[18\]J\. Ma, V\. Thomas, G\. Yu, and A\. Caterini\(2024\)In\-context data distillation with tabpfn\.arXiv preprint arXiv:2402\.06971\.Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px2.p1.2)\.
- \[19\]D\. McElfresh, S\. Khandagale, J\. Valverde, V\. Prasad C, G\. Ramakrishnan, M\. Goldblum, and C\. White\(2023\)When do neural nets outperform boosted trees on tabular data?\.Advances in Neural Information Processing Systems36,pp\. 76336–76369\.Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]S\. M\. Omohundro\(1989\)Five balltree construction algorithms\.Cited by:[Appendix E](https://arxiv.org/html/2606.11473#A5.p9.15)\.
- \[21\]L\. Prokhorenkova, G\. Gusev, A\. Vorobev, A\. V\. Dorogush, and A\. Gulin\(2019\)CatBoost: unbiased boosting with categorical features\.External Links:1706\.09516,[Link](https://arxiv.org/abs/1706.09516)Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p1.1)\.
- \[22\]J\. Qu, D\. Holzmüller, G\. Varoquaux, and M\. L\. Morvan\(2025\)TabICL: a tabular foundation model for in\-context learning on large data\.External Links:2502\.05564,[Link](https://arxiv.org/abs/2502.05564)Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.11473#S3.p1.8),[§5](https://arxiv.org/html/2606.11473#S5.p1.1)\.
- \[23\]J\. Qu, D\. Holzmüller, G\. Varoquaux, and M\. L\. Morvan\(2026\)TabICLv2: a better, faster, scalable, and open tabular foundation model\.External Links:2602\.11139,[Link](https://arxiv.org/abs/2602.11139)Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.11473#S5.p1.1)\.
- \[24\]A\. Rahimi and B\. Recht\(2007\)Random features for large\-scale kernel machines\.InAdvances in Neural Information Processing Systems,J\. Platt, D\. Koller, Y\. Singer, and S\. Roweis \(Eds\.\),Vol\.20,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf)Cited by:[Appendix D](https://arxiv.org/html/2606.11473#A4.p1.9),[Appendix D](https://arxiv.org/html/2606.11473#A4.p3.1),[§4](https://arxiv.org/html/2606.11473#S4.p8.3)\.
- \[25\]R\. Sergazinov and S\. Yin\(2025\)Chunked tabpfn: exact training\-free in\-context learning for long\-context tabular data\.External Links:2509\.00326,[Link](https://arxiv.org/abs/2509.00326)Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1)\.
- \[26\]H\. Shimodaira\(2000\)Improving predictive inference under covariate shift by weighting the log\-likelihood function\.Journal of Statistical Planning and Inference90,pp\. 227–244\.External Links:[Link](https://api.semanticscholar.org/CorpusID:286122993)Cited by:[§3](https://arxiv.org/html/2606.11473#S3.SS0.SSS0.Px2.p1.6)\.
- \[27\]V\. Thomas, J\. Ma, R\. Hosseinzadeh, K\. Golestan, G\. Yu, M\. Volkovs, and A\. Caterini\(2024\)Retrieval and fine\-tuning for in\-context tabular models\.External Links:2406\.05207,[Link](https://arxiv.org/abs/2406.05207)Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p2.1),[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px2.p1.2),[§6](https://arxiv.org/html/2606.11473#S6.SS0.SSS0.Px1.p1.4)\.
- \[28\]B\. Van Breugel and M\. Van Der Schaar\(2024\)Why tabular foundation models should be a research priority\.arXiv preprint arXiv:2405\.01147\.Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p1.1)\.
- \[29\]D\. Whiteson\(2014\)HIGGS\.Note:UCI Machine Learning RepositoryDOI: https://doi\.org/10\.24432/C5V312Cited by:[§5\.4](https://arxiv.org/html/2606.11473#S5.SS4.p2.7),[§5](https://arxiv.org/html/2606.11473#S5.p1.1)\.
- \[30\]D\. Xu, O\. Cirit, R\. Asadi, Y\. Sun, and W\. Wang\(2024\)Mixture of in\-context prompters for tabular pfns\.External Links:2405\.16156,[Link](https://arxiv.org/abs/2405.16156)Cited by:[Appendix E](https://arxiv.org/html/2606.11473#A5.p11.2),[Appendix E](https://arxiv.org/html/2606.11473#A5.p15.2),[Appendix E](https://arxiv.org/html/2606.11473#A5.p15.3),[Appendix E](https://arxiv.org/html/2606.11473#A5.p17.1),[Appendix E](https://arxiv.org/html/2606.11473#A5.p9.15),[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px2.p1.2),[§3](https://arxiv.org/html/2606.11473#S3.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.11473#S5.SS0.SSS0.Px1.p1.8),[§6](https://arxiv.org/html/2606.11473#S6.SS0.SSS0.Px1.p1.4)\.
- \[31\]V\. Yarabolu, G\. Waghmare, S\. Gupta, and S\. Asthana\(2024\-12\)A scalable approach to covariate and concept drift management via adaptive data segmentation\.InProceedings of the 8th International Conference on Data Science and Management of Data \(12th ACM IKDD CODS and 30th COMAD\),CODS\-COMAD Dec ’24,pp\. 84–92\.External Links:[Link](http://dx.doi.org/10.1145/3703323.3703337),[Document](https://dx.doi.org/10.1145/3703323.3703337)Cited by:[§3](https://arxiv.org/html/2606.11473#S3.SS0.SSS0.Px2.p1.6)\.
- \[32\]H\. Ye, S\. Liu, and W\. Chao\(2025\)A closer look at tabpfn v2: understanding its strengths and extending its capabilities\.arXiv preprint arXiv:2502\.17361\.Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p1.1)\.
- \[33\]Y\. Zeng, T\. Dinh, W\. Kang, and A\. C\. Mueller\(2025\)TabFlex: scaling tabular learning to millions with linear attention\.External Links:2506\.05584,[Link](https://arxiv.org/abs/2506.05584)Cited by:[§1](https://arxiv.org/html/2606.11473#S1.p2.1),[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]X\. Zhang, D\. C\. Maddix, J\. Yin, N\. Erickson, A\. F\. Ansari, B\. Han, S\. Zhang, L\. Akoglu, C\. Faloutsos, M\. W\. Mahoney,et al\.\(2025\)Mitra: mixed synthetic priors for enhancing tabular foundation models\.arXiv preprint arXiv:2510\.21204\.Cited by:[§2](https://arxiv.org/html/2606.11473#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AExperimental Details

### A\.1Models

We evaluate three PFN architectures, all loaded from publicly released checkpoints without modification:

- •TabPFNv2\.12\-layer, 6\-head transformer \(d=192d\{=\}192,dh=32d\_\{h\}\{=\}32\), maximum context windowNmax=10,000N\_\{\\max\}\{=\}10\{,\}000\. Supports classification and regression\.
- •TabICLv1\.12\-layer, 4\-head transformer \(d=512d\{=\}512,dh=128d\_\{h\}\{=\}128\),Nmax=4,096N\_\{\\max\}\{=\}4\{,\}096\. Classification only\.
- •TabICLv2\.12\-layer, 8\-head transformer \(d=512d\{=\}512,dh=64d\_\{h\}\{=\}64\),Nmax=4,096N\_\{\\max\}\{=\}4\{,\}096\. Supports classification and regression\.

### A\.2Default Hyperparameters

Table[4](https://arxiv.org/html/2606.11473#A1.T4)lists the default CRUMB hyperparameters used throughout the paper\. Any deviations are stated in the relevant experiment subsection\.

Table 4:Default CRUMB hyperparameters\.
### A\.3Other Experimental Details

All experiments are run on a single NVIDIA A100 \(80 GB\) GPU with 64 CPU cores; clustering and MMD selection run on CPU, PFN forward passes run on GPU\. Random seeds control train/test splits,kk\-means initialisation, RFF projections, and stochastic subsampling\. Each method is subject to a per\-run wall\-clock timeout \(300 s for main experiments; 1 800 s for large\-context experiments\); timed\-out runs are excluded from analysis\.

## Appendix BAblation Studies

We ablate each design choice of CRUMB in isolation: the number of test\-side clustersKK, the per\-cluster context budgetnn, and the within\-cluster retrieval strategy\. All ablations sweep a single hyperparameter while holding the remainder at their default values \(K=20K\{=\}20,n=0\.1​Nn\{=\}0\.1N, MMD herding\) and report the normalised score averaged across all TabArena datasets and 10 seeds\.

### B\.1Number of clustersKK\.

Figure[6](https://arxiv.org/html/2606.11473#A2.F6)plots accuracy against selection time asKKvaries in\{1,5,10,20,30\}\\\{1,5,10,20,30\\\}for fixedn=0\.1​Nn=0\.1N\. SettingK=1K\{=\}1collapses CRUMB to a single global context: accuracy degrades because the selected subset cannot simultaneously represent all regions of the test distribution\. Performance improves asKKincreases, but improvements start to become more marginal afterK=10K=10\.

![Refer to caption](https://arxiv.org/html/2606.11473v1/x7.png)\(a\)TabPFN v2
![Refer to caption](https://arxiv.org/html/2606.11473v1/x8.png)\(b\)TabICL v1
![Refer to caption](https://arxiv.org/html/2606.11473v1/x9.png)\(c\)TabICL v2

Figure 6:Accuracy vs\. selection time as the number of clustersKKvaries\. Each point is one value ofKK; error bars show±1\\pm 1SEM over seeds\.
### B\.2Within\-Cluster Retrieval Strategy

All variants of CRUMB share the same test\-side clustering step viakk\-means\[[16](https://arxiv.org/html/2606.11473#bib.bib38)\]; they differ only in how thenntraining points are chosen for each cluster\. We compare three strategies:

1. 1\.MMD herding\(default\)\. Greedy kernel herding that iteratively selects the training point minimising the empirical MMD between the growing context set and the test cluster \(Algorithm[1](https://arxiv.org/html/2606.11473#alg1)\)\.
2. 2\.Centroid\-NN\.Select thenntraining points nearest \(inℓ2\\ell\_\{2\}\) to the test\-cluster centroid\. This mirrors the routing mechanism of MICP, but applied to test\-side rather than train\-side clusters\.
3. 3\.Voronoi\-Uniform\.TheKKtest\-cluster centroids induce a Voronoi partition of the training set: each training point is assigned to its nearest centroid\. For clusterkk, we then uniformly subsamplennpoints from the Voronoi cell𝒱k=\{xi∈𝒟train:k=arg⁡minj⁡∥xi−cj∥\}\\mathcal\{V\}\_\{k\}=\\\{x\_\{i\}\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}:k=\\arg\\min\_\{j\}\\lVert x\_\{i\}\-c\_\{j\}\\rVert\\\}\. This guarantees that the selected context lies in the same region of feature space as the test cluster, but without any distributional objective\. We consider this as a benchmark to have a midpoint between thekkNN’s locality and the spread of uniform subsampling\.

All three strategies use the sameK=20K\{=\}20clusters, context budgetn=500n\{=\}500, and training pool \(capped atN=10,000N\{=\}10\{,\}000\), and are evaluated on the full TabArena collection with 10 seeds per dataset\.

#### Results\.

Figures[7](https://arxiv.org/html/2606.11473#A2.F7)and[8](https://arxiv.org/html/2606.11473#A2.F8)show the per\-dataset accuracy andR2R^\{2\}advantage of MMD herding over each alternative, displayed as strip plots with inter\-quartile\-range boxes per model\. MMD herding yields a positive mean advantage over both Centroid\-NN and Voronoi\-Uniform on every model, for both classification and regression tasks\.

Centroid\-NN tends to over\-concentrate the context around the centroid, under\-representing the tails of the test cluster\. Voronoi\-Uniform selects points from the correct region but allocates budget uniformly rather than targeting distributional coverage, leading to redundancy in dense regions and gaps in sparse ones\. MMD herding avoids both failure modes by explicitly minimising the distributional mismatch between context and test cluster\.

Figure[8](https://arxiv.org/html/2606.11473#A2.F8)confirms that the RFF\-accelerated variant \(B=50B\{=\}50,D=64D\{=\}64\) preserves the advantage of exact MMD herding over both alternatives\.

![Refer to caption](https://arxiv.org/html/2606.11473v1/x10.png)\(a\)Accuracy advantage\.
![Refer to caption](https://arxiv.org/html/2606.11473v1/x11.png)\(b\)R2R^\{2\}advantage\.

Figure 7:Per\-dataset advantage of exact MMD herding over Centroid\-NN and Voronoi\-Uniform\. Each dot is one dataset’s mean over 10 seeds; boxes show the IQR; diamonds mark the per\-model mean\. Positive values favour MMD herding\.![Refer to caption](https://arxiv.org/html/2606.11473v1/x12.png)\(a\)Accuracy advantage\.
![Refer to caption](https://arxiv.org/html/2606.11473v1/x13.png)\(b\)R2R^\{2\}advantage\.

Figure 8:Same comparison as Figure[7](https://arxiv.org/html/2606.11473#A2.F7)but using the RFF\-accelerated variant \(B=50B\{=\}50,D=64D\{=\}64\)\. The advantage over both alternatives is preserved\.

### B\.3Disentangling Clustering from Context Retrieval

CRUMB comprises two pre\-inference components: \(i\) clustering the test points into localised batches, and \(ii\) retrieving a context subset for each batch via greedy MMD minimisation\. A natural question is whether the improvement stems from clustering, from context retrieval, or from their interaction\. We isolate each factor with a2×22\\\!\\times\\\!2factorial design:

Case Ais standard uniform subsampling: the full test set is passed as a single batch with a uniformly drawn context\.Case Bclusters the test points but draws each cluster’s context uniformly from the entire training pool; since the context distribution is identical across clusters, we expect this to be statistically equivalent to Case A\.Case Cskips clustering and applies MMD herding directly to the full test set\. When the test set is large enough to approximate the training distribution, the discrepancy is already small and we expect MMD to yield little improvement over uniform sampling\.Case D \(CRUMB\)is the full CRUMB pipeline: each test cluster defines a localised target distribution that typically differs from the training marginal, giving MMD herding the distributional gap it needs to exploit\.

We evaluate all four cases across the 51 TabArena datasets×\\times3 PFN models×\\times10 seeds×\\times3 context sizes \(n∈\{500,1000,2000\}n\\\!\\in\\\!\\\{500,1000,2000\\\}\), with training pools capped at10,00010\{,\}000points andK=20K\\\!=\\\!20clusters\.

![Refer to caption](https://arxiv.org/html/2606.11473v1/x14.png)Figure 9:Per\-trial ranking of ablation cases \(95 % CI\)\.Mean rank across all \(dataset, model, seed\) trials; error bars show 95 % confidence intervals\. The full CRUMB pipeline \(Case D: Cluster\+\+MMD\) consistently achieves the best average rank, confirming that both components are important in improving performance\.Figure[9](https://arxiv.org/html/2606.11473#A2.F9)shows that Case D achieves the best mean rank\. Cases A and B perform similarly, as expected, while Case C offers only marginal improvement\. The interaction between clustering and retrieval is therefore the key mechanism: clustering creates localised, distribution\-shifted targets that MMD herding can meaningfully serve\.

## Appendix COnline setting

In this section, we highlight that in the online setting CRUMB can resolve into a variant of MICP\. CRUMB, as described in the main text, operates in a test batch setting: it requires the full set of test queries upfront in order to cluster them\. However, we show in Algorithm[2](https://arxiv.org/html/2606.11473#alg2)that the method adapts naturally to a streaming setting by caching the outputs of Stages 1 and 2\. After running CRUMB on an initial batch of test points, we store theKKcluster centroids\{𝝁k\}\\\{\\bm\{\\mu\}\_\{k\}\\\}and their associated training contexts\{𝒮k\}\\\{\\mathcal\{S\}\_\{k\}\\\}\. Subsequent test points are then routed to the nearest cached centroid and classified using the corresponding pre\-computed context, with no further MMD computation required\. Once enough points have been routed to a cluster to fill a batch, a single PFN forward pass is executed\.

In this cached regime, CRUMB\-Online*reduces exactly to the MICP inference procedure*: both route each incoming test point to the nearest centroid and use a fixed, pre\-computed training context\. The algorithms are operationally identical at inference time\. The only difference is in the*initialisation*\. Specifically, the algorithms differ in how the cached centroids and contexts are constructed\. MICP’s centroids are derived from clustering the*training*data, and its contexts are constructed by subsampling or expanding training clusters\. CRUMB\-Online’s centroids are derived from clustering recent*test*queries, and its contexts are constructed via MMD minimisation against those test clusters\. Once cached, both methods behave identically: route, look up context, run PFN\.

This distinction becomes consequential under covariate shift\. Because MICP’s cluster structure is anchored to the training distribution, it remains fixed regardless of where future queries fall\. In a similar manner the cached centroids of online CRUMB can become stale over time\. A possible solution is that CRUMB\-Online can periodically re\-run the clustering and MMD selection steps on a sliding window of recent queries, refreshing both centroids and contexts to track distributional drift over time, an option unavailable to MICP with a fixed training set\.

Algorithm 2CRUMB\-Online: Streaming Inference with Cached Contexts0:Training set

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}, initial test set

𝒟test\(0\)\\mathcal\{D\}\_\{\\text\{test\}\}^\{\(0\)\}, number of clusters

KK, subset size

nn, PFN model

ff
1:// Phase 1: Initialisation \(run CRUMB on initial test set\)

2:Standardise features of

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and

𝒟test\(0\)\\mathcal\{D\}\_\{\\text\{test\}\}^\{\(0\)\}
3:

\{C1,…,CK\}←k​\-means​\(𝒟test\(0\),K\)\\\{C\_\{1\},\\dots,C\_\{K\}\\\}\\leftarrow k\\text\{\-means\}\(\\mathcal\{D\}\_\{\\text\{test\}\}^\{\(0\)\},K\)
4:for

k=1,…,Kk=1,\\dots,Kdo

5:

𝝁k←1\|Ck\|​∑𝒙∗∈Ck𝒙∗\\bm\{\\mu\}\_\{k\}\\leftarrow\\frac\{1\}\{\|C\_\{k\}\|\}\\sum\_\{\\bm\{x\}^\{\*\}\\in C\_\{k\}\}\\bm\{x\}^\{\*\}⊳\\trianglerightStore cluster centroids

6:

𝒮k←GreedyMMD​\(𝒟train,Ck,n\)\\mathcal\{S\}\_\{k\}\\leftarrow\\text\{GreedyMMD\}\(\\mathcal\{D\}\_\{\\text\{train\}\},C\_\{k\},n\)⊳\\trianglerightStore training subsets

7:

𝒚^Ck←f​\(𝒮k,Ck\)\\hat\{\\bm\{y\}\}\_\{C\_\{k\}\}\\leftarrow f\(\\mathcal\{S\}\_\{k\},C\_\{k\}\)⊳\\trianglerightPredict initial test points

8:endfor

9:Cache

←\{\(𝝁k,𝒮k\)\}k=1K\\leftarrow\\\{\(\\bm\{\\mu\}\_\{k\},\\mathcal\{S\}\_\{k\}\)\\\}\_\{k=1\}^\{K\}
10:

11:// Phase 2: Streaming inference

12:Initialise buffer

ℬ←∅\\mathcal\{B\}\\leftarrow\\emptyset
13:whilenew test points arrivedo

14:Receive new test point

𝒙new∗\\bm\{x\}^\{\*\}\_\{\\text\{new\}\}\(standardised\)

15:

k∗←arg⁡mink∈\{1,…,K\}⁡‖𝒙new∗−𝝁k‖2k^\{\*\}\\leftarrow\\arg\\min\_\{k\\in\\\{1,\\dots,K\\\}\}\\\|\\bm\{x\}^\{\*\}\_\{\\text\{new\}\}\-\\bm\{\\mu\}\_\{k\}\\\|^\{2\}⊳\\trianglerightAssign to nearest centroid

16:

ℬk∗←ℬk∗∪\{𝒙new∗\}\\mathcal\{B\}\_\{k^\{\*\}\}\\leftarrow\\mathcal\{B\}\_\{k^\{\*\}\}\\cup\\\{\\bm\{x\}^\{\*\}\_\{\\text\{new\}\}\\\}⊳\\trianglerightAdd to cluster buffer

17:if

\|ℬk∗\|≥Bmin\|\\mathcal\{B\}\_\{k^\{\*\}\}\|\\geq B\_\{\\min\}ortimeout reachedthen

18:

𝒚^ℬk∗←f​\(𝒮k∗,ℬk∗\)\\hat\{\\bm\{y\}\}\_\{\\mathcal\{B\}\_\{k^\{\*\}\}\}\\leftarrow f\(\\mathcal\{S\}\_\{k^\{\*\}\},\\mathcal\{B\}\_\{k^\{\*\}\}\)⊳\\trianglerightBatched PFN forward pass

19:emit

𝒚^ℬk∗\\hat\{\\bm\{y\}\}\_\{\\mathcal\{B\}\_\{k^\{\*\}\}\}
20:

ℬk∗←∅\\mathcal\{B\}\_\{k^\{\*\}\}\\leftarrow\\emptyset⊳\\trianglerightFlush buffer

21:endif

22:endwhile

## Appendix DAccelerations to Greedy MMD Minimisation Implementation

In this section, we describe the optimizations we apply to the greedy MMD herding in CRUMB to improve its runtime\. There are three main accelerations: \(1\) we approximate the Gaussian RBF kernel viaRandom Fourier Features\[[24](https://arxiv.org/html/2606.11473#bib.bib52)\], replacingO​\(d\)O\(d\)distance computations and exponentiations withO​\(D\)O\(D\)inner products in a low\-dimensional embedding space; \(2\) we usebatched greedy selection, choosing the top\-BBscoring candidates per iteration before recomputing the redundancy term, reducing the number of score\-recomputation rounds fromnnto⌈n/B⌉\\lceil n/B\\rceil; \(3\) we applyearly stoppingto the herding procedure by monitoring the relative decrease inMMD^2\\widehat\{\\mathrm\{MMD\}\}^\{2\}everyΔ\\Deltaselected points and halting when improvement falls below a thresholdε\\varepsilonforPPconsecutive checks, allowing simpler clusters to terminate early with fewer context points and correspondingly faster inference\.

Random Fourier Features\.The greedy MMD selection in Algorithm 1 requires repeated evaluation of the Gaussian RBF kernelκ​\(x,x′\)=exp⁡\(−‖x−x′‖2/2​σ2\)\\kappa\(x,x^\{\\prime\}\)=\\exp\\\!\\bigl\(\-\\\|x\-x^\{\\prime\}\\\|^\{2\}/2\\sigma^\{2\}\\bigr\)between candidate training points and the running selected set\. Naïvely, each greedy round requires𝒪​\(N\)\\mathcal\{O\}\(N\)kernel evaluations against the newly selected point, and the full selection ofnnpoints costs𝒪​\(N​n\)\\mathcal\{O\}\(Nn\)kernel evaluations \(or𝒪​\(N​⌈n/B⌉\)\\mathcal\{O\}\(N\\lceil n/B\\rceil\)with batch sizeBB\)\. Since each kernel evaluation involves computing a pairwise distance and an exponentiation, this becomes the computational bottleneck for large training pools\.

We accelerate selection using*Random Fourier Features*\(RFF\)\[[24](https://arxiv.org/html/2606.11473#bib.bib52)\]to approximate the RBF kernel via an explicit finite\-dimensional feature map\. Specifically, we construct a randomised embeddingz:ℝd→ℝDz\\colon\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{D\}such that

κ​\(x,x′\)≈z​\(x\)⊤​z​\(x′\),\\kappa\(x,x^\{\\prime\}\)\\approx z\(x\)^\{\\top\}z\(x^\{\\prime\}\),\(4\)where

z​\(x\)=2D​cos⁡\(Ω⊤​x\+b\),Ω∼𝒩​\(0,σ−2​Id\),b∼Uniform​\[0,2​π\]\.z\(x\)=\\sqrt\{\\frac\{2\}\{D\}\}\\cos\(\\Omega^\{\\top\}x\+b\),\\quad\\Omega\\sim\\mathcal\{N\}\\\!\\bigl\(0,\\sigma^\{\-2\}I\_\{d\}\\bigr\),\\quad b\\sim\\mathrm\{Uniform\}\[0,2\\pi\]\.\(5\)HereΩ∈ℝd×D\\Omega\\in\\mathbb\{R\}^\{d\\times D\}andb∈ℝDb\\in\\mathbb\{R\}^\{D\}are sampled once and reused for all points, so the embedding is consistent across training and test sets\.

LetZ∈ℝN×DZ\\in\\mathbb\{R\}^\{N\\times D\}denote the RFF embeddings of theNNtraining points, and letz¯𝒞=1\|𝒞k\|​∑x∗∈𝒞kz​\(x∗\)\\bar\{z\}\_\{\\mathcal\{C\}\}=\\frac\{1\}\{\|\\mathcal\{C\}\_\{k\}\|\}\\sum\_\{x^\{\*\}\\in\\mathcal\{C\}\_\{k\}\}z\(x^\{\*\}\)be the mean embedding of the test cluster𝒞k\\mathcal\{C\}\_\{k\}\. The two key quantities in the greedy objective become simple linear\-algebraic operations in the RFF space:

- •Cross\-term\(affinity of each candidate to the cluster\):1\|𝒞k\|​∑jκ​\(xi,xj∗\)≈z​\(xi\)⊤​z¯𝒞\\frac\{1\}\{\|\\mathcal\{C\}\_\{k\}\|\}\\sum\_\{j\}\\kappa\(x\_\{i\},x\_\{j\}^\{\*\}\)\\approx z\(x\_\{i\}\)^\{\\top\}\\bar\{z\}\_\{\\mathcal\{C\}\}, computed as a single matrix–vector productZ​z¯𝒞∈ℝNZ\\,\\bar\{z\}\_\{\\mathcal\{C\}\}\\in\\mathbb\{R\}^\{N\}\.
- •Running kernel sum\(redundancy with already\-selected points\):∑j∈𝒮κ​\(xi,xj\)≈z​\(xi\)⊤​∑j∈𝒮z​\(xj\)\\sum\_\{j\\in\\mathcal\{S\}\}\\kappa\(x\_\{i\},x\_\{j\}\)\\approx z\(x\_\{i\}\)^\{\\top\}\\\!\\\!\\sum\_\{j\\in\\mathcal\{S\}\}z\(x\_\{j\}\)\. After selecting a batch ofBBpoints, we update the running sumz~𝒮←z~𝒮\+∑j∈batchz​\(xj\)\\tilde\{z\}\_\{\\mathcal\{S\}\}\\leftarrow\\tilde\{z\}\_\{\\mathcal\{S\}\}\+\\sum\_\{j\\in\\text\{batch\}\}z\(x\_\{j\}\), aDD\-dimensional vector addition\. The per\-candidate scores are thenZ​z~𝒮∈ℝNZ\\,\\tilde\{z\}\_\{\\mathcal\{S\}\}\\in\\mathbb\{R\}^\{N\}\.

Each greedy round thus costs𝒪​\(N​D\)\\mathcal\{O\}\(ND\)for the matrix–vector products plus𝒪​\(B​D\)\\mathcal\{O\}\(BD\)for the running\-sum update, compared to𝒪​\(N​B​d\)\\mathcal\{O\}\(NBd\)distance computations and exponentiations in the exact\-kernel version\. SinceD≪ND\\ll Nin practice \(we useD=64D=64\), this yields a substantial speedup\.

#### Batched greedy selection\.

The standard kernel herding procedure selects one point per iteration, recomputing allNNcandidate scores after each addition\. We accelerate this by selecting the top\-BBscoring candidates at each iteration before recomputing the redundancy termz~𝒮\\tilde\{z\}\_\{\\mathcal\{S\}\}\. This reduces the number of score\-recomputation rounds fromnnto⌈n/B⌉\\lceil n/B\\rceil, at the cost of slightly greedier \(less sequential\) decisions within each batch\. In practice we useB=50B\{=\}50\(Table[4](https://arxiv.org/html/2606.11473#A1.T4)\); the ablation in Appendix[D\.1](https://arxiv.org/html/2606.11473#A4.SS1)confirms that this introduces negligible accuracy loss compared toB=1B\{=\}1\.

#### Early Stopping to MMD herding\.

In practice rather than selecting a fixed budget ofnncontext points, the algorithm monitors the squared MMD valueMMD^t2=‖z¯𝒞−1\|St\|​∑i∈Stz​\(xi\)‖2\\widehat\{\\mathrm\{MMD\}\}^\{\\,2\}\_\{t\}=\\bigl\\\|\\bar\{z\}\_\{\\mathcal\{C\}\}\-\\tfrac\{1\}\{\|S\_\{t\}\|\}\\sum\_\{i\\in S\_\{t\}\}z\(x\_\{i\}\)\\bigr\\\|^\{2\}everyΔ=50\\Delta\{=\}50selected points \(after a warm\-up ofmax⁡\(100,n/10\)\\max\(100,\\,n/10\)\) and applies early stopping when the relative improvementrt=\(MMD^t−Δ2−MMD^t2\)/MMD^t−Δ2r\_\{t\}=\(\\widehat\{\\mathrm\{MMD\}\}^\{\\,2\}\_\{t\-\\Delta\}\-\\widehat\{\\mathrm\{MMD\}\}^\{\\,2\}\_\{t\}\)/\\widehat\{\\mathrm\{MMD\}\}^\{\\,2\}\_\{t\-\\Delta\}falls belowε=10−4\\varepsilon\{=\}10^\{\-4\}forP=5P\{=\}5consecutive checks\. The final context size\|St∗\|≤n\|S\_\{t^\{\*\}\}\|\\leq nis thus determined adaptively by the complexity of the local distribution: simpler clusters converge faster, using fewer context points and proportionally faster inference\. The overall cost is𝒪​\(t∗B​N​D\)\\mathcal\{O\}\\\!\\bigl\(\\tfrac\{t^\{\*\}\}\{B\}\\,ND\\bigr\)wheret∗≤nt^\{\*\}\\leq nis the early\-stop point\.

### D\.1RFF and Batching Ablation

We compare 18 selection variants formed by crossing three axes: kernel computation \(exact vs\. RFF withD∈\{16,64,256\}D\\in\\\{16,64,256\\\}\), greedy batch size \(B∈\{1,10,50,100\}B\\in\\\{1,10,50,100\\\}\), and two non\-MMD baselines \(Centroid\-NN, Voronoi\-Uniform\)\. All variants useK=20K\{=\}20clusters, context sizen=500n\{=\}500, and are evaluated on the full TabArena collection \(38 classification \+ 13 regression datasets, 3 models, 10 seeds\)\.

#### Key findings\.

The central result is that*most MMD\-based variants perform comparably*, with only minor separation between exact and some of the approximate configurations\. Non\-MMD baselines \(Centroid\-NN, Voronoi\-Uniform\) rank noticeably worse, confirming that the distributional objective itself, not the precision of the kernel computation, drives the quality gains\.

The practical implication is that RFF and batching can be used to accelerate selection without sacrificing accuracy and with minimal hyperparameter tuning\. Figure[10](https://arxiv.org/html/2606.11473#A4.F10)shows that mean accuracy ranking is reasonably flat across the\(B,D\)\(B,D\)grid, while Figure[12](https://arxiv.org/html/2606.11473#A4.F12)shows that larger batch sizes and small RFF dimensionDDapproximations reduce wall\-clock\. The configurationB=50B\{=\}50,D=64D\{=\}64is our recommended default: it matches exact MMD accuracy to within∼0\.2\{\\sim\}0\.2pp while running∼10×\{\\sim\}10\{\\times\}faster\. Figure[13](https://arxiv.org/html/2606.11473#A4.F13)further corroborates this, showing that the per\-trial approximation gap is tightly concentrated around zero for all RFF configurations\.

![Refer to caption](https://arxiv.org/html/2606.11473v1/x15.png)Figure 10:Average rank of all context\-selection methods across \(dataset, seed, model\) trials\.For each trial, methods are ranked by predictive performance \(accuracy for classification,R2R^\{2\}for regression\) using average tie\-breaking, with rank 1 being best\. Bars show the mean rank; error bars denote 95% confidence intervals\.![Refer to caption](https://arxiv.org/html/2606.11473v1/x16.png)Figure 11:Average rank of RFF approximations by batch size and feature dimension\.For each \(dataset, seed, model\) trial, all methods are ranked by predictive performance \(rank 1 = best\)\. Cells show the mean rank±\\pmSEM of each RFF configuration, aggregated over all trials\.![Refer to caption](https://arxiv.org/html/2606.11473v1/x17.png)Figure 12:Wall\-clock time of RFF approximations by batch size and feature dimension\.Cells show mean total inference time±\\pmstandard deviation \(in seconds\) per model, aggregated over all datasets and seeds\.![Refer to caption](https://arxiv.org/html/2606.11473v1/x18.png)Figure 13:Approximation gap \(Δ=method−exact MMD\\Delta=\\text\{method\}\-\\text\{exact MMD\}\) per trial\. The performance profile \(inset\) reports the fraction of trials within±1%\\pm 1\\%,±2%\\pm 2\\%, and±5%\\pm 5\\%of exact MMD\.

## Appendix EComputational Complexity Analysis

In this section, we compare the computational complexity of the methods used in this paper\. Table[5](https://arxiv.org/html/2606.11473#A5.T5)summarises the computational cost of each inference strategy along two axes: how the total cost scales with the training set sizeNN\(for a fixed context budgetnn\), and how many PFN forward passes are required\.

Table 5:Computational complexity of inference strategies\. We isolate scaling with training set sizeNN\(for fixedn,T,Kn,T,K\) and the number of PFN forward passes \(which determines practical inference time\)\.†MICP usesK′=⌈γ​N/n⌉K^\{\\prime\}=\\lceil\\gamma N/n\\rceilprompters; with a cap onK′K^\{\\prime\}theNN\-scaling reduces toO​\(N\)O\(N\)\. In some cases some clusters may not have any test samples routed to them, in this case MICP only runsK∗K^\{\*\}forward passes, whereK∗K^\{\*\}is the number of unique clusters which have test points routed to them\.Once a context budgetnnand the number of test samplesTTare fixed, the PFN inference cost of all context selection methods becomes independent ofNN, and the onlyNN\-dependent costs are in precomputation \(clustering, retrieval, or MMD herding\)\. CRUMB’s precomputation isO​\(N​d​\(T\+K​n\)\)O\(Nd\(T\+Kn\)\), which is linear inNN\. Full\-context inference, by contrast, isO​\(N2\)O\(N^\{2\}\)and becomes prohibitively expensive for largeNN\.

The distinction between CRUMB and per\-querykkNN is not inNN\-scaling, as both are linear, but in the number of PFN forward passes, which in practice dominates wall\-clock time since each pass involves a full transformer forward pass throughLLlayers\. Per\-querykkNN requiresTTseparate forward passes \(one per test point\), because potentially no two test points share a context\. CRUMB requires onlyKKpasses \(one per test cluster\), because all test points within a cluster share the same training context and can be batched together\. ForT=5,000T=5\{,\}000test points andK=20K=20clusters, this is a250×250\\timesreduction in the number of forward passes\.

We provide details of the computational costs for each method compared in this paper\. LetNNbe the training set size,TTthe test set size,ddthe feature dimension,nnthe context budget per forward pass \(capped at the model windowNmaxN\_\{\\max\}\),KKthe number of test clusters \(CRUMB\), andK∗K^\{\*\}number of training clusters with test points routed to them \(MICP\)\.

PFN Standard Inference\.The standard PFN takes the full training set \(NNpoints\) and all test queries \(TTpoints\) as input\. In all three architectures we consider \(TabPFNv2, TabICLv1, TabICLv2\), the attention pattern is*asymmetric*: training tokens attend to all other training tokens \(self\-attention\), while test tokens attend only to training tokens \(cross\-attention\)\. Test tokens do*not*attend to each other\. While the architectures may have different scalings overall, when considering this self\-attention and cross\-attention over the samples, the attention cost per layer can be broken down into: \(a\) train\-to\-train self\-attention cost ofO​\(N2\)O\(N^\{2\}\), and \(b\) test\-to\-train cross\-attention cost ofO​\(T​N\)O\(TN\)\. Hence any architecture that employs this style of attention across samples has a cost that scales asCattn=O​\(N2\+T​N\)C\_\{\\text\{attn\}\}=O\\bigl\(N^\{2\}\+TN\\bigr\)\. The quadratic scaling in the number of training samples motivates the need for context selection methods, in which we can usen<Nn<Nsamples instead\.

Full Context\.In selecting the full context, we incur aO​\(1\)O\(1\)selection cost, as no pre\-processing is required, and a PFN inference cost, where the attention cost will scale asO​\(N2\+N​T\)O\(N^\{2\}\+NT\)\.

Uniform Subsampling\.Drawnntraining points uniformly at random and use them as context for allTTtest queries in a single forward pass\. This breaks down into a selection cost ofO​\(n\)O\(n\)for random index generation, and a PFN inference cost, where the attention cost will scale asO​\(n2\+n​T\)O\(n^\{2\}\+nT\)\.

Per\-QuerykkNN Selection\.For each test point𝒙j∗\\bm\{x\}\_\{j\}^\{\*\}individually, retrieve itsnnnearest neighbours from𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and run a separate PFN forward pass with context𝒩j\\mathcal\{N\}\_\{j\}and a single test query\. The procedure is summarised in Algorithm[3](https://arxiv.org/html/2606.11473#alg3)\.

Algorithm 3Per\-QuerykkNN Context Selection for PFN Inference0:Training set

𝒟train=\{\(𝒙i,yi\)\}i=1N\\mathcal\{D\}\_\{\\text\{train\}\}=\\\{\(\\bm\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}, test set

𝒟test=\{𝒙j∗\}j=1T\\mathcal\{D\}\_\{\\text\{test\}\}=\\\{\\bm\{x\}\_\{j\}^\{\*\}\\\}\_\{j=1\}^\{T\}, number of neighbours

nn, PFN model

ff
1:Standardise features of

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and

𝒟test\\mathcal\{D\}\_\{\\text\{test\}\}
2:Compute pairwise distances between

𝒟test\\mathcal\{D\}\_\{\\text\{test\}\}and

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}
3:for

j=1,…,Tj=1,\\dots,Tdo

4:

𝒩j←n​\-nearest\-neighbours​\(𝒙j∗,𝒟train\)\\mathcal\{N\}\_\{j\}\\leftarrow n\\text\{\-nearest\-neighbours\}\(\\bm\{x\}\_\{j\}^\{\*\},\\mathcal\{D\}\_\{\\text\{train\}\}\)⊳\\trianglerightRetrievennneighbours

5:

y^j←f​\(𝒩j,\{𝒙j∗\}\)\\hat\{y\}\_\{j\}\\leftarrow f\(\\mathcal\{N\}\_\{j\},\\\{\\bm\{x\}\_\{j\}^\{\*\}\\\}\)⊳\\trianglerightPFN forward pass \(single test point\)

6:endfor

7:return

\{y^j\}j=1T\\\{\\hat\{y\}\_\{j\}\\\}\_\{j=1\}^\{T\}

For finding thenn\-nearest neighbours of each ofTTtest points, we need to compute the distance to allNNtraining points and select thennclosest\. Each distance computation costsO​\(d\)O\(d\)\(whereddis the dimension of the data\), givingO​\(N​d\)O\(Nd\)per query and a total cost ofCretrieve=O​\(T​N​d\)C\_\{\\text\{retrieve\}\}=O\(TNd\)\. We remark that this pre\-computation cost for per\-querykkNN can be improved by using standard techniques like ball tree search\[[20](https://arxiv.org/html/2606.11473#bib.bib101)\], where one can build a ball tree data structure inO​\(N​d​\(log⁡\(N\)\)2\)O\(Nd\(\\log\(N\)\)^\{2\}\)time\. In the literature, the query complexity of ball tree search fornnnearest neighbours is sometimes claimed to beO​\(n​d​log⁡\(N\)\)O\(nd\\log\(N\)\)\(which we call “typical cost" for simplicity\), for example in MICP\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\], while the worst case complexity is given asO​\(N​n​d\)O\(Nnd\)\(see Table 3 in\[[1](https://arxiv.org/html/2606.11473#bib.bib102)\]\)\. Thus, the typical cost can be improved toO​\(N​d​\(log⁡\(N\)\)2\+T​n​d​log⁡\(N\)\)O\(Nd\(\\log\(N\)\)^\{2\}\+Tnd\\log\(N\)\)\. Note that the details of complexity calculation depend on the exact algorithm for tree search being implemented\. While the lookup cost can be improved, this still does not allow batching in general, requiringTTforward passes\.

Since each test point has a unique PFN instance being run separately \(no batching\), and for a single test pointT=1T=1, a forward pass for PFN inference costsO​\(n2\+n\)O\(n^\{2\}\+n\), running this for every test point gives a final inference cost ofO​\(T​n2\+T​n\)O\(Tn^\{2\}\+Tn\)\.

MICP\.MixturePFN\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]introduces a Sparse Mixture of In\-Context Prompters \(MICP\)\. It clusters the*training*data intoK′K^\{\\prime\}clusters, constructs a fixed\-size “prompt” \(support set\) ofnntraining points per cluster, and routes each test point to the nearest cluster centroid at inference time\.

A critical design choice is that thenumber of promptersgrows withNNas

K′=⌈γ​Nn⌉,K^\{\\prime\}=\\left\\lceil\\frac\{\\gamma N\}\{n\}\\right\\rceil,\(6\)whereγ\>0\\gamma\>0is a hyperparameter trading efficiency for effectiveness\. Sincennis fixed \(the context budget\),K′K^\{\\prime\}grows*linearly*withNN\. MICP incurs a one\-time precomputation cost at initialization\.

The MICP algorithm is composed of three steps: \(a\) clustering the training data, \(b\) building a ball tree over theK′K^\{\\prime\}cluster centroids andNNtraining data points, and \(c\) constructing the prompter support sets onto which we route the test points\. We will compute the cost associated with each step below\.

In computing the clusters, we incur thekk\-means cost of using Lloyd’s algorithm to clusterNNtraining points intoK′=⌈γ​N/n⌉K^\{\\prime\}=\\lceil\\gamma N/n\\rceilclusters\. WithIIiterations this cost is quantified as

Ccluster=O​\(N​K′​d​I\)=O​\(γ​I​N2​dn\)\.C\_\{\\text\{cluster\}\}=O\(NK^\{\\prime\}dI\)=O\\\!\\left\(\\frac\{\\gamma IN^\{2\}d\}\{n\}\\right\)\.SinceK′∝NK^\{\\prime\}\\propto N, this scales quadratically inNN\.

After the clusters have been built, the ball tree construction over theK′K^\{\\prime\}cluster centroids \(for efficient routing\) and over the full training data \(forkkNN expansion of small clusters\) has cost\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]

Ctree=O​\(K′​d​log⁡\(K′\)\+N​d​log⁡N\)=O​\(N​d​log⁡\(N\)\)\.C\_\{\\text\{tree\}\}=O\(K^\{\\prime\}d\\log\(K^\{\\prime\}\)\+Nd\\log N\)=O\(Nd\\log\(N\)\)\.Note that we use the ball tree construction cost given in\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]for consistency, instead ofO​\(N​d​\(log⁡\(N\)\)2\)O\(Nd\(\\log\(N\)\)^\{2\}\)cost given in other studies\[[1](https://arxiv.org/html/2606.11473#bib.bib102)\]\. We keep track of the dimension unlike\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]\.

Finally, for each of theK′K^\{\\prime\}prompters, we form a support set of sizenn\. If the cluster has≥n\\geq npoints, we subsample, and if it has<n<npoints, we expand viann\-nearest\-neighbour search from the centroid\. In the worst case, whereO​\(K′\)O\(K^\{\\prime\}\)clusters need to be expanded, this has cost given as

Cprompt=O​\(K′​n​d​log⁡N\)=O​\(γ​Nn⋅n​d​log⁡N\)=O​\(γ​N​d​log⁡N\),C\_\{\\text\{prompt\}\}=O\(K^\{\\prime\}nd\\log N\)=O\\\!\\left\(\\frac\{\\gamma N\}\{n\}\\cdot nd\\log N\\right\)=O\(\\gamma Nd\\log N\),using the fact that the ball tree search typically takesO​\(n​d​log⁡\(N\)\)O\(nd\\log\(N\)\)time for findingnn\-nearest neighbours\. We remark that in the worst case, ball tree look up costsO​\(N​n​d\)O\(Nnd\)\[[1](https://arxiv.org/html/2606.11473#bib.bib102)\]\.

The total pre\-computation cost then amounts to\[[30](https://arxiv.org/html/2606.11473#bib.bib29)\]

TMixPFN, init=O​\(γ​I​N2​dn\+N​d​log⁡N\)\.T\_\{\\text\{MixPFN, init\}\}=O\\\!\\left\(\\frac\{\\gamma IN^\{2\}d\}\{n\}\+Nd\\log N\\right\)\.\(7\)
At inference, each test point is routed to its nearest cluster centroid via ball\-tree search over theK′K^\{\\prime\}centroids\. This has cost:

Croute=O​\(T​log⁡K′\)=O​\(T​log⁡\(N/n\)\)\.C\_\{\\text\{route\}\}=O\(T\\log K^\{\\prime\}\)=O\(T\\log\(N/n\)\)\.The routing step isO​\(log⁡N\)O\(\\log N\)per test sample\.

Finally, we run PFN inference\. LetK∗∈\{1,…,K′\}K^\{\*\}\\in\\\{1,\\dotsc,K^\{\\prime\}\\\}denote the number of unique clusters that test points are assigned to\. Then, in the worst case, each such cluster is assignedT/K∗T/K^\{\*\}test points, and correspondingly, the PFN inference forK∗K^\{\*\}forward passes have a total cost ofO​\(K∗​\(n2\+n​\(T/K∗\)\)\)=O​\(K∗​n2\+n​T\)O\(K^\{\*\}\(n^\{2\}\+n\(T/K^\{\*\}\)\)\)=O\(K^\{\*\}n^\{2\}\+nT\)\. Note that ifTTis comparable toNN, thenK∗≈K′=O​\(N\)K^\{\*\}\\approx K^\{\\prime\}=O\(N\)in the worst case, and therefore, the number of forward passes for MICP can be large in principle\.

CRUMB\.CRUMB has three stages: \(1\) cluster the test queries, \(2\) greedy MMD selection per cluster, \(3\) batched PFN inference\.

Clustering theTTtest points viakk\-means \(Lloyd’s algorithm\) intoKKclusters inℝd\\mathbb\{R\}^\{d\}has cost

CStage 1=O​\(T​K​d​I\)\.C\_\{\\text\{Stage 1\}\}=O\(TKdI\)\.The next step is greedy MMD selection: for each clusterCkC\_\{k\}\(with\|Ck\|\|C\_\{k\}\|test points\), the greedy kernel herding procedure selectsnntraining points from the full pool ofNNcandidates\. First, we pre\-compute cross\-terms, where for each ofNNtraining points, the computation of the mean kernel similarity to all\|Ck\|\|C\_\{k\}\|test points in the cluster \(first term in Eq\. \([3](https://arxiv.org/html/2606.11473#S4.E3)\)\) incurs the costO​\(N​\|Ck\|​cost​\(κ\)\)O\(N\|C\_\{k\}\|\\text\{cost\}\(\\kappa\)\), wherecost​\(κ\)\\text\{cost\}\(\\kappa\)is the cost of computing the kernel value given two points\. For RBF kernel, we havecost​\(κ\)=O​\(d\)\\textnormal\{cost\}\(\\kappa\)=O\(d\)corresponding to computing‖𝒙i−𝒙j∗‖2\\\|\\bm\{x\}\_\{i\}\-\\bm\{x\}\_\{j\}^\{\*\}\\\|^\{2\}and the exponential\. Thus, for RBF kernel used in this study, the associated cost per cluster for cross\-terms isCcross=O​\(N⋅\|Ck\|⋅d\)C\_\{\\text\{cross\}\}=O\(N\\cdot\|C\_\{k\}\|\\cdot d\)\.

Next, we compute the cost for the second term in Eq\. \([3](https://arxiv.org/html/2606.11473#S4.E3)\) in the greedy loop\. This involves keeping track of the kernel costs between a growing batch𝒮k\\mathcal\{S\}\_\{k\}and the remaining training data\. Specifically, at timett, this amounts to a cost ofO​\(\(N−t\)​\(t\+1\)​d\)O\(\(N\-t\)\(t\+1\)d\)for the RBF kernel\. Overn=\|𝒮k\|n=\|\\mathcal\{S\}\_\{k\}\|steps, we pay a total cost ofCgreedy=∑t=0n−1O​\(\(N−t\)​\(t\+1\)​d\)=O​\(N​n​d\)C\_\{\\textnormal\{greedy\}\}=\\sum\_\{t=0\}^\{n\-1\}O\(\(N\-t\)\(t\+1\)d\)=O\(Nnd\)\. Importantly, since we have already precomputed the cross terms, they do not need to be computed at every step\.

As a result, the total cost for one clusterCkC\_\{k\}isCStage 2\(k\)=O​\(N​\|Ck\|​d\+n​N​d\)=O​\(N​d​\(\|Ck\|\+n\)\)C\_\{\\text\{Stage 2\}\}^\{\(k\)\}=O\\bigl\(N\|C\_\{k\}\|d\+nNd\\bigr\)=O\\bigl\(Nd\(\|C\_\{k\}\|\+n\)\\bigr\)\. Aggregating over allKKclusters, we obtain

CStage 2=∑k=1KO​\(N​d​\(\|Ck\|\+n\)\)=O​\(N​d​\(∑k\|Ck\|⏟=T\+K​n\)\)=O​\(N​d​\(T\+K​n\)\)\.C\_\{\\text\{Stage 2\}\}=\\sum\_\{k=1\}^\{K\}O\\bigl\(Nd\(\|C\_\{k\}\|\+n\)\\bigr\)=O\\bigl\(Nd\\bigl\(\\underbrace\{\\textstyle\\sum\_\{k\}\|C\_\{k\}\|\}\_\{=T\}\+Kn\\bigr\)\\bigr\)=O\\bigl\(Nd\(T\+Kn\)\\bigr\)\.\(8\)Withnncapped atNmaxN\_\{\\max\}andK,T,dK,T,dfixed, this isO​\(N\)O\(N\), linear in the training pool size\.

Now, we look at the final state of batched PFN inference\. Here, we are required to runKKforward passes\. Passkkhas context𝒮k\\mathcal\{S\}\_\{k\}\(sizenn\) and test batchCkC\_\{k\}\(size\|Ck\|\|C\_\{k\}\|\)\. For a single cluster the attention cost across samples for a single forward pass isO​\(n2\+\|Ck\|​n\)O\(n^\{2\}\+\|C\_\{k\}\|n\)\. Therefore, runningKKtotal forward passes gives a cost ofCStage 3=O​\(K​n2\+∑k\|Ck\|​n\)=O​\(K​n2\+T​n\)C\_\{\\textnormal\{Stage 3\}\}=O\(Kn^\{2\}\+\\sum\_\{k\}\|C\_\{k\}\|n\)=O\(Kn^\{2\}\+Tn\)\.

As shown in the previous examples we can see that regardless of how theTTtest points are batched, the cross\-attention contribution in the attention always scalesO​\(T​n\)O\(Tn\)\. However, as each of theKKbatches uses a different training context the self attention for all batches scales asO​\(K​n2\)O\(Kn^\{2\}\)\. For per\-querykkNN we get the worst case scaling ofO​\(T​n2\)O\(Tn^\{2\}\), which motivates CRUMB as a faster method that still preserves performance\. We also note that better methods of finding training contexts can lead to a smaller context being needed for the same result\. This way speedups can be achieved by makingnnsmaller\.

Similar Articles

Online Pandora's Box for Contextual LLM Cascading

arXiv cs.AI

This paper introduces an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs, proposing a learning approach that combines GMM estimation with UCB-style confidence bounds and proving dimension-dependent regret bounds.