From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets

arXiv cs.LG Papers

Summary

PRAXIS is a new algorithm that efficiently approximates the Rashomon set of near-optimal decision trees, achieving orders of magnitude improvement in runtime and memory while maintaining near-perfect recall.

arXiv:2606.00202v1 Announce Type: new Abstract: Standard machine learning pipelines often admit many near-optimal models. These "Rashomon sets" pose a range of challenges and opportunities for uncertainty-aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real-world datasets. Code for PRAXIS is available at https://github.com/zakk-h/PRAXIS
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:40 PM

# From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets
Source: [https://arxiv.org/html/2606.00202](https://arxiv.org/html/2606.00202)
###### Abstract

Standard machine learning pipelines often admit many near\-optimal models\. These “Rashomon sets” pose a range of challenges and opportunities for uncertainty\-aware, robust decision making\. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function\. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources\. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage\. We validate that PRAXIS regularly recovers almost all of the full Rashomon set\. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real\-world datasets\. Code for PRAXIS is available at[https://github\.com/zakk\-h/PRAXIS](https://github.com/zakk-h/PRAXIS)

interpretable machine learning, Rashomon sets, Decision Trees

## 1Introduction

Model selection is crucial to any machine learning pipeline\. The Rashomon effect\(Breiman,[1984](https://arxiv.org/html/2606.00202#bib.bib16)\)refers to the phenomenon that many models can be well\-justified for a given dataset and objective\. A recent framework for model selection in the presence of this effect is theRashomon set paradigm\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80); Rudin et al\.,[2024](https://arxiv.org/html/2606.00202#bib.bib68)\), whereby an algorithm first enumerates or represents the set of all plausible models that fit the data well \(i\.e\., the Rashomon set\), and then the user interacts with the Rashomon set to select the model that aligns best with their needs\. This paradigm is particularly useful when users are unable to formalize all of their goals and constraints in advance\. Users may want to find a model that maximizes fairness, obeys causal hypotheses, uses certain features, and/or follows specific structural constraints\. These modeling goals become simpler when the Rashomon set has been enumerated, because only a simple loop through the set is required to optimize any secondary objective\. The method for uncovering this Rashomon set depends on the model class being considered\. For instance, Rashomon sets of generalized additive models \(GAMs\) can be efficiently approximated by sampling around a convex hyperboloid centered at the optimal weight vector\(Zhong et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib83)\)\. Uncovering the Rashomon set for discrete, non\-parametric model classes such as decision trees, however, can be a massive computational undertaking in both runtime and memory, as even finding a single optimal decision tree is NP\-hard\. As an example,Hu et al\. \([2019b](https://arxiv.org/html/2606.00202#bib.bib41)\)show that the size of the search space of decision trees of depth44with only2020features is≈8\.4×1018\\approx 8\.4\\times 10^\{18\}trees\.

![Refer to caption](https://arxiv.org/html/2606.00202v1/x1.png)Figure 1:An illustration of PRAXIS and other Rashomon set algorithms on the News dataset\(Fernandes et al\.,[2015b](https://arxiv.org/html/2606.00202#bib.bib35)\),λ=0\.02,ε=0\.03\\lambda=0\.02,\\varepsilon=0\.03, depth=5=5\. PRAXIS runs orders of magnitude faster than competitors while ensuring near\-perfect recall relative to optimal methods\.Although two recent algorithms\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80); Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)can exactly uncover the Rashomon set for sparse decision trees, their runtime and memory requirements scale exponentially with the depth budget and number of features\. RESPLIT\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\)provides an approximate Rashomon set, at significant cost to approximation quality and still with worst\-case exponential cost per tree recovered\.

We propose a new family of Rashomon set approximation algorithms with polynomial computation time per member of the Rashomon set\. Our algorithms use a proxy subroutine for evaluating feature splits; this subroutine efficiently computes the objectives of high quality individual trees on a given subproblem\.

Our approach, PRAXIS \(Proxy\-guidedRashomon setApproXimatIonS\), achieves orders\-of\-magnitude improvements in both runtime and memory efficiency compared to state\-of\-the\-art exact and approximate approaches while still recovering nearly the full Rashomon set\.

Our contributions are as follows\.

\(1\) We introduce a novel, flexible framework for efficient approximation of the Rashomon set of decision trees\.

\(2\) We demonstrate reliable recovery of the Rashomon set with our approximations and discuss theoretical conditions under which this recovery is guaranteed\.

\(3\) We demonstrate that our method is orders of magnitude faster and more memory efficient than the previous state\-of\-the\-art methods and provide asymptotic analysis to match\.

## 2Related Work

##### Optimal Individual Trees\.

Decision trees have been established for decades as interpretable, scalable classifiers, with widely cited implementations such as CART\(Breiman,[1984](https://arxiv.org/html/2606.00202#bib.bib16)\)and C4\.5\(Quinlan,[2014](https://arxiv.org/html/2606.00202#bib.bib67)\)\. Early decision tree approaches were greedy, building trees from the root, adding splits one at a time according to heuristics, without looking back to see if the splits could be improved\. Researchers have substantially improved the accuracy of individual sparse decision trees via global optimization of performance and sparsity, alongside a range of techniques for computational efficiency\(Lin et al\.,[2020](https://arxiv.org/html/2606.00202#bib.bib46); Aglin et al\.,[2020](https://arxiv.org/html/2606.00202#bib.bib1); Hu et al\.,[2019a](https://arxiv.org/html/2606.00202#bib.bib40); Demirović et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib25); van der Linden et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib76); Bertsimas & Dunn,[2017](https://arxiv.org/html/2606.00202#bib.bib12)\)\. A recent survey found that many of the most scalable approaches succeed by leveraging tree\-specific branch and bound logic with dynamic programming\(Costa & Pedreira,[2023](https://arxiv.org/html/2606.00202#bib.bib24)\)\. Several tree optimization works have observed that such tree\-specific approaches can be framed as search over an AND/OR graph\(Sullivan et al\.,[2024](https://arxiv.org/html/2606.00202#bib.bib74); Chaouki et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib20)\); our algorithm also corresponds to an AND/OR graph search \(see Appendix[B\.2](https://arxiv.org/html/2606.00202#A2.SS2)\)\.

##### Optimal Rashomon Sets\.

Several approaches for finding optimal single trees can be extended to find all trees within a small multiple of the optimal objective – that is, the Rashomon set of trees\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80); Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)\. Additional works exist for special cases of trees\(Mata et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib49); Ciaperoni et al\.,[2024](https://arxiv.org/html/2606.00202#bib.bib21); Babbar et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib6)\)\. The ability to capture Rashomon sets enables a range of powerful downstream applications, such as adding robustness\(Hsu et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib39)\), estimating variable importance\(Donnelly et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib28),[2026](https://arxiv.org/html/2606.00202#bib.bib29)\), or providing customization and control to domain experts and practitioners\(Rudin et al\.,[2024](https://arxiv.org/html/2606.00202#bib.bib68)\)\. These advances are significant but struggle to scale to larger practical datasets: they are combinatorially complex in memory and runtime, particularly with respect to the number of features in the dataset\.

##### Individual Tree Approximations\.

Several approaches have improved the scalability of optimal and approximately optimal decision tree algorithms through better handling of continuous features\(Mazumder et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib51); Brița et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib17)\)or incorporating carefully founded heuristics\(McTavish et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib52); Blanc et al\.,[2024](https://arxiv.org/html/2606.00202#bib.bib14); Demirović et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib26); Kiossou et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib44),[2024](https://arxiv.org/html/2606.00202#bib.bib45); Kiossou & Schaus,[2026](https://arxiv.org/html/2606.00202#bib.bib43)\)\. Of particular relevance to our work is the LicketySPLIT algorithm\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\)\. Starting at the root, this approach considers all possible splits and evaluates each based on completions of the tree with a greedy algorithm\. It selects the best initial split conditioned on greedy completions, then recurses\. This split\-selection heuristic is a rollout procedure\(Bertsekas et al\.,[1997](https://arxiv.org/html/2606.00202#bib.bib11)\); consequently, the overall LicketySPLIT algorithm belongs to the class of pilot algorithms\(Duin & Voß,[1994](https://arxiv.org/html/2606.00202#bib.bib30); Voß et al\.,[2005](https://arxiv.org/html/2606.00202#bib.bib77)\)\. Our method PRAXIS can be viewed as a pilot method that outputs not a single solution, but a set of near\-optimal solutions\. Like any pilot method, it uses a subroutine for heuristic completions\. For ease of nomenclature, we call this subroutine aproxy method\(defined in[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1)\), though this proxy can be quite complicated\. Empirically, we find that the objective of a single\-tree pilot method approach like LicketySPLIT serves as an excellent proxy, both for accurate results and for caching efficiency\.

##### Rashomon Set Approximations

Relatively few works have focused on computational benefits for approximate Rashomon sets, which can be complex given the distinct nature of the optimization task\.Babbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)provide RESPLIT, a scalable Rashomon set approximation based on a hybridization of greedy heuristics and branch\-and\-bound search\. RESPLIT’s optimization strategy is orthogonal to our approach: it solves simpler subproblems optimally and then combines them approximately, whereas we instead approximate the full problem directly\. In addition, a commonly used baseline is to take repeated bootstraps of a dataset and run single tree optimization on each bootstrap; this is explored in both exact decision tree Rashomon set works\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3); Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80)\)\. Both papers find bootstrapping to be inefficient, leading to many missed Rashomon set members, both when running an optimal method on each bootstrap and when running a greedy method like CART\(Breiman,[1984](https://arxiv.org/html/2606.00202#bib.bib16)\)\. We also compare to bootstrapping our proxy algorithm, LicketySPLIT, showing that our approach \(PRAXIS\) yields a dramatically better approximation of the Rashomon set\.

## 3Methodology

### 3\.1Notation

LetD=\{\(xi,yi\)\}i=1nD=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}be a dataset of sizenn, whereyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}and eachxi∈\{0,1\}kx\_\{i\}\\in\\\{0,1\\\}^\{k\}haskkbinary features \(possibly binarizations of continuous or categorical features\)\. A binary decision treet∈𝒯t\\in\\mathcal\{T\}is a function\{0,1\}k→\{0,1\}\\\{0,1\\\}^\{k\}\\rightarrow\\\{0,1\\\}\. A depth 0 tree, i\.e\., a leaf, makes a single point prediction for any sample\. Any other tree has a simple splitting function \(defined assplitt​\(xi\)=xi​j\\textrm\{split\}\_\{t\}\(x\_\{i\}\)=x\_\{ij\}for somej∈\[1,k\]j\\in\[1,k\]\), as well as two subtreestleft∈𝒯,tright∈𝒯​s\.t\.depth​\(t\)=1\+max⁡\(depth​\(tleft\),depth​\(tright\)\)t\_\{\\textrm\{left\}\}\\in\\mathcal\{T\},t\_\{\\textrm\{right\}\}\\in\\mathcal\{T\}\\textrm\{ s\.t\. \}\\textrm\{depth\}\(t\)=1\+\\max\\left\(\\textrm\{depth\}\(t\_\{\\textrm\{left\}\}\),\\textrm\{depth\}\(t\_\{\\textrm\{right\}\}\)\\right\); it returns:

t​\(xi\)=tleft​\(xi\)​splitt​\(xi\)\+tright​\(xi\)​\(1−splitt​\(xi\)\)\.t\(x\_\{i\}\)=t\_\{\\textrm\{left\}\}\(x\_\{i\}\)\\textrm\{split\}\_\{t\}\(x\_\{i\}\)\+t\_\{\\textrm\{right\}\}\(x\_\{i\}\)\(1\-\\textrm\{split\}\_\{t\}\(x\_\{i\}\)\)\.That is, a tree returns the prediction of its left or right subtree based on a queried feature\. We denoteDtD\_\{t\}as the subset of datasetDDfor which treettis assigned to make predictions; for the root tree, this is equal toDD, and this is then recursively defined as

Dtleft=\{\(xi,yi\)∈Dt​s\.t\.splitt​\(xi\)=1\}D\_\{t\_\{\\textrm\{left\}\}\}=\\\{\(x\_\{i\},y\_\{i\}\)\\in D\_\{t\}\\textrm\{ s\.t\. \}\\textrm\{split\}\_\{t\}\(x\_\{i\}\)=1\\\}Dtright=\{\(xi,yi\)∈Dt​s\.t\.splitt​\(xi\)=0\}D\_\{t\_\{\\textrm\{right\}\}\}=\\\{\(x\_\{i\},y\_\{i\}\)\\in D\_\{t\}\\textrm\{ s\.t\. \}\\textrm\{split\}\_\{t\}\(x\_\{i\}\)=0\\\}
Let\|t\|\|t\|denote the number of leaves in treett\. We use an objective that penalizes both the number of misclassifications and the number of leaves, with a per\-leaf miss penalty,γ\\gamma:

Obj​\(t,D,γ\)=γ​\|t\|\+∑i=1n𝟙​\{t​\(xi\)≠yi\}\.\\text\{Obj\}\(t,D,\\gamma\)=\\gamma\|t\|\+\\sum\_\{i=1\}^\{n\}\\mathds\{1\}\{\\\{t\(x\_\{i\}\)\\neq y\_\{i\}\\\}\}\.This objective is analogous to prior Rashomon set formulations\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80); Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\), which use a regularization parameterλ\\lambda; the two are related by a scaling of\|D\|\|D\|, withγ=λ​\|D\|\\gamma=\\lambda\|D\|\. In our implementation, we constrainγ\\gammato be a whole number to avoid floating point calculations, with negligible effect on objective expressiveness \(discussed in Appendix[B](https://arxiv.org/html/2606.00202#A2)\)\.

Let𝒯d=\{t∈𝒯∣depth​\(t\)≤d\}\\mathcal\{T\}\_\{d\}=\\\{t\\in\\mathcal\{T\}\\mid\\textrm\{depth\}\(t\)\\leq d\\\}; The Rashomon set for datasetDD, depth bounddd, and penaltyγ\\gammacan be defined as

ℛεabs​\(D,d,γ\):=\{t∈𝒯d∣Obj​\(t,D,γ\)≤εabs\},\\mathcal\{R\}\_\{\\varepsilon\_\{\\mathrm\{abs\}\}\}\(D,d,\\gamma\):=\\\{t\\in\\mathcal\{T\}\_\{d\}\\mid\\mathrm\{Obj\}\(t,D,\\gamma\)\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\\\},with cardinality\|ℛεabs​\(D,d,γ\)\|\|\\mathcal\{R\}\_\{\\varepsilon\_\{\\mathrm\{abs\}\}\}\(D,d,\\gamma\)\|\(abbreviated to\|ℛεabs\|\|\\mathcal\{R\}\_\{\\varepsilon\_\{\\mathrm\{abs\}\}\}\|for notational simplicity\)\. It is often convenient to parameterize this using a fractional toleranceεmult\\varepsilon\_\{\\mathrm\{mult\}\}, settingεabs=\(1\+εmult\)​mint∈𝒯d⁡Obj​\(t,D,γ\)\\varepsilon\_\{\\mathrm\{abs\}\}=\(1\+\\varepsilon\_\{\\mathrm\{mult\}\}\)\\min\_\{t\\in\\mathcal\{T\}\_\{d\}\}\\mathrm\{Obj\}\(t,D,\\gamma\)\.

As in prior work, we maintain a search graph representation of the Rashomon set, from which all valid trees can be enumerated after the algorithm terminates\. Algorithms[8](https://arxiv.org/html/2606.00202#alg8)and[9](https://arxiv.org/html/2606.00202#alg9)in[Appendix B](https://arxiv.org/html/2606.00202#A2)specify the operations used to extend this graph with new leaves and splits, respectively, and populate the minimum objective of trees rooted at that subproblem to be used throughout our algorithm\.

### 3\.2Proxy Optimizer Framework

Our goal is to approximate the Rashomon set efficiently\. To that end, we utilize proxy algorithms \(Definition[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1)\) as efficient subroutines to find high\-quality individual trees\. We use the objectives incurred by these trees to prune our search space\.

###### Definition 3\.1\(Proxy Algorithm\)\.

Given a datasetDD, depth budgetdd, and regularization parameterγ\\gamma, a*proxy algorithm*returns the objective of some treet∈𝒯dt\\in\\mathcal\{T\}\_\{d\}:

PROXY​\(D,d,γ\)=Obj​\(t,D,γ\)\.\\textsc\{PROXY\}\(D,d,\\gamma\)=\\mathrm\{Obj\}\(t,D,\\gamma\)\.Ifttis not a leaf, then its children must satisfy:

PROXY​\(Dtleft,d−1,γ\)≤Obj​\(tleft,Dtleft,γ\)PROXY​\(Dtright,d−1,γ\)≤Obj​\(tright,Dtright,γ\)\\begin\{aligned\} &\\textsc\{PROXY\}\(D\_\{t\_\{\\mathrm\{left\}\}\},d\-1,\\gamma\)&&\\leq\\mathrm\{Obj\}\(t\_\{\\mathrm\{left\}\},D\_\{t\_\{\\mathrm\{left\}\}\},\\gamma\)\\\\ &\\textsc\{PROXY\}\(D\_\{t\_\{\\mathrm\{right\}\}\},d\-1,\\gamma\)&&\\leq\\mathrm\{Obj\}\(t\_\{\\mathrm\{right\}\},D\_\{t\_\{\\mathrm\{right\}\}\},\\gamma\)\\end\{aligned\}

Many algorithms in the literature can satisfy this property, and therefore can be a proxy algorithm\(e\.g\., Breiman,[1984](https://arxiv.org/html/2606.00202#bib.bib16); Lin et al\.,[2020](https://arxiv.org/html/2606.00202#bib.bib46)\)\. If a decision tree algorithm does not satisfy this refinement property, we can easily modify it to meet the above criteria by memoizing subtree objectives \(see Appendix[B\.6](https://arxiv.org/html/2606.00202#A2.SS6)\)\. If an algorithm exactly preserves the left and right subtree objectives, we can save runtime with caching, as also discussed in Appendix[B\.6](https://arxiv.org/html/2606.00202#A2.SS6)\.

Algorithm[1](https://arxiv.org/html/2606.00202#alg1)presents our approach\. At each subproblem \(starting from the root\), we construct nodes to represent each of the possible actions, i\.e\., leaf predictions or feature splits, for extending a tree while remaining within the budgetεabs\\varepsilon\_\{\\textrm\{abs\}\}\. We add any leaf predictions that fit within the budget \(lines 2\-8\), and, if we have not hit a depth limit \(lines 9\-11\), evaluate all possible splits, pruning those whose proxy\-completed objectives exceedεabs\\varepsilon\_\{\\textrm\{abs\}\}\(lines 12\-21\)\. For each remaining split, we recursively approximate the sets of feasible left and right subtrees using the budget refinement procedure in Algorithm[3](https://arxiv.org/html/2606.00202#alg3)\(lines 22\-23\)\.

When the proxy algorithm is itselfexactlyoptimal, PRAXIS recovers the Rashomon set in a manner similar to existing state\-of\-the\-art solvers\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80); Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)\. Using an approximate proxy algorithm, however, leads to many practical benefits\.

Under some non\-optimal settings of this proxy optimizer, we can still recover exact guarantees around our approximation\. For example, Theorem[A\.12](https://arxiv.org/html/2606.00202#A1.Thmtheorem12)in[Appendix A](https://arxiv.org/html/2606.00202#A1)guarantees that, under a small set of modifications to PRAXIS, the returned set contains the full Rashomon set of simple rule lists \(trees that have at least one leaf child for each split\) for the given features and depth\.

In practice, we find that strong, approximate proxy algorithms, as well as stronger pruning, yield efficient and reliable Rashomon approximations\. Algorithm[2](https://arxiv.org/html/2606.00202#alg2)presents our default proxy algorithm for the experiments presented in the main body of this paper\. It corresponds to a substantially modified version of the LicketySPLIT algorithm\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\)\. At each subproblem, we return a leaf if the leaf objective andγ\\gammapenalty guarantee no further splits will improve the objective \(lines 2\-5\); otherwise, we find a single split minimizing the objective of a greedy tree completion on each subproblem \(line 6\) and recursively call our algorithm on the resulting two subproblems \(lines 7\-9\)\. We then compare that recursive solution to our leaf objective, and return whichever leads to a better tree \(line 10\)\. By contrast, LicketySPLIT compares the leaf objective to the greedy objective before recursion\. Either choice leads to the same worst\-case asymptotic complexity, but avoiding that prepruning step allows us to improve the objective of recovered trees\. Several other important distinctions from LicketySPLIT that improve the proxy are discussed in Appendix[B\.5](https://arxiv.org/html/2606.00202#A2.SS5)\. Most significantly, we cache subproblem solutions encountered inGreedy, which guarantees some cache reuse in later subproblems, and the greedy solver directly finds the accuracy\-maximizing split when the remaining depth limit is one \(rather than minimizing weighted child entropy\)\. Appendix[B\.5](https://arxiv.org/html/2606.00202#A2.SS5)defines a family of proxy algorithms ranging from greedy to optimal, and Appendix[D\.5](https://arxiv.org/html/2606.00202#A4.SS5)provides empirical results for these alternatives\.

Algorithm 1PRAXIS\(D,d,γ,εabs\)\(D,d,\\gamma,\\varepsilon\_\{\\textrm\{abs\}\}\)0:Subproblem dataset

DD, remaining depth

dd, per\-leaf penalty

γ\\gamma, budget

εabs\\varepsilon\_\{\\textrm\{abs\}\}
1:Let

G←OrNode​\(εabs\)G\\leftarrow\\textsc\{OrNode\}\(\\varepsilon\_\{\\textrm\{abs\}\}\)\{Initialize subgraph for subtrees found within budget

εabs\\varepsilon\_\{\\textrm\{abs\}\}\(See Appendix[B\.2](https://arxiv.org/html/2606.00202#A2.SS2)\)\}

2:for

b∈\{0,1\}b\\in\\\{0,1\\\}do

3:\{For each possible leaf prediction

bb, set

CbC\_\{b\}to the corresponding objective:

γ\\gamma\+ misclassification error\.\}

4:

Cb←γ\+\|\{\(xi,yi\)∈D:yi≠b\}\|C\_\{b\}\\leftarrow\\gamma\\;\+\\;\\big\|\\\{\(x\_\{i\},y\_\{i\}\)\\in D:y\_\{i\}\\neq b\\\}\\big\|\\,
5:if

Cb≤εabsC\_\{b\}\\leq\\varepsilon\_\{\\textrm\{abs\}\}then

6:AddLeaf\(G,b,Cb\)\\,\(G,b,C\_\{b\}\)\{See Appendix[B\.2](https://arxiv.org/html/2606.00202#A2.SS2)\}

7:endif

8:endfor

9:if

d=0d=0or

εabs<2​γ\\varepsilon\_\{\\textrm\{abs\}\}<2\\gammathen

10:return

GG\{Either no remaining depth or any split would exceed the budget\}

11:endif

12:foreachfeature

jjdo

13:

\(DL,DR\)←Partition​\(D,j\)\(D\_\{L\},D\_\{R\}\)\\leftarrow\\textsc\{Partition\}\(D,j\)
14:if

DL=∅D\_\{L\}=\\emptysetor

DR=∅D\_\{R\}=\\emptysetthen

15:continue\{Skip degenerate splits\}

16:endif

17:

PL←Proxy​\(DL,d−1,γ\)P\_\{L\}\\leftarrow\\textsc\{Proxy\}\(D\_\{L\},d\-1,\\gamma\)\{Proxy cost on left\}

18:

PR←Proxy​\(DR,d−1,γ\)P\_\{R\}\\leftarrow\\textsc\{Proxy\}\(D\_\{R\},d\-1,\\gamma\)\{Proxy cost on right\}

19:if

PL\+PR\>εabsP\_\{L\}\+P\_\{R\}\>\\varepsilon\_\{\\textrm\{abs\}\}then

20:continue\{Prune split if proxy completions exceed the budget\}

21:endif

22:

GL,GR←Solve\_Siblings\(DL,DR,d−1,γ,\{G\}\_\{L\},\{G\}\_\{R\}\\leftarrow\\textsc\{Solve\\\_Siblings\}\(D\_\{L\},D\_\{R\},d\{\-\}1,\\gamma,εabs,PR\)\\varepsilon\_\{\\textrm\{abs\}\},P\_\{R\}\)\{Find subgraphs for the left and right subproblems of this split; described in Algorithm[3](https://arxiv.org/html/2606.00202#alg3)\}

23:

AddSplit​\(G,j,GL,GR\)\\textsc\{AddSplit\}\(G,j,G\_\{L\},\{G\}\_\{R\}\)\{Add

GL,GRG\_\{L\},\{G\}\_\{R\}as a split for the current subgraph

GG; see Appendix[B\.2](https://arxiv.org/html/2606.00202#A2.SS2)\}

24:endfor

25:return

GG\{Rashomon graph for the subproblem\}

Usage Note:If provided with onlyεmult\\varepsilon\_\{\\textrm\{mult\}\}and notεabs\\varepsilon\_\{\\textrm\{abs\}\}, set Rashomon budget relative to proxy:

εabs←\(1\+εmult\)⋅Proxy​\(D,d,γ\)\\varepsilon\_\{\\textrm\{abs\}\}\\leftarrow\\bigl\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\\bigr\)\\cdot\\textsc\{Proxy\}\(D,d,\\gamma\)

Algorithm 2Default Algorithm forProxy​\(D′,d,γ\)\\textsc\{Proxy\}\(D^\{\\prime\},d,\\gamma\)0:Subproblem dataset

D′D^\{\\prime\}, remaining depth

dd, leaf penalty

γ\\gamma
1:

n′←\|D′\|n^\{\\prime\}\\leftarrow\|D^\{\\prime\}\|,

p←\|\{\(xi,yi\)∈D′:yi=1\}\|p\\leftarrow\|\\\{\(x\_\{i\},y\_\{i\}\)\\in D^\{\\prime\}:y\_\{i\}=1\\\}\|
2:

leaf\_obj←γ\+min⁡\{p,n′−p\}\\text\{leaf\\\_obj\}\\leftarrow\\gamma\+\\min\\\{p,\\,n^\{\\prime\}\-p\\\}
3:if

d=0d=0orleaf\_obj≤2​γ\\text\{leaf\\\_obj\}\\leq 2\\gammathen

4:returnleaf\_obj

5:endif

6:

j⋆←j^\{\\star\}\\leftarrowfeature whose split minimizes

Greedy​\(DL′,d−1,γ\)\+Greedy​\(DR′,d−1,γ\)\\textsc\{Greedy\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)\+\\textsc\{Greedy\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)
7:Split

D′D^\{\\prime\}by

j⋆j^\{\\star\}into

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)
8:

L←Proxy​\(DL′,d−1,γ\)L\\leftarrow\\textsc\{Proxy\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)
9:

R←Proxy​\(DR′,d−1,γ\)R\\leftarrow\\textsc\{Proxy\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)
10:return

min⁡\{leaf\_obj,L\+R\}\\min\\\{\\text\{leaf\\\_obj\},L\+R\\\}

### 3\.3Algorithm details

##### Budget\.

One of the core details around PRAXIS is the setting of the budgetεabs\\varepsilon\_\{\\textrm\{abs\}\}, which maps to the number of errors \(and additional leaves\) the algorithm can make while remaining in the Rashomon set\. At the root, the budget can be based on a user\-specified value, or it can be obtained by treating the proxy tree at the root as the reference solution and applying a multiplicative factor of\(1\+ϵmult\)\(1\+\\epsilon\_\{\\textrm\{mult\}\}\)\. In the latter case, the output can also be filtered at the end to be\(1\+ϵmult\)\(1\+\\epsilon\_\{\\textrm\{mult\}\}\)times the best tree recovered \(this filtering is trivial since we enumerate the set in sorted order of objective\)\.

To propagate the budget to child subproblems, a key subroutine in PRAXIS is Algorithm[3](https://arxiv.org/html/2606.00202#alg3), which efficiently propagates the budget down the search space using the initial budget and the proxy algorithm’s estimates of split quality\. Given the two sibling subproblems resulting from a given split, the algorithm initially allocates budget for the left subproblem so that any solution found can be combined with the proxy algorithm’s solution for the right subproblem while remaining within the parent budgetεabs\\varepsilon\_\{\\textrm\{abs\}\}\(line33\)\. This is potentially an underestimate – if there are completions of the right subproblem that perform better than the proxy, then additional budget should be provided to the left subproblem\. To handle this, the algorithm repeatedly refines \(loosens\) the left and right subproblem budgets if a better tree is found for the other subproblem \(lines55\-1212\)\. This iterative budget refinement procedure continues until no better tree is found\.

Algorithm 3Solve\_Siblings​\(DL,DR,d,γ,εabs,PR\)\\textsc\{Solve\\\_Siblings\}\(D\_\{L\},D\_\{R\},d,\\gamma,\\varepsilon\_\{\\textrm\{abs\}\},P\_\{R\}\)0:Left/right datasets

DL,DRD\_\{L\},D\_\{R\}, remaining depth

dd, regularization

γ\\gamma, parent budget

εabs\\varepsilon\_\{\\textrm\{abs\}\}, proxy cost

PRP\_\{R\}
1:

εL←−∞\\varepsilon\_\{L\}\\leftarrow\-\\infty\{largest budget used for solving

DLD\_\{L\}with PRAXIS ; currently we have not yet run on

DLD\_\{L\}at all\.\}

2:

εR←−∞\\varepsilon\_\{R\}\\leftarrow\-\\infty\{largest budget used for solving

DRD\_\{R\}with PRAXIS ; currently we have not yet run on

DRD\_\{R\}at all\.\}

3:

εL\(new\)←εabs−PR\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}\\leftarrow\\varepsilon\_\{\\textrm\{abs\}\}\-P\_\{R\}\{new budget to use for

DLD\_\{L\}with PRAXIS \}

4:while

εL\(new\)\>εL\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}\>\\varepsilon\_\{L\}do

5:

εL←εL\(new\)\\varepsilon\_\{L\}\\leftarrow\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}
6:

GL←PRAXIS​\(DL,d,γ,εL\)G\_\{L\}\\leftarrow\\textsc\{PRAXIS\}\(D\_\{L\},d,\\gamma,\\varepsilon\_\{L\}\)
7:

εR\(new\)←εabs−GL\.min\_objective\\varepsilon\_\{R\}^\{\\textrm\{\(new\)\}\}\\leftarrow\\varepsilon\_\{\\textrm\{abs\}\}\-G\_\{L\}\.\\textit\{min\\\_objective\}
8:if

εR\(new\)\>εR\\varepsilon\_\{R\}^\{\\textrm\{\(new\)\}\}\>\\varepsilon\_\{R\}then

9:

εR←εR\(new\)\\varepsilon\_\{R\}\\leftarrow\\varepsilon\_\{R\}^\{\\textrm\{\(new\)\}\}
10:

GR←PRAXIS​\(DR,d,γ,εR\)G\_\{R\}\\leftarrow\\textsc\{PRAXIS\}\(D\_\{R\},d,\\gamma,\\varepsilon\_\{R\}\)
11:

εL\(new\)←ε​\_​abs−GR\.min\_objective\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}\\leftarrow\\varepsilon\\\_\\textrm\{abs\}\-G\_\{R\}\.\\textit\{min\\\_objective\}
12:endif

13:endwhile

14:return

GL,GRG\_\{L\},G\_\{R\}

##### Runtime and Memory Requirements\.

Whereas normally the memory and runtime requirements on Rashomon set construction can be arbitrarily larger than the size of the Rashomon set, PRAXIS takes memory and runtime linear in the size of the Rashomon set approximation it finds\. We demonstrate this in the following theorems \(which are proven in[Appendix A](https://arxiv.org/html/2606.00202#A1)\)\.

###### Theorem 3\.2\(PRAXIS Runtime\)\.

Given a datasetDDwithnnsamples andkkfeatures, depth limitdd, leaf penaltyγ\\gamma, and Rashomon budgetεabs\\varepsilon\_\{\\textrm\{abs\}\},PRAXIS​\(D,d,γ,εabs\)\\textsc\{PRAXIS\}\(D,d,\\gamma,\\varepsilon\_\{\\textrm\{abs\}\}\)computes an estimate of the Rashomon setRεabs​\(D,d,γ\)R\_\{\\varepsilon\_\{\\mathrm\{abs\}\}\}\(D,d,\\gamma\)in time𝒪​\(\|Rεabs\|​k​d⋅ProxyCompute​\(n,k,d\)\)\\mathcal\{O\}\\\!\\left\(\|R\_\{\\varepsilon\_\{\\mathrm\{abs\}\}\}\|\\,k\\,d\\cdot\\textsc\{ProxyCompute\}\(n,k,d\)\\right\), assuming the runtime of our proxy algorithm is bounded:ProxyCompute​\(n,k,d\)∈Θ​\(n​g​\(k,d\)\)\\textsc\{ProxyCompute\}\(n,k,d\)\\in\\Theta\\\!\\bigl\(n\\,g\(k,d\)\\bigr\)for some functiong​\(k,d\)∈Ω​\(1\)g\(k,d\)\\in\\Omega\(1\)that is also nondecreasing indd\.

Theorem[3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)shows that for a polynomial\-time proxy algorithm that is linear innn, PRAXIS performs only polynomial work per actual tree in the Rashomon set\. This is a substantially stronger guarantee than that of TreeFARMS\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80)\), SORTeD\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\), and RESPLIT\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\), which, in the worst case, require exponential time per tree they output\. When using our default proxy algorithm\(based on LicketySPLIT from Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\), we have a runtime of𝒪​\(\|R\|​n​k3​d3\)\\mathcal\{O\}\\bigl\(\|R\|\\,n\\,k^\{3\}d^\{3\}\\bigr\), i\.e\., we spend time that is only a polynomial multiple of the size of the Rashomon set that we need to output\.

By scaling linearly in the output size, Theorem[3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)is a substantial improvement relative to optimal solvers in the often\-encountered scenario where the Rashomon set is a tiny fraction of the overall search space of trees\. For instance,Semenova et al\. \([2022](https://arxiv.org/html/2606.00202#bib.bib72)\)show that even large Rashomon sets \(allowing for 5% additive error\) are on the order of10−3710^\{\-37\}% or10−3810^\{\-38\}% of the hypothesis space\. Corollary[3\.3](https://arxiv.org/html/2606.00202#S3.Thmtheorem3)gives a condition on the Rashomon set size such that even finding a single optimal tree\(which is a precursor to finding the entire Rashomon set, Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80); Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)with standard branch\-and\-bound is super\-polynomially slower than PRAXIS, even though PRAXIS finds an approximation of the entire Rashomon set\.

###### Corollary 3\.3\(Super\-polynomial speedup over optimal dynamic programming\)\.

Consider a datasetDDof sizen×kn\\times k, and letddbe the depth budget\. Settingγ=0\\gamma=0for simplicity and assuming there is at least one tree in the Rashomon set, if

∀q,\|Rεabs​\(D,d,γ\)\|∈o​\(\(kd\)\(k​d\)q\),\\forall q,\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\\in o\\left\(\\frac\{\{k\\choose d\}\}\{\(kd\)^\{q\}\}\\right\),then a full dynamic programming search\(i\.e\., Algorithm 1 from Demirović et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib26)\)for a single optima will have worst\-case runtime that is a super\-polynomial factor of the Rashomon set size\|Rεabs\|\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\|, but PRAXIS with a polynomial\-time proxy algorithm satisfying Theorem[3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)will not, i\.e:

∀q,Worst\-Case Runtime\(OPT\)Worst\-Case Runtime\(PRAXIS\)∈ω​\(\(k​d\)q\)\.\\forall q,\\frac\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\{\\textrm\{Worst\-Case Runtime\(\{PRAXIS\}\)\}\}\\in\\omega\\Big\(\(kd\)^\{q\}\\Big\)\.

In addition to the runtime efficiency discussed above, Theorem[3\.4](https://arxiv.org/html/2606.00202#S3.Thmtheorem4)below establishes that PRAXIS’s memory complexity scales linearly with the size of the input and the set of trees it finds, for any memory\-efficient proxy algorithm\.

###### Theorem 3\.4\(PRAXIS Memory Usage\)\.

Letddbe a depth budget, andγ\\gammabe a leaf penalty\. Given a datasetDDof sizennwithkkfeatures and a proxy algorithm with𝒪​\(n​k\)\\mathcal\{O\}\(nk\)memory usage, a memory\-efficient implementation of PRAXIS can find an estimate of the Rashomon setRεabs​\(D,d,γ\)R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)using𝒪​\(n​k\+∑t∈Rεabs\|t\|\)\\mathcal\{O\}\\\!\\left\(nk\+\\sum\_\{t\\in R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\}\|t\|\\right\)memory \(where\|t\|\|t\|is the \# of leaves in treett\), with the runtime complexity in Theorem[3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)\.

In practice, our implementation trades additional memory for faster runtime by caching and reusing solutions to repeated subproblems\.

##### Caching\.

Rashomon set algorithms for decision trees make use of caching to allow subproblem reuse\. The subproblem representation ofXin et al\. \([2022](https://arxiv.org/html/2606.00202#bib.bib80)\)requires a bitvector of lengthnnto represent the subset of data points contained in a subproblem \(and the subproblem depth\), which can be memory intensive\. Alternatively, subproblems may be represented by the set of feature splits \(and branch directions\) used to reach them\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)\. This representation is substantially more memory efficient, but it cannot detect when different sets of splits correspond to the same subset of data points, which can lead to redundant work\. This creates a space–time tradeoff \(see Appendix[B\.7](https://arxiv.org/html/2606.00202#A2.SS7)\)\.

In PRAXIS, we combine the memory efficiency of split\-based representations\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)with the reuse benefits of bitvector representations\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80)\)\. We hash each bitvector representing the samples in a subproblem \(and the depth budget\) to a 64\-bit fingerprint and use this fingerprint as the cache key\. Thus, when our modified LicketySPLIT proxy is called, we cache the solution for every recursive subproblem it solves while computing the objective of the tree it implicitly enumerates; we apply the same caching strategy to each subproblem solved by the greedy algorithm\. In Appendix[B\.7](https://arxiv.org/html/2606.00202#A2.SS7), we discuss and show the benefits of this approach\. For well\-behaved hash functions, the probability of an erroneous cache hit is vanishingly small; in practice, this saves orders of magnitude of memory without introducing errors\.

##### Recovery Conditions\.

All trees returned by PRAXIS are guaranteed to lie within the specified budgetεabs\\varepsilon\_\{\\textrm\{abs\}\}, meaning precision is easy to guarantee even when using an approximate proxy\. However, perfect recall is not guaranteed: a proxy may exceed the budget on a subproblem even when an optimal subtree for that subproblem would remain feasible\. In this case, some valid Rashomon set members will not be recovered\. Theorems[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)and[3\.6](https://arxiv.org/html/2606.00202#S3.Thmtheorem6)provide sufficient conditions to guarantee a decision tree is recovered by PRAXIS\.

###### Theorem 3\.5\(Frontier cut recovery condition\)\.

Letttbe a tree within the root budgetεabs\\varepsilon\_\{\\textrm\{abs\}\}\. Fix any root or internal nodend′n\_\{d^\{\\prime\}\}ofttat depthd′d^\{\\prime\}, and letn1,n2,…,nd′−1,nd′n\_\{1\},n\_\{2\},\\dots,n\_\{d^\{\\prime\}\-1\},n\_\{d^\{\\prime\}\}be the nodes on the path from the root tond′n\_\{d^\{\\prime\}\}\(excluding the rootrr\)\. For eachnin\_\{i\}on this path, letsis\_\{i\}denote its sibling intt, and let\(nd′\)left\(n\_\{d^\{\\prime\}\}\)\_\{\\textrm\{left\}\}and\(nd′\)right\(n\_\{d^\{\\prime\}\}\)\_\{\\textrm\{right\}\}denote the two children ofnd′n\_\{d^\{\\prime\}\}\. Letdud\_\{u\}denote the remaining depth budget for internal nodeuu\. Suppose that

∑u∈\{s1,…,sd′,\(nd′\)left,\(nd′\)right\}Proxy​\(Du,du,γ\)≤εabs\.\\sum\_\{u\\in\\\{s\_\{1\},\\ldots,s\_\{d^\{\\prime\}\},\\,\(n\_\{d^\{\\prime\}\}\)\_\{\\textrm\{left\}\},\\,\(n\_\{d^\{\\prime\}\}\)\_\{\\textrm\{right\}\}\\\}\}\\textsc\{Proxy\}\\bigl\(D\_\{u\},d\_\{u\},\\gamma\\bigr\)\\;\\leq\\;\\varepsilon\_\{\\mathrm\{abs\}\}\.Then,PRAXISwill not prune exploration of\(nd′\)left\(n\_\{d^\{\\prime\}\}\)\_\{\\textrm\{left\}\}and\(nd′\)right\(n\_\{d^\{\\prime\}\}\)\_\{\\textrm\{right\}\}\. If this holds for all internal nodes, thenttwill be in the set returned byPRAXIS\.

![Refer to caption](https://arxiv.org/html/2606.00202v1/x2.png)Figure 2:Frontier cut for internal nodend′n\_\{d^\{\\prime\}\}\. The path to the node is shaded in orange\. The nodes that are summed in the frontier cut are depicted in blue\.To illustrate the condition in Theorem[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5), we introduce the notion of a*frontier cut*, depicted in Figure[2](https://arxiv.org/html/2606.00202#S3.F2)\. A frontier cut for a nodeNNof a given tree corresponds to those subproblems not already encountered along the path from the root to nodeNN\. Theorem[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)tells us that if, for each internal node of a tree, proxy algorithm completions along that node’s frontier cut remain withinεabs\\varepsilon\_\{\\textrm\{abs\}\}, then we are guaranteed to include that tree in PRAXIS’s output\.

Intuitively, we would expect that conditioning on splits from trees that are in the Rashomon set should generally improve the performance of the proxy\. Thus, we expect the condition from Theorem[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)to be satisfied for most trees in the Rashomon set\. Even when some tree in the Rashomon set does not satisfy this condition, however, we are often still able to recover the tree, thanks to Algorithm[3](https://arxiv.org/html/2606.00202#alg3)\. The repeated passes in that subroutine allow PRAXIS to widen the budget provided in subproblem exploration beyond what the initial proxy completions along a frontier cut would allow\. Corollary[A\.16](https://arxiv.org/html/2606.00202#A1.Thmtheorem16)in the appendix formalizes some of the objective improvements that will occur along sibling nodes as a consequence of these repeated passes\.

A related condition for recovering a tree in the Rashomon set comes from Corollary[3\.6](https://arxiv.org/html/2606.00202#S3.Thmtheorem6)\. This corollary implies that if the gap between the proxy and an optimal tree completion decreases \(or does not increase too much\) further down a tree, then our approach will recover that tree\. This aligns with observations about monotonically decreasing optimal\-greedy gaps in Rashomon sets\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\)\. It also provides a multiplicative factorσ\\sigmathat, when set appropriately, guarantees recovery of all trees inℛεmult\\mathcal\{R\}\_\{\\varepsilon\_\{\\textrm\{mult\}\}\}, even if the earlier condition does not hold\.

###### Corollary 3\.6\(Slack needed for recovery under approximate proxies\)\.

For leaf penaltyγ\\gamma, depthddand datasetDD, fix a treetr∈ℛεmult​\(D,d,γ\)t\_\{r\}\\in\\mathcal\{R\}\_\{\\varepsilon\_\{\\textrm\{mult\}\}\}\(D,d,\\gamma\)and initialize PRAXIS with

εabs←σ​\(1\+εmult\)​Proxy​\(D,d,γ\)\.\\varepsilon\_\{\\textrm\{abs\}\}\\leftarrow\\sigma\\,\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\,\\textsc\{Proxy\}\(D,d,\\gamma\)\.Define

α\\displaystyle\\alpha=Proxy​\(D,d,γ\)mint∈𝒯d⁡Obj​\(t,D,γ\),\\displaystyle=\\;\\frac\{\\textsc\{Proxy\}\(D,d,\\gamma\)\}\{\\min\_\{t\\in\\mathcal\{T\}\_\{d\}\}\\mathrm\{Obj\}\(t,D,\\gamma\)\},β\\displaystyle\\beta=maxu∈InternalNodes​\(tr\)⁡\{Proxy​\(Du,du,γ\)mint∈𝒯du⁡Obj​\(t,Du,γ\)\},\\displaystyle=\\;\\max\_\{u\\in\\mathrm\{InternalNodes\}\(t\_\{r\}\)\}\\left\\\{\\frac\{\\textsc\{Proxy\}\(D\_\{u\},d\_\{u\},\\gamma\)\}\{\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{u\}\}\}\\mathrm\{Obj\}\(t,D\_\{u\},\\gamma\)\}\\right\\\},ηr\\displaystyle\\eta\_\{r\}=Obj​\(tr,D,γ\)\(1\+εmult\)​mint∈𝒯d⁡Obj​\(t,D,γ\)∈\[11\+εmult,1\]\.\\displaystyle=\\;\\frac\{\\mathrm\{Obj\}\(t\_\{r\},D,\\gamma\)\}\{\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\min\_\{t\\in\\mathcal\{T\}\_\{d\}\}\\mathrm\{Obj\}\(t,D,\\gamma\)\}\\in\[\\frac\{1\}\{1\+\\varepsilon\_\{\\textrm\{mult\}\}\},1\]\.
Then,PRAXISwill returntrt\_\{r\}withσ=max⁡\{1,βα​ηr\}\\sigma=\\max\\\!\\left\\\{1,\\frac\{\\beta\}\{\\alpha\}\\eta\_\{r\}\\right\\\}

In the worst case,α=1\\alpha=1andβ\>1\\beta\>1, meaning the proxy is exact at the root but not later down the tree\. In this setting,PRAXISstill recovers every tree inℛεmult\\mathcal\{R\}\_\{\\varepsilon\_\{\\textrm\{mult\}\}\}provided the Rashomon bound is rescaled byσ≥β\\sigma\\geq\\beta\.

Based on the preceding theorems, recovery can be improved either by introducing slack into the root budget \([3\.6](https://arxiv.org/html/2606.00202#S3.Thmtheorem6)\) or by using a stronger proxy algorithm, which yields better estimates of the optimal subproblem objectives \([Theorem 3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)\)\. Both adjustments incur additional runtime cost, but can be applied after the initial execution of PRAXIS, reusing previous computation\.

## 4Experiments

PRAXISTreeFARMSSORTeDRESPLITDatasetnnkkTime \(s\)Peak MBTime \(s\)Peak MBTime \(s\)Peak MBTime \(s\)Peak MBChurn500047234\.84279\.41––123776\.1122012\.992564\.302792\.24Madeline31404512971\.6529094\.20––––133189\.3826621\.99Electricity38474264306\.94692\.10––114325\.6123777\.657619\.636402\.99Shopping1233024358\.88470\.00––24151\.417957\.331760\.042195\.54Christine5418231944\.2710439\.32––38970\.6012709\.8812625\.373096\.83Credit3000022526\.03249\.60––65145\.2314300\.782316\.154355\.73Adult48842209272\.53682\.15––56440\.2312079\.805494\.425782\.34Jasmine298420722\.77451\.61––5211\.852705\.49517\.17745\.36News39644196596\.471480\.48––114514\.4030426\.0613544\.766082\.23Bike1737916435\.40359\.94––6533\.753546\.601022\.181302\.75Helena651961564\.74290\.9638\.321983\.32100\.29779\.31564\.131755\.30Jannis57580106327\.541458\.63––5277\.685137\.225587\.192462\.35Bank45211972\.43195\.74––1808\.611189\.15164\.341468\.21Covertype58101296358\.701301\.23––64672\.8816107\.4810101\.878631\.58Droid29332840\.78165\.04––882\.65697\.1978\.89730\.01Higgs11000000842374\.9121537\.79––––––Magic190208031\.05446\.82––268\.66850\.65595\.96527\.28Madelon2000738\.19292\.36––97\.75409\.84204\.01276\.30Heloc2502650\.38140\.59––52\.34313\.958\.79205\.18Wine6497640\.24135\.34––62\.98307\.6312\.54255\.48Compas4966440\.09130\.4765\.083742\.007\.23163\.5011\.90159\.61Poker10250104013\.71952\.10––7370\.457475\.91657\.485958\.07Diabetes253680330\.79291\.69553\.3128336\.29683\.311067\.6948\.681049\.81Taxi1224158271\.18894\.6512\.502874\.38364\.853318\.6532\.352030\.43Table 1:Runtime and peak memory usage atλ=0\.02\\lambda=0\.02,ε=0\.03\\varepsilon=0\.03, depth=5=5\. Peak MB reports the peak resident set size \(RSS\) during the script to load packages, the dataset, and compute the Rashomon set\. ‘–’ indicates that the method did not finish in 90 hours or with 200GB of RAM\.Our experimental evaluation stress\-tests PRAXIS across 50 dataset\-binarization combinations \(with up to two binarizations per dataset\), spanning up to 11 million samples and 472 binary features\. We primarily binarize continuous features using the threshold\-guessing technique inMcTavish et al\. \([2022](https://arxiv.org/html/2606.00202#bib.bib52)\)\([Appendix C](https://arxiv.org/html/2606.00202#A3)has additional information on datasets and binarizations\)\. The goal of these experiments is to assess whether PRAXIS can scale to the most demanding and computationally challenging Rashomon\-set problems encountered in practice\.

In our experiments, we evaluate three key dimensions of performance: \(1\) time and memory usage required to generate a Rashomon set; \(2\) approximation quality, measured primarily through recall;111We focus on recall rather than precision, because precision in this context amounts to a simple check of whether returned trees’ training objectives are sufficiently low; since we return our trees in sorted order, we can achieve perfect precision without any drop to recall for any user\-provided absolute threshold\.and \(3\) ability to recover optimal decision trees\. We compare PRAXIS to SORTeD\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)and TreeFARMS\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80)\)—the two existing exact Rashomon\-set enumeration algorithms—as well as RESPLIT\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\), a state\-of\-the\-art approximation algorithm for Rashomon sets\. We also include a baseline that repeatedly trains the proxy on bootstrapped versions of the dataset \(for as long as PRAXIS ran\) to quantify how much PRAXIS’s search procedure improves on the underlying proxy algorithm’s performance\.

##### Timing and Memory\.

Table[1](https://arxiv.org/html/2606.00202#S4.T1)contains timing and memory profiles from a large range of datasets\. The results for the full list of over 50 datasets are given in the[Appendix D](https://arxiv.org/html/2606.00202#A4)\.

On all datasets with 100 or fewer binary features \(except for Higgs or Covertype, which have large sample sizes\), PRAXIS finished in 32 seconds or less\. In contrast, SORTeD took up to 18 hours, and TreeFARMS frequently did not finish, because it ran out of memory\. On Higgs, no other method finished within 90 hours or 200 GB of RAM, yet PRAXIS finished in 40 minutes with no extra memory beyond what was required to load the dataset\. The difference becomes more dramatic as the number of features increases: PRAXIS takes 35 seconds on Churn, whereas SORTeD takes nearly 35 hours\.

These results demonstrate that PRAXIS significantly improves the scalability of Rashomon set computation by orders of magnitude, outperforming RESPLIT by up to 2 orders of magnitude, TreeFARMS by up to 5 orders of magnitude \(and handling many datasets that TreeFARMS cannot\), and SORTeD by up to 3 orders of magnitude\. In terms of memory consumption, PRAXIS is typically 5×\\timesmore memory efficient than RESPLIT, up to 2 orders of magnitude more memory efficient than SORTeD, and up to 4 orders of magnitude more memory efficient than TreeFARMS\. Appendix[D\.9](https://arxiv.org/html/2606.00202#A4.SS9)shows that PRAXIS scales much better with depth than approximate methods, achieving up to four orders of magnitude gains in runtime and memory for depth\-7 Rashomon sets versus RESPLIT, while improving approximation quality\. In this deeper regime, SORTeD fails to complete within 150 hours on these datasets, whereas PRAXIS finishes in only 11 seconds\.

##### Approximation Quality\.

![Refer to caption](https://arxiv.org/html/2606.00202v1/x3.png)Figure 3:Approximation quality of non\-optimal methods for the 4 datasets where optimal methods either take\>40\>40hours or\>200\>200GB of RAM\. Each plot shows the number of depth 5 trees with objective value within anxxfactor of the minimum objective found by any method\. The dashed vertical line shows the Rashomon bound estimated as\(1\+ϵ\)\(1\+\\epsilon\)times the minimum objective found by any method\.DatasetRecallλ=0\.005\\lambda\{=\}0\.005λ=0\.02\\lambda\{=\}0\.02Churn\-4720\.997±0\.0050\.997\\\!\\pm\\\!0\.0051\.000±0\.0001\.000\\\!\\pm\\\!0\.000Electricity\-2640\.994±0\.0030\.994\\\!\\pm\\\!0\.0031\.000±0\.0001\.000\\\!\\pm\\\!0\.000Shopping\-2430\.984±0\.0340\.984\\\!\\pm\\\!0\.0341\.000±0\.0001\.000\\\!\\pm\\\!0\.000Christine\-2311\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.994±0\.0110\.994\\\!\\pm\\\!0\.011Credit\-2251\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Adult\-2091\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Jasmine\-2070\.993±0\.0090\.993\\\!\\pm\\\!0\.0091\.000±0\.0001\.000\\\!\\pm\\\!0\.000News\-1961\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Bike\-1640\.986±0\.0100\.986\\\!\\pm\\\!0\.0101\.000±0\.0001\.000\\\!\\pm\\\!0\.000Helena\-1561\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Jannis\-1060\.995±0\.0080\.995\\\!\\pm\\\!0\.0081\.000±0\.0001\.000\\\!\\pm\\\!0\.000Bank\-971\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Covertype\-961\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Droid\-841\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Magic\-800\.997±0\.0040\.997\\\!\\pm\\\!0\.0040\.998±0\.0040\.998\\\!\\pm\\\!0\.004Madelon\-731\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Heloc\-650\.999±0\.0000\.999\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Wine\-641\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Compas\-441\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Poker\-401\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Diabetes\-331\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Taxi\-271\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Table 2:Rashomon set recall \(mean±\\pmstandard deviation over 5 bootstraps\) of PRAXIS relative to ground truth enumeration forεmult=0\.03\\varepsilon\_\{\\text\{mult\}\}=0\.03and depthd=5d=5\. Format: Dataset\-NumBinaryFeatures\.We now establish that PRAXIS approximates the Rashomon set with near\-perfect recall across all tested datasets and far better than existing approximations\. Table[2](https://arxiv.org/html/2606.00202#S4.T2)shows the recall of PRAXIS on datasets where we can directly compare to the ground truth Rashomon set\. We see that PRAXIS frequently achieves a recall of 1\.0 across all bootstraps of datasets\. Even when it does not, it is always above 0\.98 on average, highlighting that the outcomes of downstream tasks are unlikely to change when using this subset of trees\.

##### Benchmarking Trees Found\.

To investigate settings where exact Rashomon set computations are prohibitively slow/memory intensive but PRAXIS can still complete execution for a reasonable time and memory usage, we compare our approximation quality to RESPLIT \(run to completion\) and bootstrapping our proxy algorithm \(LicketySPLIT, run for as long as PRAXIS ran\)\. Figure[3](https://arxiv.org/html/2606.00202#S4.F3)shows the cumulative number of trees found by each method with objectives less than or equal to a given value\. These are representative results from large datasets; the results for additional datasets can be found in Appendix[D\.11](https://arxiv.org/html/2606.00202#A4.SS11)\. In the case of the Churn and Electricity datasets, PRAXIS finds more than one million trees that are better than the best tree RESPLIT finds; moreover, none of the trees found by RESPLIT lie in the desired Rashomon set\. Across all 4 cases displayed in Figure[3](https://arxiv.org/html/2606.00202#S4.F3), RESPLIT and bootstrapped LicketySPLIT returned 0 or a small fraction of the estimated Rashomon set\.

##### Finding Optimal Trees\.

While the primary focus of our framework is to approximate Rashomon sets, our work also functions as a way to find near\-optimal individual decision trees \(by taking the first tree returned by PRAXIS\)\.

Across regularization valuesλ∈\{0\.005,0\.01,0\.02\}\\lambda\\in\\\{0\.005,0\.01,0\.02\\\}, and up to 50 dataset–binarization pairs for which both the globally optimal solver STreeD\(van der Linden et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib76)\)and PRAXIS \(withεmult\\varepsilon\_\{\\textrm\{mult\}\}= 0\.03\) completed successfully \(the number of completed runs varies slightly acrossλ\\lambda\), we observe exact agreement in the minimum objective value between STreeD and PRAXIS on every dataset\. Even when settingεmult=0\\varepsilon\_\{\\mathrm\{mult\}\}=0, PRAXIS recovers the optimal tree in the vast majority of cases\. Specifically, forλ=0\.02\\lambda=0\.02, PRAXIS \(εmult=0\\varepsilon\_\{\\mathrm\{mult\}\}=0\) recovers the optimal tree on all 50 datasets; forλ=0\.01\\lambda=0\.01, it fails on only two datasets; and forλ=0\.005\\lambda=0\.005, it fails on only three datasets\. In the rare failure cases, the returned objective exceeded the optimum by only a very small multiplicative factor \(e\.g\., 1\.00069\)\.

Given this, PRAXIS can be used to nearly always recover optimal decision trees and does so up to 3 orders of magnitude faster than GOSDT and STreeD\(Lin et al\.,[2020](https://arxiv.org/html/2606.00202#bib.bib46); van der Linden et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib76)\)\(see Appendix[D\.4](https://arxiv.org/html/2606.00202#A4.SS4)\)\. As an illustration, on News withλ=0\.005\\lambda=0\.005, PRAXIS computes the optimal tree in under 3 minutes; by contrast, STreeD takes more than twenty hours, and GOSDT runs out of memory\.

## 5Conclusion

We introduce PRAXIS, a fast, memory\-efficient algorithm for estimating Rashomon sets of decision trees\. PRAXIS efficiently prunes candidates using proxy completions as computationally light subroutines for candidate split evaluations\. Leveraging these proxies and a mechanism to refine bounds during search, PRAXIS approximates the Rashomon set with near\-perfect recall and delivers orders\-of\-magnitude speedups over existing methods\. These results allow for high\-quality approximations of Rashomon sets on datasets far larger than those accessible to existing methods, substantially expanding the practicality of Rashomon set computation\. Future work could explore how these ideas extend to single\-tree optimization and to broader classes of tree and partitioning problems in machine learning\.

## Acknowledgments

We acknowledge funding from the U\.S\. Department of Energy under DE\-SC0023194, and the National Institute On Drug Abuse of the National Institutes of Health under R01DA054994\. Content does not necessarily represent official views of the United States Government nor any agency thereof\. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada \(NSERC\)\. Nous remercions le Conseil de recherches en sciences naturelles et en génie du Canada \(CRSNG\) de son soutien\. We would also like to thank Yixiao Wang and Luke Moffett for comments and suggestions on our work, as well as our reviewers\.

## Impact Statement

This paper presents work on interpretable machine learning, which is essential for many ethical AI applications\.

## References

- Aglin et al\. \(2020\)Aglin, G\., Nijssen, S\., and Schaus, P\.Learning optimal decision trees using caching branch\-and\-bound search\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 34, pp\. 3146–3153, 2020\.
- Aha \(1991\)Aha, D\.Tic\-Tac\-Toe Endgame\.UCI Machine Learning Repository, 1991\.DOI: https://doi\.org/10\.24432/C5688J\.
- Arslan et al\. \(2026\)Arslan, E\., van der Linden, J\. G\. M\., Hoogendoorn, S\., Rinaldi, M\., and Demirović, E\.SORTed rashomon sets of sparse decision trees: Anytime enumeration\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=Gibq7Wa7Bq](https://openreview.net/forum?id=Gibq7Wa7Bq)\.
- Audobon Society Field Guide \(1981\)Audobon Society Field Guide\.Mushroom\.UCI Machine Learning Repository, 1981\.DOI: https://doi\.org/10\.24432/C5959T\.
- Babbar et al\. \(2025\)Babbar, V\., McTavish, H\., Rudin, C\., and Seltzer, M\.Near\-optimal decision trees in a SPLIT second\.In*International Conference on Machine Learning*, 2025\.
- Babbar et al\. \(2026\)Babbar, V\., Boner, Z\., Seltzer, M\., and Rudin, C\.Falling trees: A model class for interpretable risk prioritization\.In*International Conference on Machine Learning*, 2026\.Forthcoming\.
- Bain & Hoff \(1994\)Bain, M\. and Hoff, A\.Chess \(King\-Rook vs\. King\)\.UCI Machine Learning Repository, 1994\.DOI: https://doi\.org/10\.24432/C57W2S\.
- Baldi et al\. \(2014\)Baldi, P\., Sadowski, P\., and Whiteson, D\.Searching for exotic particles in high\-energy physics with deep learning\.*Nature Communications*, 5\(1\), July 2014\.ISSN 2041\-1723\.doi:10\.1038/ncomms5308\.URL[http://dx\.doi\.org/10\.1038/ncomms5308](http://dx.doi.org/10.1038/ncomms5308)\.
- Bao et al\. \(2021\)Bao, M\., Zhou, A\., Zottola, A\. S\., Brubach, B\., Desmarais, S\., Horowitz, S\. A\., Lum, K\., and Venkatasubramanian, S\.It’s compaslicated: The messy relationship between rai datasets and algorithmic fairness benchmarks\.In*Proceedings of the Thirty\-Fifth Conference on Neural Information Processing Systems \(NeurIPS\)*, 2021\.Datasets and Benchmarks Track \(Round 1\)\.
- Becker & Kohavi \(1996\)Becker, B\. and Kohavi, R\.Adult\.UCI Machine Learning Repository, 1996\.DOI: https://doi\.org/10\.24432/C5XW20\.
- Bertsekas et al\. \(1997\)Bertsekas, D\. P\., Tsitsiklis, J\. N\., and Wu, C\.Rollout algorithms for combinatorial optimization\.*Journal of Heuristics*, 3\(3\):245–262, 1997\.
- Bertsimas & Dunn \(2017\)Bertsimas, D\. and Dunn, J\.Optimal classification trees\.*Machine Learning*, 106:1039–1082, 2017\.
- Blackard \(1998\)Blackard, J\.Covertype\.UCI Machine Learning Repository, 1998\.DOI: https://doi\.org/10\.24432/C50K5N\.
- Blanc et al\. \(2024\)Blanc, G\., Lange, J\., Pabbaraju, C\., Sullivan, C\., Tan, L\.\-Y\., and Tiwari, M\.Harnessing the power of choices in decision tree learning\.In*Advances in Neural Information Processing Systems*, volume 36, 2024\.
- Bock \(2004\)Bock, R\.MAGIC Gamma Telescope\.UCI Machine Learning Repository, 2004\.DOI: https://doi\.org/10\.24432/C52C8B\.
- Breiman \(1984\)Breiman, L\.*Classification and regression trees*\.Routledge, 1984\.
- Brița et al\. \(2025\)Brița, C\. E\., van der Linden, J\. G\., and Demirović, E\.Optimal classification trees for continuous feature data using dynamic programming with branch\-and\-bound\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 39, pp\. 11131–11139, 2025\.
- Burrows et al\. \(2017\)Burrows, N\. R\., Hora, I\., Geiss, L\. S\., Gregg, E\. W\., and Albright, A\.Incidence of end\-stage renal disease attributed to diabetes among persons with diagnosed diabetes — united states and puerto rico, 2000–2014\.*MMWR\. Morbidity and Mortality Weekly Report*, 66\(43\):1165–1170, 2017\.doi:10\.15585/mmwr\.mm6643a2\.URL[https://doi\.org/10\.15585/mmwr\.mm6643a2](https://doi.org/10.15585/mmwr.mm6643a2)\.
- Cattral & Oppacher \(2002\)Cattral, R\. and Oppacher, F\.Poker Hand\.UCI Machine Learning Repository, 2002\.DOI: https://doi\.org/10\.24432/C5KW38\.
- Chaouki et al\. \(2025\)Chaouki, A\., Read, J\., and Bifet, A\.Branches: Efficiently seeking optimal sparse decision trees via ao\.In*Forty\-second International Conference on Machine Learning*, 2025\.
- Ciaperoni et al\. \(2024\)Ciaperoni, M\., Xiao, H\., and Gionis, A\.Efficient exploration of the Rashomon set of rule\-set models\.In*Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD 2024\)*, pp\. 478–489, 2024\.
- Cortez \(2008\)Cortez, P\.Student Performance\.UCI Machine Learning Repository, 2008\.DOI: https://doi\.org/10\.24432/C5TG7T\.
- Cortez & Silva \(2008\)Cortez, P\. and Silva, A\. M\. G\.Using data mining to predict secondary school student performance\.2008\.
- Costa & Pedreira \(2023\)Costa, V\. G\. and Pedreira, C\. E\.Recent advances in decision trees: An updated survey\.*Artificial Intelligence Review*, 56\(5\):4765–4800, 2023\.
- Demirović et al\. \(2022\)Demirović, E\., Lukina, A\., Hebrard, E\., Chan, J\., Bailey, J\., Leckie, C\., Ramamohanarao, K\., and Stuckey, P\. J\.Murtree: Optimal decision trees via dynamic programming and search\.*Journal of Machine Learning Research*, 23\(26\):1–47, 2022\.
- Demirović et al\. \(2023\)Demirović, E\., Hebrard, E\., and Jean, L\.Blossom: an anytime algorithm for computing optimal decision trees\.In*International Conference on Machine Learning*, pp\. 7533–7562\. PMLR, 2023\.
- Detrano et al\. \(1989\)Detrano, R\. C\., Jánosi, A\., Steinbrunn, W\., Pfisterer, M\. E\., Schmid, J\.\-J\., Sandhu, S\., Guppy, K\. H\., Lee, S\., and Froelicher, V\.International application of a new probability algorithm for the diagnosis of coronary artery disease\.*The American Journal of Cardiology*, 64 5:304–10, 1989\.
- Donnelly et al\. \(2023\)Donnelly, J\., Katta, S\., Rudin, C\., and Browne, E\. P\.The Rashomon importance distribution: Getting RID of unstable, single model\-based variable importance\.In*Proceedings of Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Donnelly et al\. \(2026\)Donnelly, J\., Katta, S\., Borgonovo, E\., and Rudin, C\.Doctor rashomon and the UNIVERSE of madness: Variable importance with unobserved confounding and the rashomon effect\.In*Proceedings of Artificial Intelligence and Statistics \(AISTATS\)*, 2026\.
- Duin & Voß \(1994\)Duin, C\. and Voß, S\.Steiner tree heuristics—a survey\.In*Operations Research Proceedings 1993: DGOR/NSOR Papers of the 22nd Annual Meeting of DGOR in Cooperation with NSOR/Vorträge der 22\. Jahrestagung der DGOR zusammen mit NSOR*, pp\. 485–496\. Springer, 1994\.
- Erickson et al\. \(2025\)Erickson, N\., Purucker, L\., Tschalzev, A\., Holzmüller, D\., Desai, P\. M\., Salinas, D\., and Hutter, F\.Tabarena: A living benchmark for machine learning on tabular data, 2025\.URL[https://arxiv\.org/abs/2506\.16791](https://arxiv.org/abs/2506.16791)\.
- Fanaee\-T \(2013\)Fanaee\-T, H\.Bike Sharing\.UCI Machine Learning Repository, 2013\.DOI: https://doi\.org/10\.24432/C5W894\.
- Fanaee\-T & Gama \(2013\)Fanaee\-T, H\. and Gama, J\.Event labeling combining ensemble detectors and background knowledge\.*Progress in Artificial Intelligence*, 2:113 – 127, 2013\.
- Fernandes et al\. \(2015a\)Fernandes, K\., Vinagre, P\., and Cortez, P\.A proactive intelligent decision support system for predicting the popularity of online news\.In*Portuguese Conference on Artificial Intelligence*, 2015a\.
- Fernandes et al\. \(2015b\)Fernandes, K\., Vinagre, P\., Cortez, P\., and Sernadela, P\.Online News Popularity\.UCI Machine Learning Repository, 2015b\.DOI: https://doi\.org/10\.24432/C5NS3V\.
- FICO \(2018\)FICO\.Home equity line of credit \(heloc\) dataset, 2018\.URL[https://community\.fico\.com/s/explainable\-machine\-learning\-challenge](https://community.fico.com/s/explainable-machine-learning-challenge)\.FICO Explainable Machine Learning Challenge\.
- Guyon et al\. \(2019\)Guyon, I\., Sun\-Hosoya, L\., Boullé, M\., Escalante, H\. J\., Escalera, S\., Liu, Z\., Jajetic, D\., Ray, B\., Saeed, M\., Sebag, M\., Statnikov, A\., Tu, W\., and Viegas, E\.Analysis of the AutoML challenge series 2015\-2018\.In*AutoML*, Springer series on Challenges in Machine Learning, 2019\.
- Hopkins et al\. \(1999\)Hopkins, M\., Reeber, E\., Forman, G\., and Suermondt, J\.Spambase\.UCI Machine Learning Repository, 1999\.DOI: https://doi\.org/10\.24432/C53G6X\.
- Hsu et al\. \(2026\)Hsu, E\., Chen, H\., Zhong, C\., and Semenova, L\.The double\-edged nature of the Rashomon set for trustworthy machine learning\.In*International Conference on Machine Learning*, 2026\.
- Hu et al\. \(2019a\)Hu, X\., Rudin, C\., and Seltzer, M\.Optimal sparse decision trees\.In*Advances in Neural Information Processing Systems*, volume 32, 2019a\.
- Hu et al\. \(2019b\)Hu, X\., Rudin, C\., and Seltzer, M\.Optimal sparse decision trees\.In*Advances in Neural Information Processing Systems*, volume 32, pp\. 7265–7273, 2019b\.
- Janosi et al\. \(1989\)Janosi, A\., Steinbrunn, W\., Pfisterer, M\., and Detrano, R\.Heart Disease\.UCI Machine Learning Repository, 1989\.DOI: https://doi\.org/10\.24432/C52P4X\.
- Kiossou & Schaus \(2026\)Kiossou, H\. and Schaus, P\.A generic complete anytime beam search for optimal decision tree\.In*Advances in Intelligent Data Analysis XXIV*, pp\. 97–109\. Springer Nature Switzerland, 2026\.
- Kiossou et al\. \(2022\)Kiossou, H\., Schaus, P\., Nijssen, S\., and Houndji, V\. R\.Time constrained dl8\.5 using limited discrepancy search\.In*Joint European Conference on Machine Learning and Knowledge Discovery in Databases*, pp\. 443–459\. Springer, 2022\.
- Kiossou et al\. \(2024\)Kiossou, H\., Schaus, P\., Nijssen, S\., and Aglin, G\.Efficient lookahead decision trees\.In*International Symposium on Intelligent Data Analysis*, pp\. 133–144\. Springer, 2024\.
- Lin et al\. \(2020\)Lin, J\., Zhong, C\., Hu, D\., Rudin, C\., and Seltzer, M\.Generalized and scalable optimal sparse decision trees\.In*International Conference on Machine Learning*, pp\. 6150–6160\. PMLR, 2020\.
- Malani et al\. \(2019\)Malani, P\. N\., Kullgren, J\., and Solway, E\.National poll on healthy aging \(NPHA\), United States, april 2017, 2019\.URL[https://doi\.org/10\.3886/ICPSR37305\.v1](https://doi.org/10.3886/ICPSR37305.v1)\.Available from the Inter\-university Consortium for Political and Social Research and UCI Machine Learning Repository\.
- Marcoulides \(2005\)Marcoulides, G\. A\.*Discovering Knowledge in Data: An Introduction to Data Mining*\.Wiley, 2005\.Churn dataset\.
- Mata et al\. \(2022\)Mata, K\., Kanamori, K\., and Arimura, H\.Computing the collection of good models for rule lists\.In*Proceedings of the 18th International Conference on Machine Learning and Data Mining \(MLDM 2022\)*, New York, USA, 2022\.
- Mathur et al\. \(2021\)Mathur, A\., Podila, M\., Kulkarni, K\., Niyaz, Q\., and Javaid, A\. Y\.NATICUSdroid: A malware detection framework for android using native and custom permissions\.*Journal of Information Security and Applications*, 58:102696, 2021\.Available in UCI Machine Learning Repository\.
- Mazumder et al\. \(2022\)Mazumder, R\., Meng, X\., and Wang, H\.Quant\-BnB: A scalable branch\-and\-bound method for optimal decision trees with continuous features\.In*International Conference on Machine Learning*, volume 162, pp\. 15255–15277\. PMLR, 17–23 Jul 2022\.
- McTavish et al\. \(2022\)McTavish, H\., Zhong, C\., Achermann, R\., Karimalis, I\., Chen, J\., Rudin, C\., and Seltzer, M\.Fast sparse decision tree optimization via reference ensembles\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 36, pp\. 9604–9613, 2022\.
- McTavish et al\. \(2025\)McTavish, H\., Boner, Z\., Donnelly, J\., Seltzer, M\., and Rudin, C\.Leveraging predictive equivalence in decision trees\.In*International Conference on Machine Learning \(ICML\)*, 2025\.
- Mohammad & McCluskey \(2012\)Mohammad, R\. and McCluskey, L\.Phishing Websites\.UCI Machine Learning Repository, 2012\.DOI: https://doi\.org/10\.24432/C51W2X\.
- Mohammad et al\. \(2012\)Mohammad, R\. M\. A\., Thabtah, F\. A\., and Mccluskey, L\.An assessment of features related to phishing websites using an automated technique\.In*International Conference for Internet Technology and Secured Transactions*, pp\. 492–497, 2012\.
- Moro et al\. \(2014a\)Moro, S\., Cortez, P\., and Rita, P\.A data\-driven approach to predict the success of bank telemarketing\.*Decision Support Systems*, 62:22–31, 2014a\.
- Moro et al\. \(2014b\)Moro, S\., Rita, P\., and Cortez, P\.Bank Marketing\.UCI Machine Learning Repository, 2014b\.DOI: https://doi\.org/10\.24432/C5K306\.
- Motwani & Raghavan \(2013\)Motwani, R\. and Raghavan, P\.*Randomized Algorithms*\.Cambridge University Press, USA, 2013\.
- OpenML \(2018\)OpenML\.christine\.[https://www\.openml\.org/d/41142](https://www.openml.org/d/41142), 2018\.Dataset ID 41142\.
- OpenML \(2018a\)OpenML\.Helena dataset, 2018a\.URL[https://www\.openml\.org/d/41169](https://www.openml.org/d/41169)\.OpenML Dataset ID 41169; dataset from the ChaLearn Automatic Machine Learning \(AutoML\) Challenge\.
- OpenML \(2018b\)OpenML\.Jasmine dataset, 2018b\.URL[https://www\.openml\.org/d/41143](https://www.openml.org/d/41143)\.OpenML Dataset ID 41143; dataset from the ChaLearn Automatic Machine Learning \(AutoML\) Challenge\.
- OpenML \(2018c\)OpenML\.Madeline dataset, 2018c\.URL[https://www\.openml\.org/d/41144](https://www.openml.org/d/41144)\.OpenML Dataset ID 41144; dataset from the ChaLearn Automatic Machine Learning \(AutoML\) Challenge\.
- OpenML \(2018d\)OpenML\.NYC taxi green trips, December 2016, 2018d\.URL[https://www\.openml\.org/d/41255](https://www.openml.org/d/41255)\.OpenML Dataset ID 41255; derived from New York City Taxi and Limousine Commission \(TLC\) Trip Record Data\.
- OpenML \(2022a\)OpenML\.Electricity dataset, 2022a\.URL[https://www\.openml\.org/d/43952](https://www.openml.org/d/43952)\.OpenML Dataset ID 43952; transformed version of the ELEC2 dataset originally collected from the Australian New South Wales Electricity Market\.
- OpenML \(2022b\)OpenML\.Jannis dataset, 2022b\.URL[https://www\.openml\.org/d/43977](https://www.openml.org/d/43977)\.OpenML Dataset ID 43977; dataset used in the tabular data benchmark and derived from the ChaLearn AutoML Challenge\.
- OpenML \(2025\)OpenML\.Wine dataset, 2025\.URL[https://www\.openml\.org/d/47041](https://www.openml.org/d/47041)\.OpenML Dataset ID 47041\.
- Quinlan \(2014\)Quinlan, J\. R\.*C4\.5: Programs for Machine Learning*\.Elsevier, 2014\.
- Rudin et al\. \(2024\)Rudin, C\., Zhong, C\., Semenova, L\., Seltzer, M\., Parr, R\., Liu, J\., Katta, S\., Donnelly, J\., Chen, H\., and Boner, Z\.Amazing things come from having many good models\.In*International Conference on Machine Learning*, 2024\.
- S\. & Nagapadma \(2023\)S\., B\. and Nagapadma, R\.RT\-IoT2022 \.UCI Machine Learning Repository, 2023\.DOI: https://doi\.org/10\.24432/C5P338\.
- Sakar & Kastro \(2018\)Sakar, C\. and Kastro, Y\.Online Shoppers Purchasing Intention Dataset\.UCI Machine Learning Repository, 2018\.DOI: https://doi\.org/10\.24432/C5F88Q\.
- Sakar et al\. \(2018\)Sakar, C\. O\., Polat, S\. O\., Katircioglu, M\., and Kastro, Y\.Real\-time prediction of online shoppers’ purchasing intention using multilayer perceptron and lstm recurrent neural networks\.*Neural Computing and Applications*, 31:6893 – 6908, 2018\.
- Semenova et al\. \(2022\)Semenova, L\., Rudin, C\., and Parr, R\.On the existence of simpler machine learning models\.In*2022 ACM Conference on Fairness Accountability and Transparency*, FAccT ’22, pp\. 1827–1858\. ACM, June 2022\.
- Sharmila & Nagapadma \(2023\)Sharmila, B\. S\. and Nagapadma, R\.Quantized autoencoder \(QAE\) intrusion detection system for anomaly detection in resource\-constrained IoT devices using RT\-IoT2022 dataset\.*Cybersecurity*, 6:1–15, 2023\.URL[https://api\.semanticscholar\.org/CorpusID:261516162](https://api.semanticscholar.org/CorpusID:261516162)\.
- Sullivan et al\. \(2024\)Sullivan, C\., Tiwari, M\., and Thrun, S\.Maptree: Beating “optimal” decision trees with bayesian decision trees\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 38, pp\. 9019–9026, 2024\.
- Thrun \(1991\)Thrun, S\.The MONK’s problems\-a performance comparison of different learning algorithms, CMU\-CS\-91\-197, Sch, 1991\.URL[https://api\.semanticscholar\.org/CorpusID:59699060](https://api.semanticscholar.org/CorpusID:59699060)\.
- van der Linden et al\. \(2023\)van der Linden, J\., de Weerdt, M\., and Demirović, E\.Necessary and sufficient conditions for optimal decision trees using dynamic programming\.In*Advances in Neural Information Processing Systems*, volume 36, pp\. 9173–9212, 2023\.
- Voß et al\. \(2005\)Voß, S\., Fink, A\., and Duin, C\.Looking ahead with the pilot method\.*Annals of Operations Research*, 136\(1\):285–302, 2005\.
- Whiteson \(2014\)Whiteson, D\.HIGGS\.UCI Machine Learning Repository, 2014\.DOI: https://doi\.org/10\.24432/C5V312\.
- Wnek \(1993\)Wnek, J\.MONK’s Problems\.UCI Machine Learning Repository, 1993\.DOI: https://doi\.org/10\.24432/C5R30R\.
- Xin et al\. \(2022\)Xin, R\., Zhong, C\., Chen, Z\., Takagi, T\., Seltzer, M\., and Rudin, C\.Exploring the whole Rashomon set of sparse decision trees\.In*Advances in Neural Information Processing Systems*, volume 35, pp\. 14071–14084, 2022\.
- Yeh \(2009\)Yeh, I\.\-C\.Default of Credit Card Clients\.UCI Machine Learning Repository, 2009\.DOI: https://doi\.org/10\.24432/C55S3H\.
- Yeh & Lien \(2009\)Yeh, I\.\-C\. and Lien, C\.The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients\.*Expert Syst\. Appl\.*, 36:2473–2480, 2009\.
- Zhong et al\. \(2023\)Zhong, C\., Chen, Z\., Liu, J\., Seltzer, M\., and Rudin, C\.Exploring and interacting with the set of good sparse generalized additive models\.In*Advances in Neural Information Processing Systems*, 2023\.

## Appendix Contents

## Appendix ATheoretical Results

### A\.1Proofs of Claims

See[3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)

###### Proof of Theorem[3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)\.

We split the proof into a simpler case where iterative budget refinement does not run and then argue that whenever iterative budget refinement does run, we know a subproblem has at least one more tree than we previously expected, so the earlier bound was pessimistic and thus allows us to solve the subproblem again with a larger budget\.

##### Simpler Case:

First, consider the case where Algorithm[3](https://arxiv.org/html/2606.00202#alg3)does not iteratively refine budgets\. That is, we use Algorithm[4](https://arxiv.org/html/2606.00202#alg4)in place of it\. This way, we will never solve a subproblem identified by a sequence of splits more than 1 time\.

Algorithm 4Singlepass​\(DL,DR,γ,d,εabs,PR\)\\textsc\{Singlepass\}\(D\_\{L\},D\_\{R\},\\gamma,d,\\varepsilon\_\{\\textrm\{abs\}\},P\_\{R\}\)0:Left/right datasets

DL,DRD\_\{L\},D\_\{R\}, remaining depth

dd, regularization

γ\\gamma, parent budget

εabs\\varepsilon\_\{\\textrm\{abs\}\}, proxy cost

PRP\_\{R\},

1:

bL←εabs−PRb\_\{L\}\\leftarrow\\varepsilon\_\{\\textrm\{abs\}\}\-P\_\{R\}
2:

𝒯L←PRAXIS​\(DL,d,γ,bL\)\\mathcal\{T\}\_\{L\}\\leftarrow\\textsc\{PRAXIS \}\(D\_\{L\},d,\\gamma,b\_\{L\}\)
3:

mL←𝒯L\.min\_objectivem\_\{L\}\\leftarrow\\mathcal\{T\}\_\{L\}\.\\textit\{min\\\_objective\}
4:

bR←εabs−mLb\_\{R\}\\leftarrow\\varepsilon\_\{\\textrm\{abs\}\}\-m\_\{L\}
5:

𝒯R←PRAXIS​\(DR,d,γ,bR\)\\mathcal\{T\}\_\{R\}\\leftarrow\\textsc\{PRAXIS \}\(D\_\{R\},d,\\gamma,b\_\{R\}\)
6:return

\(𝒯L,𝒯R\)\(\\mathcal\{T\}\_\{L\},\\mathcal\{T\}\_\{R\}\)

Consider the search graph created by PRAXIS\.

In this case, each node corresponds to a subproblem defined by the sequence of splits leading to it, and there is no evaluation of the same subproblem with different budgets\. By construction, PRAXIS expands only those splits whose LicketySPLIT completions remain in the Rashomon set \(based on the pruning procedure in Algorithm[1](https://arxiv.org/html/2606.00202#alg1)\)\. Fix an AND/OR graph leveld′<dd^\{\\prime\}<d, and letr:=d−d′r:=d\-d^\{\\prime\}denote the remaining depth budget at this level\. Letud′u\_\{d^\{\\prime\}\}be the number of OR nodes \(subproblems\) expanded at leveld′d^\{\\prime\}, and letnin\_\{i\}be the dataset size at theit​hi^\{th\}such node\. At each expanded node, PRAXIS evaluates allkkcandidate splits to select the next batch of candidates\. Because the proxy algorithm is linear in the local subproblem sizenin\_\{i\}, the work at nodeiiis

𝒪​\(ni​k​\(g​\(k,r\)\)\)\.\\mathcal\{O\}\\\!\\left\(n\_\{i\}k\\bigl\(g\(k,r\)\\bigr\)\\right\)\.Therefore, the total cost at leveld′d^\{\\prime\}is

𝒪​\(∑i=1ud′ni​k​\(1\+g​\(k,r\)\)\)\.\\mathcal\{O\}\\\!\\left\(\\sum\_\{i=1\}^\{u\_\{d^\{\\prime\}\}\}n\_\{i\}\\,k\\bigl\(1\+g\(k,r\)\\bigr\)\\right\)\.
Since subproblems corresponding only to trees certified to be in the Rashomon set appear in the AND/OR graph, the combined subproblem sizes at any leveld′d^\{\\prime\}cannot exceedn​\|R\|n\|R\|\. Indeed,

∑i=1ud′ni\\displaystyle\\sum\_\{i=1\}^\{u\_\{d^\{\\prime\}\}\}n\_\{i\}≤∑tree​t∈R∑nodes oftat leveld′nut\\displaystyle\\leq\\sum\_\{\\text\{tree \}t\\in R\}\\sum\_\{\\text\{nodes of $t$ at level $d^\{\\prime\}$\}\}n\_\{u\_\{t\}\}\(1\)=∑tree​t∈Rn\\displaystyle=\\sum\_\{\\text\{tree \}t\\in R\}n\(2\)=n​\|R\|\.\\displaystyle=n\|R\|\.\(3\)Substituting this bound yields that the total cost at leveld′d^\{\\prime\}is

𝒪​\(n​\|R\|​k​\(g​\(k,r\)\)\)\.\\mathcal\{O\}\\\!\\left\(n\|R\|\\,k\\,\\bigl\(g\(k,r\)\\bigr\)\\right\)\.
Summing over all non\-leaf levelsd′=0,1,…,d−1d^\{\\prime\}=0,1,\\dots,d\-1\(equivalently,r=1,2,…,dr=1,2,\\dots,d\), the total runtime is

𝒪​\(n​\|R\|​k​∑r=1d\(g​\(k,r\)\)\)\.\\mathcal\{O\}\\\!\\left\(n\|R\|\\,k\\sum\_\{r=1\}^\{d\}\\bigl\(g\(k,r\)\\bigr\)\\right\)\.Ifg​\(k,r\)g\(k,r\)is nondecreasing inrr, then∑r=1dg​\(k,r\)≤d​g​\(k,d\)\\sum\_\{r=1\}^\{d\}g\(k,r\)\\leq d\\,g\(k,d\), yielding the simplified bound

𝒪​\(n​\|R\|​k​d​\(g​\(k,d\)\)\)\.\\mathcal\{O\}\\\!\\left\(n\|R\|\\,k\\,d\\,\\bigl\(g\(k,d\)\\bigr\)\\right\)\.
At this point, the runtime is expressed in terms of the functiongg, which upper\-bounds the proxy cost for a dataset\. To express the bound directly in terms of the proxy algorithm’s runtime, we require a matching lower bound\.

From the assumption that

ProxyCompute​\(m,k,r\)=Θ​\(m​g​\(k,r\)\),\\textsc\{ProxyCompute\}\(m,k,r\)=\\Theta\(m\\,g\(k,r\)\),we have that

n​g​\(k,d\)=𝒪​\(ProxyCompute​\(n,k,d\)\)\.n\\,g\(k,d\)=\\mathcal\{O\}\(\\textsc\{ProxyCompute\}\(n,k,d\)\)\.
Substituting into the runtime bound gives:

𝒪​\(n​\|R\|​k​d​g​\(k,d\)\)=𝒪​\(\|R\|​k​d⋅ProxyCompute​\(n,k,d\)\)\.\\mathcal\{O\}\\\!\\left\(n\|R\|\\,k\\,d\\,g\(k,d\)\\right\)=\\mathcal\{O\}\\\!\\left\(\|R\|\\,k\\,d\\cdot\\textsc\{ProxyCompute\}\(n,k,d\)\\right\)\.
Additionally, PRAXIS may perform work at leaf nodes corresponding tor=0r=0\(equivalently,d′=dd^\{\\prime\}=d\), where no further splits are evaluated and the algorithm considers only leaf predictions\. The work performed at such a node is linear in the size of the local subproblem\. Summed over all leaf nodes represented in the AND/OR graph, this contributes at most𝒪​\(n​\|R\|\)\\mathcal\{O\}\(n\|R\|\)additional work and does not affect the overall asymptotic bound\.

##### General Case:

Now, consider the case where Algorithm[3](https://arxiv.org/html/2606.00202#alg3)does run—that is, PRAXIS is run on at least one more time on one of the child subproblems with a larger budget\. If the minimum objective is still the LicketySPLIT objective, the enumeration on the left subproblem is not rerun, and we reduce to the earlier case\. Thus, in order for one side to be rerun and a larger budget to have been set, it implies that there exist at least two trees for the one child \(the LicketySPLIT tree and the new minimum\-objective tree\) that could be combined with the LicketySPLIT tree for sibling node to be within the budget for the parent subproblem\.

Recall that PRAXIS has a key invariant: any solution to a subproblem not pruned is part of at least one tree in the Rashomon set \(this follows from PRAXIS’s construction, which only expands splits whose LicketySPLIT completions are within the budget\)\. Thus, in our new call of PRAXIS, we know that every subproblem visited is a part of a solution with the new minimum\-objective tree on the other side\. And, every subproblem visited in the original PRAXIS call on one side is part of a solution with the LicketySPLIT tree on the other, so we know we’re not double counting solutions\.

Thus, even though we may solve a subproblem multiple times with different budgets, the number of times a subproblem \(a sequence of splits\) is solved is never more than the number of trees in which it appears\.

This fact implies that the combined subproblem sizes at any leveld′d^\{\\prime\}cannot exceedn​\|R\|n\|R\|, the sum of all subproblem sizes across the Rashomon\-set trees found for that depth, because any distinct subproblem is solved once, and any subproblem shared bytttrees is solved no more thantttimes\. Thus, even in this case, the size of subproblems solved is bounded by the same quantity, and as such, the asymptotic runtime is as well\.

∎

See[3\.3](https://arxiv.org/html/2606.00202#S3.Thmtheorem3)

###### Proof\.

Starting from the initial condition, we have:

∀q,\|Rεabs​\(D,d,γ\)\|∈o​\(\(kd\)\(k​d\)q\)\\displaystyle\\forall q,\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\\in o\\big\(\\frac\{\{k\\choose d\}\}\{\(kd\)^\{q\}\}\\big\)applying the definition of littleoo, we know :

∀q,∃c,d0,k0,∀k≥k0,d≥d0,\|Rεabs​\(D,d,γ\)\|<c​\(\(kd\)\(k​d\)q\)\\displaystyle\\forall q,\\exists c,d\_\{0\},k\_\{0\},\\forall k\\geq k\_\{0\},d\\geq d\_\{0\},\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|<c\\big\(\\frac\{\{k\\choose d\}\}\{\(kd\)^\{q\}\}\\big\)∀q,∃c,d0,k0,∀k≥k0,d≥d0,\|Rεabs​\(D,d,γ\)\|​\(k​d\)q<c​\(\(kd\)\)\\displaystyle\\forall q,\\exists c,d\_\{0\},k\_\{0\},\\forall k\\geq k\_\{0\},d\\geq d\_\{0\},\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\(kd\)^\{q\}<c\\big\(\{\{k\\choose d\}\}\\big\)∀q,∃c,d0,k0,∀k≥k0,d≥d0,\(k​d\)q<c​\(\(kd\)\|Rεabs​\(D,d,γ\)\|\)\\displaystyle\\forall q,\\exists c,d\_\{0\},k\_\{0\},\\forall k\\geq k\_\{0\},d\\geq d\_\{0\},\(kd\)^\{q\}<c\\big\(\\frac\{\{k\\choose d\}\}\{\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\\big\)∀q,∃c,d0,k0,∀k≥k0,d≥d0,1c​\(k​d\)q<\(\(kd\)\|Rεabs​\(D,d,γ\)\|\)\\displaystyle\\forall q,\\exists c,d\_\{0\},k\_\{0\},\\forall k\\geq k\_\{0\},d\\geq d\_\{0\},\\frac\{1\}\{c\}\(kd\)^\{q\}<\\big\(\\frac\{\{k\\choose d\}\}\{\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\\big\)definec′=1cc^\{\\prime\}=\\frac\{1\}\{c\}∀q,∃c′,d0,k0,∀k≥k0,d≥d0,c′​\(k​d\)q<\(\(kd\)\|Rεabs​\(D,d,γ\)\|\)\\displaystyle\\forall q,\\exists c^\{\\prime\},d\_\{0\},k\_\{0\},\\forall k\\geq k\_\{0\},d\\geq d\_\{0\},c^\{\\prime\}\(kd\)^\{q\}<\\big\(\\frac\{\{k\\choose d\}\}\{\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\\big\)∀q,∃c′,d0,k0,∀k≥k0,d≥d0,\(\(kd\)\|Rεabs​\(D,d,γ\)\|\)\>c′​\(k​d\)q\\displaystyle\\forall q,\\exists c^\{\\prime\},d\_\{0\},k\_\{0\},\\forall k\\geq k\_\{0\},d\\geq d\_\{0\},\\big\(\\frac\{\{k\\choose d\}\}\{\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\\big\)\>c^\{\\prime\}\(kd\)^\{q\}divide both sides by kd times proxycompute cost:∀q,∃c′,d0,k0,∀k≥k0,d≥d0,\(\(kd\)k​d​ProxyCompute​\(n,k,d\)​\|Rεabs​\(D,d,γ\)\|\)\>c′​\(k​d\)q−1ProxyCompute​\(n,k,d\)\\displaystyle\\forall q,\\exists c^\{\\prime\},d\_\{0\},k\_\{0\},\\forall k\\geq k\_\{0\},d\\geq d\_\{0\},\\big\(\\frac\{\{k\\choose d\}\}\{kd\\textsc\{ProxyCompute\}\(n,k,d\)\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\\big\)\>\\frac\{c^\{\\prime\}\(kd\)^\{q\-1\}\}\{\\textsc\{ProxyCompute\}\(n,k,d\)\}\(4\)By assumption, the proxy is polynomial overall \(and linear innn, to adhere to Theorem[3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)\)\. So we know there must be some constantsz,c~,d0~,k0~,n0z,\\tilde\{c\},\\tilde\{d\_\{0\}\},\\tilde\{k\_\{0\}\},n\_\{0\}for whichProxyCompute​\(n,k,d\)≤c~​n​\(k​d\)z\\textsc\{ProxyCompute\}\(n,k,d\)\\leq\\tilde\{c\}n\(kd\)^\{z\}forn≥n0,k≥k0~,d≥d0~n\\geq\{n\_\{0\}\},k\\geq\\tilde\{k\_\{0\}\},d\\geq\\tilde\{d\_\{0\}\}\. So, considering the constants from Equation[4](https://arxiv.org/html/2606.00202#A1.E4), definingk0′=max⁡\(k0,k0~\)k\_\{0\}^\{\\prime\}=\\max\(k\_\{0\},\\tilde\{k\_\{0\}\}\),d0′=max⁡\(d0,d0~\)d\_\{0\}^\{\\prime\}=\\max\(d\_\{0\},\\tilde\{d\_\{0\}\}\), we have

∀q,∃c′,c~,z,d0′,k0′,∀k≥k0′,d≥d0′,n≥n0,\(\(kd\)k​d​ProxyCompute​\(n,k,d\)​\|Rεabs​\(D,d,γ\)\|\)\>c′​\(k​d\)q−1c~​n​\(k​d\)z\\displaystyle\\forall q,\\exists c^\{\\prime\},\\tilde\{c\},z,d\_\{0\}^\{\\prime\},k\_\{0\}^\{\\prime\},\\forall k\\geq k\_\{0\}^\{\\prime\},d\\geq d\_\{0\}^\{\\prime\},n\\geq n\_\{0\},\\big\(\\frac\{\{k\\choose d\}\}\{kd\\textsc\{ProxyCompute\}\(n,k,d\)\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\\big\)\>\\frac\{c^\{\\prime\}\(kd\)^\{q\-1\}\}\{\\tilde\{c\}n\(kd\)^\{z\}\}Defineq′=q−1−zq^\{\\prime\}=\{q\-1\-z\},c′′=c′c~c^\{\\prime\\prime\}=\\frac\{c^\{\\prime\}\}\{\\tilde\{c\}\}:∀q′,∃c′′,d0′,k0′,∀k≥k0′,d≥d0′,n≥n0,\(kd\)k​d​ProxyCompute​\(n,k,d\)​\|Rεabs​\(D,d,γ\)\|\>c′′​\(k​d\)q′n\\displaystyle\\forall q^\{\\prime\},\\exists c^\{\\prime\\prime\},d\_\{0\}^\{\\prime\},k\_\{0\}^\{\\prime\},\\forall k\\geq k\_\{0\}^\{\\prime\},d\\geq d\_\{0\}^\{\\prime\},n\\geq n\_\{0\},\\frac\{\{k\\choose d\}\}\{kd\\textsc\{ProxyCompute\}\(n,k,d\)\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\>\\frac\{c^\{\\prime\\prime\}\(kd\)^\{q^\{\\prime\}\}\}\{n\}∀q′,∃c′′,d0′,k0′,∀k≥k0′,d≥d0′,n≥n0,n​\(kd\)k​d​ProxyCompute​\(n,k,d\)​\|Rεabs​\(D,d,γ\)\|\>c′′​\(k​d\)q′\\displaystyle\\forall q^\{\\prime\},\\exists c^\{\\prime\\prime\},d\_\{0\}^\{\\prime\},k\_\{0\}^\{\\prime\},\\forall k\\geq k\_\{0\}^\{\\prime\},d\\geq d\_\{0\}^\{\\prime\},n\\geq n\_\{0\},\\frac\{\{n\}\{k\\choose d\}\}\{kd\\textsc\{ProxyCompute\}\(n,k,d\)\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\>\{c^\{\\prime\\prime\}\(kd\)^\{q^\{\\prime\}\}\}
Now, we lower bound the runtime of an optimal tree algorithm, and upper bound the runtime of PRAXIS\. The runtime for full dynamic programming search cited inDemirović et al\. \([2023](https://arxiv.org/html/2606.00202#bib.bib26)\)isΘ​\(\(n\+2d\)​\(kd\)\)\\Theta\(\(n\+2^\{d\}\)\{k\\choose d\}\)\(after translating that result to this paper’s notation for \# features and depth\)\. So the runtime is alsoΩ​\(n​\(kd\)\)\\Omega\(n\{k\\choose d\}\), and we can write:∃c~′,d0~′,k0~′,∀k≥k0~′,d≥d0~′,n≥n0′,Worst\-Case Runtime\(OPT\)≥c~′​n​\(kd\)\\exists\\tilde\{c\}^\{\\prime\},\\tilde\{d\_\{0\}\}^\{\\prime\},\\tilde\{k\_\{0\}\}^\{\\prime\},\\forall k\\geq\\tilde\{k\_\{0\}\}^\{\\prime\},d\\geq\\tilde\{d\_\{0\}\}^\{\\prime\},n\\geq\{n\_\{0\}\}^\{\\prime\},\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\\geq\\tilde\{c\}^\{\\prime\}n\{k\\choose d\}So substitutingc′′′=c′′c~′c^\{\\prime\\prime\\prime\}=\\frac\{c^\{\\prime\\prime\}\}\{\\tilde\{c\}^\{\\prime\}\},d0′′=max⁡\(d0′,d0~′\)d\_\{0\}^\{\\prime\\prime\}=\\max\(d\_\{0\}^\{\\prime\},\\tilde\{d\_\{0\}\}^\{\\prime\}\),k0′′=max⁡\(k0′,k0~′\)k\_\{0\}^\{\\prime\\prime\}=\\max\(k\_\{0\}^\{\\prime\},\\tilde\{k\_\{0\}\}^\{\\prime\}\)n0′′=max⁡\(n0,n0′\)n\_\{0\}^\{\\prime\\prime\}=\\max\(n\_\{0\},n\_\{0\}^\{\\prime\}\):

∀q′,∃c′′′,d0′′,k0′′,∀k≥k0′′,d≥d0′′,n≥n0′′,Worst\-Case Runtime\(OPT\)k​d​ProxyCompute​\(n,k,d\)​\|Rεabs​\(D,d,γ\)\|\>c′′′​\(k​d\)q′\\displaystyle\\forall q^\{\\prime\},\\exists c^\{\\prime\\prime\\prime\},d\_\{0\}^\{\\prime\\prime\},k\_\{0\}^\{\\prime\\prime\},\\forall k\\geq k\_\{0\}^\{\\prime\\prime\},d\\geq d\_\{0\}^\{\\prime\\prime\},n\\geq n\_\{0\}^\{\\prime\\prime\},\\frac\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\{kd\\textsc\{ProxyCompute\}\(n,k,d\)\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\>\{c^\{\\prime\\prime\\prime\}\(kd\)^\{q^\{\\prime\}\}\}Now, using Theorem[3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2), we knowWorst\-Case Runtime\(PRAXIS\)∈𝒪​\(\|Rεabs\|​k​d⋅ProxyCompute​\(n,k,d\)\)\\textrm\{Worst\-Case Runtime\(\{PRAXIS\}\)\}\\in\\mathcal\{O\}\\\!\\left\(\|R\_\{\\varepsilon\_\{\\mathrm\{abs\}\}\}\|\\,k\\,d\\cdot\\textsc\{ProxyCompute\}\(n,k,d\)\\right\)\. So,

∃c~′′,d0~′′,k0~′′,∀k≥k0~′′,d≥d0~′′,n≥n0~,Worst\-Case Runtime\(PRAXIS\)≤c~′′​k​d​ProxyCompute​\(n,k,d\)​\|Rεabs​\(D,d,γ\)\|\.\\exists\\tilde\{c\}^\{\\prime\\prime\},\\tilde\{d\_\{0\}\}^\{\\prime\\prime\},\\tilde\{k\_\{0\}\}^\{\\prime\\prime\},\\forall k\\geq\\tilde\{k\_\{0\}\}^\{\\prime\\prime\},d\\geq\\tilde\{d\_\{0\}\}^\{\\prime\\prime\},n\\geq\\tilde\{n\_\{0\}\},\\textrm\{Worst\-Case Runtime\(\{PRAXIS\}\)\}\\leq\\tilde\{c\}^\{\\prime\\prime\}kd\\textsc\{ProxyCompute\}\(n,k,d\)\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\.So substitutingc′′′′=c′′′​c~′′c^\{\\prime\\prime\\prime\\prime\}=\{c^\{\\prime\\prime\\prime\}\}\{\\tilde\{c\}^\{\\prime\\prime\}\},d0′′′=max⁡\(d0′′,d0~′′\)d\_\{0\}^\{\\prime\\prime\\prime\}=\\max\(d\_\{0\}^\{\\prime\\prime\},\\tilde\{d\_\{0\}\}^\{\\prime\\prime\}\),k0′′′=max⁡\(k0′′,k0~′′\)k\_\{0\}^\{\\prime\\prime\\prime\}=\\max\(k\_\{0\}^\{\\prime\\prime\},\\tilde\{k\_\{0\}\}^\{\\prime\\prime\}\)n0′′′=max⁡\(n0′′,n0~\)n\_\{0\}^\{\\prime\\prime\\prime\}=\\max\(n\_\{0\}^\{\\prime\\prime\},\\tilde\{n\_\{0\}\}\),

∀q′,∃c′′′′,d0′′′,k0′′′,∀k≥k0′′′,d≥d0′′′,n≥n0′′′,Worst\-Case Runtime\(OPT\)Worst\-Case Runtime\(PRAXIS\)\>c′′′′​\(k​d\)q′\\displaystyle\\forall q^\{\\prime\},\\exists c^\{\\prime\\prime\\prime\\prime\},d\_\{0\}^\{\\prime\\prime\\prime\},k\_\{0\}^\{\\prime\\prime\\prime\},\\forall k\\geq k\_\{0\}^\{\\prime\\prime\\prime\},d\\geq d\_\{0\}^\{\\prime\\prime\\prime\},n\\geq n\_\{0\}^\{\\prime\\prime\\prime\},\\frac\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\{\\textrm\{Worst\-Case Runtime\(\{PRAXIS\}\)\}\}\>\{c^\{\\prime\\prime\\prime\\prime\}\(kd\)^\{q^\{\\prime\}\}\}Meaning:

∀q,Worst\-Case Runtime\(OPT\)Worst\-Case Runtime\(PRAXIS\)∈ω​\(\(k​d\)q\),\\displaystyle\\forall q,\\frac\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\{\\textrm\{Worst\-Case Runtime\(\{PRAXIS\}\)\}\}\\in\\omega\(\{\(kd\)^\{q\}\}\),as required\.

Note also the following intermediate consequence\. The worst\-case runtime ofOPTexceeds the Rashomon set size by more than any polynomial factor, as shown below\.

∀q′,∃c′′′,d0′′,k0′′,∀k≥k0′′,d≥d0′′,n≥n0′′,Worst\-Case Runtime\(OPT\)k​d​ProxyCompute​\(n,k,d\)​\|Rεabs​\(D,d,γ\)\|\>c′′′​\(k​d\)q′\\displaystyle\\forall q^\{\\prime\},\\exists c^\{\\prime\\prime\\prime\},d\_\{0\}^\{\\prime\\prime\},k\_\{0\}^\{\\prime\\prime\},\\forall k\\geq k\_\{0\}^\{\\prime\\prime\},d\\geq d\_\{0\}^\{\\prime\\prime\},n\\geq n\_\{0\}^\{\\prime\\prime\},\\frac\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\{kd\\textsc\{ProxyCompute\}\(n,k,d\)\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\>\{c^\{\\prime\\prime\\prime\}\(kd\)^\{q^\{\\prime\}\}\}∀q′,∃c′′′,d0′′,k0′′,∀k≥k0′′,d≥d0′′,n≥n0′′,Worst\-Case Runtime\(OPT\)k​d​ProxyCompute​\(n,k,d\)\>c′′′​\(k​d\)q′​\|Rεabs​\(D,d,γ\)\|\\displaystyle\\forall q^\{\\prime\},\\exists c^\{\\prime\\prime\\prime\},d\_\{0\}^\{\\prime\\prime\},k\_\{0\}^\{\\prime\\prime\},\\forall k\\geq k\_\{0\}^\{\\prime\\prime\},d\\geq d\_\{0\}^\{\\prime\\prime\},n\\geq n\_\{0\}^\{\\prime\\prime\},\\frac\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\{kd\\textsc\{ProxyCompute\}\(n,k,d\)\}\>\{c^\{\\prime\\prime\\prime\}\(kd\)^\{q^\{\\prime\}\}\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}∀q′,∃c′′′,d0′′,k0′′,∀k≥k0′′,d≥d0′′,n≥n0′′,Worst\-Case Runtime\(OPT\)\>c′′′​\(k​d\)q′​\|Rεabs​\(D,d,γ\)\|\\displaystyle\\forall q^\{\\prime\},\\exists c^\{\\prime\\prime\\prime\},d\_\{0\}^\{\\prime\\prime\},k\_\{0\}^\{\\prime\\prime\},\\forall k\\geq k\_\{0\}^\{\\prime\\prime\},d\\geq d\_\{0\}^\{\\prime\\prime\},n\\geq n\_\{0\}^\{\\prime\\prime\},\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\>\{c^\{\\prime\\prime\\prime\}\(kd\)^\{q^\{\\prime\}\}\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}∀q,Worst\-Case Runtime\(OPT\)∈ω​\(\(k​d\)q​\|Rεabs​\(D,d,γ\)\|\)\\displaystyle\\forall q,\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\\in\\omega\(\{\(kd\)^\{q\}\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\}\)⇒Worst\-Case Runtime\(OPT\)∈ω​\(\|Rεabs​\(D,d,γ\)\|​Poly​\(k,d\)\)\\displaystyle\\Rightarrow\{\\textrm\{Worst\-Case Runtime\(\{OPT\}\)\}\}\\in\\omega\(\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|\\textit\{Poly\}\(k,d\)\)\(5\)In contrast, the worst case runtime ofPRAXISis at most a polynomial multiple of the Rashomon set size:

Worst\-Case Runtime\(PRAXIS\)∈𝒪​\(\|Rεabs​\(D,d,γ\)\|​k​d​ProxyCompute​\(n,k,d\)\)\\displaystyle\\textrm\{Worst\-Case Runtime\(\{PRAXIS\}\)\}\\in\\mathcal\{O\}\(\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|kd\\textsc\{ProxyCompute\}\(n,k,d\)\)\(6\)⇒Worst\-Case Runtime\(PRAXIS\)∈𝒪​\(\|Rεabs​\(D,d,γ\)\|​n​Poly​\(k,d\)\)\\displaystyle\\Rightarrow\\textrm\{Worst\-Case Runtime\(\{PRAXIS\}\)\}\\in\\mathcal\{O\}\(\|R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(D,d,\\gamma\)\|n\\textit\{Poly\}\(k,d\)\)\(7\)
∎

##### Comments on Corollary[3\.3](https://arxiv.org/html/2606.00202#S3.Thmtheorem3)\.

For ease of discussion, we have incorporated two simplifying conditions in Corollary[3\.3](https://arxiv.org/html/2606.00202#S3.Thmtheorem3): \(1\)γ=0\\gamma=0, and \(2\) the absolute Rashomon budgetεabs\\varepsilon\_\{\\mathrm\{abs\}\}is fixed and does not vary with depth\. Both can reasonably be relaxed\.γ\\gammacan be nonzero without necessarily changing the worst case behaviour of optimal trees \(the applicability of bounds fromγ\\gamma, i\.e\., those in GOSDT\(Lin et al\.,[2020](https://arxiv.org/html/2606.00202#bib.bib46)\), now depends on search orders and what bounds are admitted by the datasetDD\)\.ε\\varepsiloncan be defined to scale with the depth \(since the optimal objective for a fixed dataset monotonically decreases with depth budget\), and/or be defined as a multiplicative factor of the proxy algorithm’s call for the currentn,k,dn,k,d\. Such a relaxation can be useful for defining the scaling of a Rashomon set’s size with depth, as models become more accurate\.

See[3\.4](https://arxiv.org/html/2606.00202#S3.Thmtheorem4)

###### Proof\.

By assumption, the proxy algorithm requires only𝒪​\(n​k\)\\mathcal\{O\}\(nk\)memory to run\. This holds for our many possible proxy algorithms\. LicketySPLIT is one of them \([A\.11](https://arxiv.org/html/2606.00202#A1.Thmtheorem11)\)\. Regardless of the proxy algorithm, it takes a single float or integer to store the output: the objective of some tree\.

Now, when we run PRAXIS, at each node we visit, we just need to know which splits result in proxy objectives below the epsilon bound\. So for each potential split, we need to run the proxy algorithm on the left and right subproblems \(withO​\(n​k\)O\(nk\)memory total\), then check if that falls within the budgetεabs\\varepsilon\_\{\\textrm\{abs\}\}\. If it does, then we will save the resulting subproblems from this split, and later visit those nodes to continue PRAXIS\. This will be an additional two nodes we need to persist in the dependency graph\. However, every time this happens \(increasing the total storage used by 2\), the total sum of nodes in trees across the entire Rashomon set will also increase by at least 2, since at least one tree with this split \(the one found by PRAXIS\) will fall within theεabs\\varepsilon\_\{\\textrm\{abs\}\}bound, and that tree will have at least one internal or leaf node corresponding to each of these two nodes, since it includes this split\.

In order to keep the information stored in each node efficient, we use the following structure \(assuming we have one global copy of the entire dataset provided as input\)\. Consider a node to be active only if we are visiting that node currently \(that is, we are mid\-execution of Algorithm[1](https://arxiv.org/html/2606.00202#alg1)at that node\), or if we are visiting one of its child nodes\. When a node is active, we persist information about the row indices corresponding to the data subset used in that node\. We can still efficiently determine the row indices relevant for any child just using these row indices, the original dataset, and the binary splitting feature, so the runtime is not affected\. We also maintain memory efficiency: we only haveO​\(d\)O\(d\)nodes active at once, so the total memory required to persist these row indices isO​\(n​d\)O\(nd\); since we cannot have a depth greater than the number of binary features, that means the memory required is underO​\(n​k\)O\(nk\)and does not affect asymptotic complexity\. Once we are done visiting a node, we don’t need to visit it again and can safely stop persisting the memory needed for its row indices\.222This is slightly complicated by the multiple passes induced by Algorithm[3](https://arxiv.org/html/2606.00202#alg3); however, it does not affect the asymptotic complexity if we reconstruct the relevant row indices again when revisiting that side for an additional pass\.

So the total information we persist is just:

1. 1\.A single copy of the original dataset
2. 2\.the dependency graph structure, which can be constructed to include only nodes with a constant amount of storage space, where there are no more nodes in the dependency graph than there are nodes \(split nodes or leaves\) across the whole Rashomon set\.
3. 3\.Row indices for currently active subproblems in the dependency graph\.

Since the number of leaves and internal nodes is bounded by twice the number of leaves, we know object \(2\) isO​\(∑t∈R\|t\|\)O\(\\sum\_\{t\\in R\}\|t\|\)\. We know object \(1\) isO​\(n​k\)O\(nk\), so combining them we have proven our claimed memory complexity\. \(Note that \(3\) is also withinO​\(n​k\)O\(nk\)so does not affect the complexity\)\.

∎

\.

See[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)

###### Proof\.

Fix a target treettof depth at mostddwhose root\-to\-leaf paths we want to show are materialized in the AND/OR graph\. For any node or tree u, denote the depth of that node/tree asdud\_\{u\}\.

##### Pruning at the Root

Let the root ofttsplit on featureff, yielding childrentleftt\_\{\\mathrm\{left\}\}andtrightt\_\{\\mathrm\{right\}\}\. Algorithm[1](https://arxiv.org/html/2606.00202#alg1)will consider featureffand perform the pruning test

Proxy​\(Dtleft,droot−1,γ\)\+Proxy​\(Dtright,droot−1,γ\)≤εabs\.\\textsc\{Proxy\}\\\!\\big\(D\_\{t\_\{\\textrm\{left\}\}\},d\_\{\\mathrm\{root\}\}\-1,\\gamma\\big\)\+\\textsc\{Proxy\}\\\!\\big\(D\_\{t\_\{\\textrm\{right\}\}\},d\_\{\\mathrm\{root\}\}\-1,\\gamma\\big\)\\;\\leq\\;\\varepsilon\_\{\\textrm\{abs\}\}\.\(8\)If \([8](https://arxiv.org/html/2606.00202#A1.E8)\) holds, the split onffis not pruned and the algorithm proceeds to build subtries for both children, so the prefix consisting of the root split ofttis materialized\. \(Ifttis a single leaf, then it is handled before any split\-pruning occurs\.\)

LetbLb\_\{L\}andbRb\_\{R\}denote the budgets used when recursively solving the left and right child subproblems for the split onff\. In the worst\-case \(no iterative refinement\), these are set by subtracting the proxy estimate of the opposite side \(as shown in[Theorem A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3)and[A\.4](https://arxiv.org/html/2606.00202#A1.Thmtheorem4)\):

bL=εabs−Proxy​\(Dtright,droot−1,γ\),bR=εabs−Proxy​\(Dtleft,droot−1,γ\)\.b\_\{L\}\\;=\\;\\varepsilon\_\{\\textrm\{abs\}\}\-\\textsc\{Proxy\}\\\!\\big\(D\_\{t\_\{\\textrm\{right\}\}\},d\_\{\\mathrm\{root\}\}\-1,\\gamma\\big\),\\qquad b\_\{R\}\\;=\\;\\varepsilon\_\{\\textrm\{abs\}\}\-\\textsc\{Proxy\}\\\!\\big\(D\_\{t\_\{\\textrm\{left\}\}\},d\_\{\\mathrm\{root\}\}\-1,\\gamma\\big\)\.\(9\)

##### Budget propagation down a path

Letuube any internal node ofttat depthd′≥1d^\{\\prime\}\\geq 1\(root at depth0\)\. Letn0=root,n1,…,nd′=un\_\{0\}=\\mathrm\{root\},n\_\{1\},\\dots,n\_\{d^\{\\prime\}\}=ube the nodes on the unique path from the root touuintt\. For eachi∈\{1,…,d′\}i\\in\\\{1,\\dots,d^\{\\prime\}\\\}, letsis\_\{i\}denote the sibling ofnin\_\{i\}\(i\.e\.,sis\_\{i\}is the other child ofni−1n\_\{i\-1\}not equal tonin\_\{i\}\)\. Finally, letuleftu\_\{\\textrm\{left\}\}andurightu\_\{\\textrm\{right\}\}denote the two children ofuuintt\.

In the worst case \(without iterative budget refinement\), we have

εu=εabs−∑i=1d′PROXY​\(Dsi,dsi,γ\),\\varepsilon\_\{\\textrm\{u\}\}\\;=\\;\\varepsilon\_\{\\textrm\{abs\}\}\\;\-\\;\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\textsc\{PROXY\}\(D\_\{s\_\{i\}\},d\_\{s\_\{i\}\},\\gamma\),\(10\)whereεu\\varepsilon\_\{\\textrm\{u\}\}is the budget with which the algorithm solves the subproblem corresponding to nodeuu\.

##### Proof of \([10](https://arxiv.org/html/2606.00202#A1.E10)\) by induction ond′d^\{\\prime\}\.

Ford′=1d^\{\\prime\}=1,u=n1u=n\_\{1\}is a child of the root ands1s\_\{1\}is the opposite child\. By \([9](https://arxiv.org/html/2606.00202#A1.E9)\), the budget foruuis exactlyεabs−PROXY​\(s1\)\\varepsilon\_\{\\textrm\{abs\}\}\-\\textsc\{PROXY\}\(s\_\{1\}\), which matches \([10](https://arxiv.org/html/2606.00202#A1.E10)\)\.

Assume \([10](https://arxiv.org/html/2606.00202#A1.E10)\) holds for depthd′−1d^\{\\prime\}\-1, i\.e\.,

εnd′−1=εabs−∑i=1d′−1PROXY​\(Dsi,dsi,γ\)\.\\varepsilon\_\{n\_\{d^\{\\prime\}\-1\}\}\\;=\\;\\varepsilon\_\{\\textrm\{abs\}\}\-\\sum\_\{i=1\}^\{d^\{\\prime\}\-1\}\\textsc\{PROXY\}\(D\_\{s\_\{i\}\},d\_\{s\_\{i\}\},\\gamma\)\.At nodend′−1n\_\{d^\{\\prime\}\-1\}, in the worst\-case, we pass its chosen childu=nd′u=n\_\{d^\{\\prime\}\}a budget that subtracts off the proxy algorithm on the other side \(see[Theorem A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3)and[A\.4](https://arxiv.org/html/2606.00202#A1.Thmtheorem4)\)\. This gives us

εu\\displaystyle\\varepsilon\_\{u\}=εnd′−1−PROXY​\(Dsd′,dsd′,γ\)\\displaystyle=\\varepsilon\_\{n\_\{d^\{\\prime\}\-1\}\}\-\\textsc\{PROXY\}\(D\_\{s\_\{d^\{\\prime\}\}\},d\_\{s\_\{d^\{\\prime\}\}\},\\gamma\)=\(εabs−∑i=1d′−1PROXY​\(Dsi,dsi,γ\)\)−PROXY​\(Dsd′,dsd′,γ\)\\displaystyle=\\Big\(\\varepsilon\_\{\\textrm\{abs\}\}\-\\sum\_\{i=1\}^\{d^\{\\prime\}\-1\}\\textsc\{PROXY\}\(D\_\{s\_\{i\}\},d\_\{s\_\{i\}\},\\gamma\)\\Big\)\-\\textsc\{PROXY\}\(D\_\{s\_\{d^\{\\prime\}\}\},d\_\{s\_\{d^\{\\prime\}\}\},\\gamma\)=εabs−∑i=1d′PROXY​\(Dsi,dsi,γ\),\\displaystyle=\\varepsilon\_\{\\textrm\{abs\}\}\-\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\textsc\{PROXY\}\(D\_\{s\_\{i\}\},d\_\{s\_\{i\}\},\\gamma\),establishing \([10](https://arxiv.org/html/2606.00202#A1.E10)\)\.

##### The pruning test atuu\.

At nodeuu, the algorithm considers the split thattttakes atuuand performs the pruning check

Proxy​\(Duleft,du−1,γ\)\+Proxy​\(Duright,du−1,γ\)≤εu\.\\textsc\{Proxy\}\\\!\\big\(D\_\{u\_\{\\textrm\{left\}\}\},d\_\{u\}\-1,\\gamma\\big\)\+\\textsc\{Proxy\}\\\!\\big\(D\_\{u\_\{\\textrm\{right\}\}\},d\_\{u\}\-1,\\gamma\\big\)\\;\\leq\\;\\varepsilon\_\{u\}\.\(11\)Substituting \([10](https://arxiv.org/html/2606.00202#A1.E10)\) into \([11](https://arxiv.org/html/2606.00202#A1.E11)\) and rearranging yields the*frontier\-cut inequality*:

Proxy​\(uleft,du−1,γ\)\+Proxy​\(uright,du−1,γ\)≤εabs−∑i=1d′PROXY​\(Dsi,dsi,γ\)\\displaystyle\\textsc\{Proxy\}\\\!\\big\(u\_\{\\textrm\{left\}\},d\_\{u\}\-1,\\gamma\\big\)\+\\textsc\{Proxy\}\\\!\\big\(u\_\{\\textrm\{right\}\},d\_\{u\}\-1,\\gamma\\big\)\\leq\\varepsilon\_\{\\textrm\{abs\}\}\-\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\textsc\{PROXY\}\(D\_\{s\_\{i\}\},d\_\{s\_\{i\}\},\\gamma\)\(12\)⇔∑i=1d′PROXY​\(Dsi,dsi,γ\)\+Proxy​\(uleft,du−1,γ\)\+Proxy​\(uright,du−1,γ\)≤εabs\.\\displaystyle\\iff\\qquad\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\textsc\{PROXY\}\(D\_\{s\_\{i\}\},d\_\{s\_\{i\}\},\\gamma\)\\;\+\\;\\textsc\{Proxy\}\\\!\\big\(u\_\{\\textrm\{left\}\},d\_\{u\}\-1,\\gamma\\big\)\\;\+\\;\\textsc\{Proxy\}\\\!\\big\(u\_\{\\textrm\{right\}\},d\_\{u\}\-1,\\gamma\\big\)\\leq\\varepsilon\_\{\\textrm\{abs\}\}\.\(13\)The set

\{s1,…,sd′,uleft,uright\}\\\{s\_\{1\},\\dots,s\_\{d^\{\\prime\}\},u\_\{\\textrm\{left\}\},u\_\{\\textrm\{right\}\}\\\}is exactly the frontier cut associated withuu\.

##### Sufficiency: satisfying all frontier cuts impliesttis fully represented\.

Assume \([13](https://arxiv.org/html/2606.00202#A1.E13)\) holds for every internal nodeuuoftt\(including the root\)\. At the root, this reduces to \([8](https://arxiv.org/html/2606.00202#A1.E8)\), so the root split is not pruned\. Inductively, since budgets are initially allocated using proxy objectives and iterative budget refinement can only increase these budgets, the local pruning test \([11](https://arxiv.org/html/2606.00202#A1.E11)\) succeeds at every nodeuu\. Thus, every split used byttis expanded, and every root\-to\-leaf path ofttis materialized in the AND/OR graph\.

##### On non\-necessity:

The earlier analysis does not account for the fact that we may discover a tree better than the proxy certifies \(in particular,[A\.16](https://arxiv.org/html/2606.00202#A1.Thmtheorem16)guarantees recovery of trees at least as good as the pilot algorithm induced by the proxy\)\. As a consequence, we may subtract off less than the proxy value, leading to larger children budgets and less pruning as a consequence\.

Given that we instead recurse on children assuming the other side is the best objective found \(which we denoteMinObj\\mathrm\{MinObj\}\), which could be better than the proxy, then ifuuis at depthd′d^\{\\prime\}with path siblingss1,…,sd′s\_\{1\},\\dots,s\_\{d^\{\\prime\}\}, then after refinement we have

εu≥εabs−∑i=1d′MinObj​\(si\),\\varepsilon\_\{u\}\\;\\geq\\;\\varepsilon\_\{\\textrm\{abs\}\}\\;\-\\;\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\mathrm\{MinObj\}\(s\_\{i\}\),so a sufficient frontier\-cut condition for expanding the split atuuis really

∑i=1d′MinObj​\(si\)\+PROXY​\(uleft\)\+PROXY​\(uright\)≤εabs,\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\mathrm\{MinObj\}\(s\_\{i\}\)\\;\+\\;\\textsc\{PROXY\}\\\!\\big\(u\_\{\\textrm\{left\}\}\\big\)\\;\+\\;\\textsc\{PROXY\}\\\!\\big\(u\_\{\\textrm\{right\}\}\\big\)\\;\\leq\\;\\varepsilon\_\{\\textrm\{abs\}\},i\.e\.,Solve\_Siblingsrefines the sibling terms fromPROXY​\(si\)\\textsc\{PROXY\}\(s\_\{i\}\)toMinObj​\(si\)\\mathrm\{MinObj\}\(s\_\{i\}\)while the two children atuuare still conservatively completed usingPROXY​\(⋅\)\\textsc\{PROXY\}\(\\cdot\)\.

∎

See[3\.6](https://arxiv.org/html/2606.00202#S3.Thmtheorem6)

###### Proof\.

Fix any internal nodeuuoftrt\_\{r\}at depthqq, and let

F​\(u\)=\{s1,…,sq,uleft,uright\}F\(u\)=\\\{s\_\{1\},\\dots,s\_\{q\},u\_\{\\mathrm\{left\}\},u\_\{\\mathrm\{right\}\}\\\}denote the associated frontier cut as defined in Theorem[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)\. By definition ofβ\\beta, for everyv∈F​\(u\)v\\in F\(u\),

Proxy​\(Dv,dv,γ\)mint∈𝒯dv⁡Obj​\(t,Dv,γ\)≤β,\\frac\{\\textsc\{Proxy\}\(D\_\{v\},d\_\{v\},\\gamma\)\}\{\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{v\}\}\}\\mathrm\{Obj\}\(t,D\_\{v\},\\gamma\)\}\\leq\\beta,so

Proxy​\(Dv,dv,γ\)≤β​mint∈𝒯dv⁡Obj​\(t,Dv,γ\)\.\\textsc\{Proxy\}\(D\_\{v\},d\_\{v\},\\gamma\)\\leq\\beta\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{v\}\}\}\\mathrm\{Obj\}\(t,D\_\{v\},\\gamma\)\.Summing overv∈F​\(u\)v\\in F\(u\)yields

∑v∈F​\(u\)Proxy​\(Dv,dv,γ\)\\displaystyle\\sum\_\{v\\in F\(u\)\}\\textsc\{Proxy\}\(D\_\{v\},d\_\{v\},\\gamma\)≤β​∑v∈F​\(u\)mint∈𝒯dv⁡Obj​\(t,Dv,γ\)\.\\displaystyle\\leq\\beta\\sum\_\{v\\in F\(u\)\}\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{v\}\}\}\\mathrm\{Obj\}\(t,D\_\{v\},\\gamma\)\.\(14\)The subtrees oftrt\_\{r\}rooted at the nodes inF​\(u\)F\(u\)are disjoint and collectively cover all leaves oftrt\_\{r\}, so

∑v∈F​\(u\)mint∈𝒯dv⁡Obj​\(t,Dv,γ\)≤∑v∈F​\(u\)Obj​\(tr\|v,Dv,γ\)=Obj​\(tr,D,γ\),\\sum\_\{v\\in F\(u\)\}\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{v\}\}\}\\mathrm\{Obj\}\(t,D\_\{v\},\\gamma\)\\leq\\sum\_\{v\\in F\(u\)\}\\mathrm\{Obj\}\(t\_\{r\}\|\_\{v\},D\_\{v\},\\gamma\)=\\mathrm\{Obj\}\(t\_\{r\},D,\\gamma\),wheretr\|vt\_\{r\}\|\_\{v\}denotes the subtree oftrt\_\{r\}rooted atvv\. Substituting into \([14](https://arxiv.org/html/2606.00202#A1.E14)\) gives

∑v∈F​\(u\)Proxy​\(Dv,dv,γ\)≤β​Obj​\(tr,D,γ\)\.\\sum\_\{v\\in F\(u\)\}\\textsc\{Proxy\}\(D\_\{v\},d\_\{v\},\\gamma\)\\leq\\beta\\,\\mathrm\{Obj\}\(t\_\{r\},D,\\gamma\)\.Using the definition ofηr\\eta\_\{r\},

Obj​\(tr,D,γ\)=ηr​\(\(1\+εmult\)​mint∈𝒯d⁡Obj​\(t,D,γ\)\),\\mathrm\{Obj\}\(t\_\{r\},D,\\gamma\)=\\eta\_\{r\}\\left\(\{\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\min\_\{t\\in\\mathcal\{T\}\_\{d\}\}\\mathrm\{Obj\}\(t,D,\\gamma\)\}\\right\),so

∑v∈F​\(u\)Proxy​\(Dv,dv,γ\)≤β​ηr​\(1\+εmult\)​mint∈𝒯d⁡Obj​\(t,D,γ\)\.\\sum\_\{v\\in F\(u\)\}\\textsc\{Proxy\}\(D\_\{v\},d\_\{v\},\\gamma\)\\leq\{\\beta\\eta\_\{r\}\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\}\\min\_\{t\\in\\mathcal\{T\}\_\{d\}\}\\mathrm\{Obj\}\(t,D,\\gamma\)\.By definition ofα\\alpha,

mint∈𝒯d⁡Obj​\(t,D,γ\)=Proxy​\(D,d,γ\)α,\\min\_\{t\\in\\mathcal\{T\}\_\{d\}\}\\mathrm\{Obj\}\(t,D,\\gamma\)=\\frac\{\\textsc\{Proxy\}\(D,d,\\gamma\)\}\{\\alpha\},hence

∑v∈F​\(u\)Proxy​\(Dv,dv,γ\)≤βα​ηr​\(1\+εmult\)​Proxy​\(D,d,γ\)\.\\sum\_\{v\\in F\(u\)\}\\textsc\{Proxy\}\(D\_\{v\},d\_\{v\},\\gamma\)\\leq\\frac\{\\beta\}\{\\alpha\}\\eta\_\{r\}\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\textsc\{Proxy\}\(D,d,\\gamma\)\.Now multiply the standard root budget

εabs=\(1\+εmult\)​Proxy​\(D,d,γ\)\\varepsilon\_\{\\textrm\{abs\}\}=\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\textsc\{Proxy\}\(D,d,\\gamma\)by

σ=max⁡\{1,βα​ηr\}\.\\sigma=\\max\\\!\\left\\\{1,\\frac\{\\beta\}\{\\alpha\}\\eta\_\{r\}\\right\\\}\.Then

σ​εabs≥βα​ηr​\(1\+εmult\)​Proxy​\(D,d,γ\)≥∑v∈F​\(u\)Proxy​\(Dv,dv,γ\)\.\\sigma\\,\\varepsilon\_\{\\textrm\{abs\}\}\\geq\\frac\{\\beta\}\{\\alpha\}\\eta\_\{r\}\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\textsc\{Proxy\}\(D,d,\\gamma\)\\geq\\sum\_\{v\\in F\(u\)\}\\textsc\{Proxy\}\(D\_\{v\},d\_\{v\},\\gamma\)\.Thus the frontier\-cut inequality of Theorem[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)holds foruu\. Sinceuuwas arbitrary, it holds for every internal node oftrt\_\{r\}\. Therefore, by Theorem[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5),PRAXISwill returntrt\_\{r\}\. ∎

### A\.2Additional Theorems and Proofs

This section includes additional theoretical guarantees for PRAXIS\. We show that PRAXIS can return a set of trees that are all arbitrarily better than pure greedy, prove convergence of iterative budget refinement, and provide bounds that characterize how additional slack in the root budget compensates for proxy optimality gaps\. We also prove some algorithm invariants assuming a proxy algorithm is used \(that is, assuming the proxy adheres to Definition[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1)\), and establish key decision tree algorithms are indeed proxy algorithms\.

##### Simplifying Notation\.

For any subproblem nodeuuwith corresponding datasetDuD\_\{u\}and remaining depth budgetdud\_\{u\}, define

OPT​\(u\):=mint∈𝒯du⁡Obj​\(t,Du,γ\),\\mathrm\{OPT\}\(u\)\\;:=\\;\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{u\}\}\}\\mathrm\{Obj\}\(t,D\_\{u\},\\gamma\),Whenu=rootu=\\mathrm\{root\}we also writeOPT​\(root\)=mint∈𝒯d⁡Obj​\(t,D,γ\)\\mathrm\{OPT\}\(\\mathrm\{root\}\)=\\min\_\{t\\in\\mathcal\{T\}\_\{d\}\}\\mathrm\{Obj\}\(t,D,\\gamma\)\.

###### Proposition A\.1\(Monotonicity in the budget provided to PRAXIS\)\.

Fix a datasetDD, depthdd, and regularizationγ\\gamma\. Let0≤A≤B0\\leq A\\leq Bbe two absolute Rashomon budgets, and letℛ^PRAXIS​\(D,d,γ,U\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d,\\gamma,U\)\}denote the set of trees returned byPRAXIS​\(D,d,γ,U\)\\textsc\{PRAXIS \}\(D,d,\\gamma,U\)forU∈\{A,B\}U\\in\\\{A,B\\\}\.

Then

ℛ^PRAXIS​\(D,d,γ,A\)⊆ℛ^PRAXIS​\(D,d,γ,B\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d,\\gamma,A\)\}\\;\\subseteq\\;\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d,\\gamma,B\)\}That is, increasing the budget fromAAtoBBcan only expand the returned AND/OR graph and set of returned trees\.

###### Proof\.

The proof will proceed by induction\.

Base case \(d=0d=0\): At depth0, PRAXIS considers only leaf actions\. A leaf is added if and only if its objective fits within the local subproblem budget\. Therefore, any leaf added under budgetAAis also added under budgetBB, and the claim holds\.

Inductive Hypothesis:At depthdd, for all subproblemsDD, all non\-negative leaf penaltiesγ\\gamma, and all budgets0≤A≤B0\\leq A\\leq B,ℛ^PRAXIS​\(D,d,γ,A\)⊆ℛ^PRAXIS​\(D,d,γ,B\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d,\\gamma,A\)\}\\;\\subseteq\\;\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d,\\gamma,B\)\}\.

Inductive Step:We want to show that, for all subproblemsDD, all non\-negative leaf penaltiesγ\\gamma, and all budgets0≤A≤B0\\leq A\\leq B,ℛ^PRAXIS​\(D,d\+1,γ,A\)⊆ℛ^PRAXIS​\(D,d\+1,γ,B\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d\+1,\\gamma,A\)\}\\;\\subseteq\\;\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d\+1,\\gamma,B\)\}\. We proceed by showing each tree inℛ^PRAXIS​\(D,d\+1,γ,A\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d\+1,\\gamma,A\)\}is also inℛ^PRAXIS​\(D,d\+1,γ,B\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d\+1,\\gamma,B\)\}\. We consider different cases based on the depth of the tree in question\.

Case 1 \(the tree is a leaf\):Note that trees that are just leaves atDDwill only be inℛ^PRAXIS​\(D,d\+1,γ,A\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d\+1,\\gamma,A\)\}if they are also inℛ^PRAXIS​\(D,d\+1,γ,B\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D,d\+1,\\gamma,B\)\}, for similar reasoning to the base case \(leaf membership is directly determined by comparing leaf objective to the budget\.\)

Case 2 \(the tree is not a leaf\):All non\-leaf trees are formed from calls to theSolve\_Siblingsroutine\. In particular, for each choice of initial feature split, the returned trees come from the last call made to PRAXIS onDLD\_\{L\}andDRD\_\{R\}inSolve\_Siblingsfor that feature, which corresponds to the call with the most permissive budget\.

To handle this case, we need a simple invariant: for each child, its final budget \(the last call inSolve\_Siblings\) can never be higher for initial budgetAAthan for initial budgetBB\. To recover this invariant, we use the following lemma:

###### Lemma A\.2\(Higher initial budgets will always result in higher child budgets\)\.

Fix two subproblemsDL,DRD\_\{L\},D\_\{R\}and a depthd≥0d\\geq 0\. Fix someγ\>0\\gamma\>0\. Assume that for all budgets0≤A≤B0\\leq A\\leq B,ℛ^PRAXIS​\(DL,d,γ,A\)⊆ℛ^PRAXIS​\(DL,d,γ,B\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D\_\{L\},d,\\gamma,A\)\}\\;\\subseteq\\;\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D\_\{L\},d,\\gamma,B\)\}andℛ^PRAXIS​\(DR,d,γ,A\)⊆ℛ^PRAXIS​\(DR,d,γ,B\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D\_\{R\},d,\\gamma,A\)\}\\;\\subseteq\\;\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D\_\{R\},d,\\gamma,B\)\}\.

Then the final, largest\-budget versions ofεL\\varepsilon\_\{L\}andεR\\varepsilon\_\{R\}inSolve\_Siblings\(Algorithm[3](https://arxiv.org/html/2606.00202#alg3)\) for initial budgetBBwill be no smaller thanεL\\varepsilon\_\{L\}andεR\\varepsilon\_\{R\}for initial budget A\.

###### Proof\.

InSolve\_Siblings, for an initial budgetUU, letεL\(i\)​\(U\)\\varepsilon\_\{L\}^\{\(i\)\}\(U\)andεR\(i\)​\(U\)\\varepsilon\_\{R\}^\{\(i\)\}\(U\)denote the left/right budgets used on theii\-th iteration of the while\-loop \(i=0,1,2,…i=0,1,2,\\dots\)\. It is now sufficient for us to show that for alli≥0i\\geq 0and all0≤A≤B0\\leq A\\leq B,

εL\(i\)​\(A\)≤εL\(i\)​\(B\)andεR\(i\)​\(A\)≤εR\(i\)​\(B\)\.\\varepsilon\_\{L\}^\{\(i\)\}\(A\)\\leq\\varepsilon\_\{L\}^\{\(i\)\}\(B\)\\quad\\text\{and\}\\quad\\varepsilon\_\{R\}^\{\(i\)\}\(A\)\\leq\\varepsilon\_\{R\}^\{\(i\)\}\(B\)\.\(15\)
For any datasetSSand budgetUU, define

m\(S,d,γ,U\):=min\{obj\(t,S,γ\):t∈ℛ^PRAXIS​\(S,d,γ,U\)\},m\(S,d,\\gamma,U\):=\\min\\bigl\\\{\\mathrm\{obj\}\(t,S,\\gamma\):\\;t\\in\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(S,d,\\gamma,U\)\}\\bigr\\\},withm​\(S,d,γ,U\)=\+∞m\(S,d,\\gamma,U\)=\+\\inftyif the set is empty\.

By the assumption in this lemma, increasing the budget can only expand the estimated Rashomon set for depthdd, som​\(S,d,γ,U\)m\(S,d,\\gamma,U\)is monotone non\-increasing in budgetUU\. That is, ifA≤BA\\leq Bthen

m​\(S,d,γ,B\)≤m​\(S,d,γ,A\)\.m\(S,d,\\gamma,B\)\\leq m\(S,d,\\gamma,A\)\.
We prove \([15](https://arxiv.org/html/2606.00202#A1.E15)\) by induction ontt\.

Base case \(i=0i=0\)\.The initialization sets

εL\(0\)​\(U\)=U−PROXY​\(DR,d,γ\),\\varepsilon\_\{L\}^\{\(0\)\}\(U\)=U\-\\texttt\{PROXY\}\(D\_\{R\},d,\\gamma\),where the proxy term is independent ofU∈\{A,B\}U\\in\\\{A,B\\\}\. HenceεL\(0\)​\(A\)≤εL\(0\)​\(B\)\\varepsilon\_\{L\}^\{\(0\)\}\(A\)\\leq\\varepsilon\_\{L\}^\{\(0\)\}\(B\)\. Next,

εR\(0\)​\(U\)=U−m​\(DL,d,γ,εL\(0\)​\(U\)\)\.\\varepsilon\_\{R\}^\{\(0\)\}\(U\)=U\-m\(D\_\{L\},d,\\gamma,\\varepsilon\_\{L\}^\{\(0\)\}\(U\)\)\.SinceεL\(0\)​\(A\)≤εL\(0\)​\(B\)\\varepsilon\_\{L\}^\{\(0\)\}\(A\)\\leq\\varepsilon\_\{L\}^\{\(0\)\}\(B\)andmmis monotone in its budget argument,

m​\(DL,d,γ,εL\(0\)​\(B\)\)≤m​\(DL,d,γ,εL\(0\)​\(A\)\)\.m\\\!\\left\(D\_\{L\},d,\\gamma,\\varepsilon\_\{L\}^\{\(0\)\}\(B\)\\right\)\\leq m\\\!\\left\(D\_\{L\},d,\\gamma,\\varepsilon\_\{L\}^\{\(0\)\}\(A\)\\right\)\.Therefore

εR\(0\)​\(A\)=A−m​\(DL,d,γ,εL\(0\)​\(A\)\)≤B−m​\(DL,d,γ,εL\(0\)​\(B\)\)=εR\(0\)​\(B\)\.\\varepsilon\_\{R\}^\{\(0\)\}\(A\)=A\-m\\\!\\left\(D\_\{L\},d,\\gamma,\\varepsilon\_\{L\}^\{\(0\)\}\(A\)\\right\)\\leq B\-m\\\!\\left\(D\_\{L\},d,\\gamma,\\varepsilon\_\{L\}^\{\(0\)\}\(B\)\\right\)=\\varepsilon\_\{R\}^\{\(0\)\}\(B\)\.
Inductive step\.Assume \([15](https://arxiv.org/html/2606.00202#A1.E15)\) holds at iterationii\. The updates inSolve\_Siblingsare

εR\(i\)​\(U\)=U−m​\(DL,d,γ,εL\(i\)​\(U\)\),εL\(i\+1\)​\(U\)=U−m​\(DR,d,γ,εR\(i\)​\(U\)\)\.\\varepsilon\_\{R\}^\{\(i\)\}\(U\)=U\-m\(D\_\{L\},d,\\gamma,\\varepsilon\_\{L\}^\{\(i\)\}\(U\)\),\\qquad\\varepsilon\_\{L\}^\{\(i\+1\)\}\(U\)=U\-m\(D\_\{R\},d,\\gamma,\\varepsilon\_\{R\}^\{\(i\)\}\(U\)\)\.UsingεL\(i\)​\(A\)≤εL\(i\)​\(B\)\\varepsilon\_\{L\}^\{\(i\)\}\(A\)\\leq\\varepsilon\_\{L\}^\{\(i\)\}\(B\)and monotonicity ofmmyieldsεR\(i\)​\(A\)≤εR\(i\)​\(B\)\\varepsilon\_\{R\}^\{\(i\)\}\(A\)\\leq\\varepsilon\_\{R\}^\{\(i\)\}\(B\)\. Applying monotonicity again withεR\(i\)​\(A\)≤εR\(i\)​\(B\)\\varepsilon\_\{R\}^\{\(i\)\}\(A\)\\leq\\varepsilon\_\{R\}^\{\(i\)\}\(B\)givesεL\(i\+1\)​\(A\)≤εL\(i\+1\)​\(B\)\\varepsilon\_\{L\}^\{\(i\+1\)\}\(A\)\\leq\\varepsilon\_\{L\}^\{\(i\+1\)\}\(B\)\.

Thus \([15](https://arxiv.org/html/2606.00202#A1.E15)\) holds fori\+1i\+1, completing the induction\. Note that if the while loop terminates for budgetAAbut notBB\(or vice versa\), we write the update rule anyway \(which does not change the return, as it is solved with the same budget as the last iteration\)\. This needed to be handled because the smaller budget could run in fewer iterations, the same number of iterations, or more iterations than the larger budget\. ∎

Calling Lemma[A\.2](https://arxiv.org/html/2606.00202#A1.Thmtheorem2)withdd\(and using our inductive hypothesis to satisfy the assumption of that lemma\) gives us that the final call toDLD\_\{L\}andDRD\_\{R\}inSolve\_Siblingsinvolves a more permissive budget when the initial budget is larger\. Therefore, we know that the found trees forDLD\_\{L\}andDRD\_\{R\}with the larger initial budget are supersets of those found with the smaller initial budget, due to our inductive hypothesis\. That means any tree found bySolve\_Siblingswith the smaller budget will also be found bySolve\_Siblingsin the larger budget call, maintaining our invariant\.

Since we now know all trees will only be inℛ^PRAXIS​\(DS,d\+1,γ,A\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D\_\{S\},d\+1,\\gamma,A\)\}if they are also inℛ^PRAXIS​\(DS,d\+1,γ,B\)\\hat\{\\mathcal\{R\}\}^\{\\textsc\{PRAXIS \}\(D\_\{S\},d\+1,\\gamma,B\)\}, we have shown our inductive step\. Accordingly, by induction we have proven Proposition[A\.1](https://arxiv.org/html/2606.00202#A1.Thmtheorem1)\. ∎

###### Theorem A\.3\(PRAXIS recovers the proxy tree\)\.

Fix any subproblem\(D,d\)\(D,d\)and let the proxy return an objective for a treefpx∈𝒯df^\{\\mathrm\{px\}\}\\in\\mathcal\{T\}\_\{d\}, which we call

P​\(D,d\):=PROXY​\(D,d,γ\)=Obj​\(fpx,D,γ\)\.P\(D,d\)\\;:=\\;\\textsc\{PROXY\}\(D,d,\\gamma\)\\;=\\;\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\},D,\\gamma\)\.If*PRAXIS*is called with budgetεabs≥P​\(D,d\)\\varepsilon\_\{\\textrm\{abs\}\}\\geq P\(D,d\), i\.e\.

G←PRAXIS​\(D,d,γ,εabs\),G\\;\\leftarrow\\;\\textsc\{PRAXIS \}\(D,d,\\gamma,\\varepsilon\_\{\\textrm\{abs\}\}\),then:

1. 1\.the returned AND/OR graphGGcontains \(represents\) the proxy treefpxf^\{\\mathrm\{px\}\}, \(GGis nonempty\);
2. 2\.the minimum objective stored at the root OR\-node satisfies G\.min​\_​objective≤P​\(D,d\)\.G\.\\mathrm\{min\\\_objective\}\\;\\leq\\;P\(D,d\)\.

In particular, the root node satisfies

G\.min​\_​objective≤P​\(D,d\)\.G\.\\mathrm\{min\\\_objective\}\\leq P\(D,d\)\.

###### Proof\.

We argue by induction on the depthdd\.

Base case \(d=0d=0\)\.Thenfpxf^\{\\mathrm\{px\}\}is a leaf predicting some labelb∈\{0,1\}b\\in\\\{0,1\\\}, \(denote this label wlog asb′b^\{\\prime\}\) hence

P​\(D,0\)=Cb′,P\(D,0\)=C\_\{b^\{\\prime\}\},\(16\)one of the leaf costs computed in Algorithm[1](https://arxiv.org/html/2606.00202#alg1), before any pruning\. Sinceεabs≥P​\(D,0\)\\varepsilon\_\{\\textrm\{abs\}\}\\geq P\(D,0\), that leaf is added and thus

G\.min​\_​objective≤Cb′=P​\(D,0\)\.G\.\\mathrm\{min\\\_objective\}\\leq C\_\{b^\{\\prime\}\}=P\(D,0\)\.\(17\)
Inductive step \(d≥1d\\geq 1\)\.Let the proxy tree at\(D,d\)\(D,d\)split at the root on featurej⋆j^\{\\star\}, inducing\(Dfleftpx,Dfrightpx\)\(D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\}\)and proxy subtreesfleftpx,frightpxf^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\},f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}of depth at mostd−1d\-1\. Define the proxy calls made by PRAXIS at this split:

PL:=PROXY​\(Dfleftpx,d−1,γ\),PR:=PROXY​\(Dfrightpx,d−1,γ\)\.P\_\{L\}\\;:=\\;\\textsc\{PROXY\}\(D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},d\-1,\\gamma\),\\qquad P\_\{R\}\\;:=\\;\\textsc\{PROXY\}\(D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\},d\-1,\\gamma\)\.\(18\)By the refinement property \(Definition[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1)\) applied to the proxy tree,

PL≤Obj​\(fleftpx,Dfleftpx,γ\),PR≤Obj​\(frightpx,Dfrightpx,γ\)\.P\_\{L\}\\;\\leq\\;\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},\\gamma\),\\qquad P\_\{R\}\\;\\leq\\;\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\},\\gamma\)\.\(19\)Because the objective is additive across a split,

P​\(D,d\)\\displaystyle P\(D,d\)=Obj​\(fpx,D,γ\)\\displaystyle=\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\},D,\\gamma\)\(20\)=Obj​\(fleftpx,Dfleftpx,γ\)\+Obj​\(frightpx,Dfrightpx,γ\)\\displaystyle=\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},\\gamma\)\+\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\},\\gamma\)≥PL\+PR\.\\displaystyle\\geq P\_\{L\}\+P\_\{R\}\.\(21\)Sinceεabs≥P​\(D,d\)\\varepsilon\_\{\\textrm\{abs\}\}\\geq P\(D,d\), we have

PL\+PR≤εabs,P\_\{L\}\+P\_\{R\}\\leq\\varepsilon\_\{\\textrm\{abs\}\},\(22\)so Algorithm 1 does not prune splitj⋆j^\{\\star\}, and it callsSolveSiblingson\(Dfleftpx,Dfrightpx\)\(D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\}\)\.

InsideSolveSiblings, the first left budget is set to

εL=εabs−PR\.\\varepsilon\_\{L\}\\;=\\;\\varepsilon\_\{\\textrm\{abs\}\}\-P\_\{R\}\.\(23\)Usingεabs≥Obj​\(fleftpx,Dfleftpx,γ\)\+Obj​\(frightpx,Dfrightpx,γ\)\\varepsilon\_\{\\textrm\{abs\}\}\\geq\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},\\gamma\)\+\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\},\\gamma\)andPR≤Obj​\(frightpx,Dfrightpx,γ\)P\_\{R\}\\leq\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\},\\gamma\), we obtain

εL\\displaystyle\\varepsilon\_\{L\}=εabs−PR\\displaystyle=\\varepsilon\_\{\\textrm\{abs\}\}\-P\_\{R\}\(24\)≥εabs−Obj​\(frightpx,Dfrightpx,γ\)\\displaystyle\\geq\\varepsilon\_\{\\textrm\{abs\}\}\-\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\},\\gamma\)≥Obj​\(fleftpx,Dfleftpx,γ\)\\displaystyle\\geq\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},\\gamma\)≥PL\.\\displaystyle\\geq P\_\{L\}\.\(25\)Therefore the recursive call

GL←PRAXIS​\(Dfleftpx,d−1,γ,εL\)G\_\{L\}\\leftarrow\\textsc\{PRAXIS\}\(D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},d\-1,\\gamma,\\varepsilon\_\{L\}\)\(26\)satisfies the inductive hypothesis, henceGLG\_\{L\}representsfleftpxf^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}and

GL\.min​\_​objective≤PL≤Obj​\(fleftpx,Dfleftpx,γ\)\.G\_\{L\}\.\\mathrm\{min\\\_objective\}\\leq P\_\{L\}\\leq\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},\\gamma\)\.\(27\)
NextSolveSiblingssets the right budget to

εR=εabs−GL\.min​\_​objective\.\\varepsilon\_\{R\}\\;=\\;\\varepsilon\_\{\\textrm\{abs\}\}\-G\_\{L\}\.\\mathrm\{min\\\_objective\}\.\(28\)SinceGL\.min​\_​objective≤Obj​\(fleftpx,Dfleftpx,γ\)G\_\{L\}\.\\mathrm\{min\\\_objective\}\\leq\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},\\gamma\),

εR\\displaystyle\\varepsilon\_\{R\}=εabs−GL\.min​\_​objective\\displaystyle=\\varepsilon\_\{\\textrm\{abs\}\}\-G\_\{L\}\.\\mathrm\{min\\\_objective\}\(29\)≥εabs−Obj​\(fleftpx,Dfleftpx,γ\)\\displaystyle\\geq\\varepsilon\_\{\\textrm\{abs\}\}\-\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{left\}\}\},\\gamma\)≥Obj​\(frightpx,Dfrightpx,γ\)\\displaystyle\\geq\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\},D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\},\\gamma\)≥PR\.\\displaystyle\\geq P\_\{R\}\.\(30\)Thus the call

GR←PRAXIS​\(Dfrightpx,d−1,γ,εR\)G\_\{R\}\\leftarrow\\textsc\{PRAXIS\}\(D\_\{f^\{\\mathrm\{px\}\}\_\{\\textrm\{right\}\}\},d\-1,\\gamma,\\varepsilon\_\{R\}\)\(31\)also satisfies the inductive hypothesis, so it representsfRpxf^\{\\mathrm\{px\}\}\_\{R\}and

GR\.min​\_​objective≤PR\.G\_\{R\}\.\\mathrm\{min\\\_objective\}\\leq P\_\{R\}\.\(32\)
Finally, Algorithm[1](https://arxiv.org/html/2606.00202#alg1)adds this split, attachingGLG\_\{L\}andGRG\_\{R\}\. ThereforeGGrepresents the proxy treefpxf^\{\\mathrm\{px\}\}\. Moreover, sincefpxf^\{\\mathrm\{px\}\}is a feasible tree inGGwith objectiveP​\(D,d\)P\(D,d\),

G\.min​\_​objective≤P​\(D,d\)\.G\.\\mathrm\{min\\\_objective\}\\leq P\(D,d\)\.\(33\)
The while\-loop inSolveSiblingsonly increases budgets when a better \(lower objective\) subtree is found, so the above lower bounds onεL\\varepsilon\_\{L\}andεR\\varepsilon\_\{R\}are never violated by subsequent refinements \(Proposition[A\.1](https://arxiv.org/html/2606.00202#A1.Thmtheorem1)establishes this monotonicity\)\. ∎

We note that Theorem[A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3)directly applies to PRAXIS as the budget is set as relative to the proxy algorithm at the root:

εabs←\(1\+εmult\)⋅Proxy​\(D,d,γ\)\\varepsilon\_\{\\textrm\{abs\}\}\\leftarrow\\bigl\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\\bigr\)\\cdot\\textsc\{Proxy\}\(D,d,\\gamma\)
Beyond[Theorem A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3), which guarantees we find the proxy tree at the root, we also show below \(Corollary[A\.4](https://arxiv.org/html/2606.00202#A1.Thmtheorem4)\) that we recover the proxy tree at all explored subproblems\.

###### Corollary A\.4\(The AND/OR subgraph for every explored subproblem contains the proxy tree\)\.

Assume the root call to PRAXIS satisfies

εabs≥P​\(D,d\)\.\\varepsilon\_\{\\mathrm\{abs\}\}\\geq P\(D,d\)\.Then every unpruned OR node in the returned AND/OR graph is explored with budget

εabs′≥PROXY​\(D′,d′,γ\)\.\\varepsilon^\{\\prime\}\_\{\\mathrm\{abs\}\}\\geq\\textsc\{PROXY\}\(D^\{\\prime\},d^\{\\prime\},\\gamma\)\.Consequently, by Theorem[A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3), the subgraph rooted at every OR node contains the proxy tree for that subproblem, and the minimum objective stored at that node is at mostP​\(D′,d′\)P\(D^\{\\prime\},d^\{\\prime\}\)\.

###### Proof\.

As in Theorem[A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3), we use the notation

P​\(D,d\):=PROXY​\(D,d,γ\)=Obj​\(fpx,D,γ\)\.P\(D,d\)\\;:=\\;\\textsc\{PROXY\}\(D,d,\\gamma\)\\;=\\;\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\},D,\\gamma\)\.
Let a split produce child subproblems with proxy values

PL:=P​\(DL,d−1\),PR:=P​\(DR,d−1\),P\_\{L\}:=P\(D\_\{L\},d\-1\),\\qquad P\_\{R\}:=P\(D\_\{R\},d\-1\),and suppose this split is not pruned, i\.e\.

PL\+PR≤εabs\.P\_\{L\}\+P\_\{R\}\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\.Then the initial left budget used bySolveSiblingsis

εL:=εabs−PR\.\\varepsilon\_\{L\}:=\\varepsilon\_\{\\mathrm\{abs\}\}\-P\_\{R\}\.Therefore,

εL=εabs−PR≥PL\.\\varepsilon\_\{L\}=\\varepsilon\_\{\\mathrm\{abs\}\}\-P\_\{R\}\\geq P\_\{L\}\.So the recursive call on the left subproblem is made with budget at least its proxy objective, which by Theorem[A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3)is sufficient to recover the proxy tree on the left\.

Now letGLG\_\{L\}denote the returned left subgraph\. Since the left proxy tree is contained inGLG\_\{L\}, its minimum objective satisfies

GL\.min​\_​objective≤PL\.G\_\{L\}\.\\mathrm\{min\\\_objective\}\\leq P\_\{L\}\.The right budget is then set to

εR:=εabs−GL\.min​\_​objective\.\\varepsilon\_\{R\}:=\\varepsilon\_\{\\mathrm\{abs\}\}\-G\_\{L\}\.\\mathrm\{min\\\_objective\}\.UsingGL\.min​\_​objective≤PLG\_\{L\}\.\\mathrm\{min\\\_objective\}\\leq P\_\{L\}, we obtain

εR=εabs−GL\.min​\_​objective≥εabs−PL≥PR,\\varepsilon\_\{R\}=\\varepsilon\_\{\\mathrm\{abs\}\}\-G\_\{L\}\.\\mathrm\{min\\\_objective\}\\geq\\varepsilon\_\{\\mathrm\{abs\}\}\-P\_\{L\}\\geq P\_\{R\},sincePL\+PR≤εabsP\_\{L\}\+P\_\{R\}\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\. Hence the recursive call on the right subproblem is also made with budget at least its proxy objective, so the proxy tree on the right is recovered as well\.

Finally,SolveSiblingsonly increases budgets during subsequent refinement steps \(by[A\.2](https://arxiv.org/html/2606.00202#A1.Thmtheorem2)\)\. Thus once a child budget is at least the corresponding proxy value, this inequality is never lost \(due to the monotonicity of budgets shown in[A\.1](https://arxiv.org/html/2606.00202#A1.Thmtheorem1)\)\.

Repeated application of this logic down the AND/OR graph shows the corollary\. ∎

###### Theorem A\.5\(All Trees returned by PRAXIS can be Arbitrarily Better than Greedy\)\.

Fix a depth budgetddand letεmult≥0\\varepsilon\_\{\\textrm\{mult\}\}\\geq 0denote the multiplicative slack used in the root budget\. Assumeλ=γ=0\\lambda=\\gamma=0\(i\.e\., the objective is just misclassification error\) and use a proxy algorithm satisfying∀S,PROXY​\(S,d,γ\)≤Obj​\(LicketySPLIT​\(S,d,γ\)\),\\forall S,\\mathrm\{PROXY\}\(S,d,\\gamma\)\\;\\leq\\;\\mathrm\{Obj\}\(\\textsc\{LicketySPLIT\}\(S,d,\\gamma\)\),whereLicketySPLITrefers to Algorithm[16](https://arxiv.org/html/2606.00202#alg16)\. Then for everyδ\>0\\delta\>0there exists a data distribution𝒟\\mathcal\{D\}and sample sizennsuch that, with high probability overS∼𝒟nS\\sim\\mathcal\{D\}^\{n\}, settingεabs=\(1\+εmult\)​Proxy​\(S,d,γ\)\\varepsilon\_\{\\textrm\{abs\}\}=\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\textsc\{Proxy\}\(S,d,\\gamma\):

1. 1\.A pure greedy \(information\-gain\) depth\-ddtree achieves accuracy at most12\+δ\\tfrac\{1\}\{2\}\+\\delta\.
2. 2\.Every treeTTreturned byP​R​A​X​I​S​\(S,d,γ,εabs\)PRAXIS\(S,d,\\gamma,\\varepsilon\_\{\\textrm\{abs\}\}\)achieves accuracy at least1−δ1\-\\delta\.

###### Proof\.

Fixδ\>0\\delta\>0and defineη:=δ/\(1\+εmult\)\\eta:=\\delta/\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\. By Theorem A\.7 ofBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\), for thisη\\etaand depthddthere exist𝒟\\mathcal\{D\}andnnsuch that, with high probability overS∼𝒟nS\\sim\\mathcal\{D\}^\{n\}: \(i\)LicketySPLITreturns a depth\-ddtree with error at mostη\\eta, and \(ii\) pure greedy achieves error at least12−η\\tfrac\{1\}\{2\}\-\\eta\.

For \(1\), sinceη≤δ\\eta\\leq\\delta, we haveaccgreedy≤12\+η≤12\+δ\\text\{acc\}\_\{\\textsc\{greedy\}\}\\leq\\tfrac\{1\}\{2\}\+\\eta\\leq\\tfrac\{1\}\{2\}\+\\delta\.

For \(2\), letP:=PROXY​\(S,d,γ\)P:=\\mathrm\{PROXY\}\(S,d,\\gamma\)\. By assumption,

P\\displaystyle P\\;≤Obj​\(LicketySPLIT​\(𝒮,d,γ\)\)\.\\displaystyle\\leq\\;\\mathrm\{Obj\}\(\\textsc\{LicketySPLIT\}\(\\mathcal\{S\},d,\\gamma\)\)\.applyingthe result discussed above, thatObj​\(LicketySPLIT​\(𝒮,d,γ\)\)≤η\\mathrm\{Obj\}\(\\textsc\{LicketySPLIT\}\(\\mathcal\{S\},d,\\gamma\)\)\\leq\\eta:P\\displaystyle P\\;≤η\.\\displaystyle\\leq\\;\\eta\.PRAXIS is run with root budgetεabs:=\(1\+εmult\)​P\\varepsilon\_\{\\text\{abs\}\}:=\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)P, hence

εabs≤\(1\+εmult\)​η=δ\.\\varepsilon\_\{\\text\{abs\}\}\\leq\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\eta=\\delta\.Because every materialized tree respects the budget at the root, every treeTTmaterialized satisfiesObj​\(T,S,γ\)≤εabs≤δ\\mathrm\{Obj\}\(T,S,\\gamma\)\\leq\\varepsilon\_\{\\text\{abs\}\}\\leq\\delta\. Sinceγ=0\\gamma=0,Obj​\(T,S,γ\)\\mathrm\{Obj\}\(T,S,\\gamma\)equals the misclassification error ofTT, soacc​\(T\)≥1−δ\\text\{acc\}\(T\)\\geq 1\-\\delta\. ∎

Theorem[A\.5](https://arxiv.org/html/2606.00202#A1.Thmtheorem5)establishes that PRAXIS can return a set of trees that are all arbitrarily better than greedy\. Additionally, using any proxy algorithm that performs at least as well as greedy, the trees returned by PRAXIS will never be more thanε\\varepsilonworse than greedy, and will always include one at least as good as greedy\.

###### Corollary A\.6\(Recovery guarantee for c\-approximation proxy algorithms\)\.

AssumeProxyis acc\-approximation algorithm, i\.e\., for all\(D′,d′\)\(D^\{\\prime\},d^\{\\prime\}\),

Proxy​\(D′,d′,γ\)≤c​mint∈𝒯d′⁡Obj​\(t,D′,γ\)\.\\textsc\{Proxy\}\(D^\{\\prime\},d^\{\\prime\},\\gamma\)\\;\\leq\\;c\\min\_\{t\\in\\mathcal\{T\}\_\{d^\{\\prime\}\}\}\\mathrm\{Obj\}\(t,D^\{\\prime\},\\gamma\)\.Then, for any budgetεabs\\varepsilon\_\{\\mathrm\{abs\}\},PRAXISreturns every treetr∈𝒯dt\_\{r\}\\in\\mathcal\{T\}\_\{d\}such that

Obj​\(tr,D,γ\)≤εabsc\.\\mathrm\{Obj\}\(t\_\{r\},D,\\gamma\)\\;\\leq\\;\\frac\{\\varepsilon\_\{\\mathrm\{abs\}\}\}\{c\}\.

###### Proof\.

Fix any treetr∈𝒯dt\_\{r\}\\in\\mathcal\{T\}\_\{d\}with

Obj​\(tr,D,γ\)≤εabsc\.\\mathrm\{Obj\}\(t\_\{r\},D,\\gamma\)\\;\\leq\\;\\frac\{\\varepsilon\_\{\\mathrm\{abs\}\}\}\{c\}\.Letuube any internal node oftrt\_\{r\}, let its depth bed′d^\{\\prime\}, and let

\{s1,…,sd′,uleft,uright\}\\\{s\_\{1\},\\dots,s\_\{d^\{\\prime\}\},u\_\{\\mathrm\{left\}\},u\_\{\\mathrm\{right\}\}\\\}be the associated frontier cut as in[Theorem 3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)\. Since these subtrees partition the leaves oftrt\_\{r\}, the sum of their optimal objectives is at most the objective oftrt\_\{r\}:

∑i=1d′mint∈𝒯dsi⁡Obj​\(t,Dsi,γ\)\+mint∈𝒯du−1⁡Obj​\(t,Duleft,γ\)\+mint∈𝒯du−1⁡Obj​\(t,Duright,γ\)≤Obj​\(tr,D,γ\)\.\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{s\_\{i\}\}\}\}\\mathrm\{Obj\}\(t,D\_\{s\_\{i\}\},\\gamma\)\\;\+\\;\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{u\}\-1\}\}\\mathrm\{Obj\}\(t,D\_\{u\_\{\\mathrm\{left\}\}\},\\gamma\)\\;\+\\;\\min\_\{t\\in\\mathcal\{T\}\_\{d\_\{u\}\-1\}\}\\mathrm\{Obj\}\(t,D\_\{u\_\{\\mathrm\{right\}\}\},\\gamma\)\\;\\leq\\;\\mathrm\{Obj\}\(t\_\{r\},D,\\gamma\)\.By thecc\-approximation property ofProxy,

∑i=1d′Proxy​\(Dsi,dsi,γ\)\+Proxy​\(Duleft,du−1,γ\)\+Proxy​\(Duright,du−1,γ\)≤c​Obj​\(tr,D,γ\)≤εabs\.\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\textsc\{Proxy\}\(D\_\{s\_\{i\}\},d\_\{s\_\{i\}\},\\gamma\)\\;\+\\;\\textsc\{Proxy\}\(D\_\{u\_\{\\mathrm\{left\}\}\},d\_\{u\}\-1,\\gamma\)\\;\+\\;\\textsc\{Proxy\}\(D\_\{u\_\{\\mathrm\{right\}\}\},d\_\{u\}\-1,\\gamma\)\\;\\leq\\;c\\,\\mathrm\{Obj\}\(t\_\{r\},D,\\gamma\)\\;\\leq\\;\\varepsilon\_\{\\mathrm\{abs\}\}\.Thus the frontier\-cut condition of[Theorem 3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)holds at every internal node oftrt\_\{r\}, soPRAXISreturnstrt\_\{r\}\. ∎

###### Lemma A\.7\(Convergence of iterative budget refinement\)\.

Fix a split that has been evaluated and not pruned in Algorithm[1](https://arxiv.org/html/2606.00202#alg1)\. LetPL,PRP\_\{L\},P\_\{R\}be the proxy objectives andOPTL,OPTR\\mathrm\{OPT\}\_\{L\},\\mathrm\{OPT\}\_\{R\}be the optimal subtree objectives for the left and right subproblems \(under the same depth constraint\)\.

Then the iterative budget refinement procedure in Algorithm[3](https://arxiv.org/html/2606.00202#alg3)performs at most

2​min⁡\(PL−OPTL,PR−OPTR\)\+22\\,\\min\\big\(P\_\{L\}\-\\mathrm\{OPT\}\_\{L\},\\;P\_\{R\}\-\\mathrm\{OPT\}\_\{R\}\\big\)\+2recursive calls to Algorithm[1](https://arxiv.org/html/2606.00202#alg1)\.

###### Proof\.

TheSolve\_Siblingsprocedure alternates between solving the left and right subproblems under budgets derived from the current estimate of the opposite side\. Initially, the left budget is set toεabs−PR\\varepsilon\_\{\\textrm\{abs\}\}\-P\_\{R\}\. The right budget is then set toεabs−minL\\varepsilon\_\{\\textrm\{abs\}\}\-\\mathrm\{min\}\_\{L\}, whereminL≤PL\\mathrm\{min\}\_\{L\}\\leq P\_\{L\}\. After this point, we make at most one PRAXIS call for each improvement to these budgets\.

Because objectives are integral, each successful refinement strictly decreases the current best objective on the active side by at least11\. Moreover, since the proxy objective corresponds to a realizable tree, no side can improve beyond its optimal objective:

εL≥εabs−OPTR,εR≥εabs−OPTL\.\\varepsilon\_\{L\}\\geq\\varepsilon\_\{\\textrm\{abs\}\}\-\\mathrm\{OPT\}\_\{R\},\\qquad\\varepsilon\_\{R\}\\geq\\varepsilon\_\{\\textrm\{abs\}\}\-\\mathrm\{OPT\}\_\{L\}\.Therefore, the left side can be refined at mostPL−OPTLP\_\{L\}\-\\mathrm\{OPT\}\_\{L\}times \(and the 1 initial call\), and the right side at mostPR−OPTRP\_\{R\}\-\\mathrm\{OPT\}\_\{R\}times \(and the 1 initial call\)\.

To obtain the tighter bound, observe that once one side ceases to improve, at most one additional recursive call is made on the other side\. Suppose the left side converges first\. In this case, we incur one initial call, at most

2​min⁡\(PL−OPTL,PR−OPTR\)2\\,\\min\\\!\\big\(P\_\{L\}\-\\mathrm\{OPT\}\_\{L\},\\;P\_\{R\}\-\\mathrm\{OPT\}\_\{R\}\\big\)refinement calls as the two sides alternate, and a final call to the right side with the optimal budget decremented\. An analogous argument applies if the right side converges first\.

Summing these contributions – the initial call, the alternating refinement calls, and the final recursive call – yields the stated bound\. ∎

###### Theorem A\.8\(Slack needed for perfect approximation if the proxy has additive optimality gaps\)\.

Letttbe any tree\. LetDDbe a dataset,ddbe a depth limit, andγ\\gammabe a per\-leaf penalty\. For each internal nodeuuoftt\(denotedu∈Int​\(t\)u\\in\\textrm\{Int\}\(t\)\), letC​\(u\)C\(u\)denote its frontier cut\. Denote nonnegative errors\{Δv\}\\\{\\Delta\_\{v\}\\\}such that for every subproblemvv,

PROXY​\(Dv,dv,γ\)≤OPT​\(v\)\+Δv\.\\textsc\{PROXY\}\(D\_\{v\},d\_\{v\},\\gamma\)\\;\\leq\\;\\mathrm\{OPT\}\(v\)\+\\Delta\_\{v\}\.and defineη\\etasuch thatttisη\\eta\-suboptimal at the root:

Obj​\(t,D,γ\)≤OPT​\(root\)\+η\.\\mathrm\{Obj\}\(t,D,\\gamma\)\\;\\leq\\;\\mathrm\{OPT\}\(\\mathrm\{root\}\)\+\\eta\.If the budget is set with

εabs≥OPT​\(root\)\+η\+maxu∈Int​\(t\)​∑v∈C​\(u\)Δv,\\varepsilon\_\{\\textrm\{abs\}\}\\geq\\mathrm\{OPT\}\(\\mathrm\{root\}\)\+\\eta\\;\+\\;\\max\_\{u\\in\\mathrm\{Int\}\(t\)\}\\sum\_\{v\\in C\(u\)\}\\Delta\_\{v\},thenttis fully represented in the AND/OR graph\.

###### Proof\.

For every internal nodeuuoftt, the frontier cut is within budget\.

∑v∈C​\(u\)PROXY​\(Dv,dv,γ\)\\displaystyle\\sum\_\{v\\in C\(u\)\}\\textsc\{PROXY\}\(D\_\{v\},d\_\{v\},\\gamma\)≤∑v∈C​\(u\)OPT​\(v\)\+∑v∈C​\(u\)Δv\\displaystyle\\leq\\sum\_\{v\\in C\(u\)\}\\mathrm\{OPT\}\(v\)\\;\+\\;\\sum\_\{v\\in C\(u\)\}\\Delta\_\{v\}\(34\)≤∑v∈C​\(u\)OPT​\(v\)\+maxu∈Int​\(t\)​∑v∈C​\(u\)Δv\.\\displaystyle\\leq\\sum\_\{v\\in C\(u\)\}\\mathrm\{OPT\}\(v\)\\;\+\\;\\max\_\{u\\in\\mathrm\{Int\}\(t\)\}\\sum\_\{v\\in C\(u\)\}\\Delta\_\{v\}\.\(35\)
For the treett, the frontier cutC​\(u\)C\(u\)partitions the remaining work below the already fixed prefix, so the sum of optimal subtree objectives below the cut is at most the objective oftt\. Hence,

∑v∈C​\(u\)OPT​\(v\)≤Obj​\(t,D,γ\)≤OPT​\(root\)\+η\.\\sum\_\{v\\in C\(u\)\}\\mathrm\{OPT\}\(v\)\\;\\leq\\;\\mathrm\{Obj\}\(t,D,\\gamma\)\\;\\leq\\;\\mathrm\{OPT\}\(\\text\{root\}\)\+\\eta\.\(36\)
Combining \([35](https://arxiv.org/html/2606.00202#A1.E35)\) and \([36](https://arxiv.org/html/2606.00202#A1.E36)\), we obtain

∑v∈C​\(u\)PROXY​\(Dv,dv,γ\)\\displaystyle\\sum\_\{v\\in C\(u\)\}\\textsc\{PROXY\}\(D\_\{v\},d\_\{v\},\\gamma\)≤OPT​\(root\)\+η\+maxu∈Int​\(t\)​∑v∈C​\(u\)Δv\\displaystyle\\leq\\mathrm\{OPT\}\(\\text\{root\}\)\+\\eta\\;\+\\;\\max\_\{u\\in\\mathrm\{Int\}\(t\)\}\\sum\_\{v\\in C\(u\)\}\\Delta\_\{v\}\(37\)≤εabs\.\\displaystyle\\leq\\varepsilon\_\{\\textrm\{abs\}\}\.\(38\)∎

###### Proposition A\.9\(A greedy tree algorithm is a proxy algorithm\)\.

LetGreedy​\(D,d,γ\)\\textsc\{Greedy\}\(D,d,\\gamma\)be the output of the greedy algorithm in Algorithm[15](https://arxiv.org/html/2606.00202#alg15)or[17](https://arxiv.org/html/2606.00202#alg17)that returns the objective of a treet∈𝒯dt\\in\\mathcal\{T\}\_\{d\}\. Define

PROXY0\(D,d,γ\):=Greedy\(D,d,γ\)\.\\mathrm\{PROXY\}\_\{0\}\(D,d,\\gamma\)\\ :=\\ \\ \\textsc\{Greedy\}\(D,d,\\gamma\)\.ThenPROXY0\\mathrm\{PROXY\}\_\{0\}is a proxy algorithm \(Definition[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1)\)\.

###### Proof\.

By recursive construction, whenGreedybuildstt, the subtreetut\_\{u\}at any nodeuuis exactly the tree returned by callingGreedyon\(Du,du,γ\)\(D\_\{u\},d\_\{u\},\\gamma\)\. HencePROXY0​\(Du,du,γ\)=Obj​\(tu,Du,γ\)\\mathrm\{PROXY\}\_\{0\}\(D\_\{u\},d\_\{u\},\\gamma\)=\\mathrm\{Obj\}\(t\_\{u\},D\_\{u\},\\gamma\), which implies the stated inequality\. ∎

###### Proposition A\.10\(LicketySPLIT is a proxy algorithm\)\.

LetLicketySPLIT​\(D,d,γ\)\\textsc\{LicketySPLIT\}\(D,d,\\gamma\)be Algorithm[16](https://arxiv.org/html/2606.00202#alg16)\(or Algorithm[18](https://arxiv.org/html/2606.00202#alg18)withℓ=1\\ell=1\) that returns the objective of a treet∈𝒯dt\\in\\mathcal\{T\}\_\{d\}, and define

PROXY1​\(D,d,γ\):=LicketySPLIT​\(D,d,γ\)\.\\mathrm\{PROXY\}\_\{1\}\(D,d,\\gamma\)\\ :=\\textsc\{LicketySPLIT\}\(D,d,\\gamma\)\.ThenPROXY1\\mathrm\{PROXY\}\_\{1\}is a proxy algorithm\.

###### Proof\.

Algorithm[16](https://arxiv.org/html/2606.00202#alg16)or[18](https://arxiv.org/html/2606.00202#alg18)selects a split at the root \(via greedy completions\) and then*recurses by calling itself*on each child subproblem to build the left and right subtrees, finally returning the objective of that recursively built tree\. Notably, these are identical recursive calls \(except for updating the depth budget to pass down the constraint\)\. Therefore, for any nodeuuin the returned tree, rerunningLicketySPLITon\(Du,du,γ\)\(D\_\{u\},d\_\{u\},\\gamma\)reproduces the same subtreefuf\_\{u\}, soPROXY1​\(Du,du,γ\)=Obj​\(fu,Du,γ\)\\mathrm\{PROXY\}\_\{1\}\(D\_\{u\},d\_\{u\},\\gamma\)=\\mathrm\{Obj\}\(f\_\{u\},D\_\{u\},\\gamma\), implying the refinement inequality\. ∎

Likewise, this holds for our generalization of LicketySPLIT, detailed in Algorithm[18](https://arxiv.org/html/2606.00202#alg18)of[subsection B\.5](https://arxiv.org/html/2606.00202#A2.SS5)\. Because the algorithm recurses with the sameℓ\\ell\(the only added parameter\), just one depth lower, the refinement inequality holds with equality for this entire family of decision tree algorithms\.

###### Lemma A\.11\(LicketySPLIT is a memory\-efficient proxy algorithm\)\.

Given a dataset of sizennwithkkbinary features, the memory cost of the LicketySPLIT\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\)algorithm can be limited to𝒪​\(n​k\)\\mathcal\{O\}\(nk\)\.

###### Proof\.

We can show this by induction:

Inductive hypothesis:Letccbe some fixed constant\. Given that LicketySPLIT called with remaining depthd−1d\-1and any dataset of dimensionn1×kn\_\{1\}\\times k, withn1≤nn\_\{1\}\\leq ntakes memory no greater thanc​\(n1​k\+1\)c\(n\_\{1\}k\+1\), we want to show that LicketySPLIT with depthddand any dataset of dimensionn2×kn\_\{2\}\\times k, withn2≤nn\_\{2\}\\leq n, takes memory no greater thanc​\(n2​k\+1\)c\(n\_\{2\}k\+1\)\.

Pickccsuch that the original dataset size is≤c​n​k\\leq cnk\(and all subsets of sizenin\_\{i\}are similarly of size≤c​ni​k\\leq cn\_\{i\}k\) and the storage required for a few constants is≤c\\leq c\.

Base case:When the remaining depth budget is 0, LicketySPLIT requires no additional memory beyond its input dataset \(size≤c​n2​k\\leq cn\_\{2\}k\) and a constant \(size≤c\\leq c\); all that is required is to compute the leaf objective, corresponding to the minimum of the number of positive vs negative entries in labelyy\. So the memory cost is not greater thanc​\(n​k\+1\)c\(nk\+1\)\.

Inductive step:When depth is\>0\>0, LicketySPLIT must call a greedy subroutine for the left and right children from every possible binary feature split\. For a given potential split, the required memory is no more thanc​n2​k\+ccn\_\{2\}k\+c: we need only provide the left and right training data subsets to the greedy algorithms, and store the sum of the resulting objectives\. By going through one split at a time and tracking the best objective so far and the corresponding feature, LicketySPLIT can do this without persisting more than constant memory\. Once the optimal split is known, LicketySPLIT then constructs the left and right subproblems corresponding to that split \(still with total size matching the original dataset size, which is≤c​n2​k\\leq cn\_\{2\}k\)\. LicketySPLIT then must run LicketySPLIT with one fewer depth for the left and right subproblems of the selected split\. Note that each of the two subproblems has a number of samples no greater thann2−1n\_\{2\}\-1, because the optimal split must place at least one sample in each subproblem\. So, by the inductive hypothesis, each split requires≤c​\(\(n2−1\)​k\+1\)≤c​\(n2​k\)\\leq c\(\(n\_\{2\}\-1\)k\+1\)\\leq c\(n\_\{2\}k\)memory to be solved individually\. After one subproblem is solved, the memory used for it can be freed except for a single constant, and we can solve the other subproblem with≤c​\(n2​k\)\\leq c\(n\_\{2\}k\)memory\. In total, we use≤c​\(n2​k\+1\)\\leq c\(n\_\{2\}k\+1\)memory, as required\. \(Note that once these two LicketySPLIT approaches are ready to be called, no other information needs to be persisted in memory; LicketySPLIT will just return the sum of these two calls\. So the total amount of information in memory remains bounded byc​n​k\+ccnk\+c\)\.

Therefore, by induction, the total memory use for any LicketySPLIT call is bounded byc​n​k\+ccnk\+c, and thus inO​\(n​k\)O\(nk\)\. ∎

### A\.3Rashomon Set of Rule Lists

###### Theorem A\.12\(Modified PRAXIS returns a superset of the Rashomon set of rule lists\)\.

Let a simple rule list be defined as a tree where each split has at least one leaf child: that is, adding the constraint that for any non\-leaf tree t,min⁡\(depth​\(tleft\),depth​\(tright\)\)=0\\min\(\\textrm\{depth\}\(t\_\{\\textrm\{left\}\}\),\\textrm\{depth\}\(t\_\{\\textrm\{right\}\}\)\)=0\. Then, when the proxy algorithm in Algorithm[1](https://arxiv.org/html/2606.00202#alg1)is any algorithm with performance guaranteed to at least match the objective of the majority leaf prediction, and the pruning from Algorithm[1](https://arxiv.org/html/2606.00202#alg1)is adjusted to take a min instead of a sum, as in Algorithm[5](https://arxiv.org/html/2606.00202#alg5), the resulting Rashomon set will include all rule lists within the depth budget\.

###### Proof\.

First note that a few other adjustments to the algorithm are needed, to keep its behaviour fully specified with the new pruning condition \(it is now possible to explore some subproblems that do not include any trees within the budget\)\. These changes are detailed in Algorithms[5](https://arxiv.org/html/2606.00202#alg5)and[6](https://arxiv.org/html/2606.00202#alg6)\. Intuitively, these changes mean that we have a slight adjustment to the conditions on Theorem[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5), such that rule lists are not pruned\.

Now, we must show that a valid rule list within the depth budgetddis never pruned\. We will show this via induction\.

Consider any rule listtton a datasetDDwith

Obj​\(t,D,γ\)≤εabs\.\\mathrm\{Obj\}\(t,D,\\gamma\)\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\.\(39\)

##### Base Case \(Leaf\)

Ifttis a leaf withObj​\(t,D,γ\)≤εabs\\mathrm\{Obj\}\(t,D,\\gamma\)\\leq\\varepsilon\_\{\\mathrm\{abs\}\}, then note that adding leaves to the AND/OR graph \(if they are within budget\) happens before \(and is unrelated to\) any pruning of splits\. Thus,ttis trivially recovered\.

##### Inductive Step \(Non\-Leaf\)

Supposettis not a leaf\. Letttbe the root split oftt, inducing subproblem datasets\(Dtleft,Dtright\)\(D\_\{t\_\{\\textrm\{left\}\}\},D\_\{t\_\{\\textrm\{right\}\}\}\)and subtrees\(tleft,tright\)\(\{t\_\{\\textrm\{left\}\}\},\{t\_\{\\textrm\{right\}\}\}\)\. Sincettis a rule list, at least one child is a leaf\. Without loss of generality, assumetright\{t\_\{\\textrm\{right\}\}\}is a leaf andtleft\{t\_\{\\textrm\{left\}\}\}is a rule list \(that could be a leaf as well\)\. Note that there is no loss of generality because the modified pruning condition \(min⁡\(PL,PR\)\\min\(P\_\{L\},P\_\{R\}\)\) is symmetric\. Moreover, Algorithm[6](https://arxiv.org/html/2606.00202#alg6)never subtracts more than a single leaf objective from the opposite branch, regardless of which side contains the leaf\.

Becausettis feasible,

Obj​\(t,D,γ\)=Obj​\(tleft,Dtleft,γ\)\+Obj​\(tright,Dtright,γ\)\\displaystyle\\mathrm\{Obj\}\(t,D,\\gamma\)=\\mathrm\{Obj\}\(\{t\_\{\\textrm\{left\}\}\},D\_\{t\_\{\\textrm\{left\}\}\},\\gamma\)\+\\mathrm\{Obj\}\(\{t\_\{\\textrm\{right\}\}\},D\_\{t\_\{\\textrm\{right\}\}\},\\gamma\)≤εabs\.\\displaystyle\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\.\(40\)
Sincetright\{t\_\{\\textrm\{right\}\}\}is a leaf, its objective is at least the optimal leaf objective onDtrightD\_\{t\_\{\\textrm\{right\}\}\}, which \(by assumption\) is at least the proxy objective\. DefiningPR=Proxy​\(Dtright,d−1,γ\)P\_\{R\}=\\textsc\{Proxy\}\(D\_\{t\_\{\\textrm\{right\}\}\},d\-1,\\gamma\)andPL=Proxy​\(Dtleft,d−1,γ\)P\_\{L\}=\\textsc\{Proxy\}\(D\_\{t\_\{\\textrm\{left\}\}\},d\-1,\\gamma\),

PR\\displaystyle P\_\{R\}≤MajorityLeafObj​\(Dtright\)≤Obj​\(tright,Dtright,γ\)\.\\displaystyle\\leq\\mathrm\{\\textrm\{MajorityLeafObj\}\}\(D\_\{t\_\{\\textrm\{right\}\}\}\)\\leq\\mathrm\{Obj\}\(\{t\_\{\\textrm\{right\}\}\},D\_\{t\_\{\\textrm\{right\}\}\},\\gamma\)\.\(41\)
Combining \([40](https://arxiv.org/html/2606.00202#A1.E40)\) and \([41](https://arxiv.org/html/2606.00202#A1.E41)\) gives

Obj​\(tleft,Dtleft,γ\)\\displaystyle\\mathrm\{Obj\}\(\{t\_\{\\textrm\{left\}\}\},D\_\{t\_\{\\textrm\{left\}\}\},\\gamma\)≤εabs−Obj​\(tright,Dtright,γ\)\\displaystyle\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\-\\mathrm\{Obj\}\(\{t\_\{\\textrm\{right\}\}\},D\_\{t\_\{\\textrm\{right\}\}\},\\gamma\)\(42\)≤εabs−PR\.\\displaystyle\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\-P\_\{R\}\.\(43\)
Now, consider the new pruning condition for a split\. Becausetright\{t\_\{\\textrm\{right\}\}\}is a leaf, we know the following:

min⁡\(PL,PR\)\\displaystyle\\min\(P\_\{L\},P\_\{R\}\)≤PR\\displaystyle\\leq P\_\{R\}\(44\)≤Obj​\(tright,Dtright,γ\)\\displaystyle\\leq\\mathrm\{Obj\}\(\{t\_\{\\textrm\{right\}\}\},D\_\{t\_\{\\textrm\{right\}\}\},\\gamma\)\(45\)≤Obj​\(t,Dt,γ\)−γ\\displaystyle\\leq\\mathrm\{Obj\}\(\{t\},D\_\{t\},\\gamma\)\-\\gamma\(46\)≤εabs−γ\.\\displaystyle\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\-\\gamma\.\(47\)
Thus the modified pruning rulemin⁡\(PL,PR\)\>εabs−γ\\min\(P\_\{L\},P\_\{R\}\)\>\\varepsilon\_\{\\mathrm\{abs\}\}\-\\gammadoes not trigger, and the root split ofttis explored\.

By \([43](https://arxiv.org/html/2606.00202#A1.E43)\), the recursive call on datasetDtleftD\_\{t\_\{\\textrm\{left\}\}\}with budget\(εabs−PR\)\(\\varepsilon\_\{\\mathrm\{abs\}\}\-P\_\{R\}\)admitstleft\{t\_\{\\textrm\{left\}\}\}as a feasible solution\. By the inductive hypothesis, PRAXIS fully recoverstleft\{t\_\{\\textrm\{left\}\}\}\. Sincetright\{t\_\{\\textrm\{right\}\}\}is a feasible leaf, the modified sibling solver marks a valid subtree as existing and the split is retained\. Hencettis represented in the search graph\.

Algorithm 5Modified\_PRAXIS\(D,d,γ,εabs\)\(D,d,\\gamma,\\varepsilon\_\{\\textrm\{abs\}\}\), for Theorem[A\.12](https://arxiv.org/html/2606.00202#A1.Thmtheorem12)\. Changes inred0:Subproblem dataset

DD, remaining depth

dd, per\-leaf penalty

γ\\gamma, budget

εabs\\varepsilon\_\{\\textrm\{abs\}\},

1:valid\_tree\_exists

←\\leftarrowFalse

2:Let

G←OrNode​\(εabs\)G\\leftarrow\\textsc\{OrNode\}\(\\varepsilon\_\{\\textrm\{abs\}\}\)\{Initialize subgraph for subtrees found within budget

εabs\\varepsilon\_\{\\textrm\{abs\}\}\(See Appendix[B\.2](https://arxiv.org/html/2606.00202#A2.SS2)\)\}

3:for

b∈\{0,1\}b\\in\\\{0,1\\\}do

4:\{For each possible leaf prediction

bb, set

CbC\_\{b\}to the objective if all points were in a single leaf:

λ\\lambda\+ misclassification error\.\}

5:

Cb←γ\+\|\{\(xi,yi\)∈D:yi≠b\}\|C\_\{b\}\\leftarrow\\gamma\\;\+\\;\\big\|\\\{\(x\_\{i\},y\_\{i\}\)\\in D:y\_\{i\}\\neq b\\\}\\big\|
6:if

Cb≤εabsC\_\{b\}\\leq\\varepsilon\_\{\\textrm\{abs\}\}then

7:valid\_tree\_exists←\\leftarrowTrue

8:AddLeaf\(G,b,Cb\)\\,\(G,b,C\_\{b\}\)\{See Appendix[B\.2](https://arxiv.org/html/2606.00202#A2.SS2)\}

9:endif

10:endfor

11:if

d=0d=0or

εabs<2​γ\\varepsilon\_\{\\textrm\{abs\}\}<2\\gammathen

12:return

GG\{No depth for splits or any split would exceed the budget\}

13:endif

14:foreachfeature

jjdo

15:

\(DL,DR\)←Partition​\(D,j\)\(D\_\{L\},D\_\{R\}\)\\leftarrow\\textsc\{Partition\}\(D,j\)
16:if

DL=∅D\_\{L\}=\\emptysetor

DR=∅D\_\{R\}=\\emptysetthen

17:continue\{Skip degenerate splits\}

18:endif

19:

PL←Proxy​\(DL,d−1,γ\)P\_\{L\}\\leftarrow\\textsc\{Proxy\}\(D\_\{L\},d\-1,\\gamma\)\{Proxy cost on left\}

20:

PR←Proxy​\(DR,d−1,γ\)P\_\{R\}\\leftarrow\\textsc\{Proxy\}\(D\_\{R\},d\-1,\\gamma\)\{Proxy cost on right\}

21:if

min⁡\(PL,PR\)\>εabs−γ\\min\(P\_\{L\},P\_\{R\}\)\>\\varepsilon\_\{\\textrm\{abs\}\}\-\\gammathen

22:continue\{Prune split if proxy completions exceed the budget\}

23:endif

24:valid\_subtree\_exists,GL,GR←Modified\_Solve\_Siblings\(DL,DR,γ,d−1,\{G\}\_\{L\},\{G\}\_\{R\}\\leftarrow\\textsc\{\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}Modified\\\_\}Solve\\\_Siblings\}\(D\_\{L\},D\_\{R\},\\gamma,d\{\-\}1,

εabs,PL,PR\)\\varepsilon\_\{\\textrm\{abs\}\},\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}P\_\{L\}\},P\_\{R\}\)\{Find subgraphs for the left and right subproblems of this split; described inAlgorithm[6](https://arxiv.org/html/2606.00202#alg6)\}

25:ifvalid\_subtree\_existsthen

26:valid\_tree\_exists

←\\leftarrowTrue

27:

AddSplit​\(G,j,GL,GR\)\\textsc\{AddSplit\}\(G,j,G\_\{L\},\{G\}\_\{R\}\)\{AddGL,GRG\_\{L\},\{G\}\_\{R\}as a split for the current subgraphGG; see Appendix[B\.2](https://arxiv.org/html/2606.00202#A2.SS2)\}

28:endif

29:endfor

30:return

GG\{Rashomon graph for the subproblem\}

Usage Note:If provided with onlyεmult\\varepsilon\_\{\\textrm\{mult\}\}and notεabs\\varepsilon\_\{\\textrm\{abs\}\}, set Rashomon budget relative to proxy:

εabs←\(1\+εmult\)⋅Proxy​\(D,λ,d,\|D\|\)\\varepsilon\_\{\\textrm\{abs\}\}\\leftarrow\\bigl\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\\bigr\)\\cdot\\textsc\{Proxy\}\(D,\\lambda,d,\|D\|\)

Algorithm 6Modified\_Solve\_Siblings​\(DL,DR,γ,d,εabs,PL,PR\)\\textsc\{\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}Modified\\\_\}Solve\\\_Siblings\}\(D\_\{L\},D\_\{R\},\\gamma,d,\\varepsilon\_\{\\textrm\{abs\}\},\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}P\_\{L\}\},P\_\{R\}\), modified for Theorem[A\.12](https://arxiv.org/html/2606.00202#A1.Thmtheorem12)\. Changes inred\.0:Left/right datasets

DL,DRD\_\{L\},D\_\{R\}, remaining depth

dd, regularization

γ\\gamma, parent budget

εabs\\varepsilon\_\{\\textrm\{abs\}\}, proxy cost

PRP\_\{R\}
1:valid\_tree\_exists

←\\leftarrowFalse

2:

εL←−∞\\varepsilon\_\{L\}\\leftarrow\-\\infty\{largest budget used for solving

DLD\_\{L\}with PRAXIS ; currently we’ve not run on

DLD\_\{L\}at all\.\}

3:

εR←−∞\\varepsilon\_\{R\}\\leftarrow\-\\infty\{largest budget used for solving

DRD\_\{R\}with PRAXIS ; currently we’ve not run on

DRD\_\{R\}at all\.\}

4:

εL\(new\)←εabs−PR\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}\\leftarrow\\varepsilon\_\{\\textrm\{abs\}\}\-P\_\{R\}\{new budget to use for

DLD\_\{L\}with PRAXIS \}

5:

εR\(new\)←εabs−PL\\varepsilon\_\{R\}^\{\\textrm\{\(new\)\}\}\\leftarrow\\varepsilon\_\{\\textrm\{abs\}\}\-P\_\{L\}\{new budget to use for

DRD\_\{R\}with PRAXIS \}

6:while

εL\(new\)\>εL\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}\>\\varepsilon\_\{L\}do

7:

εL←εL\(new\)\\varepsilon\_\{L\}\\leftarrow\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}
8:if

εL\>0\\varepsilon\_\{L\}\>0then

9:valid\_subtree\_exists,

GL←Modified\_PRAXIS​\(DL,d,γ,εL\)G\_\{L\}\\leftarrow\\textsc\{\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}Modified\\\_\}PRAXIS\}\(D\_\{L\},d,\\gamma,\\varepsilon\_\{L\}\)
10:ifvalid\_subtree\_existsthen

11:valid\_tree\_exists

←\\leftarrowTrue

12:

εR\(new\)←εabs−GL\.min\_objective\\varepsilon\_\{R\}^\{\\textrm\{\(new\)\}\}\\leftarrow\\varepsilon\_\{\\textrm\{abs\}\}\-G\_\{L\}\.\\textit\{min\\\_objective\}
13:endif

14:endif

15:if

εR\(new\)\>εR\\varepsilon\_\{R\}^\{\\textrm\{\(new\)\}\}\>\\varepsilon\_\{R\}then

16:

εR←εR\(new\)\\varepsilon\_\{R\}\\leftarrow\\varepsilon\_\{R\}^\{\\textrm\{\(new\)\}\}
17:if

εR\>0\\varepsilon\_\{R\}\>0then

18:valid\_subtree\_exists,

GR←Modified\_PRAXIS​\(DL,d,γ,εR\)G\_\{R\}\\leftarrow\\textsc\{\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}Modified\\\_\}PRAXIS\}\(D\_\{L\},d,\\gamma,\\varepsilon\_\{R\}\)
19:ifvalid\_subtree\_existsthen

20:valid\_tree\_exists

←\\leftarrowTrue

21:

εL\(new\)←εabs−GR\.min\_objective\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}\\leftarrow\\varepsilon\_\{\\textrm\{abs\}\}\-G\_\{R\}\.\\textit\{min\\\_objective\}
22:endif

23:endif

24:

GR←Modified\_PRAXIS​\(DR,d,γ,εR,N\)G\_\{R\}\\leftarrow\\textsc\{\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}Modified\\\_\}PRAXIS\}\(D\_\{R\},d,\\gamma,\\varepsilon\_\{R\},N\)
25:

εL\(new\)←ε​\_​abs−GR\.min\_objective\\varepsilon\_\{L\}^\{\\textrm\{\(new\)\}\}\\leftarrow\\varepsilon\\\_\\textrm\{abs\}\-G\_\{R\}\.\\textit\{min\\\_objective\}
26:endif

27:endwhile

28:returnvalid\_tree\_exists,GL,GRG\_\{L\},G\_\{R\}

∎

Although we design PRAXIS with the intention of approximating the full Rashomon set of sparse decision trees, we note that one could modify this proxy\-approach to guarantee that the exact Rashomon set of rule lists is contained in the solution\. Then, any valid cross product of these one sided decision tree branches would also be a valid solution, along with others\.

In practice, the decision tree algorithms perform much better than a single leaf, so we are able to deploy more aggressive pruning to approximate the full Rashomon set faster and better than this approach \(see[subsection D\.8](https://arxiv.org/html/2606.00202#A4.SS8)\)\.

### A\.4Refining Proxy Guesses

Beyond being guaranteed to recover the proxy tree at each subproblem in the AND/OR graph, we are also guaranteed to find trees that would be generated from stronger decision tree algorithms\. In Algorithm[7](https://arxiv.org/html/2606.00202#alg7), we give a single decision tree algorithm that recursively selects splits according to proxy\-completed objectives; this is precisely the pilot algorithm induced by the proxy\. We note that if the proxy is a greedy tree algorithm, then this wrapper corresponds toLicketySPLIT\(withℓ=1\\ell=1in our generalization\)\. If the proxy is our default ofLicketySPLIT\(ℓ=1\\ell=1\), then this wrapper corresponds toLicketySPLIT\(ℓ=2\\ell=2\)\. Having this guarantee speeds up the convergence of iterative budget refinement and improves the recovery conditions of PRAXIS \.

Algorithm 7ProxyWrapper​\(D′,d,γ\)\\textsc\{ProxyWrapper\}\(D^\{\\prime\},d,\\gamma\)0:Subproblem dataset

D′⊆DD^\{\\prime\}\\subseteq D, remaining depth

dd, leaf penalty

γ\\gamma
1:

n′←\|D′\|n^\{\\prime\}\\leftarrow\|D^\{\\prime\}\|,

p←p\\leftarrow\# positives in

D′D^\{\\prime\}
2:

leaf\_loss←γ\+min⁡\(p,n′−p\)\\textit\{leaf\\\_loss\}\\leftarrow\\gamma\+\\min\(p,\\,n^\{\\prime\}\-p\)
3:if

d=0d=0orleaf\_loss≤2​γ\\textit\{leaf\\\_loss\}\\leq 2\\gammathen

4:returnleaf\_loss\{early escape: either no depth, or split can’t beat leaf\}

5:endif

6:

best\_sum←\+∞\\textit\{best\\\_sum\}\\leftarrow\+\\infty,

best\_feature←⊥\\textit\{best\\\_feature\}\\leftarrow\\bot
7:foreachfeature

jjdo

8:Partition

D′D^\{\\prime\}into

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)by split

xjx\_\{j\};continueif

DL′=∅D^\{\\prime\}\_\{L\}=\\emptysetor

DR′=∅D^\{\\prime\}\_\{R\}=\\emptyset
9:

sj←PROXY​\(DL′,d−1,γ\)\+PROXY​\(DR′,d−1,γ\)s\_\{j\}\\leftarrow\\textsc\{PROXY\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)\\;\+\\;\\textsc\{PROXY\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)
10:if

sj<best\_sums\_\{j\}<\\textit\{best\\\_sum\}then

11:

best\_sum←sj\\textit\{best\\\_sum\}\\leftarrow s\_\{j\},

best\_feature←j\\textit\{best\\\_feature\}\\leftarrow j
12:endif

13:endfor

14:

ans←leaf\_loss\\textit\{ans\}\\leftarrow\\textit\{leaf\\\_loss\}
15:if

best\_feature≠⊥\\textit\{best\\\_feature\}\\neq\\botthen

16:Partition

D′D^\{\\prime\}bybest\_featureinto

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)
17:

L←ProxyWrapper​\(DL′,d−1,γ\)L\\leftarrow\\textsc\{ProxyWrapper\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)\{recurse only on the best split\}

18:

R←ProxyWrapper​\(DR′,d−1,γ\)R\\leftarrow\\textsc\{ProxyWrapper\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)
19:

ans←min⁡\(ans,L\+R\)\\textit\{ans\}\\leftarrow\\min\(\\textit\{ans\},\\,L\+R\)\{take better of leaf and recursive completion\}

20:endif

21:returnans

We first note that Algorithm[7](https://arxiv.org/html/2606.00202#alg7)is guaranteed to have an objective at least as good as the proxy algorithm that it wraps\. This is formalized by the following theorem\.

###### Theorem A\.13\(A pilot algorithm using the proxy is no worse than the proxy\)\.

Fix any subproblem\(D′,d\)\(D^\{\\prime\},d\)and leaf penaltyγ\\gamma\. Lettpx∈𝒯dt^\{\\mathrm\{px\}\}\\in\\mathcal\{T\}\_\{d\}be the tree whose objective corresponds to the proxy call:

PROXY​\(D′,d,γ\)=Obj​\(tpx,D′,γ\)\.\\textsc\{PROXY\}\(D^\{\\prime\},d,\\gamma\)\\;=\\;\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\},D^\{\\prime\},\\gamma\)\.LetProxyWrapper​\(D′,d,γ\)\\textsc\{ProxyWrapper\}\(D^\{\\prime\},d,\\gamma\)be as in Algorithm[7](https://arxiv.org/html/2606.00202#alg7)\.

Then

ProxyWrapper​\(D′,d,γ\)≤PROXY​\(D′,d,γ\)\\textsc\{ProxyWrapper\}\(D^\{\\prime\},d,\\gamma\)\\;\\leq\\;\\textsc\{PROXY\}\(D^\{\\prime\},d,\\gamma\)

###### Proof\.

Write

W​\(D′,d\):=ProxyWrapper​\(D′,d,γ\),P​\(D′,d\):=PROXY​\(D′,d,γ\)\.W\(D^\{\\prime\},d\)\\;:=\\;\\textsc\{ProxyWrapper\}\(D^\{\\prime\},d,\\gamma\),\\qquad P\(D^\{\\prime\},d\)\\;:=\\;\\textsc\{PROXY\}\(D^\{\\prime\},d,\\gamma\)\.We prove by induction onddthat for every datasetD′D^\{\\prime\},

W​\(D′,d\)≤P​\(D′,d\)\.W\(D^\{\\prime\},d\)\\;\\leq\\;P\(D^\{\\prime\},d\)\.
Base case \(d=0d=0\)\.ProxyWrapperreturns the objective of the majority leaf onD′D^\{\\prime\}\. SinceP​\(D′,0\)P\(D^\{\\prime\},0\)is the objective of*some*depth0tree \(a leaf\) by Definition[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1), we haveW​\(D′,0\)≤P​\(D′,0\)W\(D^\{\\prime\},0\)\\leq P\(D^\{\\prime\},0\)\.

Inductive step\.Fixd≥1d\\geq 1and assume that for all datasetsAA,

W​\(A,d−1\)≤P​\(A,d−1\)\.W\(A,d\-1\)\\;\\leq\\;P\(A,d\-1\)\.Fix an arbitrary datasetD′D^\{\\prime\}, and lettpx∈𝒯dt^\{\\mathrm\{px\}\}\\in\\mathcal\{T\}\_\{d\}be a tree such that

P​\(D′,d\)=Obj​\(tpx,D′,γ\)\.P\(D^\{\\prime\},d\)\\;=\\;\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\},D^\{\\prime\},\\gamma\)\.
Iftpxt^\{\\mathrm\{px\}\}is a leaf, thenP​\(D′,d\)P\(D^\{\\prime\},d\)equals a leaf objective onD′D^\{\\prime\}, andProxyWrapperalways returns at most the majority leaf objective, soW​\(D′,d\)≤P​\(D′,d\)W\(D^\{\\prime\},d\)\\leq P\(D^\{\\prime\},d\)\.

Otherwise, letkkbe the root split oftpxt^\{\\mathrm\{px\}\}, inducing a partition\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)and subtreestLpx,tRpxt^\{\\mathrm\{px\}\}\_\{L\},t^\{\\mathrm\{px\}\}\_\{R\}\. By additivity of the objective,

P​\(D′,d\)=Obj​\(tpx,D′,γ\)=Obj​\(tLpx,DL′,γ\)\+Obj​\(tRpx,DR′,γ\)\.P\(D^\{\\prime\},d\)=\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\},D^\{\\prime\},\\gamma\)=\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\}\_\{L\},D^\{\\prime\}\_\{L\},\\gamma\)\+\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\}\_\{R\},D^\{\\prime\}\_\{R\},\\gamma\)\.By Definition[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1),

P​\(DL′,d−1\)≤Obj​\(tLpx,DL′,γ\),P​\(DR′,d−1\)≤Obj​\(tRpx,DR′,γ\),P\(D^\{\\prime\}\_\{L\},d\-1\)\\leq\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\}\_\{L\},D^\{\\prime\}\_\{L\},\\gamma\),\\qquad P\(D^\{\\prime\}\_\{R\},d\-1\)\\leq\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\}\_\{R\},D^\{\\prime\}\_\{R\},\\gamma\),so

P​\(DL′,d−1\)\+P​\(DR′,d−1\)≤P​\(D′,d\)\.P\(D^\{\\prime\}\_\{L\},d\-1\)\+P\(D^\{\\prime\}\_\{R\},d\-1\)\\;\\leq\\;P\(D^\{\\prime\},d\)\.
Now consider the featurejjselected byProxyWrapperat\(D′,d\)\(D^\{\\prime\},d\)\(if all splits are invalid, thenProxyWrapperreturns the leaf and we are done asPROXYalso cannot split without increasing the objective beyond a majority leaf\)\. Let\(DL,DR\)\(D\_\{L\},D\_\{R\}\)be the partition induced byjj\. By the selection rule inProxyWrapper,

P​\(DL,d−1\)\+P​\(DR,d−1\)≤P​\(DL′,d−1\)\+P​\(DR′,d−1\)≤P​\(D′,d\)\.P\(D\_\{L\},d\-1\)\+P\(D\_\{R\},d\-1\)\\;\\leq\\;P\(D^\{\\prime\}\_\{L\},d\-1\)\+P\(D^\{\\prime\}\_\{R\},d\-1\)\\;\\leq\\;P\(D^\{\\prime\},d\)\.Moreover,ProxyWrappertakes the minimum of a majority leaf and a recursive completion\. Thus,

W​\(D′,d\)≤W​\(DL,d−1\)\+W​\(DR,d−1\)\.W\(D^\{\\prime\},d\)\\;\\leq\\;W\(D\_\{L\},d\-1\)\+W\(D\_\{R\},d\-1\)\.Applying the induction hypothesis toDLD\_\{L\}andDRD\_\{R\}gives

W​\(DL,d−1\)\+W​\(DR,d−1\)≤P​\(DL,d−1\)\+P​\(DR,d−1\)≤P​\(D′,d\),W\(D\_\{L\},d\-1\)\+W\(D\_\{R\},d\-1\)\\;\\leq\\;P\(D\_\{L\},d\-1\)\+P\(D\_\{R\},d\-1\)\\;\\leq\\;P\(D^\{\\prime\},d\),henceW​\(D′,d\)≤P​\(D′,d\)W\(D^\{\\prime\},d\)\\leq P\(D^\{\\prime\},d\)\.

∎

We note that wrapping a proxy algorithm in this way has additionally clean properties: the wrapped procedure itself becomes a proxy algorithm whose refinement property holds with equality\.

###### Proposition A\.14\(A pilot algorithm using the proxy satisfies the refinement property with equality\)\.

Let

W​\(D,d,γ\):=ProxyWrapper​\(D,d,γ\)\.W\(D,d,\\gamma\):=\\textsc\{ProxyWrapper\}\(D,d,\\gamma\)\.ThenWWis a proxy algorithm\. Moreover, if the tree returned byProxyWrapper​\(D,d,γ\)\\textsc\{ProxyWrapper\}\(D,d,\\gamma\)splits into left and right subtreestL,tRt\_\{L\},t\_\{R\}on child subproblems\(DL,d−1\)\(D\_\{L\},d\-1\)and\(DR,d−1\)\(D\_\{R\},d\-1\), then

W​\(DL,d−1,γ\)=Obj​\(tL,DL,γ\),W​\(DR,d−1,γ\)=Obj​\(tR,DR,γ\)\.W\(D\_\{L\},d\-1,\\gamma\)=\\mathrm\{Obj\}\(t\_\{L\},D\_\{L\},\\gamma\),\\qquad W\(D\_\{R\},d\-1,\\gamma\)=\\mathrm\{Obj\}\(t\_\{R\},D\_\{R\},\\gamma\)\.That is,WWsatisfies the refinement property with equality\.

###### Proof\.

Consider the tree thatProxyWrapperimplicitly enumerates\. Now consider callingProxyWrapperon the left and right subproblems of the root node\. Note that these are exactly the recursive calls that were made during the implicit construction of the tree\.

Thus, the same left and right subtrees are reproduced whenProxyWrapperis applied to these subproblems\. In particular, the proxy values returned on the child subproblems equal the objectives of the corresponding subtrees of the original tree\. Hence, the refinement property holds with equality\. Notably, this conclusion does not require the original proxy algorithm to satisfy refinement with equality\.

∎

With these properties in hand, we now show that whenever PRAXIS explores a subproblem with a budget at least the proxy value \(such as in the default initialization to PRAXIS\), it necessarily recovers the correspondingProxyWrappertree at the root and at every subproblem in the AND/OR graph\.

###### Theorem A\.15\(A pilot algorithm using the proxy maps to a tree found by PRAXIS\)\.

Fix any subproblem\(D,d\)\(D,d\)and leaf penaltyγ\\gamma\.

Let the proxy algorithm return a treefpx∈𝒯df^\{\\mathrm\{px\}\}\\in\\mathcal\{T\}\_\{d\}with objective

P​\(D,d\):=PROXY​\(D,d,γ\)=Obj​\(fpx,D,γ\)\.P\(D,d\)\\;:=\\;\\textsc\{PROXY\}\(D,d,\\gamma\)\\;=\\;\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\},D,\\gamma\)\.
Letfpw​\(D,d\)f^\{\\mathrm\{pw\}\}\(D,d\)denote the tree implicitly enumerated byProxyWrapper\(D,d,γ\)\(D,d,\\gamma\)\(Algorithm[7](https://arxiv.org/html/2606.00202#alg7)\)

Then:

1. 1\.the returned AND/OR graphGGcontains the ProxyWrapper treefpw​\(D,d\)f^\{\\mathrm\{pw\}\}\(D,d\);
2. 2\.In particular, G\.min​\_​objective≤Obj​\(fpw​\(D,d\),D,γ\)≤Obj​\(fpx,D,γ\)\.G\.\\mathrm\{min\\\_objective\}\\;\\leq\\;\\mathrm\{Obj\}\(f^\{\\mathrm\{pw\}\}\(D,d\),D,\\gamma\)\\;\\leq\\;\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\},D,\\gamma\)\.

###### Proof\.

Claim\.If PRAXIS is invoked on\(D,d\)\(D,d\)withεabs≥P​\(D,d\)\\varepsilon\_\{\\mathrm\{abs\}\}\\geq P\(D,d\), then the returned graph containsfpw​\(D,d\)f^\{\\mathrm\{pw\}\}\(D,d\)\.

Base case:d=0d=0\.

By construction,ProxyWrapperreturns a leaf \(in fact, the majority leaf\)\. Sinceεabs≥P​\(D,0\)\\varepsilon\_\{\\mathrm\{abs\}\}\\geq P\(D,0\)and PRAXIS inserts all feasible leaf actions whose objective is≤εabs\\leq\\varepsilon\_\{\\mathrm\{abs\}\}, the leaf corresponding tofpw​\(D,0\)f^\{\\mathrm\{pw\}\}\(D,0\)is included \(if it fits within budget\)\. The claim holds\.

Inductive step\.Assume the claim holds for depthd−1d\-1\. Fix\(D,d\)\(D,d\)\.

There are two cases\.

*Case 1: ProxyWrapper selects a leaf\.*

Thenfpw​\(D,d\)f^\{\\mathrm\{pw\}\}\(D,d\)is a leaf with objectiveleaf\_loss\. Sinceεabs≥P​\(D,d\)\\varepsilon\_\{\\mathrm\{abs\}\}\\geq P\(D,d\), PRAXIS includes that leaf option \(by the same reasoning as the base case\)\. Thusfpw​\(D,d\)f^\{\\mathrm\{pw\}\}\(D,d\)is contained inGG\.

*Case 2: ProxyWrapper selects a splitj⋆j^\{\\star\}\.*

Let\(DL​\(j⋆\),DR​\(j⋆\)\)\(D\_\{L\}\(j^\{\\star\}\),D\_\{R\}\(j^\{\\star\}\)\)be the partition ofDDinduced byj⋆j^\{\\star\}\.

By definition ofProxyWrapper,

j⋆∈arg⁡minj⁡\(P​\(DL​\(j\),d−1\)\+P​\(DR​\(j\),d−1\)\)\.j^\{\\star\}\\in\\arg\\min\_\{j\}\\big\(P\(D\_\{L\}\(j\),d\-1\)\+P\(D\_\{R\}\(j\),d\-1\)\\big\)\.
Letkkdenote the root split used by the proxy treefpxf^\{\\mathrm\{px\}\}\. By Definition[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1), iffpxf^\{\\mathrm\{px\}\}splits onkkwith childrentLpx,tRpxt^\{\\mathrm\{px\}\}\_\{L\},t^\{\\mathrm\{px\}\}\_\{R\}, then

P​\(DL​\(k\),d−1\)≤Obj​\(tLpx,DL​\(k\),γ\),P​\(DR​\(k\),d−1\)≤Obj​\(tRpx,DR​\(k\),γ\)\.P\(D\_\{L\}\(k\),d\-1\)\\leq\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\}\_\{L\},D\_\{L\}\(k\),\\gamma\),\\qquad P\(D\_\{R\}\(k\),d\-1\)\\leq\\mathrm\{Obj\}\(t^\{\\mathrm\{px\}\}\_\{R\},D\_\{R\}\(k\),\\gamma\)\.Summing yields

P​\(DL​\(k\),d−1\)\+P​\(DR​\(k\),d−1\)≤Obj​\(fpx,D,γ\)=P​\(D,d\)\.P\(D\_\{L\}\(k\),d\-1\)\+P\(D\_\{R\}\(k\),d\-1\)\\leq\\mathrm\{Obj\}\(f^\{\\mathrm\{px\}\},D,\\gamma\)=P\(D,d\)\.
Sincej⋆j^\{\\star\}minimizes the proxy\-completed split score,

P​\(DL​\(j⋆\),d−1\)\+P​\(DR​\(j⋆\),d−1\)≤P​\(DL​\(k\),d−1\)\+P​\(DR​\(k\),d−1\)≤P​\(D,d\)≤εabs\.P\(D\_\{L\}\(j^\{\\star\}\),d\-1\)\+P\(D\_\{R\}\(j^\{\\star\}\),d\-1\)\\leq P\(D\_\{L\}\(k\),d\-1\)\+P\(D\_\{R\}\(k\),d\-1\)\\leq P\(D,d\)\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\.
Therefore PRAXIS does not prune splitj⋆j^\{\\star\}, because its pruning condition is

P​\(DL,d−1\)\+P​\(DR,d−1\)\>εabs\.P\(D\_\{L\},d\-1\)\+P\(D\_\{R\},d\-1\)\>\\varepsilon\_\{\\mathrm\{abs\}\}\.
Budget passed to children\.

PRAXIS initializes the left budget by subtracting the proxy value on the right:

εL=εabs−P​\(DR,d−1\)\.\\varepsilon\_\{L\}=\\varepsilon\_\{\\mathrm\{abs\}\}\-P\(D\_\{R\},d\-1\)\.Using

P​\(DL,d−1\)\+P​\(DR,d−1\)≤εabs,P\(D\_\{L\},d\-1\)\+P\(D\_\{R\},d\-1\)\\leq\\varepsilon\_\{\\mathrm\{abs\}\},we obtain

εL≥P​\(DL,d−1\)\.\\varepsilon\_\{L\}\\geq P\(D\_\{L\},d\-1\)\.
Thus, PRAXIS is invoked on\(DL,d−1\)\(D\_\{L\},d\-1\)with a budget at least as large as the proxy value at the subproblem\. Therefore, by the inductive hypothesis, the graph returned for the left child containsfpw​\(DL,d−1\)f^\{\\mathrm\{pw\}\}\(D\_\{L\},d\-1\)\. Moreover,

GL\.min​\_​objective≤Obj​\(fpw​\(DL,d−1\),DL,γ\)≤P​\(DL,d−1\)\.G\_\{L\}\.\\mathrm\{min\\\_objective\}\\leq\\mathrm\{Obj\}\(f^\{\\mathrm\{pw\}\}\(D\_\{L\},d\-1\),D\_\{L\},\\gamma\)\\leq P\(D\_\{L\},d\-1\)\.
The right budget is then set to

εR=εabs−GL\.min​\_​objective≥εabs−P​\(DL,d−1\)≥P​\(DR,d−1\)\.\\varepsilon\_\{R\}=\\varepsilon\_\{\\mathrm\{abs\}\}\-G\_\{L\}\.\\mathrm\{min\\\_objective\}\\geq\\varepsilon\_\{\\mathrm\{abs\}\}\-P\(D\_\{L\},d\-1\)\\geq P\(D\_\{R\},d\-1\)\.
Thus the recursive call on\(DR,d−1\)\(D\_\{R\},d\-1\)also receives a budget that is at least its proxy value, and by induction, the graph containsfpw​\(DR,d−1\)f^\{\\mathrm\{pw\}\}\(D\_\{R\},d\-1\)\.

Since PRAXIS attaches these child graphs under splitj​j^\{\\\*\}, the graph for\(D,d\)\(D,d\)containsfpw​\(D,d\)f^\{\\mathrm\{pw\}\}\(D,d\)\. We note that by[Theorem A\.13](https://arxiv.org/html/2606.00202#A1.Thmtheorem13),fpw​\(D,d\)f^\{\\mathrm\{pw\}\}\(D,d\)will always fit within the budget, since the proxy tree itself–which is no better–always fits within the budget of an explored subproblem \([Theorem A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3)\)\.

Importantly, we note that attaching these left and rightProxyWrappersubtrees \(corresponding to the proxy rollout after selecting splitjj\) yields theProxyWrappertree for\(D,d\)\(D,d\)\. This follows becauseProxyWrappergreedily selects the best split according to the proxy algorithm and does not revisit earlier decisions \(by[A\.14](https://arxiv.org/html/2606.00202#A1.Thmtheorem14), rerunningProxyWrapperon any subproblem appearing in aProxyWrappertree returns the same subtree\)\.

Iterative refinement\.

The sibling procedure only increases budgets supplied to recursive calls\. Thus, the above argument continues to hold throughout refinement\.

∎

Following very similar logic to[A\.4](https://arxiv.org/html/2606.00202#A1.Thmtheorem4), we show that this fact does not just hold for the root node, but the hypothesis is true for all nodes in the AND/OR graph\.

###### Corollary A\.16\(The tree corresponding to the pilot algorithm is recovered at every explored subproblem\)\.

Assume the root call to PRAXIS satisfies

εabs≥P​\(D,d\)\.\\varepsilon\_\{\\mathrm\{abs\}\}\\geq P\(D,d\)\.Then every OR node in the returned AND/OR graph is explored with budget

εabs′≥P​\(D′,d′\)\.\\varepsilon^\{\\prime\}\_\{\\mathrm\{abs\}\}\\geq P\(D^\{\\prime\},d^\{\\prime\}\)\.Consequently, by Theorem[A\.15](https://arxiv.org/html/2606.00202#A1.Thmtheorem15), the subgraph rooted at every OR node contains theProxyWrappertree for that subproblem, and the minimum objective stored at that node is at most the objective of thatProxyWrappertree, which is itself at mostP​\(D′,d′\)P\(D^\{\\prime\},d^\{\\prime\}\)\.

###### Proof\.

Let a split produce child subproblems with proxy values

PL:=P​\(DL,d−1\),PR:=P​\(DR,d−1\),P\_\{L\}:=P\(D\_\{L\},d\-1\),\\qquad P\_\{R\}:=P\(D\_\{R\},d\-1\),and suppose this split is not pruned, i\.e\.

PL\+PR≤εabs\.P\_\{L\}\+P\_\{R\}\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\.Then the initial left budget used bySolveSiblingsis

εL:=εabs−PR\.\\varepsilon\_\{L\}:=\\varepsilon\_\{\\mathrm\{abs\}\}\-P\_\{R\}\.Therefore,

εL=εabs−PR≥PL\.\\varepsilon\_\{L\}=\\varepsilon\_\{\\mathrm\{abs\}\}\-P\_\{R\}\\geq P\_\{L\}\.Thus, the recursive call on the left subproblem is made with budget at least its proxy objective, which by Theorem[A\.15](https://arxiv.org/html/2606.00202#A1.Thmtheorem15)is sufficient to recover theProxyWrappertree on the left\.

Now letGLG\_\{L\}denote the returned left subgraph\. Since the leftProxyWrappertree is contained inGLG\_\{L\}, its minimum objective satisfies

GL\.min​\_​objective≤PL\.G\_\{L\}\.\\mathrm\{min\\\_objective\}\\leq P\_\{L\}\.The right budget is then set to

εR:=εabs−GL\.min​\_​objective\.\\varepsilon\_\{R\}:=\\varepsilon\_\{\\mathrm\{abs\}\}\-G\_\{L\}\.\\mathrm\{min\\\_objective\}\.UsingGL\.min​\_​objective≤PLG\_\{L\}\.\\mathrm\{min\\\_objective\}\\leq P\_\{L\}, we obtain

εR=εabs−GL\.min​\_​objective≥εabs−PL≥PR,\\varepsilon\_\{R\}=\\varepsilon\_\{\\mathrm\{abs\}\}\-G\_\{L\}\.\\mathrm\{min\\\_objective\}\\geq\\varepsilon\_\{\\mathrm\{abs\}\}\-P\_\{L\}\\geq P\_\{R\},sincePL\+PR≤εabsP\_\{L\}\+P\_\{R\}\\leq\\varepsilon\_\{\\mathrm\{abs\}\}\. Hence, the recursive call on the right subproblem is also made with budget at least its proxy objective, so theProxyWrappertree on the right is recovered as well\.

Finally,SolveSiblingsonly increases budgets during subsequent refinement steps \(by[A\.2](https://arxiv.org/html/2606.00202#A1.Thmtheorem2)\)\. By[A\.1](https://arxiv.org/html/2606.00202#A1.Thmtheorem1), no trees will ever be removed from the AND/OR graph during this budget refinement, so the hypothesis holds for the final attached subgraph\. Repeated application of this argument down the AND/OR graph establishes the corollary\. ∎

With[Theorem A\.13](https://arxiv.org/html/2606.00202#A1.Thmtheorem13)and[Theorem A\.15](https://arxiv.org/html/2606.00202#A1.Thmtheorem15)in place, we note how conditioning on theProxyWrapperobjective in addition to theProxyobjective changes earlier theoretical results\. For[Theorem 3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5), because we are guaranteed that the left and right AND/OR graph solutions will contain theProxyWrappertree, then when we resolve the other side in Algorithm[3](https://arxiv.org/html/2606.00202#alg3), we subtract off at leastProxyWrapperon the other side \(as opposed toProxy\)\. Thus, while the true recovery condition is \(for each internal nodeuuof the desired tree\)

∑i=1tMinObj​\(si\)\+PROXY​\(uleft\)\+PROXY​\(uright\)≤εabs,\\sum\_\{i=1\}^\{t\}\\mathrm\{MinObj\}\(s\_\{i\}\)\\;\+\\;\\textsc\{PROXY\}\\\!\\big\(u\_\{\\textrm\{left\}\}\\big\)\\;\+\\;\\textsc\{PROXY\}\\\!\\big\(u\_\{\\textrm\{right\}\}\\big\)\\;\\leq\\;\\varepsilon\_\{\\textrm\{abs\}\},
We can give a tighter sufficient condition than was given in Algorithm[3\.5](https://arxiv.org/html/2606.00202#S3.Thmtheorem5)\.

∑i=1tProxyWrapper​\(si\)\+PROXY​\(uleft\)\+PROXY​\(uright\)≤εabs,\\sum\_\{i=1\}^\{t\}\\textsc\{ProxyWrapper\}\(s\_\{i\}\)\\;\+\\;\\textsc\{PROXY\}\\\!\\big\(u\_\{\\textrm\{left\}\}\\big\)\\;\+\\;\\textsc\{PROXY\}\\\!\\big\(u\_\{\\textrm\{right\}\}\\big\)\\;\\leq\\;\\varepsilon\_\{\\textrm\{abs\}\},
Likewise, for Lemma[A\.7](https://arxiv.org/html/2606.00202#A1.Thmtheorem7), we know that the first minimum objective query will already beProxyWrapper, as opposed to simplyProxy\.

## Appendix BImplementation, Data Structures, and Caching

### B\.1Integer Objectives

Throughout this work, we minimize a regularized empirical risk objective of the form

Obj​\(f,D,γ\)=γ​\|f\|\+∑i=1\|D\|𝟏​\{f​\(xi\)≠yi\},\\mathrm\{Obj\}\(f,D,\\gamma\)\\;=\\;\\gamma\\,\|f\|\\;\+\\;\\sum\_\{i=1\}^\{\|D\|\}\\mathbf\{1\}\\\{f\(x\_\{i\}\)\\neq y\_\{i\}\\\},\(48\)
This is equivalent to minimizing the normalized objective \(withγ=λ​\|D\|\\gamma=\\lambda\|D\|\)

Objnorm​\(f,D,λ\)=λ​\|f\|\+1\|D\|​∑i=1\|D\|𝟏​\{f​\(xi\)≠yi\},\\mathrm\{Obj\_\{\\textrm\{norm\}\}\}\(f,D,\\lambda\)\\;=\\;\\lambda\\,\|f\|\\;\+\\;\\frac\{1\}\{\|D\|\}\\sum\_\{i=1\}^\{\|D\|\}\\mathbf\{1\}\\\{f\(x\_\{i\}\)\\neq y\_\{i\}\\\},
For algorithmic stability and efficiency, we work exclusively with an equivalent integer\-valued objective obtained by requiringγ\\gammato be an integer \(equivalently,λ\\lambdato be an integer multiple of1\|D\|\\frac\{1\}\{\|D\|\}\)\. As withλ\\lambda, we require it to be non\-negative\.

This mild assumption eliminates floating\-point drift, which can become problematic in extracting trees with some objective and decomposing it into an exact sum of a left and right child\. Moreover, integer objectives allow additional storage in the AND/OR graph \(like a histogram of objectives at each subproblem\) to be stored more efficiently as there are fewer unique values, and for quicker theoretical convergence of iterative budget refinement \(see[A\.7](https://arxiv.org/html/2606.00202#A1.Thmtheorem7)\)\. Beyond these structural advantages, integer arithmetic is faster\.

One advantage to defining objectives in this form is that we can decompose the objective via a simple sum\.

Objint​\(f,D,γ\)=Objint​\(fL,DL,γ\)\+Objint​\(fR,DR,γ\)\.\\mathrm\{Obj\}\_\{\\mathrm\{int\}\}\(f,D,\\gamma\)\\;=\\;\\mathrm\{Obj\}\_\{\\mathrm\{int\}\}\(f\_\{L\},D\_\{L\},\\gamma\)\\;\+\\;\\mathrm\{Obj\}\_\{\\mathrm\{int\}\}\(f\_\{R\},D\_\{R\},\\gamma\)\.\(49\)
One consequence of \([49](https://arxiv.org/html/2606.00202#A2.E49)\) is that the regularization parameterγ\\gammais unchanged under recursive decomposition\. As such, it therefore behaves as a global constant, and we omit it from function arguments when it is unambiguous\.

If a user wishes to run our method withγ\\gammavalues mapping to someλ\\lambdathat is not an integer multiple of1\|D\|\\frac\{1\}\{\|D\|\}, one can extend the algorithms from this work directly to accommodate non\-integerγ\\gamma\. Another option is to scale up the objective, i\.e\.

Obj​\(f,D,γ\)=γ​\|f\|\+C×∑i=1\|D\|𝟏​\{f​\(xi\)≠yi\},\\mathrm\{Obj\}\(f,D,\\gamma\)\\;=\\;\\gamma\\,\|f\|\\;\+\\;C\\times\\sum\_\{i=1\}^\{\|D\|\}\\mathbf\{1\}\\\{f\(x\_\{i\}\)\\neq y\_\{i\}\\\},for some integerCC, allowing a more precise match\. We also observe that roundingλ\\lambdato the nearest multiple of1\|D\|\\frac\{1\}\{\|D\|\}, for reasonably sized datasets, should be of minor consequence\.

We rigorously formalize the relation between the rounded and unroundedγ\\gammabelow\.

##### Effect of snappingλ\\lambda\.

λ∈ℝ≥0\\lambda\\in\\mathbb\{R\}\_\{\\geq 0\}\. Letλ~\\tilde\{\\lambda\}be the nearest multiple of1\|D\|\\tfrac\{1\}\{\|D\|\}, i\.e\.λ~=1\|D\|​round​\(λ​\|D\|\)\\tilde\{\\lambda\}=\\tfrac\{1\}\{\|D\|\}\\mathrm\{round\}\(\\lambda\|D\|\)\. Then

\|λ~−λ\|≤12​\|D\|\.\|\\tilde\{\\lambda\}\-\\lambda\|\\leq\\frac\{1\}\{2\|D\|\}\.\(50\)

##### Effect on tree objectives\.

For any tree withLLleaves,

\|Objnorm​\(f,D,λ~\)−Objnorm​\(f,D,λ\)\|=\|λ~−λ\|​L≤L2​\|D\|≤2d−1\|D\|\.\\big\|\\mathrm\{Obj\}\_\{\\mathrm\{norm\}\}\(f,D,\\tilde\{\\lambda\}\)\-\\mathrm\{Obj\}\_\{\\mathrm\{norm\}\}\(f,D,\\lambda\)\\big\|=\|\\tilde\{\\lambda\}\-\\lambda\|\\,L\\;\\leq\\;\\frac\{L\}\{2\|D\|\}\\leq\\frac\{2^\{d\-1\}\}\{\|D\|\}\.\(51\)

##### Enumeratingλ\\lambdausing the Rashomon Set ofλ~\\tilde\{\\lambda\}\.

By \([51](https://arxiv.org/html/2606.00202#A2.E51)\), we can add a small amount of additive slack \(2d−1\|D\|\\frac\{2^\{d\-1\}\}\{\|D\|\}\) to the Rashomon bound and recover the Rashomon set defined byλ\\lambdaby solving withλ~\\tilde\{\\lambda\}\.

##### Scaling objectives to be lossless\.

Supposeλ\\lambdais specified toppdecimal places \(commonlyp≤4p\\leq 4, asλ\\lambdais frequently chosen to be in\{0\.04,0\.02,0\.015,0\.01,0\.0075,0\.005,0\.0025,0\.001,0\}\\\{0\.04,0\.02,0\.015,0\.01,0\.0075,0\.005,0\.0025,0\.001,0\\\}\)\. Then we can write it asλ=a/10p\\lambda=a/10^\{p\}for some integeraa, soγ=λ​\|D\|=a10p​\|D\|=a​\|D\|10p\\gamma=\\lambda\|D\|=\\frac\{a\}\{10^\{p\}\}\|D\|=\\frac\{a\|D\|\}\{10^\{p\}\}\. If we scale by10pgcd\(a\|D\|,10p≤10p\\frac\{10^\{p\}\}\{gcd\(a\|D\|,10^\{p\}\}\\leq 10^\{p\}, the scaled objective is integer, and thus there is no loss in the integerization\.

### B\.2AND/OR Graph Representation

Much like prior work\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3); Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80)\), we build out an AND/OR graph that sufficiently explores the search space to embed the Rashomon set in the graph\. In this section, we provide information about this AND/OR graph, which encodes all of the information needed to materialize trees in the Rashomon set\.

##### Subproblems and choices\.

At any point during the execution of PRAXIS, the algorithm operates on a subproblem, defined by a datasetDD, a remaining depthdd, and a remaining objective budget \(unlike the proxy algorithm, whose subproblems do not depend on the budget\)\. For a fixed subproblem, there are multiple possible ways to construct a tree within the budget: the algorithm may choose a leaf prediction, or it may choose to split on one of several features and recursively construct trees for the left and right children\.

##### OR nodes \(choices at a subproblem\):

We represent each subproblem by an*OR node*\. Each OR node stores a collection of alternative choices, any one of which yields a valid tree for that subproblem\. These choices consist of:

- •Leaf choices, corresponding to predicting a constant labelb∈\{0,1\}b\\in\\\{0,1\\\}; and
- •Split choices, corresponding to splitting on a featurejj, which can be used to recursively choose trees within that lowered budget for the left and right child subproblems\. The AND structure arises because, once a split is chosen, both child subproblems must be recursively handled for that split\.

##### AND nodes \(joint requirements of a split\):

A split choice introduces an*AND node*: choosing a split on featurejjrequires both a valid left subtree and a valid right subtree\. Thus, a split is feasible if and only if feasible solutions exist for both children within the remaining budget\.

##### Graph structure:

The resulting structure is an AND/OR graph:

- •OR nodes represent subproblems and store alternative choices;
- •split choices introduce AND structure by linking to two child OR nodes;
- •leaf choices terminate the recursion\.

A decision tree corresponds to selecting exactly one outgoing choice at each OR node and, for every split choice, recursively selecting choices in both children\.

##### Feasibility of combinations:

For a given OR node, not every possible combination of left and right subtrees necessarily yields a feasible tree under the budget\. However, our algorithms ensures a useful invariant for this representation: for every split choice stored at an OR node,*each*feasible subtree on one side of the split can be paired with*at least one*feasible subtree on the other side to form a valid tree within budget\. Otherwise, that split would have been pruned by the proxy test and never added to the AND/OR graph\.

##### Incremental construction:

Algorithm 8AddLeaf​\(G,b,Cb\)\\textsc\{AddLeaf\}\(G,b,C\_\{b\}\)0:ORNODE

GG, leaf prediction

b∈\{0,1\}b\\in\\\{0,1\\\}, leaf objective

CbC\_\{b\}
1:

G\.leaves←G\.leaves∪\{\(b,Cb\)\}G\.\\textit\{leaves\}\\leftarrow G\.\\textit\{leaves\}\\cup\\\{\(b,C\_\{b\}\)\\\}
2:if

Cb<G\.min\_objectiveC\_\{b\}<G\.\\textit\{min\\\_objective\}then

3:

G\.min\_objective←CbG\.\\textit\{min\\\_objective\}\\leftarrow C\_\{b\}
4:endif

Algorithm 9AddSplit​\(N,j,GL,GR\)\\textsc\{AddSplit\}\(N,j,G\_\{L\},G\_\{R\}\)0:ORNODE

GG, feature

jj, ORNODES

GL,GRG\_\{L\},G\_\{R\}
1:

G\.splits←G\.splits∪\{\(j,GL,GR\)\}G\.\\textit\{splits\}\\leftarrow G\.\\textit\{splits\}\\cup\\\{\(j,G\_\{L\},G\_\{R\}\)\\\}
2:

msum←GL\.min\_objective\+GR\.min\_objectivem\_\{\\text\{sum\}\}\\leftarrow G\_\{L\}\.\\textit\{min\\\_objective\}\+G\_\{R\}\.\\textit\{min\\\_objective\}
3:if

msum<G\.min\_objectivem\_\{\\text\{sum\}\}<G\.\\textit\{min\\\_objective\}then

4:

G\.min\_objective←msumG\.\\textit\{min\\\_objective\}\\leftarrow m\_\{\\text\{sum\}\}
5:endif

Algorithms[8](https://arxiv.org/html/2606.00202#alg8)and[9](https://arxiv.org/html/2606.00202#alg9)incrementally construct this AND/OR graph during the execution of PRAXIS\. Unlike prior work, our AND/OR graph is significantly smaller, because the proxy algorithms have already pruned a substantial portion of the search space\. A subproblem is included in our AND/OR graph if and only if it is used in some tree in the Rashomon setRεabs​\(𝒯,D\)R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(\\mathcal\{T\},D\)\. Algorithms[8](https://arxiv.org/html/2606.00202#alg8)and[9](https://arxiv.org/html/2606.00202#alg9)build out this AND/OR graph as we run PRAXIS\. PRAXIS returns an AND/OR graph by attaching smaller AND/OR graphs using these methods\. Given that PRAXIS at a subproblem is considering a split, the multipass framework in Algorithm[3](https://arxiv.org/html/2606.00202#alg3)recurses on the children subproblems with a refined budget\. After iterative budget refinement, we attach the AND/OR graph from the two children to the parent with Algorithm[9](https://arxiv.org/html/2606.00202#alg9)\. Analogously, if we can fit a leaf within the budget, we add it with Algorithm[8](https://arxiv.org/html/2606.00202#alg8)\.

One crucial value that these methods store in each node is them​i​n​\_​o​b​j​e​c​t​i​v​emin\\\_objective\. This is used in Algorithm[3](https://arxiv.org/html/2606.00202#alg3)to decrement budgets according to the best solution that was found\.

These pieces of information are sufficient to construct the full AND/OR graph, which represents our Rashomon set approximation\. However, to efficiently extract trees one at a time \(without materializing the entire Rashomon set\), we require additional information—specifically, a list or histogram of objectives at each subproblem\. Algorithm[B\.3](https://arxiv.org/html/2606.00202#A2.SS3)describes this process\. We do not store this information during the execution of PRAXIS, because iterative budget refinement may rebuild the AND/OR graph multiple times per subproblem\. Instead, we defer this work to a single postprocessing step, as it is not needed during the algorithm’s execution\.

### B\.3Postprocessing Algorithms

To efficiently extract trees from the AND/OR graph, we must know how many trees on one side of a split can be paired with trees on the other side, so that we can build a one\-to\-one map from a tree index at the root to an actual tree in the Rashomon set\. We construct this map to be monotonic, in the sense that a smaller index maps to a tree with no larger objective value \(i\.e\., we can index into the trees in sorted order and output them one at a time\)\. Algorithm[10](https://arxiv.org/html/2606.00202#alg10)shows the histogram implementation we use: a list of\(objective,count\)\(\\text\{objective\},\\text\{count\}\)pairs, maintained in increasing order of objective\.

Algorithm[12](https://arxiv.org/html/2606.00202#alg12)builds histograms of tree objectives that can be achieved within the subproblem budget\. At a given subproblem \(starting at the root\), it recursively processes the left and right subproblems of every split node and adds any leaves to the histogram of the current subproblem\. Note that we do not check whether leaves fit within the budget, as this check was already performed during the execution of PRAXIS\.

Once histograms have been computed for all descendant subproblems that appear in the Rashomon setRεabs​\(𝒯,D\)R\_\{\\varepsilon\_\{\\textrm\{abs\}\}\}\(\\mathcal\{T\},D\), and leaf contributions have been added to the current histogram, the remaining step is to propagate histogram information upward through split nodes\. Algorithm[11](https://arxiv.org/html/2606.00202#alg11)performs a filtered Cartesian product of the histograms from the left and right children of a split node\. Because objectives are additive, we sum objective values and multiply their multiplicities to obtain the total number of trees with that objective arising from the split\.

The resulting\(objective,count\)\(\\text\{objective\},\\text\{count\}\)pairs are then merged into the parent histogram\. This procedure is repeated for all splits until all information has been fully propagated to the root \(and then we have the distribution over objectives for the entire Rashomon set\)\.

Algorithm 10AddHist​\(N,obj,add\_cnt\)\\textsc\{AddHist\}\(N,\\textit\{obj\},\\textit\{add\\\_cnt\}\)0:ORNODE

NN, objective valueobj, increment

add\_cnt∈ℕ\\textit\{add\\\_cnt\}\\in\\mathbb\{N\}
1:Find the position ofobjin

N\.histN\.\\textit\{hist\}, keeping the list sorted by objective

2:ifobjalready appears in

N\.histN\.\\textit\{hist\}then

3:Increase its count byadd\_cnt

4:else

5:Insert a new entry

\(obj,add\_cnt\)\(\\textit\{obj\},\\textit\{add\\\_cnt\}\)in sorted order

6:endif

Algorithm 11AddSplitAndBuild​\(N,j,L,R\)\\textsc\{AddSplitAndBuild\}\(N,j,L,R\)0:ORNODE node

NNwith budget

N\.budgetN\.\\textit\{budget\}, feature index

jj, left child ORNODE

LL, right child ORNODE

RR
1:Initialize a new split record

sswith

s\.feature←js\.\\textit\{feature\}\\leftarrow j,

s\.left←Ls\.\\textit\{left\}\\leftarrow L,

s\.right←Rs\.\\textit\{right\}\\leftarrow R
2:

sum\_counts←\\textit\{sum\\\_counts\}\\leftarrowempty map from objective

→\\tocount

3:foreach

\(ℓobj,ℓcnt\)\(\\ell\_\{\\text\{obj\}\},\\ell\_\{\\text\{cnt\}\}\)in

L\.histL\.\\textit\{hist\}do

4:

rem←N\.budget−ℓobj\\textit\{rem\}\\leftarrow N\.\\textit\{budget\}\-\\ell\_\{\\text\{obj\}\}
5:foreach

\(robj,rcnt\)\(r\_\{\\text\{obj\}\},r\_\{\\text\{cnt\}\}\)in

R\.histR\.\\textit\{hist\}with

robj≤remr\_\{\\text\{obj\}\}\\leq\\textit\{rem\}do

6:

tot←ℓobj\+robj\\textit\{tot\}\\leftarrow\\ell\_\{\\text\{obj\}\}\+r\_\{\\text\{obj\}\}
7:

sum\_counts​\[tot\]←sum\_counts​\[tot\]\+ℓcnt⋅rcnt\\textit\{sum\\\_counts\}\[\\textit\{tot\}\]\\leftarrow\\textit\{sum\\\_counts\}\[\\textit\{tot\}\]\+\\ell\_\{\\text\{cnt\}\}\\cdot r\_\{\\text\{cnt\}\}
8:endfor

9:endfor

10:ifsum\_countsis not emptythen

11:Lettmpbe the list of pairs

\(obj,cnt\)\(\\textit\{obj\},\\textit\{cnt\}\)fromsum\_counts, sorted byobj

12:Merge the sorted lists

N\.histN\.\\textit\{hist\}andtmpinto a new sorted list, summing counts for equal objectives

13:Replace

N\.histN\.\\textit\{hist\}by this merged list

14:endif

15:

s\.num\_valid\_trees←∑\(obj,cnt\)∈sum\_countscnts\.\\textit\{num\\\_valid\\\_trees\}\\leftarrow\\sum\_\{\(\\textit\{obj\},\\,\\textit\{cnt\}\)\\in\\textit\{sum\\\_counts\}\}\\textit\{cnt\}
16:Append

ssto

N\.splitsN\.\\textit\{splits\}

Algorithm 12BuildHistogramsPost​\(N\)\\textsc\{BuildHistogramsPost\}\(N\)0:ORNODE

NN
1:if

N\.hist\_builtN\.\\textit\{hist\\\_built\}istruethen

2:return

3:endif

4:foreachsplit

ssin

N\.splitsN\.\\textit\{splits\}do

5:

BuildHistogramsPost\(s\.left\)\\textsc\{BuildHistogramsPost\}\(s\.\\textit\{left\}\)
6:

BuildHistogramsPost\(s\.right\)\\textsc\{BuildHistogramsPost\}\(s\.\\textit\{right\}\)
7:endfor

8:

saved\_splits←N\.splits\\textit\{saved\\\_splits\}\\leftarrow N\.\\textit\{splits\}
9:

N\.splits←∅N\.\\textit\{splits\}\\leftarrow\\emptyset
10:

N\.hist←∅N\.\\textit\{hist\}\\leftarrow\\emptyset
11:foreachleaf

\(b,loss\)\(b,\\textit\{loss\}\)in

N\.leavesN\.\\textit\{leaves\}do

12:

AddHist​\(N,loss,1\)\\textsc\{AddHist\}\(N,\\textit\{loss\},1\)
13:endfor

14:foreachsplit

ssinsaved\_splitsdo

15:

AddSplitAndBuild\(N,s\.feature,s\.left,s\.right\)\\textsc\{AddSplitAndBuild\}\(N,s\.\\textit\{feature\},s\.\\textit\{left\},s\.\\textit\{right\}\)
16:endfor

17:

N\.hist\_built←trueN\.\\textit\{hist\\\_built\}\\leftarrow\\textbf\{true\}

With histograms of objectives available at each subproblem, we can efficiently extract trees in sorted order\. This procedure follows a similar indexing scheme to that ofXin et al\. \([2022](https://arxiv.org/html/2606.00202#bib.bib80)\), except that we explicitly guarantee a sorted order\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)\. The high\-level idea is to convert each histogram into a cumulative count of trees\. Given a request for thexx\-th tree, we first consult the root histogram to determine the corresponding objective value and the index of the tree within that objective bucket\. We then implement a procedure to retrieve theyy\-th tree with objectivezz\.

There are many possible ways to break ties among trees with the same objective\. We use a deterministic scheme without any meaningful interpretation, although this could be replaced with a secondary criterion\.

Tie\-breaking is not the focus of our work; instead, our goal is to produce trees in sorted order, allowing users to choose whether to enumerate only the top\(1\+ε\)\(1\+\\varepsilon\)\-factor of trees or to extract the additional trees returned by PRAXIS that are within the initial budget\. Either is a reasonable choice, since[Table 12](https://arxiv.org/html/2606.00202#A4.T12)shows that PRAXIS approximates the set of trees within the initial budget almost as well as the true Rashomon set\.

See[subsection B\.4](https://arxiv.org/html/2606.00202#A2.SS4)for further details on extracting the trees in sorted order\.

### B\.4Tree Extraction in sorted order

Once histograms of objective values have been computed for every subproblem via Algorithm[12](https://arxiv.org/html/2606.00202#alg12), we can extract individual trees from the AND/OR graph in monotone \(nondecreasing\) objective order without materializing the entire Rashomon set\.

Algorithm[13](https://arxiv.org/html/2606.00202#alg13)is a wrapper method for tree extraction, which calls the main recursive extraction method\. Given an indexii, it scans the root histogram in increasing order of objective value, identifying the smallest objectivezzsuch that the cumulative number of trees with objective at mostzzexceedsii\. This determines both the target objectivezzand the within\-bucket indexkkcorresponding to theithi^\{\\text\{th\}\}tree\.

Algorithm[14](https://arxiv.org/html/2606.00202#alg14)then recursively materializes thekthk^\{\\text\{th\}\}tree with objective exactlyzzfrom the AND/OR graph\.

At an ORNODENN, the procedure first enumerates any leaf options whose loss equalszz\(which defines a deterministic tie\-breaking scheme within an objective bucket\)\. If the desired index is not exhausted by leaf solutions, the algorithm decrementskkand proceeds through the split records stored atNN\. For a split\(j,L,R\)\(j,L,R\), it computes how many trees under that split achieve objectivezzby pairing objective values fromL\.histL\.\\textit\{hist\}andR\.histR\.\\textit\{hist\}whose sums equalzz\. If the desired index falls within this split, the algorithm deterministically selects the corresponding pair of left and right subtrees and recurses on the children, constructing a prediction node with featurejj\. Otherwise,kkis further decremented to skip all trees induced by that split, continuing until the target position within the bucket is reached\.

Algorithm 13GetIthTree​\(root,i\)\\textsc\{GetIthTree\}\(\\mathrm\{root\},i\)0:Root ORNODE

root\\mathrm\{root\}, index

i∈\{0,1,…\}i\\in\\\{0,1,\\dots\\\}
1:

BuildHistogramsPost​\(root\)\\textsc\{BuildHistogramsPost\}\(\\mathrm\{root\}\)
2:

cum←0\\textit\{cum\}\\leftarrow 0
3:foreach

\(obj,ct\)\(\\textit\{obj\},\\textit\{ct\}\)in

root\.hist\\mathrm\{root\}\.\\textit\{hist\}in increasingobjdo

4:if

i<cum\+cti<\\textit\{cum\}\+\\textit\{ct\}then

5:

target\_obj←obj\\textit\{target\\\_obj\}\\leftarrow\\textit\{obj\}
6:

k←i−cumk\\leftarrow i\-\\textit\{cum\}\{index within this objective bucket\}

7:return

GetKthTreeWithObjective​\(root,target\_obj,k\)\\textsc\{GetKthTreeWithObjective\}\(\\mathrm\{root\},\\textit\{target\\\_obj\},k\)
8:endif

9:

cum←cum\+ct\\textit\{cum\}\\leftarrow\\textit\{cum\}\+\\textit\{ct\}
10:endfor

In Alorithm[14](https://arxiv.org/html/2606.00202#alg14),MakeLeaf​\(b\)\\textsc\{MakeLeaf\}\(b\)denotes a leaf predicting labelbb, andMakeSplit​\(j,TL,TR\)\\textsc\{MakeSplit\}\(j,T\_\{L\},T\_\{R\}\)denotes the decision tree whose root splits on featurejj, with left subtreeTLT\_\{L\}and right subtreeTRT\_\{R\}\.

Algorithm 14GetKthTreeWithObjective​\(N,z,k\)\\textsc\{GetKthTreeWithObjective\}\(N,z,k\)0:ORNODE

NN, target objective

zz, index

kkwithin objective\-

zzbucket

1:

BuildHistogramsPost​\(N\)\\textsc\{BuildHistogramsPost\}\(N\)\{\(1\) Leaf trees at

NN\(tie\-breaker: leaves are enumerated before splits\)\.\}

2:foreachleaf

\(b,ℓ\)\(b,\\ell\)in

N\.leavesN\.\\textit\{leaves\}do

3:if

ℓ=z\\ell=zthen

4:if

k=0k=0then

5:return

MakeLeaf​\(b\)\\textsc\{MakeLeaf\}\(b\)
6:else

7:

k←k−1k\\leftarrow k\-1
8:endif

9:endif

10:endfor\{\(2\) Split trees at

NN\.\}

11:foreachsplit record

ssin

N\.splitsN\.\\textit\{splits\}do

12:

L←s\.leftL\\leftarrow s\.\\textit\{left\},

R←s\.rightR\\leftarrow s\.\\textit\{right\}
13:

BuildHistogramsPost​\(L\)\\textsc\{BuildHistogramsPost\}\(L\);

BuildHistogramsPost​\(R\)\\textsc\{BuildHistogramsPost\}\(R\)\{Compute how many trees under this split achieve objective

zz\.\}

14:

total\_here←0\\textit\{total\\\_here\}\\leftarrow 0
15:foreach

\(ℓobj,ℓcnt\)\(\\ell\_\{\\text\{obj\}\},\\ell\_\{\\text\{cnt\}\}\)in

L\.histL\.\\textit\{hist\}do

16:

robj←z−ℓobjr\_\{\\text\{obj\}\}\\leftarrow z\-\\ell\_\{\\text\{obj\}\}
17:if

robjr\_\{\\text\{obj\}\}appears in

R\.histR\.\\textit\{hist\}with count

rcntr\_\{\\text\{cnt\}\}then

18:

total\_here←total\_here\+ℓcnt⋅rcnt\\textit\{total\\\_here\}\\leftarrow\\textit\{total\\\_here\}\+\\ell\_\{\\text\{cnt\}\}\\cdot r\_\{\\text\{cnt\}\}
19:endif

20:endfor

21:if

k<total\_herek<\\textit\{total\\\_here\}then

22:\{The desired tree lies under split

ss; select the

\(ℓobj,robj\)\(\\ell\_\{\\text\{obj\}\},r\_\{\\text\{obj\}\}\)pair and recurse\.\}

23:

running←0\\textit\{running\}\\leftarrow 0
24:foreach

\(ℓobj,ℓcnt\)\(\\ell\_\{\\text\{obj\}\},\\ell\_\{\\text\{cnt\}\}\)in

L\.histL\.\\textit\{hist\}do

25:

robj←z−ℓobjr\_\{\\text\{obj\}\}\\leftarrow z\-\\ell\_\{\\text\{obj\}\}
26:if

robjr\_\{\\text\{obj\}\}appears in

R\.histR\.\\textit\{hist\}with count

rcntr\_\{\\text\{cnt\}\}then

27:

pairs←ℓcnt⋅rcnt\\textit\{pairs\}\\leftarrow\\ell\_\{\\text\{cnt\}\}\\cdot r\_\{\\text\{cnt\}\}
28:if

running\+pairs\>k\\textit\{running\}\+\\textit\{pairs\}\>kthen

29:

rel←k−running\\textit\{rel\}\\leftarrow k\-\\textit\{running\}
30:

left\_idx←⌊rel/rcnt⌋\\textit\{left\\\_idx\}\\leftarrow\\left\\lfloor\\textit\{rel\}/r\_\{\\text\{cnt\}\}\\right\\rfloor
31:

right\_idx←relmodrcnt\\textit\{right\\\_idx\}\\leftarrow\\textit\{rel\}\\bmod r\_\{\\text\{cnt\}\}
32:

TL←GetKthTreeWithObjective​\(L,ℓobj,left\_idx\)T\_\{L\}\\leftarrow\\textsc\{GetKthTreeWithObjective\}\(L,\\ell\_\{\\text\{obj\}\},\\textit\{left\\\_idx\}\)
33:

TR←GetKthTreeWithObjective​\(R,robj,right\_idx\)T\_\{R\}\\leftarrow\\textsc\{GetKthTreeWithObjective\}\(R,r\_\{\\text\{obj\}\},\\textit\{right\\\_idx\}\)
34:return

MakeSplit\(s\.feature,TL,TR\)\\textsc\{MakeSplit\}\(s\.\\textit\{feature\},T\_\{L\},T\_\{R\}\)
35:endif

36:

running←running\+pairs\\textit\{running\}\\leftarrow\\textit\{running\}\+\\textit\{pairs\}
37:endif

38:endfor

39:else

40:

k←k−total\_herek\\leftarrow k\-\\textit\{total\\\_here\}\{skip all objective\-

zztrees from this split\}

41:endif

42:endfor

### B\.5LicketySPLIT modifications

We use the LicketySPLIT algorithm fromBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)as our default proxy algorithm in PRAXIS\. However, we provide a multitude of implementation and algorithm changes that often allow PRAXIS to run much faster than even a single call of the implementation detailed inBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)\.

Below are the implementations of LicketySPLIT and the greedy tree methods that it calls that are faithful toBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)\. ThoughBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)’s implementation used a modified version of the GOSDT\(Lin et al\.,[2020](https://arxiv.org/html/2606.00202#bib.bib46)\)algorithm, here we represent it in a pure form\. We do not present the implementation as returning an explicit tree \(althoughBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)does so\)\. Instead, we return only the objective value of the corresponding tree, which is sufficient for our purposes\.

LicketySPLIT \(Algorithm[16](https://arxiv.org/html/2606.00202#alg16)\) introduces a one\-step lookahead relative to a greedy tree algorithm\. Rather than selecting a feature using information gain alone,LicketySPLITevaluates each candidate split by completing both children greedily and selecting the split that minimizes the resulting summed objective\. After selecting this feature, the algorithm recurses usingLicketySPLITitself on the children\.

Algorithm 15GreedyTree​\(D′,d,γ\)\\textsc\{GreedyTree\}\(D^\{\\prime\},d,\\gamma\)0:Subproblem dataset

D′D^\{\\prime\}, remaining depth

dd, per\-leaf penalty

γ\\gamma
1:Let

n′←\|D′\|n^\{\\prime\}\\leftarrow\|D^\{\\prime\}\|,

p←p\\leftarrownumber of positive labels in

D′D^\{\\prime\}
2:

leaf\_loss←γ\+min⁡\(p,n′−p\)\\textit\{leaf\\\_loss\}\\leftarrow\\gamma\+\\min\(p,\\,n^\{\\prime\}\-p\)
3:if

d=0d=0or

leaf\_loss≤2​γ\\textit\{leaf\\\_loss\}\\leq 2\\gammathen

4:returnleaf\_loss

5:endif

6:Select feature

jjmaximizing information gain on

D′D^\{\\prime\}
7:Partition

D′D^\{\\prime\}into

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)using feature

jj
8:if

DL′=∅D^\{\\prime\}\_\{L\}=\\emptysetor

DR′=∅D^\{\\prime\}\_\{R\}=\\emptysetthen

9:returnleaf\_loss

10:endif

11:

L←GreedyTree​\(DL′,d−1,γ\)L\\leftarrow\\textsc\{GreedyTree\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)
12:

R←GreedyTree​\(DR′,d−1,γ\)R\\leftarrow\\textsc\{GreedyTree\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)
13:return

min⁡\(leaf\_loss,L\+R\)\\min\(\\textit\{leaf\\\_loss\},\\,L\+R\)

Algorithm 16LicketySPLIT​\(D′,d,γ\)\\textsc\{LicketySPLIT\}\(D^\{\\prime\},d,\\gamma\)\(fromBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)\)0:Subproblem dataset

D′D^\{\\prime\}, remaining depth

dd, per\-leaf penalty

γ\\gamma\.

1:Let

n′←\|D′\|n^\{\\prime\}\\leftarrow\|D^\{\\prime\}\|,

p←p\\leftarrownumber of positive labels in

D′D^\{\\prime\}
2:

leaf\_loss←γ\+min⁡\(p,n′−p\)\\textit\{leaf\\\_loss\}\\leftarrow\\gamma\+\\min\(p,\\,n^\{\\prime\}\-p\)
3:if

d=0d=0or

leaf\_loss≤2​γ\\textit\{leaf\\\_loss\}\\leq 2\\gammathen

4:returnleaf\_loss

5:endif

6:

best\_sum←\+∞\\textit\{best\\\_sum\}\\leftarrow\+\\infty,

best\_feature←⊥\\textit\{best\\\_feature\}\\leftarrow\\bot
7:foreachfeature

jjdo

8:Partition

D′D^\{\\prime\}into

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)using feature

jj
9:if

DL′=∅D^\{\\prime\}\_\{L\}=\\emptysetor

DR′=∅D^\{\\prime\}\_\{R\}=\\emptysetthen

10:continue

11:endif

12:

sj←GreedyTree​\(DL′,d−1,γ\)\+GreedyTree​\(DR′,d−1,γ\)s\_\{j\}\\leftarrow\\textsc\{GreedyTree\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)\+\\textsc\{GreedyTree\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)
13:if

sj<best\_sums\_\{j\}<\\textit\{best\\\_sum\}then

14:

best\_sum←sj\\textit\{best\\\_sum\}\\leftarrow s\_\{j\}
15:

best\_feature←j\\textit\{best\\\_feature\}\\leftarrow j
16:endif

17:endfor

18:if

best\_feature=⊥\\textit\{best\\\_feature\}=\\botthen

19:returnleaf\_loss

20:endif

21:if

best\_sum≥leaf\_loss\\textit\{best\\\_sum\}\\geq\\textit\{leaf\\\_loss\}then

22:returnleaf\_loss

23:endif

24:Partition

D′D^\{\\prime\}usingbest\_featureinto

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)
25:

L←LicketySPLIT​\(DL′,d−1,γ\)L\\leftarrow\\textsc\{LicketySPLIT\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)
26:

R←LicketySPLIT​\(DR′,d−1,γ\)R\\leftarrow\\textsc\{LicketySPLIT\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)
27:return

L\+RL\+R

Now, we present changes to the LicketySPLIT and its accompanying greedy subroutine in three categories: optimal solvers of a shallow depth, caching to guarantee speed\-ups, and generalizations to interpolate between greedy and optimal\.

We generalize LicketySPLIT via a hierarchical rollout scheme for split selection dictated by a lookahead parameterℓ\\ell\. Takingℓ=1\\ell=1in our generalization recovers LicketySPLIT, andℓ=0\\ell=0is taken to mean the greedy tree algorithm\.

While LicketySPLIT \(ℓ=1\\ell=1\) evaluates candidate splits using greedy completion \(ℓ=0\\ell=0\), our approach generalizes split selection by defining it recursively through a hierarchy of heuristics\. For a givenℓ\\ell, candidate splits are evaluated using completions generated by a lower\-tier heuristic \(ℓ−1\\ell\-1\); for instance, the proxy algorithm atℓ=2\\ell=2selects splits based on LicketySPLIT completions\. This construction yields a hierarchical rollout scheme in which each heuristic rolls out a cheaper heuristic, ultimately terminating at greedy induction\.

We explain the modifications to LicketySPLIT, fixingℓ=1\\ell=1for simplicity, as it is our preferred proxy algorithm\.

Algorithm 17GreedyTree​\(D′,d,γ\)\\textsc\{GreedyTree\}\(D^\{\\prime\},d,\\gamma\)\(with shared depth\-1 exact solver \+ caching\)0:Subproblem dataset

D′⊆DD^\{\\prime\}\\subseteq D, remaining depth

dd, regularization

γ\\gamma
1:if

d=0d=0then

2:

ans←Depth\_d\_Exact​\(D′,0,γ\)\\textit\{ans\}\\leftarrow\\textsc\{Depth\\\_d\\\_Exact\}\(D^\{\\prime\},0,\\gamma\)
3:returnans

4:endif

5:if

d=1d=1then

6:return

Depth\_D\_Exact​\(D′,1,γ\)\\textsc\{Depth\\\_D\\\_Exact\}\(D^\{\\prime\},1,\\gamma\)\{last step is optimal misclassification \(shared with LicketySPLIT\)\}

7:endif

8:

k←Key​\(D′\)k\\leftarrow\\textsc\{Key\}\(D^\{\\prime\}\)\{subproblem identifier \(64\-bit fingerprint\)\}

9:if

k∈𝒞greedy​\[d\]k\\in\\mathcal\{C\}\_\{\\mathrm\{greedy\}\}\[d\]then

10:return

𝒞greedy​\[d\]​\[k\]\\mathcal\{C\}\_\{\\mathrm\{greedy\}\}\[d\]\[k\]
11:endif

12:

leaf\_loss←Depth\_D\_Exact​\(D′,0,γ\)\\textit\{leaf\\\_loss\}\\leftarrow\\textsc\{Depth\\\_D\\\_Exact\}\(D^\{\\prime\},0,\\gamma\)
13:if

leaf\_loss≤2​γ\\textit\{leaf\\\_loss\}\\leq 2\\gammathen

14:Optionally cache:𝒞greedy​\[d\]​\[k\]←leaf\_loss​\(D′\)\\mathcal\{C\}\_\{\\mathrm\{greedy\}\}\[d\]\[k\]\\leftarrow\\textit\{leaf\\\_loss\}\(D^\{\\prime\}\)

15:returnleaf\_loss

16:endif

17:Choose feature

j⋆j^\{\\star\}maximizing information gain on

D′D^\{\\prime\}
18:Split

D′D^\{\\prime\}into

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)by

j⋆j^\{\\star\}; if either side is empty, returnleaf\_loss

19:

L←GreedyTree​\(DL′,d−1,γ\)L\\leftarrow\\textsc\{GreedyTree\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)
20:

R←GreedyTree​\(DR′,d−1,γ\)R\\leftarrow\\textsc\{GreedyTree\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)
21:

split\_loss←L\+R\\textit\{split\\\_loss\}\\leftarrow L\+R
22:

ans←min⁡\(leaf\_loss,split\_loss\)\\textit\{ans\}\\leftarrow\\min\(\\textit\{leaf\\\_loss\},\\,\\textit\{split\\\_loss\}\)
23:Cache:

𝒞greedy​\[d\]​\[k\]←ans\\mathcal\{C\}\_\{\\mathrm\{greedy\}\}\[d\]\[k\]\\leftarrow\\textit\{ans\}
24:returnans

Algorithm 18LicketySPLIT​\(D′,d,γ,ℓ\)\\textsc\{LicketySPLIT\}\(D^\{\\prime\},d,\\gamma,\\ell\)\(low depth exact solvers \+ caching \+ clamped lookahead \+ generalization\)0:Subproblem dataset

D′⊆DD^\{\\prime\}\\subseteq D, remaining depth

dd, lookahead

ℓ≥0\\ell\\geq 0
1:if

d=0d=0then

2:

ans←Depth\_d\_Exact​\(D′,0\)\\textit\{ans\}\\leftarrow\\textsc\{Depth\\\_d\\\_Exact\}\(D^\{\\prime\},0\)
3:returnans

4:endif

5:

ℓ←min⁡\(ℓ,d−1\)\\ell\\leftarrow\\min\(\\ell,\\,d\-1\)\{clamp lookahead by remaining depth to allow more caching,

ℓ=d−1\\ell=d\-1is optimal at depth d now\}

6:if

ℓ=d−1\\ell=d\-1then

7:return

Depth\_d\_Exact​\(D′,d\)\\textsc\{Depth\\\_d\\\_Exact\}\(D^\{\\prime\},d\)\{shared with GreedyTree\}

8:endif

9:

k←Key​\(D′\)k\\leftarrow\\textsc\{Key\}\(D^\{\\prime\}\)
10:if

k∈𝒞lickety​\[d,ℓ\]k\\in\\mathcal\{C\}\_\{\\mathrm\{lickety\}\}\[d,\\ell\]then

11:return

𝒞lickety​\[d,ℓ\]​\[k\]\\mathcal\{C\}\_\{\\mathrm\{lickety\}\}\[d,\\ell\]\[k\]
12:endif

13:

leaf\_loss←Depth\_D\_Exact​\(D′,0,γ\)\\textit\{leaf\\\_loss\}\\leftarrow\\textsc\{Depth\\\_D\\\_Exact\}\(D^\{\\prime\},0,\\gamma\)
14:if

leaf\_loss≤2​γ\\textit\{leaf\\\_loss\}\\leq 2\\gammathen

15:Optionally cache:𝒞depth0​\[k\]←leaf\_loss​\(D′\)\\mathcal\{C\}\_\{\\mathrm\{depth0\}\}\[k\]\\leftarrow\\textit\{leaf\\\_loss\}\(D^\{\\prime\}\)

16:returnleaf\_loss

17:endif

18:

best\_sum←\+∞\\textit\{best\\\_sum\}\\leftarrow\+\\infty,

best\_feature←⊥\\textit\{best\\\_feature\}\\leftarrow\\bot
19:foreachfeature

jjdo

20:Split

D′D^\{\\prime\}into

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)by

jj;continueif either side is empty

21:if

ℓ=1\\ell=1then

22:

sj←GreedyTree​\(DL′,d−1,γ\)\+GreedyTree​\(DR′,d−1,γ\)s\_\{j\}\\leftarrow\\textsc\{GreedyTree\}\(D^\{\\prime\}\_\{L\},d\-1,\\gamma\)\+\\textsc\{GreedyTree\}\(D^\{\\prime\}\_\{R\},d\-1,\\gamma\)\{one\-step lookahead: greedy completion\}

23:else

24:

sj←LicketySPLIT​\(DL′,d−1,ℓ−1,γ\)\+LicketySPLIT​\(DR′,d−1,ℓ−1,γ\)s\_\{j\}\\leftarrow\\textsc\{LicketySPLIT\}\(D^\{\\prime\}\_\{L\},d\-1,\\ell\-1,\\gamma\)\+\\textsc\{LicketySPLIT\}\(D^\{\\prime\}\_\{R\},d\-1,\\ell\-1,\\gamma\)\{generalℓ\\ell: recurse withℓ−1\\ell\-1during split evaluation\}

25:endif

26:if

sj<best\_sums\_\{j\}<\\textit\{best\\\_sum\}then

27:

best\_sum←sj\\textit\{best\\\_sum\}\\leftarrow s\_\{j\},

best\_feature←j\\textit\{best\\\_feature\}\\leftarrow j
28:endif

29:endfor

30:

ans←leaf\_loss\\textit\{ans\}\\leftarrow\\textit\{leaf\\\_loss\}
31:if

best\_feature≠⊥\\textit\{best\\\_feature\}\\neq\\botthen

32:Split

D′D^\{\\prime\}bybest\_featureinto

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)
33:

L←LicketySPLIT​\(DL′,d−1,ℓ,γ\)L\\leftarrow\\textsc\{LicketySPLIT\}\(D^\{\\prime\}\_\{L\},d\-1,\\ell,\\gamma\)\{recurse with constantℓ\\ellin recursion\}

34:

R←LicketySPLIT​\(DR′,d−1,ℓ,γ\)R\\leftarrow\\textsc\{LicketySPLIT\}\(D^\{\\prime\}\_\{R\},d\-1,\\ell,\\gamma\)
35:

ans←min⁡\(ans,L\+R\)\\textit\{ans\}\\leftarrow\\min\(\\textit\{ans\},\\,L\+R\)
36:endif

37:Cache:

𝒞lickety​\[d,ℓ\]​\[k\]←ans\\mathcal\{C\}\_\{\\mathrm\{lickety\}\}\[d,\\ell\]\[k\]\\leftarrow\\textit\{ans\}
38:returnans

Algorithm 19Depth\_d\_Exact​\(D′,d,γ\)\\textsc\{Depth\\\_d\\\_Exact\}\(D^\{\\prime\},d,\\gamma\)\(optimal depth\-ddsolver, shared ford=1d=1\),\(all new\)0:Subproblem dataset

D′⊆DD^\{\\prime\}\\subseteq D, remaining depth

dd, regularization

γ\\gamma
0:

OPT​\(D′,d\)\\mathrm\{OPT\}\(D^\{\\prime\},d\), the minimum integer objective among all trees of depth at most

ddon

D′D^\{\\prime\}
1:

k←Key​\(D′\)k\\leftarrow\\textsc\{Key\}\(D^\{\\prime\}\)
2:if

\(d,k\)∈𝒞opt\(d,k\)\\in\\mathcal\{C\}\_\{\\mathrm\{opt\}\}then

3:return

𝒞opt​\[d\]​\[k\]\\mathcal\{C\}\_\{\\mathrm\{opt\}\}\[d\]\[k\]
4:endif

5:

n′←\|D′\|n^\{\\prime\}\\leftarrow\|D^\{\\prime\}\|,

p←p\\leftarrownumber of positive labels in

D′D^\{\\prime\}
6:

leaf\_loss←γ\+min⁡\(p,n′−p\)\\textit\{leaf\\\_loss\}\\leftarrow\\gamma\+\\min\(p,\\,n^\{\\prime\}\-p\)
7:if

d=0d=0then

8:Optionally cache:

𝒞depth0​\[k\]←leaf\_loss\\mathcal\{C\}\_\{\\mathrm\{depth0\}\}\[k\]\\leftarrow\\textit\{leaf\\\_loss\}
9:returnleaf\_loss

10:endif

11:

best←leaf\_loss\\textit\{best\}\\leftarrow\\textit\{leaf\\\_loss\}
12:foreachfeature

jjdo

13:Split

D′D^\{\\prime\}into

\(DL′,DR′\)\(D^\{\\prime\}\_\{L\},D^\{\\prime\}\_\{R\}\)by

jj;continueif either side is empty

14:

sj←Depth\_d\_Exact​\(DL′,d−1\)\+Depth\_d\_Exact​\(DR′,d−1\)s\_\{j\}\\leftarrow\\textsc\{Depth\\\_d\\\_Exact\}\(D^\{\\prime\}\_\{L\},d\-1\)\\;\+\\;\\textsc\{Depth\\\_d\\\_Exact\}\(D^\{\\prime\}\_\{R\},d\-1\)
15:

best←min⁡\(best,sj\)\\textit\{best\}\\leftarrow\\min\(\\textit\{best\},\\,s\_\{j\}\)
16:endfor

17:Cache:

𝒞opt​\[d\]​\[k\]←best\\mathcal\{C\}\_\{\\mathrm\{opt\}\}\[d\]\[k\]\\leftarrow\\textit\{best\}
18:returnbest

##### Optimal solvers at shallow depth\.

The greedy tree solver was previously not optimal at depth 1 \(the best split for information gain is not necessarily the one that minimizes misclassification error\)\. Now, with an optimal depth 1 solver, greedy is optimal at depth 1, and LicketySPLIT is optimal at depth 2 without any additional time complexity \(and in practice, computing additive objectives is cheaper than information gain\)\.

Because we delegate solving shallow subproblems to a custom subroutine, this allows the proxy algorithm LicketySPLIT, and the subroutine of it, the greedy tree algorithm, to share caching\. For instance, if they were both called at depth 1, they now use the same cache\. Additionally, when the lookahead parameterℓ≥d−1\\ell\\geq d\-1, it would yield an optimal proxy algorithm even withℓ=d−1\\ell=d\-1, so we clamp it to share more caching\. We elaborate more on caching beyond the shared optimal solvers in[subsection B\.6](https://arxiv.org/html/2606.00202#A2.SS6)\.

##### Further exploration of search space:

In the original implementation ofLicketySPLIT\(see lines 21–23 of Algorithm[16](https://arxiv.org/html/2606.00202#alg16); also the condition in line 1 of Algorithm 3 of\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\)\), the algorithm includes an early stopping condition: if the best greedy tree completion over all candidate splits is no better than a leaf, the algorithm terminates and returns the leaf loss for the subproblem\. In our implementation, we remove this early exit\. While this change appears trivial, the early stopping condition is implicitly required in the implementation ofBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)as a consequence of their use of GOSDT to implement the algorithm\. This modification allows the algorithm to potentially find better trees \(and never a worse tree as we take the minimum over it and the leaf loss\) at no additional asymptotic cost\.

##### Caching and shared subproblem reuse within LicketySPLIT

We use a comprehensive caching strategy within LicketySPLIT: we cache every subproblem that it or its subroutine \(a greedy tree algorithm\) solves to report the proxy solution\. This will be helpful within the larger PRAXIS algorithm, but we first observe that the LicketySPLIT algorithm benefits from caching greedy recursive subproblems as it itself recurses\.

###### Theorem B\.1\(Caching greedy solutions at subproblems gives provable cache reuse\)\.

RunLicketySPLIT​\(D,d,ℓ=1\)\\textsc\{LicketySPLIT\}\(D,d,\\ell=1\)with remaining depthd≥1d\\geq 1\. Assume that no early stopping occurs forLicketySPLITand its greedy subroutine \(i\.e\. there is always a split that partitions the data and that2​λ<leaf\_objective2\\lambda<\\text\{leaf\\\_objective\}at all depthsd\>0d\>0\)\. Then, the total number of greedy\-cache hits incurred during the execution is at least\(2d\+1−4\)\(2^\{d\+1\}\-4\)\.

###### Proof\.

Under the stated assumption that no early stopping occurs,LicketySPLIT​\(D,d,1\)\\textsc\{LicketySPLIT\}\(D,d,1\)always selects a non\-degenerate split at each node with positive remaining depth\. Thus, the tree returned byLicketySPLITis a full binary tree of depthdd\. Equivalently, the recursion tree of the execution is a full binary tree of depthdd\. Additionally, as no early stopping occurs, the greedy tree algorithm will also return a full binary tree of depthddat every subproblem with remaining depthdd\.

Now fix any non\-root internal nodeuuof this recursion tree, with induced subproblem datasetDuD\_\{u\}and remaining depthr≥1r\\geq 1atuu\. Letppbe the parent ofuu\. In the call atpp,LicketySPLITevaluates candidate splits by calling its greedy subroutine on the child subproblems\. In particular, the candidate split atppthat eventually leads touunecessarily triggered a greedy call onDuD\_\{u\}with remaining depthrr\. Therefore, the value for this greedy call is present in the greedy cache before the algorithm begins processing nodeuuitself\. However, it isn’t the greedy cache for this subproblem that matters – it is the fact that left and right greedy solutions are cached for some split atuu\(because the greedy continues; there is no early stopping\)\.

WhenLicketySPLITprocesses nodeuu, it again iterates over candidate splits and, for each split, requests two greedy completions: one for the left child subproblem and one for the right child subproblem\. Consider the specific featurej⋆j^\{\\star\}chosen by the cached greedy run onDuD\_\{u\}\. By definition of the greedy recursion, that greedy call computed and cached the two recursive greedy subcalls on the children induced byj⋆j^\{\\star\}\.

WhenLicketySPLITlater evaluates featurej⋆j^\{\\star\}among its candidate splits at nodeuu, it requests exactly those two greedy child values\. Both are cache hits\. Thus, every non\-root internal node contributes at least two greedy\-cache reuses\.

It remains to count the non\-root internal nodes\. A full binary tree of depthddhas2d−12^\{d\}\-1internal nodes, including the root\. Hence it has2d−22^\{d\}\-2non\-root internal nodes\. Since each non\-root internal node contributes at least two greedy\-cache hits, the total number of greedy\-cache hits incurred during the execution is at least

2​\(2d−2\)=2d\+1−4\.2\(2^\{d\}\-2\)=2^\{d\+1\}\-4\.∎

Though we do not prove it here, we note that the proof forℓ≥2\\ell\\geq 2is essentially identical\. Calling LicketySPLIT\(ℓ=2\\ell=2\) recursively calls LicketySPLIT\(ℓ=1\)\\ell=1\), andΩ​\(2d\)\\Omega\(2^\{d\}\)of those calls will be reused in the execution of LicketySPLIT\(ℓ=2\\ell=2\)\. This is in addition to the greedy reuse within each LicketySPLIT\(ℓ=1\\ell=1\) call\.

Beyond caching within the proxy algorithm \(which was strictly an improvement to LicketySPLIT\), this caching can also help reuse work in the recursive calls of PRAXIS as it explores subproblems within the budgetεa​b​s\\varepsilon\_\{abs\}\. This does not hold for every proxy algorithm, but because LicketySPLIT\(ℓ\\ell\) recurses with the same parameters \(except decreasing the depth budget\), we can reuse the work it did if PRAXIS explores its initial split\. We discuss this more and quantify the benefits saved in[subsection B\.6](https://arxiv.org/html/2606.00202#A2.SS6)\.

### B\.6Proxy Algorithm Choice for Efficient Caching

In Algorithm[18](https://arxiv.org/html/2606.00202#alg18), we defined a generalization of LicketySPLIT with a lookahead parameterℓ\\ell, withℓ=1\\ell=1yielding the behavior of our modified LicketySPLIT algorithm\. LicketySPLIT\(ℓ=1\\ell=1\) repeatedly selects the split whose greedy completion yields the lowest objective value\. LicketySPLIT\(ℓ≥2\\ell\\geq 2\) repeatedly selects the split whose LicketySPLIT\(ℓ−1\\ell\-1\) completion yields the lowest objective value\. Combined withℓ=0\\ell=0\(which we define to be greedy\), this allows us to interpolate between linear time proxy algorithms and optimal ones\.

We chooseLicketySPLITfor its favorable time–accuracy trade\-off, further improving its runtime via caching and its accuracy via optimal solvers at shallow depths \(the improvements are detailed in[subsection B\.5](https://arxiv.org/html/2606.00202#A2.SS5)\)\. We showed in[Theorem B\.1](https://arxiv.org/html/2606.00202#A2.Thmtheorem1)that our modified version of LicketySPLIT allows for substantial caching within a single call of LicketySPLIT\. Importantly, this proxy also exhibits a useful structure where each subtree of a LicketySPLIT tree corresponds to a LicketySPLIT call on the corresponding data subset\. This property, which also holds for all values ofℓ\\ell, allows for caching across LicketySPLIT calls in PRAXIS \.

##### Interaction with PRAXIS\.

Consider an execution of PRAXIS at some subproblem\(D,d\)\(D,d\)\. Suppose a split is not pruned, which occurs whenever the proxy completions for its children satisfy the budget constraint\. To evaluate that split, PRAXIS invokes the proxy on both child subproblems\(DL,d−1\)\(D\_\{L\},d\-1\)and\(DR,d−1\)\(D\_\{R\},d\-1\)\.

After budget refinement, PRAXIS recurses on these child subproblems\. At such a child, PRAXIS again evaluates candidate splits by invoking the proxy\. One of these candidate splits corresponds exactly to the split chosen by the proxy when it was originally applied to that child subproblem\. Thus, PRAXIS requests a proxy completion for a subproblem that the proxy itself has already constructed internally\. However, calling the proxy algorithm on these subproblems is not guaranteed to solve the same problem\.

For example, consider SPLIT with lookahead depth 2 fromBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)without any postprocessing\. In this configuration, SPLIT chooses the optimal first two splits, conditioned on the tree being completed greedily after \(and does not alter the greedy completion\)\. If SPLIT is called again on a child after taking the first split of the SPLIT tree, it solves a slightly different problem\. As a consequence, the SPLIT call on the two children is not fully cached \(though one could imagine that many of the greedy calls would be\)\.

To save fully on caching, we need the proxy algorithm, when implemented recursively, to recurse with the same parameters, just one depth lower \(that is, it recurses with exactly the parameter set one would use to call the algorithm\)\. This holds forLicketySPLIT\(ℓ\)\(\\ell\): although split selection depends onℓ−1\\ell\-1, recursive calls are made withℓ\\ellunchanged\. This behavior corresponds precisely to the equality case of the refinement property in[3\.1](https://arxiv.org/html/2606.00202#S3.Thmtheorem1)\.

##### Role of the refinement property\.

The refinement condition

PROXY​\(DL,γ,d−1\)\\displaystyle\\textsc\{PROXY\}\(D\_\{L\},\\gamma,d\-1\)≤Obj​\(fL,γ,DL\)\\displaystyle\\leq\\text\{Obj\}\(f\_\{L\},\\gamma,D\_\{L\}\)\(52\)∧PROXY​\(DR,γ,d−1\)\\displaystyle\\wedge\\quad\\textsc\{PROXY\}\(D\_\{R\},\\gamma,d\-1\)≤Obj​\(fR,γ,DR\)\\displaystyle\\leq\\text\{Obj\}\(f\_\{R\},\\gamma,D\_\{R\}\)
ensures that when the proxy is reapplied to a subproblem induced by its own tree, the resulting objective does not worsen\. If equality holds, the subtree is preserved; if the inequality is strict, the subtree is refined to a better one\. In either case,[Theorem A\.3](https://arxiv.org/html/2606.00202#A1.Thmtheorem3)shows that all invariants of PRAXIS are maintained, including feasibility and monotonicity of minimum objectives\. However, we prefer the equality case for additional caching benefits\.

##### Failure without refinement\.

If a proxy algorithm violates the refinement condition:

¬\(PROXY​\(DL,γ,d−1\)≤Obj​\(fL,γ,DL\)∧PROXY​\(DR,γ,d−1\)≤Obj​\(fR,γ,DR\)\)\.\\neg\\\!\\left\(\\textsc\{PROXY\}\(D\_\{L\},\\gamma,d\-1\)\\leq\\text\{Obj\}\(f\_\{L\},\\gamma,D\_\{L\}\)\\;\\wedge\\;\\textsc\{PROXY\}\(D\_\{R\},\\gamma,d\-1\)\\leq\\text\{Obj\}\(f\_\{R\},\\gamma,D\_\{R\}\)\\right\)\.\(53\)
i\.e\., if it is possible that reapplying the proxy to a subtree worsens its objective, then key invariants of PRAXIS may fail\. In particular, a split chosen by the proxy at a parent node may later be pruned because the sum of proxy objectives conditioned on that split exceeds the allowed budget\. This breaks the invariant that the minimum objective returned by each recursive call of PRAXIS is at least as good as the proxy objective for that subproblem, as PRAXIS isn’t guaranteed to recover the proxy algorithm\.

##### Giving other decision tree algorithms the refinement property via caching\.

Any decision tree algorithm that lacks the refinement property can be modified to satisfy it by caching all of its subtree solutions that it implicitly constructs in yielding the top\-level objective\. For each subproblem\(D,d\)\(D,d\), one would maintain a set of candidate completions produced during execution—both from direct calls to the algorithm and from subtrees constructed as part of larger trees\. The proxy value is then defined as the minimum objective over this set:

PROXY​\(D,λ,d\):=min⁡\{Obj​\(f,D\):f​constructed for​\(D,d\)\}\.\\textsc\{PROXY\}\(D,\\lambda,d\)\\;:=\\;\\min\\\{\\mathrm\{Obj\}\(f,D\):f\\text\{ constructed for \}\(D,d\)\\\}\.With this modification, reapplying the proxy can only preserve or improve the subtree objectives \(because it always minimizes over a set that includes the old subtree\), ensuring the refinement property, and restoring all invariants required by PRAXIS \.

##### Comparing proxy algorithms with and without these caching benefits

In[Table 3](https://arxiv.org/html/2606.00202#A2.T3), we compare the resource usage of PRAXIS when using two different proxy algorithms with the same asymptotic time complexity,𝒪​\(n​k3​d3\)\\mathcal\{O\}\(nk^\{3\}d^\{3\}\)\.

The first proxy is our generalization of LicketySPLIT, characterized by a lookahead parameterℓ=2\\ell=2\. This proxy recursively selects the best split using LicketySPLIT completions, where each completion is itself obtained by choosing the best split according to greedy completions\.

We compare this to a second decision tree algorithm, which we refer to as BlockSPLIT \(we define this algorithm to make a direct comparison\)\. BlockSPLIT is obtained by recursively applying the SPLIT procedure ofBabbar et al\. \([2025](https://arxiv.org/html/2606.00202#bib.bib5)\)without postprocessing, using a block size \(lookahead depth\) of 2\. Concretely, BlockSPLIT chooses the first two splits conditioned on greedy completions, then chooses the next two splits conditioned on greedy completions, and so on\.

BlockSPLIT does not satisfy the equality case of the proxy algorithm refinement property\. This is because it recurses with an internal bit that tracks whether there is one remaining split to choose in the current block, or whether the algorithm is restarting a new block of two splits\. Furthermore, if one refines a tree produced by BlockSPLIT, the block optimization may be offset by one level relative to how the tree was originally constructed\. In this case, calling the algorithm on a node of its own tree can produce a strictly worse solution than the subtree already used\.

To address this, we define a wrapper that turns BlockSPLIT into a valid proxy algorithm\. Instead of directly returning the objective produced by BlockSPLIT, the wrapper checks whether an objective for the same subproblem was previously computed with the internal bit in the opposite position\. Since the algorithm may encounter the same subproblem in either state – once when entered externally \(with the bit indicating two more splits must be done in the block\) and once internally with the bit flipped – the wrapper returns the better of the two objectives\.

The results in[Table 3](https://arxiv.org/html/2606.00202#A2.T3)show that, although both proxy algorithms have the same asymptotic time complexity, usingℓ=2\\ell=2leads to substantially better runtime and memory usage\. This is becauseℓ=2\\ell=2makes much better use of caching, as motivated earlier\.

Table 3:Runtime and peak memory comparison between PRAXIS withℓ=2\\ell\{=\}2and a proxy algorithm of the same time complexity without caching benefits\.λ=0\.01\\lambda=0\.01,ε=0\.01\\varepsilon=0\.01, depth=5=5\.Time \(s\) seconds and Peak Memory \(MB\) are reported for the entire script\.PRAXIS \(ℓ=2\\ell=2\)PRAXIS with less cache friendly proxyDatasetnnkkTimePeak MBTimePeak MBAdult48,8422092,393\.005,026\.954,730\.0021,316\.95Bank45,211217962\.202,061\.883,130\.7019,391\.88Bike17,379164478\.802,760\.591,113\.3014,300\.59Christine5,4182319,946\.00139,833\.82––Churn5,00047213,645\.9076,252\.34––Covertype581,012961,979\.601,301\.232,760\.501,669\.83Credit30,0002251,401\.403,826\.305,158\.4039,868\.30Diabetes253,680121945\.401,037\.272,604\.205,166\.67Electricity38,47426418,210\.6045,175\.4422,793\.0080,906\.44Helena65,196156540\.501,913\.961,090\.607,849\.96Higgs11,000,0008454,014\.0021,537\.79110,377\.2021,537\.79Jannis57,58024743,428\.00150,035\.83––Jasmine2,984207257\.004,314\.291,211\.7030,794\.29Madeline3,1407627\.30988\.0254\.542,019\.72Madelon2,000186929\.7016,384\.993,809\.50117,292\.99Magic19,0201672,349\.6016,237\.193,032\.5028,859\.19News39,6441967,942\.0020,999\.32––Poker1,025,010401,129\.40952\.101,136\.10952\.10Shopping12,3302432,058\.6014,315\.355,081\.0073,797\.35

### B\.7Subproblem Representation

A subproblem in a decision tree search is most naturally identified by the subset of samples it contains \(together with the remaining depth\)\. This representation is used byXin et al\. \([2022](https://arxiv.org/html/2606.00202#bib.bib80)\), who encode subproblems as bitvectors of lengthnn, but this can be prohibitively memory\-intensive for large datasets\.

An alternative representation, supported byArslan et al\. \([2026](https://arxiv.org/html/2606.00202#bib.bib3)\), identifies a subproblem by the set of feature splits \(and branch directions\) taken to reach it\. While this representation does not grow with the sample size, distinct sequences of splits can induce the same subset of samples, causing identical subproblems to be treated as different and solved repeatedly\.

This implementation is commonly implemented by storing a canonical sorted list of literals\. Each literal encodes a feature indexiiand branch direction as a 16\-bit integer2​i\+b2i\+b, whereb∈\{0,1\}b\\in\\\{0,1\\\}indicates whether the path takes the true or false branch\. Sorting enforces a canonical order, since the conjunction of split conditions is commutative and permutations of splits induce the same subproblem\.

This representation is more compact than an alternative encoding using two feature\-sized bitvectors – one indicating which features are split on and one encoding branch directions–because the number of literals is bounded by the depth\. For a depth budget ofd=5d=5, the representation stores at most five 16\-bit literals, requiring 80 bits per subproblem\. This is more space\-efficient than the feature\-bitvector encoding whenever the number of features exceeds 40, which is the case for nearly all of our datasets\.

Our approach combines the advantages of both representations\. We define subproblems canonically by the subset of samples they contain, but store this information implicitly via a 64\-bit fingerprint computed from the corresponding bitvector \(together with the remaining depth\)\. This compression is possible because the dominant combinatorial growth in the search space comes from the number of features and the remaining depth, rather than from the number of samples\.

The only trade\-off is a vanishingly small probability of hash collisions\. In practice, this probability is negligible\. Moreover, even in the unlikely event of a collision, the correctness of the AND/OR graph is unaffected in the sense that any tree materialized from the graph will still be valid and lie within the specified budget\. A collision may cause either additional trees to be explored or some trees to be missed, but it does not invalidate the feasibility of any returned tree\. This is because we do not cache subgraphs, only proxy algorithm evaluations\.

For our default proxy, a modified version ofLicketySPLIT, we maintain two separate caches: one for the greedy algorithm and one forLicketySPLIT\. The greedy cache is substantially larger, storing up to five million entries per depth for medium to medium\-challenging Rashomon set queries\. With a 64\-bit fingerprint, one would expect 1 in 1\.5 million runs of the algorithm to have a collision \(we provide a derivation at the end of the section\)\. In contrast, collisions are virtually guaranteed with a 32\-bit fingerprint at this scale\. For substantially deeper queries or higher\-dimensional feature spaces, the fingerprint size can be increased accordingly \(noting that representations based on split sequences would also grow with either the number of features or depth of the search space, so this is not unique to us\)\.

[Table 4](https://arxiv.org/html/2606.00202#A2.T4)and[Table 5](https://arxiv.org/html/2606.00202#A2.T5)compare the three representations on datasets whose runtimes exceed one second, usingλ=0\.007\\lambda=0\.007,ε=0\.02\\varepsilon=0\.02, and depthd=5d=5\. We display the number of binarized features after each dataset\.

Bitvector \(Exact\)Literal \(Itemset\)Hash \(Us\)DatasetTime \(s\)Memory \(MB\)CacheTime \(s\)Memory \(MB\)CacheTime \(s\)Memory \(MB\)CacheAdult\-209358\.715,9642,620,799603\.01,1176,458,004339\.13932,620,799Bank\-9713\.21,759300,46218\.2248523,44511\.8203300,462Bank\-217191\.911,7092,001,930327\.47944,275,476183\.63522,001,930Bike\-431\.3354112,9031\.4162171,3251\.1149112,903Bike\-164171\.39,5074,504,463292\.91,50010,262,367157\.04414,504,463Chess\-501\.130452,0363\.2175213,1840\.915752,036Christine\-8031\.72,8483,376,06229\.96944,085,16225\.63733,376,062Christine\-2312,580\.989,977108,754,0932,672\.220,062147,725,8022,298\.47,924108,754,093Churn\-812\.0347302,9355\.92931,200,1481\.5154302,935Churn\-472657\.514,97919,435,9364,639\.530,121210,901,558612\.11,34119,435,936Covertype\-961,073\.275,4231,217,1462,477\.01,3643,812,6751,020\.81,3621,217,146Table 4:Subproblem\-key comparison at depthd=5d=5,λ=0\.007\\lambda=0\.007, andε=0\.02\\varepsilon=0\.02\. Peak memory is reported in MB\. Cache is the total number of cached subproblems \(cache size\)\.RatiosDatasetLiteral/BV CacheBV/Hash MBLit/Hash MBBV/Hash TimeLit/Hash TimeAdult\-2092\.46440\.6312\.8421\.0581\.778Bank\-971\.7428\.6461\.2171\.1171\.543Bank\-2172\.13633\.2332\.2531\.0451\.783Bike\-431\.5172\.3791\.0891\.2111\.312Bike\-1642\.27821\.5523\.4011\.0911\.865Chess\-504\.0961\.9381\.1201\.1463\.499Christine\-801\.2107\.6421\.8621\.2381\.166Christine\-2311\.35811\.3542\.5321\.1231\.162Churn\-813\.9622\.2571\.9091\.3013\.810Churn\-47210\.84711\.16322\.4511\.0747\.579Covertype\-963\.13255\.3761\.0011\.0512\.426Table 5:Ratio comparison between the Literal, Bitvector \(BV\), and Hash subproblem representations at depthd=5d=5,λ=0\.007\\lambda=0\.007, andε=0\.02\\varepsilon=0\.02\. The maximal ratio in each column is bolded\.These tables make the core trade\-off very clear\. The bitvector dataset can be catastrophically memory\-intensive for larger datasets such as Covertype\. This is a difference of7575GB versus1\.31\.3GB, and would only grow if we considered a dataset with20×20\\timesmore samples, such as Higgs\.

The literal/itemset representation eliminates the dependence on the sample size, but[Table 5](https://arxiv.org/html/2606.00202#A2.T5)shows that this comes at the cost of substantial cache inflation\. Although each individual key is compact, distinct split sequences can correspond to the same subset of samples, requiring multiple itemset representations, whereas a single bitvector would suffice\. As a result, the literal representation can actually lead to increased runtime and memory consumption\.

Quantitatively, the literal representation uses more cache entries than bitvector/hash \(e\.g\., Adult\-209:2\.46×2\.46\\times; Bike\-164:2\.28×2\.28\\times; Churn\-472:10\.85×10\.85\\times\), reflecting that multiple distinct split sequences can map to the same subset of samples and therefore prevent reuse\. This difference grows as the number of features increases\.

In contrast, the hash fingerprint preserves the canonical identity while keeping keys constant in size \(and thus reducing memory consumption\)\. In these experiments, it also matches the bitvector cache sizes exactly, as we have never observed a collision that impacted the output of PRAXIS\.

The core takeaway from the experiments is that the hash fingerprint is more memory efficient than both exact bitvector and itemset representations, and faster than using itemsets\. In this ablation, we are also slightly faster than exact bitvectors, though this difference is not major and was not used to inform our choice of subproblem\. The reason for this minor speedup is our use of interning for the ablative exact bitvector baseline\. Interning is a memory\-saving technique in which each distinct object is stored only once, and subsequent uses of the same object refer to that shared copy via a compact identifier\. In our setting, this means storing each distinct bitvector mask exactly once and using a small integer ID in cache keys, rather than repeatedly copying the full bitvector\. This approach decreases the memory consumption of the bitvector subproblem representation with a slight speed trade\-off\. This time\-memory tradeoff via interning is insignificant compared to the larger gains we see with introducing the hash fingerprint representation\.

##### Derivation of hash collisions result \(under uniform hashing assumptions\):

When using the modified LicketySPLIT proxy algorithm, PRAXIS maintains two types of subproblem caches: one for subproblems encountered by LicketySPLIT and one for subproblems encountered by the greedy subroutine\. Each cache key includes both a subproblem identifier and the remaining depth, i\.e\.,\(subproblem,depth\)\(\\mathrm\{subproblem\},\\mathrm\{depth\}\)\. Equivalently, we can view this as maintaining a separate cache for each pair consisting of a cache type and a remaining depth\. Since there are two cache types and a depth limit ofdd, we analyze up to2​d2dcaches\. Each of these caches maps a subproblem \(a 64\-bit fingerprint, a bitvector, or an itemset\) to the corresponding objective returned by either LicketySPLIT or greedy for that subproblem and remaining depth\. We do not assume that these caches are independent\.

We analyze collisions under the standard idealized model in which the 64\-bit hash function behaves as a uniform random mapping from distinct subproblem identifiers \(bitvectors\) to\{0,…,264−1\}\\\{0,\\dots,2^\{64\}\-1\\\}\. Letc∈\{1,…,2​d\}c\\in\\\{1,\\dots,2d\\\}index into the caches, with cacheccstoringncn\_\{c\}unique subproblem identifiers \(bitvectors\), each hashed uniformly to one ofM=264M=2^\{64\}different values\. We consider the probability of the eventAcA\_\{c\}, which asks whether there exists a collision among thencn\_\{c\}bitvectors stored in cachecc\. We are interested in bounding the probability that a collision occurs anywhere\. We define the event that a collision occurs anywhere\.

A:=⋃c=12​dAc\.A\\;:=\\;\\bigcup\_\{c=1\}^\{2d\}A\_\{c\}\.\(54\)
By the union bound,

Pr⁡\(A\)=Pr⁡\(⋃c=12​dAc\)≤∑c=12​dPr⁡\(Ac\)\.\\Pr\(A\)\\;=\\;\\Pr\\\!\\left\(\\bigcup\_\{c=1\}^\{2d\}A\_\{c\}\\right\)\\;\\leq\\;\\sum\_\{c=1\}^\{2d\}\\Pr\(A\_\{c\}\)\.\(55\)
Now fix a cachecc\. Under standard hashing assumptions, the probability that no collision occurs among thencn\_\{c\}bitvectors is

Pr⁡\(Acc\)=∏i=0nc−1\(1−iM\)\.\\Pr\(A\_\{c\}^\{c\}\)\\;=\\;\\prod\_\{i=0\}^\{n\_\{c\}\-1\}\\left\(1\-\\frac\{i\}\{M\}\\right\)\.\(56\)Consequently,

Pr⁡\(Ac\)=1−∏i=0nc−1\(1−iM\)\.\\Pr\(A\_\{c\}\)\\;=\\;1\-\\prod\_\{i=0\}^\{n\_\{c\}\-1\}\\left\(1\-\\frac\{i\}\{M\}\\right\)\.\(57\)
Substituting \([57](https://arxiv.org/html/2606.00202#A2.E57)\) into the union bound \([55](https://arxiv.org/html/2606.00202#A2.E55)\) yields

Pr⁡\(A\)≤∑c=12​d\(1−∏i=0nc−1\(1−i264\)\)\.\\Pr\(A\)\\;\\leq\\;\\sum\_\{c=1\}^\{2d\}\\left\(1\-\\prod\_\{i=0\}^\{n\_\{c\}\-1\}\\left\(1\-\\frac\{i\}\{2^\{64\}\}\\right\)\\right\)\.\(58\)
As long asnc≪M=232n\_\{c\}\\ll\\sqrt\{M\}=2^\{32\}\(which holds for all Rashomon set problems considered\), the collision probability admits the standard birthday approximation:

1−∏i=0nc−1\(1−iM\)\\displaystyle 1\-\\prod\_\{i=0\}^\{n\_\{c\}\-1\}\\left\(1\-\\frac\{i\}\{M\}\\right\)≈1−exp⁡\(−nc​\(nc−1\)2​M\)\\displaystyle\\approx 1\-\\exp\\\!\\left\(\-\\frac\{n\_\{c\}\(n\_\{c\}\-1\)\}\{2M\}\\right\)≤nc​\(nc−1\)2​M\.\\displaystyle\\leq\\frac\{n\_\{c\}\(n\_\{c\}\-1\)\}\{2M\}\.\(59\)
The approximation follows from the classical analysis of the birthday paradox \(see, e\.g\., Chapter 3 ofMotwani & Raghavan \([2013](https://arxiv.org/html/2606.00202#bib.bib58)\)\), while the upper bound on the approximation uses the bound \(1−e−x≤x1\-e^\{\-x\}\\leq x, which follows fromx−\(1−e−x\)x\-\(1\-e^\{\-x\}\)taking only non\-negative values\)\.

Applying \([59](https://arxiv.org/html/2606.00202#A2.E59)\) to \([58](https://arxiv.org/html/2606.00202#A2.E58)\) gives the final bound

Pr⁡\(A\)≲∑c=12​dnc​\(nc−1\)2⋅264\.\\Pr\(A\)\\;\\lesssim\\;\\sum\_\{c=1\}^\{2d\}\\frac\{n\_\{c\}\(n\_\{c\}\-1\)\}\{2\\cdot 2^\{64\}\}\.\(60\)
Before plugging in numbers, we note that we do not cache subproblems with remaining depth0as a design choice, since these cases are fast to compute\. Second, the number of cached subproblems near the root of the search is small and insignificant compared to the number of subproblems cached at later depths, so the bound is usually dominated by one term\.

We evaluate this bound using the three runs with the largest cache sizes shown in[Table 4](https://arxiv.org/html/2606.00202#A2.T4), along with the smallest run\. For Rashomon set tasks with 50 binary features \(one of the largest feature counts considered inArslan et al\. \([2026](https://arxiv.org/html/2606.00202#bib.bib3)\)\), we expect approximately one collision in every 27\.3 billion runs of PRAXIS\. Scaling up to Churn with 472 binary features, the odds increase to about one in154,500154\{,\}500\. Even for the Christine dataset, which had the largest cache size, the probability of a collision is still only about one in3,5833\{,\}583\. While these odds could be further reduced by using a larger fingerprint, we find a 64\-bit fingerprint to be more than sufficient for the problem sizes considered here\.

RunNonzero cache sizes\(nc\)\(n\_\{c\}\)Pr⁡\(collision anywhere\)\\Pr\(\\text\{collision anywhere\}\)OddsBike\-164\{3\.68×106,7\.20×105,5\.29×104,3\.74×104,8\.81×103,318,318,1\}\\\{3\.68\{\\times\}10^\{6\},\\;7\.20\{\\times\}10^\{5\},\\;5\.29\{\\times\}10^\{4\},\\;3\.74\{\\times\}10^\{4\},\\;8\.81\{\\times\}10^\{3\},\\;318,\\;318,\\;1\\\}3\.82×10−73\.82\{\\times\}10^\{\-7\}11in2\.6×1062\.6\{\\times\}10^\{6\}Chess\-50\{3\.41×104,1\.34×104,2\.36×103,1\.45×103,518,74,74,1\}\\\{3\.41\{\\times\}10^\{4\},\\;1\.34\{\\times\}10^\{4\},\\;2\.36\{\\times\}10^\{3\},\\;1\.45\{\\times\}10^\{3\},\\;518,\\;74,\\;74,\\;1\\\}3\.67×10−113\.67\{\\times\}10^\{\-11\}11in2\.7×10102\.7\{\\times\}10^\{10\}Christine\-231\{1\.01×108,6\.69×106,6\.62×105,1\.03×105,4\.08×104,462,462,1\}\\\{1\.01\{\\times\}10^\{8\},\\;6\.69\{\\times\}10^\{6\},\\;6\.62\{\\times\}10^\{5\},\\;1\.03\{\\times\}10^\{5\},\\;4\.08\{\\times\}10^\{4\},\\;462,\\;462,\\;1\\\}2\.79×10−42\.79\{\\times\}10^\{\-4\}11in3\.6×1033\.6\{\\times\}10^\{3\}Churn\-472\{1\.49×107,4\.24×106,2\.24×105,8\.83×104,2\.43×104,790,790,1\}\\\{1\.49\{\\times\}10^\{7\},\\;4\.24\{\\times\}10^\{6\},\\;2\.24\{\\times\}10^\{5\},\\;8\.83\{\\times\}10^\{4\},\\;2\.43\{\\times\}10^\{4\},\\;790,\\;790,\\;1\\\}6\.47×10−66\.47\{\\times\}10^\{\-6\}11in1\.5×1051\.5\{\\times\}10^\{5\}Table 6:Probability of at least one 64\-bit fingerprint collision across all2​d2ddistinct caches

### B\.8Caching Ablation

We now evaluate the importance of caching \(with the 64\-bit fingerprint\) via an ablation\. Note that the runtime result detailed in[Theorem 3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)still holds without caching, but we can provably incur practical savings within our modified LicketySPLIT proxy algorithm and within the AND/OR graph expansion\.

[Table 7](https://arxiv.org/html/2606.00202#A2.T7)reports the results\. Across datasets with small Rashomon sets \(Diabetes, Credit, Bank\), PRAXIS is able to run without any additional memory beyond the peak memory used when loading the dataset\. In contrast, with aggressive caching of proxy algorithms, PRAXIS could use up to 4GB of memory\. In these cases, PRAXIS typically runs an order of magnitude faster with caching\. Given that 4GB of memory is not prohibitively large, we use caching in our default implementation\.

For datasets with larger Rashomon sets \(such as Christine\), caching still leads to more memory use, but it is dwarfed by the size of the AND/OR graph needed to encode these trees\. The amount of memory needed specifically for caching is unlikely to determine the feasibility of the run \(with the 64\-bit fingerprint, it would alter feasibility with other subproblem representations such as a full bitvector\)\.

DatasetTreesCache OFFCache ONSpeedupTime \(s\)PeakMem \(GB\)Time \(s\)PeakMem \(GB\)×\\timesBank\-217215,82372,2390\.286,2133\.0511\.6Christine\-2319\.92×10109\.92\\times 10^\{10\}4,00696\.5885597\.684\.7Churn\-4725\.78×1095\.78\\times 10^\{9\}44,2085\.592,2708\.6019\.5Covertype\-455\.04×1095\.04\\times 10^\{9\}4,4652\.419842\.424\.5Credit\-22536,14062,0440\.235,9774\.2410\.4Diabetes\-12112417,4710\.682,1680\.768\.1Jasmine\-2073\.11×1093\.11\\times 10^\{9\}14,37810\.4098814\.4514\.6Wine\-645,756,6015310\.52370\.7514\.4Table 7:PRAXIS with and without caching subproblems solved by proxy algorithms\.λ=0\.003,ε=0\.03,d=5\\lambda=0\.003,\\varepsilon=0\.03,d=5

## Appendix CDatasets and Binarization

We provide a table of all 51 datasets and binarizations used in our experiments\. For all datasets, we remove rows containing missing values and one\-hot encode all categorical variables\. Since all existing Rashomon set algorithms \(including ours\) operate on binary features, all continuous features are binarized prior to training\. For most datasets, we use the threshold\-guessing binarization procedure ofMcTavish et al\. \([2022](https://arxiv.org/html/2606.00202#bib.bib52)\), which can handle both categorical and continuous features\. This procedure trains a gradient\-boosted decision tree ensemble with a specified number of estimators and depth cutoffs\. It then collects all of the thresholds generated, orders them by Gini variable importance, and removes the least important thresholds iteratively\. To remove any dependence on the construction of the binary dataset, we use a large range of GBDT parameters: from depth 1 to depth 7, and from 15 estimators to 500 estimators\. On smaller datasets such as Monk2, we took all thresholds, as it was manageable to solve using all of them\. On very large datasets such as Higgs, we used quantile binarization on numerical features to reduce the computation cost of training a reference ensemble\.

DatasetSamplesFeaturesAdult4884214Adult48842209Aging71457Bank4521197Bank45211217Bike1737943Bike17379164Chess2805650Christine541880Christine5418231Churn500081Churn5000472Compas496644Covertype58101245Covertype58101296Credit30000134Credit30000225Diabetes25368033Diabetes253680121Droid2933284Electricity3847494Electricity38474264Heart29742Helena6519684Helena65196156Heloc250265Higgs1100000084IOT12311786Jannis57580106Jannis57580247Jasmine298451Jasmine2984207Madeline314076Madeline3140451Madelon200073Madelon2000186Magic1902080Magic19020167Monk260117Mushroom812413News39644196Phishing1105544Poker102501040Shopping12330112Shopping12330243Spambase460124Spambase460167Student64948Taxi122415827TicTacToe95826Wine649764Table 8:Datasets with number of features in their binarizations##### Adult\(Becker & Kohavi,[1996](https://arxiv.org/html/2606.00202#bib.bib10)\)\(48,842 samples, 14 and 209 features\)

Predict whether an individual earns more than 50,000 per year based on demographic and occupational attributes\. Two binarizations are used, produced via threshold guessing with \(i\) 40 depth\-1 estimators and \(ii\) 400 depth\-2 estimators\.

##### Aging\(Malani et al\.,[2019](https://arxiv.org/html/2606.00202#bib.bib47)\)\(714 samples, 57 features\)

Predict whether an individual has visited at least two doctors\. Binarized using threshold guessing with 100 estimators of depth 7\.

##### Bank\(Moro et al\.,[2014a](https://arxiv.org/html/2606.00202#bib.bib56),[b](https://arxiv.org/html/2606.00202#bib.bib57)\)\(45,211 samples, 97 and 217 features\)

Predict whether a client subscribes to a term deposit following a marketing campaign\. Two binarizations are generated using \(i\) 100 depth\-3 estimators and \(ii\) 400 depth\-1 estimators\.

##### Bike\(Fanaee\-T & Gama,[2013](https://arxiv.org/html/2606.00202#bib.bib33); Fanaee\-T,[2013](https://arxiv.org/html/2606.00202#bib.bib32)\)\(17,379 samples, 43 and 164 features\)

Predict whether daily bike rental demand exceeds the median\. Two binarizations are used: \(i\) depth\-3 trees with 250 estimators and \(ii\) depth\-1 trees with 250 estimators\.

##### Chess\(Bain & Hoff,[1994](https://arxiv.org/html/2606.00202#bib.bib7)\)\(28,056 samples, 50 features\)

Determine whether White can force a win within a fixed number of plies, chosen to balance class labels\. Binarized using threshold guessing with a 100\-estimator ensemble of depth 10\.

##### Christine\(OpenML,[2018](https://arxiv.org/html/2606.00202#bib.bib59); Guyon et al\.,[2019](https://arxiv.org/html/2606.00202#bib.bib37)\)\(5,418 samples, 80 and 231 features\)

Binary classification using the provided target column\. Two binarizations are constructed using 40 estimators with depths 2 and 3\.

##### Churn\(Erickson et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib31); Marcoulides,[2005](https://arxiv.org/html/2606.00202#bib.bib48)\)\(5,000 samples, 81 and 472 features\)

Predict whether a customer will churn\. Two binarizations are used, generated with 40 estimators of depth 5 and depth 3\.

##### COMPAS\(Bao et al\.,[2021](https://arxiv.org/html/2606.00202#bib.bib9)\)\(4,966 samples, 44 features\)

Predict whether a defendant will recidivate within two years\. Binarized using 400 depth\-1 estimators\.

##### Covertype\(Blackard,[1998](https://arxiv.org/html/2606.00202#bib.bib13)\)\(581,012 samples, 45 and 96 features\)

Predict whether a forest plot belongs to cover type 2\. Two binarizations are used: 52 estimators of depth 1 and 46 estimators of depth 2\.

##### Credit\(Yeh,[2009](https://arxiv.org/html/2606.00202#bib.bib81); Yeh & Lien,[2009](https://arxiv.org/html/2606.00202#bib.bib82)\)\(30,000 samples, 134 and 225 features\)

Predict whether a client will default on their credit card payment in the following month\. Two binarizations are generated using 40 estimators with depths 4 and 3\.

##### Diabetes\(Burrows et al\.,[2017](https://arxiv.org/html/2606.00202#bib.bib18)\)\(253,680 samples, 33 and 121 features\)

Predict whether an individual is diabetic\. Continuous features are binarized using quantile thresholds, with two configurations using approximately 6 and 500 candidate quantiles per feature\.

##### Droid\(Mathur et al\.,[2021](https://arxiv.org/html/2606.00202#bib.bib50)\)\(29,332 samples, 84 features\)

Predict whether an Android application is malicious\. Binarized using threshold guessing with 500 estimators of depth 5\.

##### Electricity\(OpenML,[2022a](https://arxiv.org/html/2606.00202#bib.bib64)\)\(38,474 samples, 94 and 264 features\)

Predict whether electricity prices increase relative to a 24\-hour moving average\. Two binarizations are used, generated with 40 estimators of depths 3 and 4\.

##### Heart\(Janosi et al\.,[1989](https://arxiv.org/html/2606.00202#bib.bib42); Detrano et al\.,[1989](https://arxiv.org/html/2606.00202#bib.bib27)\)\(297 samples, 42 features\)

Predict the presence of heart disease\. Binarized using 70 estimators of depth 2\.

##### Helena\(OpenML,[2018a](https://arxiv.org/html/2606.00202#bib.bib60); Guyon et al\.,[2019](https://arxiv.org/html/2606.00202#bib.bib37)\)\(65,196 samples, 84 and 156 features\)

Predict whether an instance belongs to the most frequent class\. Two binarizations are generated using \(i\) 35 depth\-3 estimators and \(ii\) 40 depth\-2 estimators\.

##### Heloc\(FICO,[2018](https://arxiv.org/html/2606.00202#bib.bib36)\)\(2,502 samples, 65 features\)

Predict whether an individual is high\- or low\-risk for a home equity line of credit\. Binarized using 200 depth\-1 estimators\.

##### Higgs\(Whiteson,[2014](https://arxiv.org/html/2606.00202#bib.bib78); Baldi et al\.,[2014](https://arxiv.org/html/2606.00202#bib.bib8)\)\(11,000,000 samples, 84 features\)

Predict whether a particle collision corresponds to signal or background noise\. Each continuous feature is binarized at the 25th, 50th, and 75th percentiles, estimated using a 200,000\-sample subsample and applied to the full dataset\.

##### IOT\(S\. & Nagapadma,[2023](https://arxiv.org/html/2606.00202#bib.bib69); Sharmila & Nagapadma,[2023](https://arxiv.org/html/2606.00202#bib.bib73)\)\(123,117 samples, 86 features\)

Predict whether a network\-traffic record corresponds to an attack or normal behavior\. Each numerical feature is discretized into five binary indicators using evenly spaced thresholds\.

##### Jannis\(OpenML,[2022b](https://arxiv.org/html/2606.00202#bib.bib65); Guyon et al\.,[2019](https://arxiv.org/html/2606.00202#bib.bib37)\)\(57,580 samples, 106 and 247 features\)

Binary classification using the provided target column\. Two binarizations are generated using 40 estimators with depths 3 and 2\.

##### Jasmine\(OpenML,[2018b](https://arxiv.org/html/2606.00202#bib.bib61); Guyon et al\.,[2019](https://arxiv.org/html/2606.00202#bib.bib37)\)\(2,984 samples, 51 and 207 features\)

Binary classification using the provided target column\. Two binarizations are generated using 40 estimators with depths 3 and 4\.

##### Madeline\(OpenML,[2018c](https://arxiv.org/html/2606.00202#bib.bib62)\)\(3,140 samples, 76 and 451 features\)

Binary classification using the provided target column\. Two binarizations are generated using 40 estimators with depths 2 and 4\.

##### Madelon\(Guyon et al\.,[2019](https://arxiv.org/html/2606.00202#bib.bib37)\)\(2,000 samples, 73 and 186 features\)

Synthetic dataset where points are labeled based on proximity to selected corners of a hypercube\. Two binarizations are generated using 40 estimators with depths 2 and 3\.

##### Magic\(Bock,[2004](https://arxiv.org/html/2606.00202#bib.bib15)\)\(19,020 samples, 80 and 167 features\)

Predict whether a Cherenkov telescope image corresponds to a gamma ray or background noise\. Two binarizations are generated using 35 estimators with depths 3 and 2\.

##### Monk2\(Thrun,[1991](https://arxiv.org/html/2606.00202#bib.bib75); Wnek,[1993](https://arxiv.org/html/2606.00202#bib.bib79)\)\(601 samples, 17 features\)

Predict the logical rule where the label is 1 if exactly two of six attributes take value 1\. All possible thresholds are included\.

##### Mushroom\(Audobon Society Field Guide,[1981](https://arxiv.org/html/2606.00202#bib.bib4)\)\(8,124 samples, 13 features\)

Predict whether a mushroom is poisonous\. Binarized using 400 depth\-1 estimators\.

##### News\(Fernandes et al\.,[2015b](https://arxiv.org/html/2606.00202#bib.bib35),[a](https://arxiv.org/html/2606.00202#bib.bib34)\)\(39,644 samples, 196 features\)

Predict whether an online news article’s number of shares exceeds the median\. Each feature is discretized into five binary indicators using quantile thresholds\.

##### Phishing\(Mohammad & McCluskey,[2012](https://arxiv.org/html/2606.00202#bib.bib54); Mohammad et al\.,[2012](https://arxiv.org/html/2606.00202#bib.bib55)\)\(11,055 samples, 44 features\)

Predict whether a website is phishing or legitimate\. Binarized using 200 estimators of depth 2\.

##### Poker\(Cattral & Oppacher,[2002](https://arxiv.org/html/2606.00202#bib.bib19)\)\(1,025,010 samples, 40 features\)

Predict whether a five\-card hand is at least a pair\. Binarized using 40 estimators of depth 4\.

##### Shopping\(Sakar & Kastro,[2018](https://arxiv.org/html/2606.00202#bib.bib70); Sakar et al\.,[2018](https://arxiv.org/html/2606.00202#bib.bib71)\)\(12,330 samples, 112 and 243 features\)

Predict whether an online shopping session ends in a purchase\. Two binarizations are generated using 40 estimators with depths 3 and 4\.

##### Spambase\(Hopkins et al\.,[1999](https://arxiv.org/html/2606.00202#bib.bib38)\)\(4,601 samples, 24 and 67 features\)

Predict whether an email is spam\. Two binarizations are used: \(i\) 15 estimators of depth 4 and \(ii\) 40 estimators of depth 1\.

##### Student\(Cortez,[2008](https://arxiv.org/html/2606.00202#bib.bib22); Cortez & Silva,[2008](https://arxiv.org/html/2606.00202#bib.bib23)\)\(649 samples, 48 features\)

Predict whether a student passes a course \(final grade≥10\\geq 10\)\. Binarized using 500 depth\-1 estimators\.

##### Taxi\(OpenML,[2018d](https://arxiv.org/html/2606.00202#bib.bib63)\)\(1,224,158 samples, 27 features\)

Predict whether a taxi ride includes a tip\. Binarized using 40 estimators of depth 2\.

##### Tic\-Tac\-Toe\(Aha,[1991](https://arxiv.org/html/2606.00202#bib.bib2)\)\(958 samples, 26 features\)

Predict whether a player has won given a terminal board configuration\. Binarized using 20 estimators of depth 7\.

##### Wine\(OpenML,[2025](https://arxiv.org/html/2606.00202#bib.bib66)\)\(6,497 samples, 64 features\)

Predict whether wine quality is at least 7\. Binarized using 200 depth\-1 estimators\.

## Appendix DExperiments

##### Experimental Setup

All Rashomon set algorithms are run with a 90\-hour time limit and 200 GB of allocated memory, unless otherwise noted\. Whenever we use bootstrapping, such as in experiments evaluating the mean and standard deviation of PRAXIS’s recall or in computing the Rashomon Importance Distribution\(Donnelly et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib28)\), we generate 10 bootstrap resamples of the binarized dataset\. For consistency with the three other Rashomon set algorithms and with PRAXIS, we allow trivial extensions of trees to be included in the Rashomon set\. Trivial extensions are splits in which the leaf and its sibling make identical predictions and thus form a special case of predictive equivalence\(McTavish et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib53)\)\. These extensions can be removed with minimal additional handling of the AND/OR graph or by post\-processing the Rashomon set to discard trees that are predictively equivalent to another tree already in the set\.

PRAXIS and SORTeD\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)use a depth convention in which the root is at depth 0, whereas RESPLIT and TreeFARMS index the root at depth 1; we adjust for this difference accordingly\. Additionally, for RESPLIT\(Babbar et al\.,[2025](https://arxiv.org/html/2606.00202#bib.bib5)\), we set the lookahead depth parameter to exactly half of the overall tree depth\. Thus, when training depth\-5 and depth\-7 trees \(corresponding to depths 6 and 8 under RESPLIT’s convention\), we use lookahead depths of 3 and 4, respectively\.

Additionally, because PRAXIS operates on integer\-valued objectives, all reported values ofλ\\lambdaare snapped to the nearest multiple of1/\|D\|1/\|D\|for compatibility\.

For all quantitative recall comparisons, we compute ground truth using PRAXIS configured with an optimal proxy and an exact subproblem representation \(i\.e\., without 64\-bit fingerprinting\)\. This configuration is used only to define the reference Rashomon set and eliminates any theoretical risk of hash collisions or proxy\-induced pruning errors\. All reported experimental results for our method, including runtime, memory usage, and recall, use the modified LicketySPLIT proxy \(unless stated otherwise\) with a 64\-bit fingerprint used to key the subproblem caches\. Consequently, any observed recall differences reflect algorithmic behavior rather than numerical differences across implementations\.

##### Machines Used

All experiments were performed on an institutional computing cluster\. Each experiment was executed on a single compute node equipped with an AMD EPYC 9554 processor \(2\.75 GHz\), with 64 physical cores\. Resources were restricted to 1 CPU core and 200 GB of RAM\.

### D\.1Rashomon Set Timing

Across all three Rashomon set configurations shown \([Table 9](https://arxiv.org/html/2606.00202#A4.T9)–[Table 11](https://arxiv.org/html/2606.00202#A4.T11)\), PRAXIS is consistently the fastest method, often by orders of magnitude: on most datasets it completes in seconds to minutes while SORTeD and RESPLIT frequently require hours to days, and TreeFARMS does not finish the majority of tasks with a 200GB memory limit\.

PRAXIS is the fastest method to complete across all datasets, binarizations, and parameter settings\. In a small number of cases, however, PRAXIS exhibits higher memory usage than competing methods\. When this excess exceeds a few hundred megabytes, it is always associated with particularly challenging datasets\. The main example of this is Madelon, a synthetic dataset in which labels are determined by proximity to selected corners of a hypercube\. Such structure does not favor heuristics, which means the initial budget is set much more loosely than optimal, leading to increased exploration\. This behavior does not lead to degraded approximation quality \(see[subsection D\.2](https://arxiv.org/html/2606.00202#A4.SS2)\)\.

This issue can be mitigated within the PRAXIS framework\. As shown in[subsubsection D\.5\.1](https://arxiv.org/html/2606.00202#A4.SS5.SSS1), using a stronger proxy algorithm both reduces memory usage and shortens runtime on Madelon\. Nevertheless, we retain the modified LicketySPLIT proxy as the default, since synthetic adversarial datasets such as Madelon are not typical real\-world use cases\. Instead, we include them to stress\-test the algorithms and characterize worst\-case behavior\.

Finally, we note that in the few settings where RESPLIT uses less memory than PRAXIS, this comes at the cost of substantially worse approximation quality \(see[subsection D\.11](https://arxiv.org/html/2606.00202#A4.SS11)\)\.

Table 9:Runtime and peak memory usage atλ=0\.02\\lambda=0\.02,εmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03, depth=5=5\.Peak MB reports peak resident set size \(RSS\), including imports, dataset loading, and algorithm execution\. – indicates that the method did not finish in 90 hours or with 200GB of RAM\.PRAXISTreeFARMSSORTeDRESPLITDatasetnnkkTimePeak MBTimePeak MBTimePeak MBTimePeak MBAdult48842140\.04144\.270\.44220\.772\.27242\.914\.37218\.53Adult48842209272\.53682\.15––56440\.2312079\.805494\.425782\.34Aging714570\.02126\.921433\.529208\.922\.01147\.500\.96139\.97Bank45211972\.43195\.74––1808\.611189\.15164\.341468\.21Bank4521121719\.79281\.88––39822\.129455\.451861\.185860\.19Bike17379430\.78146\.38––24\.83284\.1026\.07256\.52Bike1737916435\.40359\.94––6533\.753546\.601022\.181302\.75Chess28056500\.33151\.67––26\.66288\.1232\.75325\.80Christine54188031\.28974\.84––252\.29801\.67373\.98551\.52Christine5418231944\.2710439\.32––38970\.6012709\.8812625\.373096\.83Churn5000810\.22136\.17––42\.26292\.9413\.96243\.64Churn500047234\.84279\.41––123776\.1122012\.992564\.302792\.24Compas4966440\.09130\.4765\.083742\.007\.23163\.5011\.90159\.61Covertype5810124513\.11579\.30519\.7027925\.923123\.762986\.861640\.532518\.53Covertype58101296358\.701301\.23––64672\.8816107\.4810101\.878631\.58Credit300001345\.74190\.12––6194\.553472\.01447\.431728\.17Credit3000022526\.03249\.60––65145\.2314300\.782316\.154355\.73Diabetes253680330\.79291\.69553\.3128336\.29683\.311067\.6948\.681049\.81Diabetes25368012126\.59677\.67––50986\.7710723\.032410\.238037\.53Droid29332840\.79165\.04––882\.65697\.1978\.89730\.01Electricity38474947\.83192\.63––1333\.201186\.57216\.981011\.17Electricity38474264306\.94692\.10––114325\.6123777\.657619\.636402\.99Heart297420\.05130\.0111792\.0120541\.303\.46165\.314\.59139\.74Helena65196840\.64214\.392\.53664\.2716\.31447\.1468\.20634\.65Helena651961564\.74290\.9638\.321983\.32100\.29779\.31564\.131755\.30Heloc2502650\.38140\.59––52\.34313\.958\.79205\.18Higgs11000000842374\.9121537\.79––––––IOT123117860\.63302\.863\.73936\.5418\.29688\.025\.33805\.53Jannis57580106327\.541458\.63––5277\.685137\.225587\.192462\.35Jannis5758024715352\.0442067\.00––215642\.6657931\.75237926\.2211297\.20Jasmine2984510\.35142\.87––13\.72211\.1416\.78193\.93Jasmine298420722\.77451\.61––5211\.852705\.49517\.17745\.36Madeline3140762\.22196\.45––136\.18518\.20444\.67460\.68Madeline31404512971\.6529094\.20––––133189\.3826621\.99Madelon2000738\.19292\.36––97\.75409\.84204\.01276\.30Madelon20001861420\.3329392\.03––8760\.554460\.804196\.681242\.15Magic190208031\.05446\.82––268\.66850\.65595\.96527\.28Magic19020167368\.482481\.18––6605\.225146\.428074\.651771\.25Monk2601170\.00125\.963\.81414\.670\.09133\.140\.14130\.55Mushroom8124130\.00128\.880\.18160\.310\.07145\.331\.46138\.40News39644196596\.471480\.48––114514\.4030426\.0613544\.766082\.23Phishing11055440\.07138\.56645\.5787245\.9320\.98191\.6710\.20195\.34Poker10250104013\.71952\.10––7370\.457475\.91657\.485958\.07Shopping123301124\.09187\.69––646\.54980\.08137\.95606\.73Shopping1233024358\.88470\.00––24151\.417957\.331760\.042195\.54Spambase4601240\.05130\.40106\.6810617\.660\.98155\.854\.46148\.48Spambase4601673\.67232\.30––53\.30351\.8940\.33223\.32Student649480\.02127\.5125801\.7936353\.513\.42159\.913\.05142\.59Taxi1224158271\.18894\.6512\.502874\.38364\.853318\.6532\.352030\.43TicTacToe958260\.17145\.15216\.409333\.290\.79147\.810\.54138\.73Wine6497640\.24135\.34––62\.98307\.6312\.54255\.48Table 10:Runtime and peak memory usage atλ=0\.01\\lambda=0\.01,εmult=0\.02\\varepsilon\_\{\\textrm\{mult\}\}=0\.02, depth=5=5\.Peak MB reports peak resident set size \(RSS\), including imports, dataset loading, and algorithm execution\. – indicates that the method did not finish in 90 hours or with 200GB of RAM\.PRAXISTreeFARMSSORTeDRESPLITDatasetnnkkTimePeak MBTimePeak MBTimePeak MBTimePeak MBAdult48842140\.03142\.791\.45356\.502\.44237\.792\.29212\.25Adult48842209213\.43329\.20––43716\.0013414\.495409\.875889\.31Aging714570\.02126\.823052\.9314589\.092\.12153\.571\.24140\.83Bank45211973\.69194\.56––2057\.391651\.47246\.531806\.91Bank4521121730\.26281\.51––41199\.7914633\.892653\.477062\.95Bike17379430\.87145\.73––35\.19304\.32171\.08378\.37Bike1737916445\.83402\.90––8686\.164317\.586840\.293007\.27Chess28056500\.84151\.67––31\.13325\.01255\.75771\.57Christine54188060\.961854\.50––318\.82943\.47176\.95578\.80Christine54182312688\.4936909\.23––50600\.2415805\.864764\.453184\.38Churn5000811\.51160\.04––65\.26385\.3555\.45302\.81Churn5000472672\.042673\.30––207785\.9735154\.4265161\.254495\.50Compas4966440\.28135\.31309\.3713473\.637\.20181\.2112\.82160\.25Covertype5810124516\.00579\.653242\.37138008\.743199\.013423\.121199\.732586\.74Covertype58101296411\.171301\.31––51888\.2819681\.5814090\.829014\.84Credit300001348\.49193\.99––8578\.204990\.32623\.321930\.66Credit3000022537\.61265\.08––78946\.9020408\.633144\.974616\.44Diabetes253680331\.41291\.29––1003\.211346\.4379\.661353\.52Diabetes25368012149\.56679\.43––54065\.5817048\.953854\.9910305\.41Droid29332841\.32164\.86––905\.42781\.8099\.92734\.73Electricity3847494167\.97853\.92––1554\.071714\.89384\.581052\.48Electricity3847426415226\.1841489\.45––137252\.7631984\.6022809\.666679\.00Heart297420\.15138\.0715562\.3521285\.084\.36167\.928\.86305\.46Helena65196844\.09214\.653149\.22142928\.841023\.981345\.18140\.121719\.96Helena6519615653\.02343\.79––19191\.467918\.211027\.525564\.16Heloc2502651\.71187\.75––63\.12361\.5317\.97236\.79Higgs11000000849354\.1921536\.63––––––IOT123117860\.69305\.503\.73939\.3820\.49688\.755\.34809\.26Jannis57580106232\.051214\.11––6107\.026181\.5110247\.632501\.41Jannis5758024710085\.6135068\.73––238250\.8271731\.88225254\.5311419\.48Jasmine2984511\.76184\.79––16\.98224\.1315\.75188\.03Jasmine2984207142\.812348\.67––6355\.123245\.50573\.81745\.05Madeline31407629\.261063\.57––176\.20524\.15––Madelon200073103\.713816\.83––116\.89418\.05––Madelon20001864731\.5394685\.84––11299\.924768\.32––Magic190208034\.69467\.59––370\.131041\.29370\.55532\.23Magic19020167707\.414875\.07––9429\.276714\.694113\.341947\.57Monk2601170\.02128\.173\.80419\.890\.12133\.920\.16130\.15Mushroom8124130\.00128\.550\.25169\.000\.11145\.631\.71138\.34News396441961151\.192658\.81––136627\.1641543\.9410672\.609491\.36Phishing11055440\.13138\.57––22\.91210\.7714\.30202\.65Poker102501040770\.12990\.07––10703\.7330587\.0144109\.0221182\.09Shopping123301125\.76194\.20––1021\.691438\.49183\.90723\.99Shopping1233024391\.16575\.52––34442\.1311823\.672395\.572560\.67Spambase4601240\.03128\.77164\.1514223\.421\.22153\.601\.63148\.00Spambase4601671\.87182\.17––59\.83318\.1126\.15225\.06Student649480\.05129\.0126873\.7243952\.534\.71168\.663\.23143\.88Taxi1224158271\.42894\.1695\.599040\.931987\.103496\.16138\.842334\.88TicTacToe958260\.57178\.92243\.169455\.001\.03148\.1044\.541135\.14Wine6497640\.47140\.97––75\.13371\.1820\.55333\.90Table 11:Runtime and peak memory usage atλ=0\.005\\lambda=0\.005,εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}=0\.01, depth=5=5\.Peak MB reports peak resident set size \(RSS\), including imports, dataset loading, and algorithm execution\. – indicates that the method did not finish in 90 hours or with 200GB of RAM\.PRAXISTreeFARMSSORTeDRESPLITDatasetnnkkTimePeak MBTimePeak MBTimePeak MBTimePeak MBAdult48842140\.04142\.9514\.273726\.863\.23239\.912\.09208\.89Adult48842209103\.67287\.67––50692\.9016083\.113389\.435828\.01Aging714570\.02126\.963787\.8915215\.652\.31158\.901\.38141\.26Bank452119714\.84195\.26––1631\.562136\.69604\.201901\.05Bank45211217258\.36308\.94––50481\.8619501\.409870\.897499\.14Bike17379430\.34143\.97––43\.12310\.59644\.1814718\.25Bike17379164193\.43392\.34––10165\.044818\.43––Chess28056500\.56151\.82––42\.15334\.95311\.711337\.80Christine5418809\.11340\.74––366\.60965\.2959\.01475\.03Christine5418231740\.809069\.65––58533\.8716757\.121747\.802822\.89Churn5000811\.56167\.24––83\.37412\.2570\.71422\.36Churn5000472373\.36668\.72––300644\.5140201\.4157447\.8522535\.59Compas4966440\.14131\.78357\.0015292\.306\.95172\.8312\.91165\.10Covertype5810124527\.02577\.12––2926\.568743\.22800\.522686\.79Covertype581012961175\.581301\.56––45014\.1136188\.309721\.099161\.04Credit3000013414\.08187\.86––8981\.246006\.17788\.482041\.50Credit3000022565\.09231\.32––96518\.8824726\.793844\.085090\.96Diabetes253680332\.39288\.94––1296\.701583\.26116\.811401\.60Diabetes25368012169\.09676\.90––61324\.2124343\.175082\.6111350\.65Droid29332841\.83165\.44––864\.82857\.08111\.37734\.25Electricity384749433\.11203\.75––1677\.051896\.45328\.271058\.29Electricity384742643936\.7810709\.56––152529\.9735515\.0622561\.256885\.98Heart297423\.80741\.07––5\.27162\.44144\.5514249\.28Helena65196844\.85214\.45––1582\.831867\.96223\.491864\.16Helena6519615671\.38288\.56––31017\.1811336\.791639\.326105\.35Heloc2502652\.08155\.22––69\.14386\.4214\.31229\.00Higgs110000008413787\.2221538\.12––––––IOT123117860\.78304\.723\.66938\.4812\.73688\.695\.29806\.43Jannis57580106389\.751980\.20––5752\.377342\.777888\.344098\.71Jannis5758024723348\.1582284\.37––301779\.9086424\.7463050\.1422226\.05Jasmine2984511\.28148\.84––18\.84244\.918\.07191\.91Jasmine2984207248\.601136\.76––6700\.773744\.66354\.05751\.01Madeline3140764\.29242\.38––190\.28510\.69––Madelon20007311\.091751\.20––130\.92436\.17––Magic190208024\.47378\.42––313\.07941\.38216\.06534\.37Magic19020167453\.49682\.59––9165\.806907\.262277\.172181\.43Monk2601170\.16207\.18––0\.14133\.680\.16129\.73Mushroom8124130\.00127\.170\.36172\.730\.19144\.431\.47136\.66News396441961432\.083948\.98––162757\.5947046\.815987\.4211158\.40Phishing11055440\.15138\.13––18\.30216\.7213\.78201\.65Poker102501040450\.94989\.50––8472\.4414107\.5561478\.45112157\.77Shopping123301129\.37161\.82––1137\.701835\.37226\.42769\.48Shopping12330243121\.89303\.47––43659\.8615684\.982416\.492734\.83Spambase4601240\.03128\.27179\.3415334\.181\.69155\.174338\.2092571\.09Spambase4601671\.00136\.66––70\.93334\.9026\.00223\.70Student649480\.12127\.3624661\.3844943\.455\.54174\.048\.23144\.50Taxi1224158271\.65893\.99––1672\.813575\.89146\.762709\.54TicTacToe958260\.08133\.65208\.989563\.601\.27144\.92214\.338655\.65Wine6497640\.61142\.35––78\.07428\.4322\.27348\.78
### D\.2Rashomon Set Recall

We now report approximation quality over a broader range of parameter settings than in[Table 2](https://arxiv.org/html/2606.00202#S4.T2)\. In particular, we vary the sparsity parameter fromλ=0\.0025\\lambda=0\.0025toλ=0\.02\\lambda=0\.02and evaluate recall under two regimes: Multiplicative, which considers trees within a1\+εmult1\+\\varepsilon\_\{\\textrm\{mult\}\}factor of the optimal objective, and Absolute, which considers trees within1\+εmult1\+\\varepsilon\_\{\\textrm\{mult\}\}factor of the proxy algorithm\. Because trees are enumerated in sorted order, the multiplicative set can always be recovered and is a subset \(or equal to\) the full set returned by PRAXIS\. Regardless, if recall remains high on the larger absolute set, these additional trees may be useful for downstream applications\.

[Table 12](https://arxiv.org/html/2606.00202#A4.T12)reports Rashomon set recall \(mean±\\pmstandard deviation over bootstraps\)\. Across most datasets andλ\\lambdavalues, PRAXIS achieves essentially perfect recall \(often1\.000±01\.000\\pm 0\) under both variants, indicating close agreement with ground\-truth enumeration\. In cases where multiplicative recall falls below 1, absolute recall differs only slightly and may be marginally higher or lower; these differences are not substantial\. This suggests that trees farther from optimal are not necessarily harder to recover, even though their higher objectives leave less slack before a completion exceeds the pruning budget\. Additionally, the datasets with a recall below 1 tend to be smaller datasets, which can be solved exactly with little computational cost\.

In[subsubsection D\.5\.3](https://arxiv.org/html/2606.00202#A4.SS5.SSS3), we report recall for using a greedy proxy algorithm, which has reasonable, but inferior, recall compared to using modified LicketySPLIT as the proxy algorithm\.

In the main body of the paper, we calculate recall relative to two algorithms: PRAXIS with an optimal proxy and an exact subproblem representation, which corresponds to full recall of the Rashomon set, and SORTeD\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80)\)\. For each method, we compute the histogram of tree objectives within the Rashomon set\. Then, for each objective value, we take the maximum count reported by either method\. We use these maximum counts to estimate the ground\-truth number of optimal trees\. This approach is more conservative, yielding lower recall for our method, than using either PRAXIS with an optimal proxy alone or SORTeD alone as the ground truth\. We use this procedure because, in some cases, the two existing exact solvers \(TreeFARMS\(Xin et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib80)\)and SORTeD\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\)\) disagree even when run on identical datasets with identical parameter settings\. Additionally, TreeFARMS does not always finish within our 90\-hour time limit and 200GB memory limit\.

We believe the missed trees in SORTeD are caused solely by column deduplication in SORTeD\. For example, if

years\_of\_education≥16andincome≥70000\\texttt\{years\\\_of\\\_education\}\\geq 16\\qquad\\text\{and\}\\qquad\\texttt\{income\}\\geq 70000
induce identical bitvectors \(this can happen with quantile binarization or Threshold Guessing\(McTavish et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib52)\)\), SORTeD arbitrarily chooses one binary column to keep, even though the two splits represent different features and should both be included in the Rashomon set \(their inclusion can affect downstream analyses such as variable\-importance conclusions\)\. We have not observed any cases where PRAXIS with optimal parameters finds fewer trees than SORTeD\.

Note that the occurrence of this kind of collision is not indicative of a flaw in binarization when we are working with Rashomon sets rather than trying to find a single optimum\. Consider, for example, the exhaustive binarization needed to evaluate the full Rashomon set for continuous features: each individual feature becomes, in the worst case,nnbinary features\. If any two features are sufficiently correlated, it is quite likely that there exist thresholds for these two features with the same support\. Even for fully decorrelated features, each continuous feature will have a split with support size one \(actually two disjoint ones: where all but one sample is below a threshold, or all but one is above\), and now we have the birthday paradox withnnpossibilities andkkassignments\.

Using the same birthday paradox approximation as in Section[B\.7](https://arxiv.org/html/2606.00202#A2.SS7), we have a probability ofk​\(k−1\)2​n\\frac\{k\(k\-1\)\}\{2n\}of a collision; for 20 continuous features and 5000 samples, that gives a 7\.6% chance of at least one collision among just this subset of features with support 1\. When the collision occurs, every instance of the preserved split can be replaced with the removed split while maintaining Rashomon membership, leading to a substantial number of missed trees\.

Table 12:Recall comparison between the Absolute\-ε\\varepsilonand Multiplicative\-ε\\varepsilon\(Mult\) Rashomon sets acrossλ∈\{0\.02,0\.01,0\.005,0\.0025\}\\lambda\\in\\\{0\.02,0\.01,0\.005,0\.0025\\\}\(depth=5=5,ε=0\.03\\varepsilon=0\.03\)\. Values are mean±\\pmstd over 10 bootstraps \(rounded to 3 decimals for means and 2 decimals for std\)\. If some but not all bootstraps run out of memory, the mean and std reported are for those runs that finished\. If a dataset has more than one binarization, the smaller one is displayed in this table\.λ=0\.02\\lambda=0\.02λ=0\.01\\lambda=0\.01λ=0\.005\\lambda=0\.005λ=0\.0025\\lambda=0\.0025DatasetAbsMultAbsMultAbsMultAbsMultAdult\-141\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Aging\-571\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.998±0\.000\.998\\\!\\pm\\\!0\.000\.999±0\.000\.999\\\!\\pm\\\!0\.00Bank\-971\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Bike\-431\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.979±0\.030\.979\\\!\\pm\\\!0\.030\.978±0\.030\.978\\\!\\pm\\\!0\.030\.994±0\.010\.994\\\!\\pm\\\!0\.010\.993±0\.010\.993\\\!\\pm\\\!0\.01Christine\-800\.993±0\.010\.993\\\!\\pm\\\!0\.010\.993±0\.010\.993\\\!\\pm\\\!0\.010\.992±0\.010\.992\\\!\\pm\\\!0\.010\.994±0\.010\.994\\\!\\pm\\\!0\.010\.997±0\.000\.997\\\!\\pm\\\!0\.000\.998±0\.010\.998\\\!\\pm\\\!0\.011\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Churn\-811\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.977±0\.030\.977\\\!\\pm\\\!0\.030\.979±0\.030\.979\\\!\\pm\\\!0\.030\.991±0\.020\.991\\\!\\pm\\\!0\.020\.989±0\.030\.989\\\!\\pm\\\!0\.03Compas\-441\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.999±0\.000\.999\\\!\\pm\\\!0\.000\.999±0\.000\.999\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Covertype\-451\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Diabetes\-331\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Droid\-841\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.975±0\.030\.975\\\!\\pm\\\!0\.031\.000±0\.001\.000\\\!\\pm\\\!0\.000\.987±0\.010\.987\\\!\\pm\\\!0\.010\.993±0\.010\.993\\\!\\pm\\\!0\.01Electricity\-941\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.921±0\.040\.921\\\!\\pm\\\!0\.040\.963±0\.060\.963\\\!\\pm\\\!0\.060\.982±0\.030\.982\\\!\\pm\\\!0\.030\.980±0\.030\.980\\\!\\pm\\\!0\.030\.950±0\.000\.950\\\!\\pm\\\!0\.000\.987±0\.020\.987\\\!\\pm\\\!0\.02Heart\-420\.995±0\.020\.995\\\!\\pm\\\!0\.021\.000±0\.001\.000\\\!\\pm\\\!0\.000\.946±0\.130\.946\\\!\\pm\\\!0\.130\.917±0\.220\.917\\\!\\pm\\\!0\.220\.948±0\.140\.948\\\!\\pm\\\!0\.140\.952±0\.130\.952\\\!\\pm\\\!0\.130\.948±0\.140\.948\\\!\\pm\\\!0\.140\.952±0\.130\.952\\\!\\pm\\\!0\.13Helena\-841\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Heloc\-651\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00IOT\-861\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Jasmine\-511\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.985±0\.020\.985\\\!\\pm\\\!0\.020\.997±0\.010\.997\\\!\\pm\\\!0\.010\.994±0\.010\.994\\\!\\pm\\\!0\.010\.995±0\.010\.995\\\!\\pm\\\!0\.010\.992±0\.020\.992\\\!\\pm\\\!0\.020\.992±0\.020\.992\\\!\\pm\\\!0\.02Madeline\-760\.991±0\.010\.991\\\!\\pm\\\!0\.011\.000±0\.001\.000\\\!\\pm\\\!0\.000\.986±0\.040\.986\\\!\\pm\\\!0\.040\.997±0\.010\.997\\\!\\pm\\\!0\.010\.999±0\.000\.999\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.866±0\.260\.866\\\!\\pm\\\!0\.260\.905±0\.230\.905\\\!\\pm\\\!0\.23Madelon\-730\.975±0\.010\.975\\\!\\pm\\\!0\.010\.997±0\.010\.997\\\!\\pm\\\!0\.010\.996±0\.010\.996\\\!\\pm\\\!0\.010\.999±0\.000\.999\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.995±0\.010\.995\\\!\\pm\\\!0\.010\.999±0\.000\.999\\\!\\pm\\\!0\.00Magic\-800\.983±0\.030\.983\\\!\\pm\\\!0\.030\.991±0\.020\.991\\\!\\pm\\\!0\.020\.985±0\.010\.985\\\!\\pm\\\!0\.010\.991±0\.010\.991\\\!\\pm\\\!0\.010\.990±0\.010\.990\\\!\\pm\\\!0\.010\.992±0\.010\.992\\\!\\pm\\\!0\.010\.999±0\.000\.999\\\!\\pm\\\!0\.000\.999±0\.000\.999\\\!\\pm\\\!0\.00Monk2\-171\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.901±0\.170\.901\\\!\\pm\\\!0\.170\.943±0\.150\.943\\\!\\pm\\\!0\.150\.874±0\.260\.874\\\!\\pm\\\!0\.260\.914±0\.200\.914\\\!\\pm\\\!0\.200\.888±0\.130\.888\\\!\\pm\\\!0\.130\.887±0\.150\.887\\\!\\pm\\\!0\.15Mushroom\-131\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.00Phishing\-441\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.987±0\.040\.987\\\!\\pm\\\!0\.040\.987±0\.040\.987\\\!\\pm\\\!0\.04Shopping\-1121\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.996±0\.010\.996\\\!\\pm\\\!0\.010\.996±0\.010\.996\\\!\\pm\\\!0\.01Spambase\-240\.992±0\.020\.992\\\!\\pm\\\!0\.020\.992±0\.020\.992\\\!\\pm\\\!0\.020\.993±0\.020\.993\\\!\\pm\\\!0\.020\.998±0\.010\.998\\\!\\pm\\\!0\.010\.967±0\.080\.967\\\!\\pm\\\!0\.080\.963±0\.090\.963\\\!\\pm\\\!0\.090\.986±0\.030\.986\\\!\\pm\\\!0\.030\.992±0\.020\.992\\\!\\pm\\\!0\.02Student\-481\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.985±0\.040\.985\\\!\\pm\\\!0\.040\.987±0\.040\.987\\\!\\pm\\\!0\.040\.991±0\.010\.991\\\!\\pm\\\!0\.010\.992±0\.010\.992\\\!\\pm\\\!0\.010\.986±0\.040\.986\\\!\\pm\\\!0\.040\.987±0\.040\.987\\\!\\pm\\\!0\.04Taxi\-271\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.990±0\.010\.990\\\!\\pm\\\!0\.010\.991±0\.010\.991\\\!\\pm\\\!0\.01TicTacToe\-261\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.967±0\.060\.967\\\!\\pm\\\!0\.060\.988±0\.030\.988\\\!\\pm\\\!0\.030\.967±0\.080\.967\\\!\\pm\\\!0\.080\.965±0\.090\.965\\\!\\pm\\\!0\.090\.982±0\.060\.982\\\!\\pm\\\!0\.060\.995±0\.020\.995\\\!\\pm\\\!0\.02Wine\-641\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.001\.000±0\.001\.000\\\!\\pm\\\!0\.000\.998±0\.010\.998\\\!\\pm\\\!0\.010\.998±0\.000\.998\\\!\\pm\\\!0\.00
### D\.3What Trees Are Missing?

We next examine the datasets from[Table 12](https://arxiv.org/html/2606.00202#A4.T12)for which PRAXIS does not achieve perfect recall on every bootstrap\. Our goal is to understand which Rashomon\-set trees are missed\.

Fix a\(λ,d,εmult\)\(\\lambda,d,\\varepsilon\_\{\\textrm\{mult\}\}\)configuration\. For each bootstrap, consider a tree that belongs to the true Rashomon set \(within a1\+εmult1\+\\varepsilon\_\{\\textrm\{mult\}\}factor of the minimum objective\) but is not returned by our approximation\. We normalize its objective value by dividing by the optimal objective on that bootstrap, mapping it into the interval\[1,1\+εmult\]\[1,1\+\\varepsilon\_\{\\textrm\{mult\}\}\]\. We then aggregate these normalized objectives across all bootstraps and analyze the distribution of missingness on the interval\[1,1\+εmult\]\[1,1\+\\varepsilon\_\{\\textrm\{mult\}\}\]\.

We summarize the distribution of missing objectives using three statistics\. The*top third*reports the proportion of missing trees whose normalized objective lies in\[1,1\+εmult/3\)\[1,1\+\\varepsilon\_\{\\textrm\{mult\}\}/3\), while the*bottom third*reports the proportion in\[1\+2​εmult/3,1\+εmult\]\[1\+2\\varepsilon\_\{\\textrm\{mult\}\}/3,1\+\\varepsilon\_\{\\textrm\{mult\}\}\]\. In addition, we report the proportion of missing trees that occur at the*very end*\. Unlike the quantile\-based summaries, this corresponds to a single objective value: the largest objective observed in the full Rashomon set on that bootstrap\. This statistic captures whether the trees we miss lie \(essentially\) right at the Rashomon boundary, where there is no slack to absorb proxy error\.

[Table 13](https://arxiv.org/html/2606.00202#A4.T13)aggregates these statistics across all datasets and values ofλ\\lambda\(with depth fixed tod=5d=5andεmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03\)\. Across all configurations, PRAXIS rarely misses trees in the best third of the objective interval, consistent with the fact that it can be used as an optimal tree finder \(see[subsection D\.4](https://arxiv.org/html/2606.00202#A4.SS4)\)\. Instead, the missed trees are overwhelmingly concentrated near the Rashomon boundary, and in many cases a substantial fraction occur exactly at the worst possible objective\. This suggests that the observed recall losses are driven primarily by trees whose objectives are very close to the Rashomon threshold, motivating the use of slack in PRAXIS’s budget to improve recall for a slightly smaller budget \(see[subsection D\.6](https://arxiv.org/html/2606.00202#A4.SS6)\)\. However, this is typically unnecessary, as PRAXIS frequently recovers all or nearly all of the Rashomon set, yielding a strong approximation anyway\.

Table 13:Distribution of the*missed*trees for variousλ\\lambdaconfigurations \(d=5d=5,εmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03\)\. The number line illustrates the regions used in the table\. For each missed tree, we normalize its objective by the optimal objective, so values lie in\[1,1\+εmult\]\[1,1\+\\varepsilon\_\{\\textrm\{mult\}\}\]\. The entries report*where the missed trees occur*, not the miss rate within each region: for example,0\.900\.90in Bottom Third means that90%90\\%of the missed trees lie in the bottom third, not that90%90\\%of bottom\-third trees are missed\. Top Third denotes\[1,1\+εmult/3\)\[1,1\+\\varepsilon\_\{\\textrm\{mult\}\}/3\), Bottom Third denotes\[1\+2​εmult/3,1\+εmult\]\[1\+2\\varepsilon\_\{\\textrm\{mult\}\}/3,1\+\\varepsilon\_\{\\textrm\{mult\}\}\], and Very End denotes missed trees at the largest objective in the Rashomon set\. See Appendix[D\.2](https://arxiv.org/html/2606.00202#A4.SS2)for overall approximation quality; very few trees are missed\.λ\\lambdaDatasetTop ThirdBottom ThirdVery End0\.0025Aging\-570\.00001\.00000\.9780Bank\-970\.00001\.00000\.1316Bike\-430\.00000\.98820\.1188Christine\-800\.00000\.99970\.3870Churn\-810\.00000\.95140\.3918Compas\-440\.00000\.99920\.2661Covertype\-450\.00000\.99960\.0050Droid\-840\.00000\.93330\.0000Electricity\-940\.00000\.99490\.0453Heart\-420\.00001\.00001\.0000Heloc\-650\.00000\.99970\.6642Jasmine\-510\.00000\.99370\.5966Madeline\-760\.00000\.99440\.5591Madelon\-730\.00000\.99400\.6998Magic\-800\.00000\.99510\.1204Monk2\-170\.00040\.98430\.8440Phishing\-440\.00000\.84520\.0476Shopping\-1120\.00000\.93990\.1796Spambase\-240\.00001\.00000\.4434Student\-480\.00000\.99930\.9713Taxi\-270\.00001\.00000\.0625Tictactoe\-260\.00000\.99230\.8454Wine\-640\.00000\.99290\.32290\.005Bike\-430\.00000\.95810\.1321Christine\-800\.00000\.99800\.2632Churn\-810\.00480\.92860\.3360Compas\-440\.00001\.00000\.1123Covertype\-450\.00001\.00000\.0000Electricity\-940\.00230\.96980\.0490Heart\-420\.00001\.00001\.0000Heloc\-650\.00000\.99030\.3829Jasmine\-510\.00000\.97670\.3878Madeline\-760\.00001\.00000\.4695Madelon\-730\.00000\.98990\.5539Magic\-800\.00030\.97730\.0949Monk2\-170\.00000\.99590\.7115Spambase\-240\.00000\.90910\.5152Student\-480\.00000\.98810\.9486Tictactoe\-260\.03130\.85040\.6104Wine\-640\.00001\.00000\.63640\.01Christine\-800\.00000\.98460\.1919Compas\-440\.00001\.00000\.3636Electricity\-940\.01130\.90160\.0160Heart\-420\.00000\.99250\.9925Heloc\-650\.00001\.00001\.0000Jasmine\-510\.00001\.00000\.0000Madeline\-760\.00000\.99520\.3692Madelon\-730\.00000\.95580\.4575Magic\-800\.00000\.98800\.0901Monk2\-170\.00000\.99810\.5102Spambase\-240\.00000\.00000\.0000Student\-480\.00001\.00001\.0000Tictactoe\-260\.00001\.00000\.78120\.02Christine\-800\.00000\.80000\.2000Madelon\-730\.00001\.00000\.1000Magic\-800\.00000\.97120\.0481Spambase\-240\.00000\.60000\.0000
### D\.4Optimal Tree Finder

Across all 3 sparsity settings \(λ∈\{0\.02,0\.01,0\.005\}\\lambda\\in\\\{0\.02,0\.01,0\.005\\\}\), PRAXIS almost always achieves the same minimum objective as STreeD \(and GOSDT when available\) while being typically 2\-3 orders of magnitude faster\. In the rare failure cases, the returned objective exceeds the optimal value by a small multiplicative factor and by an even smaller additive amount\.

Table 14:Minimum objective values atλ=0\.005\\lambda=0\.005, depth=5=5andεmult∈\{0\.0,0\.01,0\.03\}\\varepsilon\_\{\\textrm\{mult\}\}\\in\\\{0\.0,0\.01,0\.03\\\}, with runtimes in secondsDatasetSTreeDPRAXIS \(εmult=0\.0\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.0\)PRAXIS \(εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.01\)PRAXIS \(εmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.03\)Time \(GOSDT\)Time \(STreeD\)Time \(PRAXIS ,εmult=0\.0\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.0\)Adult\-1486778677867786771\.162\.170\.03Adult\-2098677867786778677–49434\.5943\.85Aging\-57135135135135–1\.880\.02Bank\-975273527352735273–3142\.876\.35Bank\-2175273527352735273–48896\.4483\.77Bike\-432977297729772977430\.1651\.900\.32Bike\-1642889288928892889–5333\.0468\.07Chess\-506321632163216321445\.1220\.320\.59Christine\-801706170617061706–346\.662\.05Christine\-231170917091709––16139\.6663\.66Churn\-815805805805801648\.4781\.570\.55Churn\-472580580580580–61895\.81101\.92Compas\-44172417241724172441\.201\.340\.09Covertype\-451646681646681646681646681423\.34849\.664\.17Covertype\-96164668164668164668164668–19457\.1874\.15Credit\-1345712571257125712–3665\.6110\.54Credit\-2255712571257125712–28223\.6368\.63Diabetes\-33366143661436614366141565\.15608\.102\.00Diabetes\-12136614366143661436614–23273\.75102\.40Droid\-842715271527152715–305\.521\.59Electricity\-949406951994069406–536\.496\.35Electricity\-2649376937693769376–44936\.91672\.79Heart\-4239393939301\.391\.201\.32Helena\-843246324632463246–404\.414\.67Helena\-1563246324632463246–16671\.6230\.47Heloc\-657707707707706502\.0119\.700\.61Higgs\-84–415965741596574159657––4466\.14IOT\-8613561356135613560\.271\.060\.74Jannis\-10617339173511733917339–3787\.9946\.88Jannis\-247173111731117311––120205\.0614468\.23Jasmine\-51659659659659352\.7917\.230\.73Jasmine\-207659659659659–1620\.5197\.47Madeline\-766796796796795488\.7449\.201\.17Madeline\-451652––––183383\.52–Madelon\-734564564564565573\.9930\.7325\.52Madelon\-186435435–––2374\.226330\.09Magic\-803903390339033903–183\.609\.82Magic\-1673903391539053903–4171\.2339\.28Monk2\-171621621621621\.020\.060\.18Mushroom\-132532532532530\.180\.060\.00News\-196157161571615716––72221\.13177\.13Phishing\-441136113611361136199\.007\.930\.16Poker\-40473585473585473585473585–5270\.7161\.15Shopping\-1121456145614561456–545\.954\.29Shopping\-2431456145614561456–12060\.3340\.47Spambase\-245645645645649\.390\.570\.02Spambase\-675705705705701038\.0418\.880\.39Student\-48115115115115255\.051\.780\.07Taxi\-27100623100623100623100623268\.61427\.251\.69TicTacToe\-2616816816816818\.450\.440\.03Wine\-6412601260126012601492\.8329\.500\.52Table 15:Minimum objective values atλ=0\.01\\lambda=0\.01, depth=5=5andεmult∈\{0\.0,0\.01,0\.03\}\\varepsilon\_\{\\textrm\{mult\}\}\\in\\\{0\.0,0\.01,0\.03\\\}, with runtimes in secondsDatasetSTreeDPRAXIS \(εmult=0\.0\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.0\)PRAXIS \(εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.01\)PRAXIS \(εmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.03\)Time \(GOSDT\)Time \(STreeD\)Time \(PRAXIS ,εmult=0\.0\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.0\)Adult\-1496539653965396530\.311\.060\.03Adult\-2099653965396539653–9611\.0637\.43Aging\-5713813813813856\.402\.240\.02Bank\-975741574157415741–2265\.542\.80Bank\-2175741574157415741–33338\.4028\.43Bike\-433454345434543454304\.7643\.900\.36Bike\-1643454345434543454–5459\.6512\.53Chess\-507133713371337133356\.9020\.730\.46Christine\-801851185118511851–305\.204\.06Christine\-2311860186018601860–13646\.9680\.84Churn\-817367367367361005\.8977\.910\.69Churn\-472736736736736–48780\.84149\.35Compas\-44184518451845184529\.271\.240\.05Covertype\-45173383173383173383173383923\.09535\.433\.33Covertype\-96173383173383173383173383–16413\.6260\.55Credit\-1346012601260126012–3483\.048\.42Credit\-2256012601260126012–24738\.4937\.78Diabetes\-3337883378833788337883618\.10426\.371\.40Diabetes\-12137883378833788337883–20655\.0749\.53Droid\-843285328532853285–177\.581\.15Electricity\-9410526105261052610526–524\.9532\.23Electricity\-26410526105261052610526–38287\.222732\.79Heart\-4262626262211\.611\.140\.09Helena\-844224422442244224153\.8018\.991\.84Helena\-1564224422442244224–1634\.6914\.12Heloc\-658188188188185534\.7617\.180\.61Higgs\-84–444935944493594449359––2285\.68IOT\-8625862586258625860\.262\.230\.67Jannis\-10618789187891878918789–3182\.8826\.58Jannis\-24718771187711877118771–97224\.37321\.67Jasmine\-51725725725725257\.596\.070\.45Jasmine\-207725725725725–1289\.8930\.33Madeline\-768898898898894159\.8749\.334\.51Madeline\-451858858858858–174318\.2519950\.96Madelon\-735775775775774212\.6630\.2840\.14Madelon\-186560560560560–2235\.101697\.94Magic\-804498450044984498–155\.683\.21Magic\-1674490449044904490–3412\.5928\.78Monk2\-172082122092080\.810\.060\.00Mushroom\-134444444444440\.160\.030\.00News\-19616496164961649616496–65250\.51131\.71Phishing\-441360136013601360143\.647\.700\.12Poker\-40509567509567509567509567–5076\.4049\.22Shopping\-1121578157815781578–349\.462\.85Shopping\-2431578157815781578–8196\.0027\.35Spambase\-247027027027025\.180\.440\.02Spambase\-67708708708708762\.0717\.370\.30Student\-48125125125125163\.021\.450\.04Taxi\-271128651128651128651128655\.38140\.521\.41TicTacToe\-2624424424424412\.800\.450\.36Wine\-6413261326132613261008\.0525\.990\.36Table 16:Minimum objective values atλ=0\.02\\lambda=0\.02, depth=5=5andεmult∈\{0\.0,0\.01,0\.03\}\\varepsilon\_\{\\textrm\{mult\}\}\\in\\\{0\.0,0\.01,0\.03\\\}, with runtimes in secondsDatasetSTreeDPRAXIS \(εmult=0\.0\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.0\)PRAXIS \(εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.01\)PRAXIS \(εmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.03\)Time \(GOSDT\)Time \(STreeD\)Time \(PRAXIS ,εmult=0\.0\\varepsilon\_\{\\textrm\{mult\}\}\{=\}0\.0\)Adult\-14115571155711557115570\.080\.260\.02Adult\-20911557115571155711557–32301\.1628\.89Aging\-5714514514514524\.881\.640\.02Bank\-976193619361936193204\.541324\.781\.92Bank\-2176193619361936193–14266\.2419\.94Bike\-434138413841384138140\.3828\.090\.20Bike\-1644138413841384138–2071\.786\.63Chess\-508253825382538253251\.5418\.390\.31Christine\-8020532053205320533960\.06237\.724\.10Christine\-2312054205420542054–27891\.2363\.42Churn\-8180780780780763\.6744\.960\.22Churn\-472807807807807–30923\.9634\.95Compas\-4419921992199219923\.850\.690\.04Covertype\-45190813190813190813190813216\.556\.802\.83Covertype\-96190813190813190813190813–10531\.6850\.26Credit\-1346612661266126612–5998\.635\.78Credit\-2256612661266126612–17247\.6726\.36Diabetes\-33404204042040420404203\.5429\.180\.79Diabetes\-12140420404204042040420–40707\.2226\.85Droid\-844167416741674167–245\.280\.80Electricity\-9411769117691176911769–430\.662\.62Electricity\-26411769117691176911769–29572\.3061\.22Heart\-4277777777147\.280\.880\.03Helena\-8453095309530953090\.290\.450\.64Helena\-15653095309530953090\.821\.004\.78Heloc\-658758758758753731\.9614\.370\.25Higgs\-84–476311947631194763119––1073\.52IOT\-8650485048504850480\.270\.970\.63Jannis\-10620963209632096320963–2502\.8619\.77Jannis\-24720986209862098620986–71091\.52275\.90Jasmine\-51828828828828151\.625\.070\.09Jasmine\-207828828828828–1192\.103\.43Madeline\-7611101110111011102783\.98110\.120\.65Madeline\-4511104110411041104–159124\.31405\.31Madelon\-737347347347342402\.6728\.347\.01Madelon\-186730730730730–2258\.98214\.28Magic\-8052265226522652262206\.13123\.314\.69Magic\-1675225522552255225–2341\.0425\.46Monk2\-172182182182180\.550\.050\.00Mushroom\-137687687687680\.120\.020\.00News\-19617675176751767517675–47814\.6875\.18Phishing\-4416701670167016701\.715\.760\.06Poker\-40531808531808531808531808–4081\.4213\.91Shopping\-1121826182618261826152\.18123\.011\.38Shopping\-2431826182618261826–3321\.3713\.68Spambase\-249509509509501\.310\.250\.01Spambase\-67945945945945501\.6942\.460\.37Student\-4814314314314384\.211\.080\.02Taxi\-271373471373471373471373471\.824\.851\.17TicTacToe\-263043043043046\.570\.350\.07Wine\-641407140714071407532\.8219\.850\.25
### D\.5Other Proxy Algorithms

#### D\.5\.1Other Proxy Algorithm Timings

In[subsection B\.6](https://arxiv.org/html/2606.00202#A2.SS6), we demonstrated how less cache\-friendly proxies can hurt runtime and memory usage\. Here, we evaluate three proxy regimes for PRAXIS : \(i\) a greedy tree builder \(ℓ=0\\ell\{=\}0\), \(ii\) our default modified LicketySPLIT algorithm \(ℓ=1\\ell\{=\}1\), and \(iii\) our LicketySPLIT generalization \(ℓ=2\\ell\{=\}2, detailed in[subsection B\.5](https://arxiv.org/html/2606.00202#A2.SS5)\), all of which have cache\-friendly properties\. Each increase inℓ\\elladds a factor of𝒪​\(k​d\)\\mathcal\{O\}\(kd\)to the runtime of each proxy call\. In practice, this means that each increase inℓ\\ellmakes PRAXIS an order of magnitude slower, though this is not a universal rule\.[Table 17](https://arxiv.org/html/2606.00202#A4.T17)–[Table 19](https://arxiv.org/html/2606.00202#A4.T19)show these observations\. For instance, on the Christine, Electricity, Jannis, and Madelon datasets, using a greedy algorithm as the proxy algorithm can lead PRAXIS to run out of memory, as the initial bound was set too loose \(see[Table 19](https://arxiv.org/html/2606.00202#A4.T19)\)\. These cases are a motivating reason for why the modified LicketySPLIT algorithm is our default proxy algorithm: only on Madelon is it ever slower than using theℓ=2\\ell=2generalization, which is a very challenging dataset that pushes beyond typical real\-world uses\. Additionally, it is generally expected that increasingℓ\\ellwill correspond to better recall for the set of trees withinεmult\\varepsilon\_\{\\textrm\{mult\}\}of optimal\. While this does not always have to hold due to the budget initialization in PRAXIS, we find that it holds practically \(see[subsubsection D\.5\.2](https://arxiv.org/html/2606.00202#A4.SS5.SSS2)\)\. In particular, we find that there are some cases where using a greedy proxy can lead to significantly worse recall than using modified LicketySPLIT \(which leads to perfect recall the majority of the time\)\. One explanation for this is that choosing the optimal split based on more information is more robust, so the proxy objective at the root better reflects the performance of the proxy objective at other subproblems, so[3\.6](https://arxiv.org/html/2606.00202#S3.Thmtheorem6)and other theoretical results are more likely to hold\.

Table 17:Runtime and peak memory usage atλ=0\.02\\lambda=0\.02,εmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03, depth=5=5\.Peak MB reports peak resident set size \(RSS\), including imports, dataset loading, and algorithm execution\. – indicates that the method did not finish\.PRAXIS \(ℓ=0\\ell\{=\}0\)PRAXIS \(ℓ=1\\ell\{=\}1\)PRAXIS \(ℓ=2\\ell\{=\}2\)DatasetnnkkTimePeak MBTimePeak MBTimePeak MBAdult48842140\.01144\.270\.04144\.270\.11144\.27Adult488422097\.84286\.95272\.53682\.154709\.9314651\.75Aging714570\.00126\.920\.02126\.920\.17134\.71Bank45211970\.05195\.742\.43195\.7430\.17298\.87Bank452112170\.19281\.8819\.79281\.88627\.681476\.60Bike17379431\.17149\.960\.78146\.385\.02206\.43Bike1737916442\.20332\.1735\.40359\.94605\.434125\.13Chess28056500\.03151\.670\.33151\.672\.64159\.85Christine54188066\.841417\.7731\.28974\.84199\.965516\.96Christine54182312285\.9521360\.87944\.2710439\.32––Churn5000810\.01136\.170\.22136\.172\.92199\.82Churn50004720\.11162\.3434\.84279\.412139\.1314412\.20Compas4966440\.01130\.470\.09130\.470\.42137\.01Covertype581012452\.51579\.3013\.11579\.3041\.02579\.30Covertype5810129621\.691301\.23358\.701301\.232760\.711350\.85Credit300001340\.10189\.425\.74190\.12142\.21764\.59Credit300002250\.25230\.3026\.03249\.60985\.633665\.34Diabetes253680330\.09291\.690\.79291\.695\.53291\.69Diabetes2536801210\.54677\.6726\.59677\.67545\.65848\.44Droid29332840\.03165\.040\.79165\.0418\.15254\.45Electricity38474940\.28183\.057\.83192\.6372\.14590\.03Electricity384742644\.22290\.44306\.94692\.107059\.9716147\.89Heart297420\.01126\.410\.05130\.010\.44158\.48Helena65196840\.04214\.390\.64214\.398\.91218\.62Helena651961560\.11290\.964\.74290\.96126\.88644\.31Heloc2502650\.23133\.980\.38140\.595\.08335\.07Higgs110000008471\.3321537\.792374\.9121537\.7937609\.9421537\.79IOT123117860\.06302\.860\.63302\.869\.43302\.86Jannis5758010614\.44221\.44327\.541458\.632283\.1414792\.75Jannis57580247485\.68757\.7815352\.0442067\.00––Jasmine2984510\.02129\.660\.35142\.873\.00254\.38Jasmine29842070\.37141\.2922\.77451\.61460\.978297\.02Madeline314076115\.009451\.592\.22196\.4519\.29710\.83Madeline3140451––2971\.6529094\.20––Madelon20007340\.801640\.368\.19175\.668\.15166\.89Madelon20001862448\.5453989\.451420\.3329392\.031967\.5735821\.65Magic190208021\.23299\.1331\.05446\.82115\.341355\.47Magic19020167404\.962118\.31368\.482481\.183190\.9929738\.90Monk2601170\.00125\.960\.00125\.960\.01126\.65Mushroom8124130\.00128\.880\.00128\.880\.00128\.88News396441966\.49249\.32596\.471480\.4816288\.0861919\.68Phishing11055440\.01138\.560\.07138\.560\.79146\.90Poker1025010400\.62952\.1013\.71952\.10150\.82952\.10Shopping123301122\.51162\.804\.09187\.6957\.091001\.80Shopping1233024336\.44308\.2758\.88470\.001609\.7814590\.64Spambase4601240\.01128\.880\.05130\.400\.19137\.54Spambase4601670\.36136\.483\.67232\.3022\.89754\.88Student649480\.00125\.950\.02127\.510\.23141\.20Taxi1224158270\.29894\.651\.18894\.654\.72894\.65TicTacToe958260\.02128\.400\.17145\.150\.16142\.17Wine6497640\.01135\.340\.24135\.343\.18206\.77Table 18:Runtime and peak memory usage atλ=0\.01\\lambda=0\.01,εmult=0\.02\\varepsilon\_\{\\textrm\{mult\}\}=0\.02, depth=5=5\.Peak MB reports peak resident set size \(RSS\), including imports, dataset loading, and algorithm execution\. – indicates that the method did not finish\.PRAXIS \(ℓ=0\\ell\{=\}0\)PRAXIS \(ℓ=1\\ell\{=\}1\)PRAXIS \(ℓ=2\\ell\{=\}2\)DatasetnnkkTimePeak MBTimePeak MBTimePeak MBAdult48842140\.04142\.790\.03142\.790\.11142\.79Adult4884220968\.23287\.65213\.43329\.204202\.969429\.84Aging714570\.00126\.820\.02126\.820\.22138\.23Bank45211970\.10194\.563\.69194\.5646\.95374\.95Bank452112170\.37281\.5130\.26281\.51969\.462103\.91Bike17379430\.41144\.000\.87145\.735\.68207\.70Bike1737916413\.67225\.5145\.83402\.91707\.024586\.06Chess28056500\.10151\.670\.84151\.674\.95175\.17Christine541880240\.185638\.4060\.961854\.50218\.225511\.67Christine54182318532\.8695097\.522688\.4936909\.23––Churn5000810\.43137\.021\.51160\.0414\.76413\.08Churn500047279\.49399\.20672\.042673\.3022093\.47127553\.72Compas4966440\.17133\.180\.28135\.310\.91145\.99Covertype581012456\.12579\.6516\.00579\.6549\.36579\.65Covertype5810129659\.591301\.31411\.171301\.312953\.701556\.26Credit300001340\.14189\.788\.49194\.00203\.411014\.44Credit300002250\.35231\.8537\.61265\.081407\.143913\.87Diabetes253680330\.11291\.291\.41291\.299\.97291\.29Diabetes2536801211\.50679\.4349\.56679\.43960\.791039\.69Droid29332840\.05164\.861\.32164\.8623\.14285\.92Electricity384749464\.11635\.72167\.97853\.92206\.931074\.71Electricity384742645691\.4427635\.3315226\.1841489\.4524916\.2169089\.77Heart297420\.02127\.280\.15138\.060\.97182\.99Helena65196841\.17214\.654\.09214\.6546\.84392\.11Helena6519615612\.65290\.9153\.02343\.79986\.023744\.70Heloc2502650\.05129\.841\.71187\.7518\.10666\.60Higgs1100000084393\.5621536\.639354\.1921536\.6373165\.2621536\.63IOT123117860\.06305\.500\.69305\.5010\.90305\.50Jannis57580106310\.421347\.28232\.051214\.111690\.068275\.83Jannis5758024711780\.8838579\.3110085\.6135068\.73––Jasmine2984510\.24134\.401\.76184\.794\.23253\.73Jasmine298420711\.70268\.94142\.812348\.67584\.558660\.03Madeline314076––29\.261063\.5735\.101115\.99Madelon200073––103\.713816\.8334\.161026\.64Madelon2000186––4731\.5394685\.841150\.6021008\.90Magic19020801\.69155\.9434\.69467\.59157\.951492\.90Magic19020167167\.851377\.71707\.414875\.074224\.4532126\.71Monk2601170\.00125\.150\.02128\.170\.02127\.72Mushroom8124130\.00128\.550\.00128\.550\.01128\.55News3964419652\.38302\.441151\.192658\.8125423\.37120160\.33Phishing11055440\.01138\.570\.13138\.571\.48159\.73Poker102501040457\.95990\.07770\.12990\.071509\.56990\.07Shopping123301125\.18175\.075\.76194\.2087\.151188\.32Shopping1233024376\.70415\.3691\.16575\.522516\.9516797\.06Spambase4601240\.05130\.680\.03128\.770\.13133\.84Spambase4601671\.49174\.851\.87182\.1713\.96556\.00Student649480\.00126\.160\.05129\.010\.41151\.35Taxi1224158270\.31894\.161\.42894\.165\.63894\.16TicTacToe958260\.01127\.370\.57178\.930\.20144\.43Wine6497640\.01135\.970\.47140\.975\.71258\.30Table 19:Runtime and peak memory usage atλ=0\.005\\lambda=0\.005,εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}=0\.01, depth=5=5across 3 proxy options\.Peak MB reports peak resident set size \(RSS\), including imports, dataset loading, and algorithm execution\. – indicates that the method did not finish\.PRAXIS \(ℓ=0\\ell\{=\}0\)PRAXIS \(ℓ=1\\ell\{=\}1\)PRAXIS \(ℓ=2\\ell\{=\}2\)DatasetnnkkTimePeak MBTimePeak MBTimePeak MBAdult48842140\.04142\.950\.04142\.950\.13142\.95Adult4884220932\.36287\.67103\.67287\.672428\.604861\.11Aging714570\.00126\.960\.02126\.960\.25138\.39Bank452119712\.10195\.2614\.84195\.26144\.98671\.35Bank45211217224\.29299\.06258\.36308\.944200\.017761\.75Bike17379430\.02143\.970\.34143\.973\.22184\.31Bike173791640\.22170\.06193\.43392\.35386\.162333\.43Chess28056501\.24151\.820\.56151\.824\.57171\.81Christine5418802823\.7893080\.189\.11340\.74148\.763683\.37Christine5418231––740\.809069\.65––Churn5000811\.15160\.911\.56167\.2414\.48405\.19Churn5000472313\.911614\.43373\.36668\.7216226\.72120081\.82Compas4966440\.83171\.840\.14131\.780\.56141\.74Covertype58101245203\.48577\.1227\.02577\.1273\.76577\.12Covertype581012961983\.471394\.221175\.581301\.565246\.792086\.85Credit300001340\.21187\.8614\.08187\.86254\.821120\.93Credit300002250\.44231\.3265\.09231\.321715\.574348\.59Diabetes253680330\.15288\.942\.39288\.9414\.38288\.94Diabetes2536801211\.35676\.9069\.09676\.901370\.851290\.02Droid29332840\.22165\.441\.83165\.4429\.04310\.22Electricity3847494––33\.11203\.75363\.331528\.72Electricity38474264––3936\.7810709\.5617162\.8739449\.11Heart2974241\.1517781\.653\.80741\.070\.65164\.29Helena65196841\.46214\.454\.85214\.4563\.93427\.16Helena6519615624\.05288\.5671\.38288\.561135\.653719\.30Heloc2502650\.13129\.702\.08155\.2224\.39786\.97Higgs11000000849941\.6821538\.1213787\.2221538\.1265312\.9421538\.12IOT123117860\.06304\.720\.78304\.7212\.29304\.72Jannis57580106––389\.751980\.201174\.444765\.17Jannis57580247––23348\.1582284\.37––Jasmine2984514\.15383\.181\.28148\.845\.36269\.31Jasmine2984207916\.4713211\.78248\.601136\.76710\.4510160\.48Madeline314076––4\.29242\.3843\.021052\.55Madelon200073452\.7577155\.7611\.091751\.2022\.42773\.55Magic19020801\.18150\.7624\.47378\.41138\.151336\.18Magic1902016719\.67192\.93453\.49682\.603298\.7422249\.79Monk2601170\.31289\.940\.16207\.180\.03128\.70Mushroom8124130\.00127\.170\.00127\.170\.01127\.17News39644196770\.88520\.431432\.083948\.9826120\.80120072\.61Phishing11055440\.12138\.130\.15138\.131\.64160\.26Poker1025010404874\.992703\.79450\.94989\.501251\.19989\.50Shopping123301121\.17152\.069\.37161\.83124\.631402\.99Shopping1233024313\.05171\.57121\.89303\.472774\.6917096\.23Spambase4601240\.07134\.620\.03128\.270\.16134\.46Spambase4601673\.93310\.891\.00136\.6614\.37554\.75Student649480\.01124\.620\.12127\.360\.88175\.46Taxi1224158270\.32893\.991\.65893\.996\.75893\.99TicTacToe958260\.03134\.170\.08133\.650\.27149\.93Wine6497641\.61141\.500\.61142\.357\.67285\.83
#### D\.5\.2Trees found using various proxy algorithms

We display approximation results comparing the same three proxy algorithms forλ=0\.005\\lambda=0\.005,εmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03, and depthd=5d=5\. We exclude cases where the Rashomon set size is believed to be small \(<10<10trees\)\.

![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x4.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x5.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x6.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x7.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x8.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x9.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x10.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x11.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x12.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x13.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x14.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x15.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x16.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x17.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x18.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x19.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x20.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x21.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x22.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x23.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x24.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x25.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x26.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x27.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x28.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x29.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x30.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x31.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x32.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x33.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x34.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x35.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x36.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x37.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x38.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x39.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x40.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x41.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x42.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x43.png)
#### D\.5\.3Recall with greedy proxy algorithm

[Table 20](https://arxiv.org/html/2606.00202#A4.T20)shows that using the greedy proxy algorithm in PRAXIS can perform significantly worse than using modified LicketyRESPLIT, which was shown in[Table 12](https://arxiv.org/html/2606.00202#A4.T12)\. While the majority of datasets had recall above0\.850\.85, there existed datasets \(Madelon and TicTacToe\) where PRAXIS recovered a small fraction or none of the true Rashomon set\. Additionally, some datasets \(particularly at lowerλ\\lambdavalues\) were unable to complete using the greedy proxy algorithm due to the suboptimality of the objectives it returned on them\.

Table 20:Multiplicative\-εmult\\varepsilon\_\{\\textrm\{mult\}\}\(Mult\) Rashomon recall acrossλ∈\{0\.02,0\.01,0\.005,0\.0025\}\\lambda\\in\\\{0\.02,0\.01,0\.005,0\.0025\\\}\(depth=5=5,εmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03\) using the greedy proxy algorithm\. Values are mean±\\pmstd over 10 bootstraps\. Missing \(–\) entries indicate no completed runs\. For the Bank dataset atλ=0\.0025\\lambda=0\.0025, no standard deviation is reported because only a single bootstrap run completed within the 200 GB memory limit\.Datasetλ=0\.02\\lambda=0\.02λ=0\.01\\lambda=0\.01λ=0\.005\\lambda=0\.005λ=0\.0025\\lambda=0\.0025Adult\-140\.987±0\.0210\.987\\\!\\pm\\\!0\.0211\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.979±0\.0110\.979\\\!\\pm\\\!0\.0110\.996±0\.0030\.996\\\!\\pm\\\!0\.003Aging\-571\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.912±0\.1580\.912\\\!\\pm\\\!0\.1580\.967±0\.0570\.967\\\!\\pm\\\!0\.057Bank\-971\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.742±0\.1640\.742\\\!\\pm\\\!0\.1640\.997±–0\.997\\\!\\pm\\\!\\text\{\-\-\}Bike\-430\.955±0\.1420\.955\\\!\\pm\\\!0\.1420\.972±0\.0440\.972\\\!\\pm\\\!0\.0440\.847±0\.1270\.847\\\!\\pm\\\!0\.1270\.902±0\.1220\.902\\\!\\pm\\\!0\.122Christine\-800\.694±0\.1800\.694\\\!\\pm\\\!0\.1800\.933±0\.0990\.933\\\!\\pm\\\!0\.0990\.981±0\.0150\.981\\\!\\pm\\\!0\.015–Churn\-811\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.946±0\.0830\.946\\\!\\pm\\\!0\.0830\.979±0\.0480\.979\\\!\\pm\\\!0\.0480\.982±0\.0510\.982\\\!\\pm\\\!0\.051Compas\-440\.958±0\.0590\.958\\\!\\pm\\\!0\.0590\.944±0\.0850\.944\\\!\\pm\\\!0\.0850\.990±0\.0090\.990\\\!\\pm\\\!0\.0090\.995±0\.0120\.995\\\!\\pm\\\!0\.012Covertype\-451\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.990±0\.0140\.990\\\!\\pm\\\!0\.0140\.998±0\.0010\.998\\\!\\pm\\\!0\.0011\.000±0\.0001\.000\\\!\\pm\\\!0\.000Diabetes\-331\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.995±0\.0160\.995\\\!\\pm\\\!0\.016Droid\-841\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Electricity\-941\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.867±0\.0970\.867\\\!\\pm\\\!0\.097––Heart\-420\.874±0\.2140\.874\\\!\\pm\\\!0\.2140\.871±0\.2950\.871\\\!\\pm\\\!0\.2950\.863±0\.3040\.863\\\!\\pm\\\!0\.3040\.863±0\.3040\.863\\\!\\pm\\\!0\.304Helena\-841\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.787±0\.1200\.787\\\!\\pm\\\!0\.1200\.669±0\.2410\.669\\\!\\pm\\\!0\.2410\.562±0\.2670\.562\\\!\\pm\\\!0\.267Heloc\-650\.983±0\.0360\.983\\\!\\pm\\\!0\.0360\.942±0\.0660\.942\\\!\\pm\\\!0\.0660\.959±0\.0610\.959\\\!\\pm\\\!0\.061–IOT\-861\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Jasmine\-510\.861±0\.1100\.861\\\!\\pm\\\!0\.1100\.911±0\.1320\.911\\\!\\pm\\\!0\.1320\.912±0\.1210\.912\\\!\\pm\\\!0\.1210\.939±0\.0420\.939\\\!\\pm\\\!0\.042Madeline\-760\.861±0\.3210\.861\\\!\\pm\\\!0\.321–––Madelon\-730\.867±0\.2380\.867\\\!\\pm\\\!0\.2380\.014±0\.0200\.014\\\!\\pm\\\!0\.0200\.000±−0\.000\\\!\\pm\\\!\-0\.000±0\.0000\.000\\\!\\pm\\\!0\.000Magic\-800\.865±0\.1570\.865\\\!\\pm\\\!0\.1570\.971±0\.0420\.971\\\!\\pm\\\!0\.0420\.948±0\.0430\.948\\\!\\pm\\\!0\.043–Monk2\-171\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.612±0\.4430\.612\\\!\\pm\\\!0\.4430\.813±0\.2770\.813\\\!\\pm\\\!0\.2770\.783±0\.3590\.783\\\!\\pm\\\!0\.359Mushroom\-131\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.943±0\.0740\.943\\\!\\pm\\\!0\.074Phishing\-441\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Shopping\-1121\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.913±0\.1000\.913\\\!\\pm\\\!0\.1000\.953±0\.0450\.953\\\!\\pm\\\!0\.045Spambase\-240\.955±0\.0780\.955\\\!\\pm\\\!0\.0781\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.996±0\.0050\.996\\\!\\pm\\\!0\.005Student\-481\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.900±0\.1620\.900\\\!\\pm\\\!0\.1620\.908±0\.1800\.908\\\!\\pm\\\!0\.1800\.990±0\.0310\.990\\\!\\pm\\\!0\.031Taxi\-271\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.975±0\.0500\.975\\\!\\pm\\\!0\.050TicTacToe\-260\.240±0\.3820\.240\\\!\\pm\\\!0\.3820\.324±0\.3930\.324\\\!\\pm\\\!0\.3930\.692±0\.4380\.692\\\!\\pm\\\!0\.4380\.730±0\.4340\.730\\\!\\pm\\\!0\.434Wine\-641\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.945±0\.1270\.945\\\!\\pm\\\!0\.1270\.985±0\.0170\.985\\\!\\pm\\\!0\.0170\.984±0\.0070\.984\\\!\\pm\\\!0\.007
#### D\.5\.4Proxies with Feature Selection

An immediate implication of[Theorem 3\.2](https://arxiv.org/html/2606.00202#S3.Thmtheorem2)is that PRAXIS scales linearly in the number of binary features outside of the work performed by the proxy algorithm\. While an ideal Rashomon set approximation would retain a large feature representation to fully capture the Rashomon effect, any single sparse decision tree would use a small fraction of those features\. Given this, it may not be necessary for the proxy algorithm itself to operate over the entire feature set to obtain accurate subproblem estimates\.

We therefore study a hybrid regime in which the AND/OR graph is constructed using a large binarization \(on the order of hundreds of thresholds\), while the proxy algorithm is restricted to a much smaller subset of features\. In our experiments, we obtain both the small and large binarizations using Threshold Guessing\(McTavish et al\.,[2022](https://arxiv.org/html/2606.00202#bib.bib52)\), which prunes binary threshold features to the minimal subset sufficient to match the predictions of a strong reference ensemble \(a gradient boosted decision tree\)\. To be precise, for the datasets where we generated two binarizations \(with two different parameter sets for GBDTs\), we pass the dataset with more columns into PRAXIS, but we modify our proxy algorithm \(modified LicketySPLIT\) to use only those features in the intersection of the two binarizations\. By limiting the proxy to this reduced feature set while preserving the full binarization for search, our goal is to approximate the Rashomon set equally well but with even less of a computational burden\.

In[Table 21](https://arxiv.org/html/2606.00202#A4.T21)and[Table 22](https://arxiv.org/html/2606.00202#A4.T22), we see that using feature selection in the proxy algorithm is frequently an order of magnitude faster than using all the features in the larger binarization, while rarely sacrificing in the approximation of the Rashomon set\. This fact does have some nuance, however\. For datasets such as Christine \(withd=7d=7\), the proxy solution at the root was significantly worse when using feature selection, resulting in a much longer runtime and higher memory consumption\. Interestingly, on Christine withd=5d=5, the feature selection version actually yielded more trees within the bound \(a consequence of the initial budget having more slack\)\.

However, there are some risks to this approach\. If the number of features in the intersection is too small \(e\.g\., Spambase\), PRAXIS with a restricted proxy will run out of memory \(d=7d=7, due to a very poor budget initialization\) or take longer than the full variant \(d=5d=5\)\.

We include this experiment to illustrate that PRAXIS can accommodate extremely large feature spaces \(giving much better handling of continuous features, something that is not possible in any other Rashomon set work\), provided the proxy is constrained in a principled way\.

Table 21:Effect of feature selection on runtime, memory, and Rashomon set size\. Memory is reported as peak RSS usage \(Peak MB\)\.λ=0\.01,εmult=0\.03,d=5\\lambda=0\.01,\\varepsilon\_\{\\textrm\{mult\}\}=0\.03,d=5\. Trees are counted only if they are within a1\+εmult1\+\\varepsilon\_\{\\textrm\{mult\}\}factor of the best tree either method found\.FeaturesFeature SelectionFull FeaturesDatasetFSFullTime \(s\)Peak MBTreesTime \(s\)Peak MBTreesAdult142094\.26144\.27128290\.24675\.64128Bank7821716\.27235\.65656\.24323\.106Bike4216411\.16209\.2459467\.22571\.40594Covertype4596233\.35579\.301,592748\.101,301\.231,592Christine612311,291\.7311,056\.51110,3388,670\.74137,812\.63108,301Churn4847229\.99505\.5818,7561,294\.295,493\.5619,461Credit8622511\.92189\.42237\.13263\.522Electricity762644,721\.2717,737\.12240,81828,839\.6877,920\.45240,818Helena321565\.38249\.577460\.02361\.4774Jannis452475,548\.038,489\.09502,75429,538\.39120,010\.49503,636Jasmine4020716\.02449\.68310253\.284,734\.81310Madeline41451––––––Madelon43186–––8,075\.54183,972\.04423,264Magic54167264\.801,287\.62630,9491,554\.8310,724\.28755,584Shopping7424320\.74259\.7015116\.08675\.9815Spambase106713\.332,374\.393,9112\.93209\.734,901Table 22:Effect of feature selection on runtime, memory, and Rashomon set size\. Memory is reported as peak RSS usage \(Peak MB\)\.λ=0\.0075,εmult=0\.01,d=7\\lambda=0\.0075,\\varepsilon\_\{\\textrm\{mult\}\}=0\.01,d=7\. Trees are counted only if they are within a1\+εmult1\+\\varepsilon\_\{\\textrm\{mult\}\}factor of the best tree either method found\.FeaturesFeature SelectionFull FeaturesDatasetFSFullTime \(s\)Peak MBTreesTime \(s\)Peak MBTreesAdult142093\.64144\.2722438\.45491\.7122Bank7821719\.75246\.611110\.42302\.351Bike4216480\.66649\.51264671\.851,542\.75264Covertype4596171\.48579\.301141,083\.311,301\.23114Christine612316,570\.9057,737\.208937,088\.7021,987\.82893Churn4847223\.54432\.374,800919\.791,810\.984,800Credit8622535\.90238\.982170\.45325\.192Electricity762642,996\.055,528\.33140,51722,176\.6712,875\.72140,517Helena321563\.07242\.51864\.56309\.668Jannis4524784,690\.03133,152\.963,553,370–––Jasmine40207109\.082,259\.58761,400\.287,145\.4576Madeline41451––––––Madelon43186––––––Magic54167227\.911,432\.9011,5471,578\.623,224\.0911,547Shopping7424328\.07337\.927227\.76762\.617Spambase1067–––2\.72163\.99216We additionally compute the Rashomon Importance Distribution \(RID;Donnelly et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib28)\) using PRAXIS to assess the effect of binarization size\.[Table 23](https://arxiv.org/html/2606.00202#A4.T23)reports, for each dataset feature, whether it receives zero or nonzero importance under the RID when using a larger versus smaller binarization in the proxy algorithm\. We group all thresholds corresponding to the same continuous feature into a single importance score \(through the RID binning map\), while treating each one\-hot encoded categorical variable as a distinct feature\.

Across both datasets, the larger binarization identifies a substantial number of features that receive nonzero importance in the Rashomon set but are absent under the smaller binarization\. At the same time, it also reveals that some features highlighted by the smaller binarization are in fact irrelevant to all near\-optimal models\. As a result, disagreements in variable importance between the RID computed on the smaller and larger binarizations comprise a substantial fraction of the features in the union of the two binarizations \(22\.3% for Christine and 29\.9% for Jannis\)\.

Table 23:Feature\-level agreement between binarization thresholds\(a\)\*Christine\(157 features \(union of both binarization\)\)Smaller BinarizationLarger BinarizationNot ImportantYes ImportantNot Important8624Yes Important1136Agreement metricsDisagreement35 \(22\.3%\)Both nonzero36 featuresJaccard similarity0\.507

\(b\)\*Jannis\(127 features \(union of both binarization\)\)Smaller BinarizationLarger BinarizationNot ImportantYes ImportantNot Important6934Yes Important420Agreement metricsDisagreement38 \(29\.9%\)Both nonzero20 featuresJaccard similarity0\.345

### D\.6Extracting smaller Rashomon sets

In[subsection D\.4](https://arxiv.org/html/2606.00202#A4.SS4), we showed that PRAXIS can recover exactly optimal trees when the budget initialization is set withεmult=0\\varepsilon\_\{\\textrm\{mult\}\}=0nearly all of the time, or, in every case tested, when one setsεmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03\. Here, we show that a largerεmult\\varepsilon\_\{\\textrm\{mult\}\}will contain a great approximation of the Rashomon set corresponding to a smallerεmult\\varepsilon\_\{\\textrm\{mult\}\}\. In particular, we run PRAXIS withεmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03and evaluate the recall of the Rashomon set characterized byεmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}=0\.01across a range ofλ\\lambdavalues\.

Table 24:Running PRAXIS withεmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03but evaluating recall for the Rashomon set characterized byεmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}=0\.01\.λ=0\.02\\lambda=0\.02λ=0\.01\\lambda=0\.01λ=0\.005\\lambda=0\.005λ=0\.0025\\lambda=0\.0025DatasetSubset RecallSubset RecallSubset RecallSubset RecallAdult\-141\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Aging\-571\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Bank\-971\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Bike\-431\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Christine\-801\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Churn\-811\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.998±0\.0080\.998\\\!\\pm\\\!0\.0080\.998±0\.0070\.998\\\!\\pm\\\!0\.007Compas\-441\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Covertype\-451\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Diabetes\-331\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Droid\-841\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Electricity\-941\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.993±0\.0150\.993\\\!\\pm\\\!0\.0150\.981±0\.0450\.981\\\!\\pm\\\!0\.0450\.990±0\.0180\.990\\\!\\pm\\\!0\.018Heart\-421\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.999±0\.0040\.999\\\!\\pm\\\!0\.0040\.900±0\.3160\.900\\\!\\pm\\\!0\.3160\.900±0\.3160\.900\\\!\\pm\\\!0\.316Helena\-841\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Heloc\-651\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000IOT\-861\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Jasmine\-511\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.996±0\.0130\.996\\\!\\pm\\\!0\.013Madeline\-761\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.890±0\.2720\.890\\\!\\pm\\\!0\.272Madelon\-731\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.998±0\.0050\.998\\\!\\pm\\\!0\.005Magic\-801\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.999±0\.0020\.999\\\!\\pm\\\!0\.0021\.000±0\.0001\.000\\\!\\pm\\\!0\.000Monk2\-171\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.967±0\.1040\.967\\\!\\pm\\\!0\.1040\.997±0\.0100\.997\\\!\\pm\\\!0\.0100\.871±0\.2030\.871\\\!\\pm\\\!0\.203Mushroom\-131\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Phishing\-441\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.950±0\.1580\.950\\\!\\pm\\\!0\.158Shopping\-1121\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Spambase\-241\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Student\-481\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.992±0\.0270\.992\\\!\\pm\\\!0\.0270\.991±0\.0280\.991\\\!\\pm\\\!0\.028Taxi\-271\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000TicTacToe\-261\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.986±0\.0430\.986\\\!\\pm\\\!0\.0431\.000±0\.0001\.000\\\!\\pm\\\!0\.000Wine\-641\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000
### D\.7Allowing Non\-Majority Leaf Predictions

All existing Rashomon set algorithms enumerate decision trees under the restriction that each leaf predicts the majority class of the samples it contains\. While this convention simplifies enumeration, it excludes trees that are still within the Rashomon bound\.

PRAXIS removes this restriction by constructing a more flexible AND/OR graph representation that supports enumeration and extraction of trees with any feasible leaf predictions \(for a finite number of classes\), provided that the resulting tree remains within the specified objective budget\. This additional flexibility is important for downstream analyses that depend on the full diversity of feasible models, such as the Rashomon Importance Distribution\(Donnelly et al\.,[2023](https://arxiv.org/html/2606.00202#bib.bib28)\)\.

[Table 25](https://arxiv.org/html/2606.00202#A4.T25)compares the size of the Rashomon set returned by PRAXIS when restricting leaves to majority\-class predictions versus allowing any leaf prediction that satisfies the budget\. Across datasets, permitting non\-majority leaf predictions expands the Rashomon set by up to 40%\.

Importantly, enabling non\-majority leaf predictions leaves runtime essentially unchanged, because no additional subproblems are explored, since alternative leaf predictions are evaluated only at already\-discovered subproblems\. Moreover, any non\-majority leaf prediction is guaranteed to have an objective no better than the majority\-class prediction at that node, so allowing these predictions cannot introduce new splits or alter the structure of the AND/OR search graph\.

We note that all comparisons with existing methods are conducted using the majority\-leaf\-only setting to ensure fairness, though PRAXIS will allow non\-majority leaves by default\.

DatasetTrees \(Majority Leaves\)Trees \(Any Leaves\)RatioAdult\-143743881\.037Adult\-20940,44546,7071\.155Aging\-5758581\.000Bank\-972752751\.000Bank\-2171,2091,2091\.000Bike\-431,1851,2161\.026Bike\-16428,051,41333,375,3161\.190Chess\-5013,06013,8441\.060Christine\-804,455,5305,023,8051\.128Churn\-811,022,4721,022,4721\.000Churn\-47284,800,25684,800,2561\.000Compas\-44502,296631,4111\.257Covertype\-453,887,0204,277,7051\.101Covertype\-96111,933,980118,746,6731\.061Credit\-1342993371\.127Credit\-2251,0181,1661\.145Diabetes\-33111\.000Diabetes\-121111\.000Droid\-8421211\.000Electricity\-9445,425,46050,308,9701\.108Electricity\-26420,085,242,49222,739,283,6411\.132Heart\-42295,467,637,808,347325,114,015,524,4041\.100Helena\-841141141\.000Helena\-1562282281\.000Heloc\-65392,314459,6691\.172Higgs\-848,021,8528,607,8681\.073IOT\-86111\.000Jannis\-1061,749,341,0332,088,723,8631\.194Jasmine\-519,813,78712,282,0771\.252Jasmine\-207513,101,630624,323,8881\.217Madeline\-76596,704,313627,765,1281\.052Madelon\-73715,653,012,747,0001,012,247,433,898,6001\.414Magic\-8017,957,43119,585,3831\.091Magic\-1671,120,242,1161,198,581,7941\.070Monk2\-171,531,085,513,715,9841,933,476,627,234,8321\.263Mushroom\-1318181\.000Phishing\-4418181\.000Poker\-4010,470,399,01914,767,465,9031\.410Shopping\-11259591\.000Shopping\-2431051051\.000Spambase\-241081081\.000Spambase\-679,0489,0481\.000Student\-4825,78426,9961\.047Taxi\-27331\.000TicTacToe\-2663,198,72064,643,2001\.023Wine\-642683001\.119Table 25:Effect of majority\-leaf\-only restriction on Rashomon set size\.
### D\.8Enumerating the full set of Rule Lists

We now compare PRAXIS with the rule\-list variant from[Theorem A\.12](https://arxiv.org/html/2606.00202#A1.Thmtheorem12), which guarantees that the full Rashomon set of rule lists is contained as a subset of the returned decision trees when the proxy is at least as good as a majority leaf prediction\.

We run PRAXIS with a greedy proxy using

λ=0\.01,εmult=0\.1,d=5,\\lambda=0\.01,\\quad\\varepsilon\_\{\\textrm\{mult\}\}=0\.1,\\quad d=5,and record the resulting budget\. We then rerun the algorithm in rule\-list mode\. There are 4 different rule list variants we consider deploying: for one, we could either use a majority leaf proxy algorithm or a greedy tree algorithm\. Additionally, the result of[Theorem A\.12](https://arxiv.org/html/2606.00202#A1.Thmtheorem12)does not require iterative budget refinement, so we also consider just subtracting the leaf objective for the other side when we recurse\.

[Table 26](https://arxiv.org/html/2606.00202#A4.T26)reports results on eight representative datasets without iterative budget refinement\. Using a greedy proxy instead of a majority\-leaf predictor recovers substantially more non–rule\-list decision trees\. Moreover, for several datasets \(Bike, Churn, and Covertype\), no rule lists exist within 10% of the greedy tree objective\.

Overall, the rule\-list variant is not competitive with the default PRAXIS configuration\. The default configuration is often up to two orders of magnitude faster, although in some datasets the difference narrows to only a few\-fold\. We also observe no advantage to using the greedy rule\-list variant \(without iterative budget refinement\) over directly running PRAXIS\. The larger number of trees returned by PRAXIS arises from its iterative budget refinement\. Performance improves with iterative budget refinement \(see[Table 27](https://arxiv.org/html/2606.00202#A4.T27)\), which increases the number of trees returned, but at the cost of additional runtime\.

Table 26:Comparison of rule list enumeration under a fixed budget withλ=0\.01\\lambda=0\.01,εmult=0\.1\\varepsilon\_\{\\mathrm\{mult\}\}=0\.1, and depthd=5d=5\. Budgets are defined relative to the greedy tree objective\. The final two columns report the ratio of the number of trees enumerated by each rule list variant relative to PRAXIS \.DatasetGreedy Rule ListLeaf Rule ListPRAXIS \(Greedy\)Tree Ratio vs\. PRAXISTreesTime \(s\)TreesTime \(s\)TreesTime \(s\)Greedy RLLeaf RLAdult\-14103,4811\.6064980\.914108,8440\.3210\.9510\.005Aging\-575,76872\.8925,76867\.0565,7680\.0951\.0001\.000Bike\-4321,493,743138\.512032\.36523,485,28913\.9860\.9150\.000Chess\-506,867,464322\.358089\.9916,883,1727\.5280\.9980\.000Churn\-8125,248,640995\.6352,691,075766\.39325,917,92318\.5560\.9740\.104Compas\-4411,456,202,024115\.0981,066,678,12550\.40611,926,699,09362\.7030\.9610\.089Covertype\-45271,299,2905,025\.53203,393\.202279,368,9582,686\.5310\.9710\.000Helena\-8426,968824\.7171,898638\.73930,07916\.2770\.8970\.063Table 27:Rule list enumeration with iterative budget refinement \(RLIBR\) under a fixed budget withλ=0\.01\\lambda=0\.01,εmult=0\.1\\varepsilon\_\{\\mathrm\{mult\}\}=0\.1, and depthd=5d=5\. Budgets are defined relative to the greedy tree objective\. The final two columns report the ratio of the number of trees enumerated by RLIBR relative to PRAXIS \(Greedy\)\.DatasetRLIBR GreedyRLIBR LeafPRAXIS \(Greedy\)Tree Ratio vs\. PRAXISTreesTime \(s\)TreesTime \(s\)TreesTime \(s\)GreedyLeafAdult\-14109,1471\.772109,0791\.751108,8440\.3211\.0031\.002Aging\-575,76876\.5685,76868\.7205,7680\.0951\.0001\.000Bike\-4323,644,937188\.04423,321,27479\.21823,485,28913\.9861\.0070\.993Chess\-506,883,708327\.6586,847,876143\.7886,883,1727\.5281\.0000\.995Churn\-8126,094,6671,122\.60626,062,8031,161\.38425,917,92318\.5561\.0071\.006Compas\-4411,926,719,436133\.50011,866,219,859133\.90911,926,699,09362\.7031\.0000\.996Covertype\-45279,650,3495,971\.026271,114,0074,607\.987279,368,9582,686\.5311\.0010\.970Helena\-8431,622863\.67731,622656\.37830,07916\.2771\.0511\.051
### D\.9Depth 7 Rashomon Sets

Even for approximate algorithms, Rashomon set computation becomes increasingly challenging as depth grows, since the search space expands exponentially\. Despite this, PRAXIS remains practical at depths where the approximation offered by RESPLIT struggles\. For example, on Magic with 167 binary features, PRAXIS completes in under 7 hours, whereas RESPLIT was projected to require over 150 days before timing out\.

More broadly, PRAXIS enables depth\-7 Rashomon sets to be approximated efficiently across a wide range of datasets\. For instance, Bank with 97 binary features completes in just over 2 minutes \(compared to nearly 70 hours for RESPLIT\), while Churn with 472 binary features finishes in under 2 hours, where RESPLIT fails to complete at all\. Similar behavior is observed across many other datasets\.

Beyond being up to four orders of magnitude more efficient than RESPLIT in both runtime and memory, PRAXIS also yields substantially better Rashomon set approximations\. Table[28](https://arxiv.org/html/2606.00202#A4.T28), on 100% of datasets with a sufficient number of features \(and where both methods finish\), PRAXIS returned more trees within the estimated Rashomon bound\. RESPLIT frequently returns zero trees within the bounds, even when hundreds of thousands of feasible trees exist and were found by PRAXIS \.

We additionally ran SORTeD on each of the datasets shown in the tables for 148 hours\. Many datasets ran out of time; the full list is shown in Table[30](https://arxiv.org/html/2606.00202#A4.T30)\. This list includes Rashomon sets that PRAXIS approximated in 156 seconds \(Bike\), 386 seconds \(Covertype\), 55 seconds \(Credit\), 11 seconds \(Droid\) and 54 seconds \(Helena\)\.

Table 28:Number of trees within the shared Rashomon bound \(λ=0\.003\\lambda=0\.003,εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}=0\.01, depth=7=7\)\. The bound is computed as\(1\+εmult\)⋅min\(1\+\\varepsilon\_\{\\textrm\{mult\}\}\)\\cdot\\minobjective across PRAXIS and RESPLIT for each dataset\. We display only the datasets where both methods finished, and for datasets with at least 40 binary features \(to allow the algorithm to build out a reasonably full tree, as depth 7 trees have up to 127 splits\)\.DatasetPRAXIS CountRESPLIT CountRESPLIT / PRAXISAging\-57111\.00Bank\-972372371\.00Christine\-8097,38300\.00Churn\-81444,98400\.00Compas\-4411,7771,0580\.09Covertype\-45651,74982,0420\.13Droid\-842040\.20Electricity\-94338,38400\.00Helena\-8490901\.00Heloc\-658,955580\.01IOT\-86111\.00Jasmine\-516,3063520\.06Jasmine\-207251,2408,3280\.03Magic\-80332,55000\.00Phishing\-4444320\.73Shopping\-112118460\.39Spambase\-6738300\.00Student\-4856,920,25600\.00Wine\-64255470\.18Table 29:Runtime and peak memory atλ=0\.003\\lambda=0\.003,εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}=0\.01, depth=7=7\.PRAXISRESPLITDatasetnnkkTimePeak MBTimePeak MBAdult48842140\.19145\.6918\.11415\.54Adult488422097461\.512774\.41––Aging714570\.07130\.7080\.95342\.25Bank4521197141\.28290\.55247881\.4291926\.96Bank452112173535\.031661\.61––Bike17379432\.55146\.74––Bike17379164156\.06376\.99––Chess28056506\.12159\.23––Christine5418803460\.6919184\.1235727\.4131740\.47Churn5000817\.09204\.3720299\.75113299\.24Churn50004726761\.0615688\.87––Compas4966441\.15147\.321859\.00459\.10Covertype58101245385\.79591\.8347552\.4222275\.11Covertype58101296107591\.7524887\.65––Credit3000013454\.93218\.25––Credit30000225241\.45323\.15––Diabetes253680337\.87288\.396797\.9324142\.36Diabetes253680121373\.83678\.28––Droid293328411\.08166\.1817210\.0132359\.53Electricity38474941212\.862107\.4426917\.5636508\.97Helena651968453\.73221\.9127537\.87103029\.52Helena65196156945\.48585\.24––Heloc25026592\.731291\.254953\.275481\.90IOT123117860\.75306\.325\.35808\.59Jasmine298451273\.509100\.83767\.512055\.16Jasmine29842071651\.246359\.39249459\.3278873\.87Madeline31407620\.29329\.56––Magic19020802607\.545061\.0340294\.2117221\.75Magic1902016724503\.2514662\.51––Monk2601170\.10137\.16––Mushroom8124130\.01130\.811\.79145\.43News3964419661652\.9924006\.25––Phishing11055440\.91139\.52968\.782222\.81Poker1025010409883\.75990\.20––Shopping12330112393\.77662\.7569335\.7943121\.48Shopping1233024311203\.956963\.14––Spambase4601240\.17130\.6839\.61377\.48Spambase46016753\.20433\.592271\.983803\.36Student649482\.43197\.89362\.591410\.65Taxi12241582718\.39846\.5613034\.8314111\.06TicTacToe9582617\.454087\.46––Wine64976441\.15400\.145735\.9911738\.09Table 30:PRAXIS and SORTeD Runtime atλ=0\.003\\lambda=0\.003,εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}=0\.01, depth=7=7\.RuntimeDatasetnnkkPRAXISSORTeDAdult48842140\.1939\.53Adult488422097461\.51\>\>532800Aging714570\.07414\.39Bank4521197141\.28\>\>200 GBBank452112173535\.03\>\>200 GBBike17379432\.554103\.58Bike17379164156\.06\>\>532800Chess28056506\.125841\.20Christine5418803460\.69\>\>200 GBChurn5000817\.0941797\.70Churn50004726761\.06\>\>532800Compas4966441\.151352\.45Covertype58101245385\.79\>\>532800Covertype58101296107591\.75\>\>532800Credit3000013454\.93\>\>532800Credit30000225241\.45\>\>532800Diabetes253680337\.87104729\.81Diabetes253680121373\.83\>\>532800Droid293328411\.08\>\>532800Electricity38474941212\.86\>\>532800Helena651968453\.73\>\>532800Helena65196156945\.48\>\>532800Heloc25026592\.7354344\.65IOT123117860\.7524\.84Jasmine298451273\.507167\.92Jasmine29842071651\.24\>\>532800Madeline31407620\.29150090\.54Magic19020802607\.54152162\.28Magic1902016724503\.25\>\>532800Monk2601170\.102\.68Mushroom8124130\.010\.96News3964419661652\.99\>\>532800Phishing11055440\.916195\.96Poker1025010409883\.75\>\>200 GBShopping12330112393\.77\>\>200 GBShopping1233024311203\.95\>\>200 GBSpambase4601240\.1779\.16Spambase46016753\.2061253\.28Student649482\.431545\.37Taxi12241582718\.39138645\.19TicTacToe9582617\.45137\.47Wine64976441\.1543200\.34
### D\.10Results on Fully Binarized Datasets

In[Table 31](https://arxiv.org/html/2606.00202#A4.T31), we also evaluate Rashomon set computation on fully binarized versions of several datasets\. For each continuous feature, we generate binary threshold features using the midpoints between consecutive unique values\. This produces a fully binarized feature space\. As in our earlier experiments, PRAXIS achieves near\-perfect approximation quality, while requiring substantially less time than prior methods\. We additionally ran TreeFARMS on these datasets, but it did not complete on any instance within the prescribed time and memory limits\.

DatasetPRAXISRESPLITSORTeDTimeRecallTimeRecallTimeλ=0\.02,ε=0\.03\\lambda=0\.02,\\ \\varepsilon=0\.03Bike\-279187\.314434\.10\.285767188\.5Compas\-1081\.44175\.761277\.9Diabetes\-185106\.8—7622\.2——Droid\-850\.89175\.51933\.7Heloc\-14963996—98764\.5——Student\-1010\.27118\.321163\.8λ=0\.005,ε=0\.01\\lambda=0\.005,\\ \\varepsilon=0\.01Bike\-279938\.21——93207\.2Compas\-1082\.24179\.40\.0254239\.5Diabetes\-185273\.4—15121\.1——Droid\-852\.361136\.90\.125737\.5Heloc\-149634842\.1—94000\.7——Student\-1012\.080\.981746\.860\.2073321\.8λ=0\.005,ε=0\.03\\lambda=0\.005,\\ \\varepsilon=0\.03Bike\-2792809\.41——104518\.2Compas\-10834\.30\.9999830\.20\.0023308\.1Diabetes\-185273\.6—15138\.9——Droid\-852\.641139\.60\.0476739\.6Student\-10113\.510\.9992786\.80\.1134340\.7Table 31:Timing and approximation results on fully binarized continuous datasets at depth55\. Time is reported in seconds\. Results are shown only for runs that completed within 144,000 seconds and 200GB RAM\. TreeFARMS did not complete on any of these instances within the same resource budget\.
### D\.11Approximation Quality Comparisons

DatasetPRAXIS RecallRESPLIT Recallλ=0\.005\\lambda\{=\}0\.005λ=0\.02\\lambda\{=\}0\.02λ=0\.005\\lambda\{=\}0\.005λ=0\.02\\lambda\{=\}0\.02Adult\-2091\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.001±0\.0010\.001\\\!\\pm\\\!0\.0010\.265±0\.0390\.265\\\!\\pm\\\!0\.039Bank\-971\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.511±0\.2020\.511\\\!\\pm\\\!0\.2021\.000±0\.0001\.000\\\!\\pm\\\!0\.000Bike\-1640\.986±0\.0100\.986\\\!\\pm\\\!0\.0101\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.617±0\.0000\.617\\\!\\pm\\\!0\.0000\.299±0\.0750\.299\\\!\\pm\\\!0\.075Christine\-2311\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.994±0\.0110\.994\\\!\\pm\\\!0\.0110\.617±0\.0000\.617\\\!\\pm\\\!0\.0000\.435±0\.0910\.435\\\!\\pm\\\!0\.091Churn\-4720\.997±0\.0050\.997\\\!\\pm\\\!0\.0051\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.000±0\.0000\.000\\\!\\pm\\\!0\.0000\.955±0\.0710\.955\\\!\\pm\\\!0\.071Compas\-441\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.007±0\.0070\.007\\\!\\pm\\\!0\.0070\.916±0\.1770\.916\\\!\\pm\\\!0\.177Covertype\-961\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.003±0\.0010\.003\\\!\\pm\\\!0\.0011\.000±0\.0001\.000\\\!\\pm\\\!0\.000Credit\-2251\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Diabetes\-331\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Droid\-841\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.065±0\.0270\.065\\\!\\pm\\\!0\.0271\.000±0\.0001\.000\\\!\\pm\\\!0\.000Electricity\-2640\.994±0\.0030\.994\\\!\\pm\\\!0\.0031\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.067±0\.0000\.067\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Helena\-1561\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.202±0\.1220\.202\\\!\\pm\\\!0\.1221\.000±0\.0001\.000\\\!\\pm\\\!0\.000Heloc\-650\.999±0\.0000\.999\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.001±0\.0020\.001\\\!\\pm\\\!0\.0020\.905±0\.2110\.905\\\!\\pm\\\!0\.211Jannis\-1060\.995±0\.0080\.995\\\!\\pm\\\!0\.0081\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.001±0\.0000\.001\\\!\\pm\\\!0\.0000\.232±0\.0210\.232\\\!\\pm\\\!0\.021Jasmine\-2070\.993±0\.0090\.993\\\!\\pm\\\!0\.0091\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.001±0\.0010\.001\\\!\\pm\\\!0\.0010\.489±0\.4690\.489\\\!\\pm\\\!0\.469Madelon\-731\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.000±0\.0000\.000\\\!\\pm\\\!0\.0000\.009±0\.0160\.009\\\!\\pm\\\!0\.016Magic\-800\.997±0\.0040\.997\\\!\\pm\\\!0\.0040\.998±0\.0040\.998\\\!\\pm\\\!0\.0040\.001±0\.0010\.001\\\!\\pm\\\!0\.0010\.301±0\.0220\.301\\\!\\pm\\\!0\.022News\-1961\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.003±0\.0020\.003\\\!\\pm\\\!0\.0020\.938±0\.0430\.938\\\!\\pm\\\!0\.043Poker\-401\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.111±0\.0000\.111\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Shopping\-2430\.984±0\.0340\.984\\\!\\pm\\\!0\.0341\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.091±0\.0800\.091\\\!\\pm\\\!0\.0800\.987±0\.0300\.987\\\!\\pm\\\!0\.030Taxi\-271\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.000Wine\-641\.000±0\.0001\.000\\\!\\pm\\\!0\.0001\.000±0\.0001\.000\\\!\\pm\\\!0\.0000\.313±0\.2780\.313\\\!\\pm\\\!0\.2781\.000±0\.0001\.000\\\!\\pm\\\!0\.000Table 32:Rashomon set recall \(mean±\\pmstandard deviation over 5 bootstraps\) comparing PRAXIS \(PRAXIS\) and RESPLIT across datasets forεmult=0\.03\\varepsilon\_\{\\text\{mult\}\}=0\.03and depthd=5d=5\. Format: Dataset\-NumBinaryFeatures\.##### Recall compared to RESPLIT\.

In[Table 2](https://arxiv.org/html/2606.00202#S4.T2), we show recall results for PRAXIS across two sparsity values:λ=0\.02\\lambda=0\.02andλ=0\.005\\lambda=0\.005\. In[Table 32](https://arxiv.org/html/2606.00202#A4.T32), we additionally provide the recall results for RESPLIT\. We note that RESPLIT substantially degrades in performance for smaller values ofλ\\lambda; this is in stark contrast to PRAXIS\. We also show approximation results for PRAXIS under an even smaller value ofλ\\lambda:0\.00250\.0025in[Table 33](https://arxiv.org/html/2606.00202#A4.T33)\.

DatasetRecallAdult\-2091\.000±0\.0001\.000\\\!\\pm\\\!0\.000Bank\-971\.000±0\.0011\.000\\\!\\pm\\\!0\.001Bike\-1641\.000±0\.0011\.000\\\!\\pm\\\!0\.001Churn\-4720\.995±0\.0090\.995\\\!\\pm\\\!0\.009Compas\-441\.000±0\.0001\.000\\\!\\pm\\\!0\.000Credit\-2251\.000±0\.0001\.000\\\!\\pm\\\!0\.000Diabetes\-331\.000±0\.0001\.000\\\!\\pm\\\!0\.000Droid\-840\.994±0\.0100\.994\\\!\\pm\\\!0\.010Helena\-1561\.000±0\.0001\.000\\\!\\pm\\\!0\.000Heloc\-651\.000±0\.0001\.000\\\!\\pm\\\!0\.000Madelon\-730\.997±0\.0060\.997\\\!\\pm\\\!0\.006Magic\-800\.999±0\.0010\.999\\\!\\pm\\\!0\.001Poker\-401\.000±0\.0001\.000\\\!\\pm\\\!0\.000Shopping\-2430\.997±0\.0040\.997\\\!\\pm\\\!0\.004Taxi\-270\.989±0\.0070\.989\\\!\\pm\\\!0\.007Wine\-640\.997±0\.0050\.997\\\!\\pm\\\!0\.005Table 33:PRAXIS recall \(mean±\\pmstandard deviation over 5 bootstraps\) forλ=0\.0025\\lambda=0\.0025,ε=0\.003\\varepsilon=0\.003\. We exclude rows where the ground truth could not be enumerated\.
##### Qualitative Comparison of Rashomon Set Approximations\.

We display results for Rashomon set approximations forλ=0\.005\\lambda=0\.005,εmult=0\.01\\varepsilon\_\{\\textrm\{mult\}\}=0\.01andd=5d=5\. We show a subset of the 51 datasets and binarizations here because we exclude cases where the Rashomon set is believed to be small \(on the order of10110^\{1\}trees\) or where neither PRAXIS or RESPLIT ran\. Across all figures,bluedenotes PRAXIS using the modified LicketySPLIT proxy,greendenotes RESPLIT, andorangedenotes bootstrapped LicketySPLIT for as long as PRAXIS ran\. As in[Figure 3](https://arxiv.org/html/2606.00202#S4.F3), the dashed vertical line marks the estimated Rashomon bound, defined as\(1\+ε\)\(1\+\\varepsilon\)times the minimum objective found by any method\. Trees returned beyond this threshold should not be counted as improving coverage of the target Rashomon set, since they have objectives outside the requested quality range\. If such lower\-quality trees are of interest, this should instead be reflected by requesting a larger Rashomon set, i\.e\., by increasingε\\varepsilon\.

![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x44.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x45.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x46.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x47.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x48.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x49.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x50.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x51.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x52.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x53.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x54.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x55.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x56.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x57.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x58.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x59.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x60.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x61.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x62.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x63.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x64.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x65.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x66.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x67.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x68.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x69.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x70.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x71.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x72.png)![[Uncaptioned image]](https://arxiv.org/html/2606.00202v1/x73.png)

### D\.12Stopping Algorithms Early

One algorithm we compare against, SORTeD\(Arslan et al\.,[2026](https://arxiv.org/html/2606.00202#bib.bib3)\), supports saving intermediate solutions\. Because it enumerates the Rashomon set in sorted order of objective value, truncating this process is equivalent to computing the Rashomon set for a smallerεmult\\varepsilon\_\{\\mathrm\{mult\}\}\. However, before any solutions can be returned, SORTeD must first identify the optimal tree, which constitutes a major computational bottleneck\. As shown in Figure[4](https://arxiv.org/html/2606.00202#A4.F4), computing the optimal tree dominates the runtime; consequently, early termination still incurs most of the computational cost while recovering only a small fraction of the Rashomon set\.

![Refer to caption](https://arxiv.org/html/2606.00202v1/figs/stop_early.png)Figure 4:Runtime vs\. fraction of the Rashomon set \(λ=0\.005,εmult=0\.03\\lambda=0\.005,\\varepsilon\_\{\\mathrm\{mult\}\}=0\.03\) recovered when stopping SORTeD early\. The majority of time is spent computing the optimal tree, while early stopping yields only a small fraction of the set\.Similarly, PRAXIS could be run with a smallerεmult\\varepsilon\_\{\\mathrm\{mult\}\}and then reuse all of the caches from the proxy algorithm to approximate a Rashomon set for a largerεmult\\varepsilon\_\{\\mathrm\{mult\}\}\.

### D\.13Example Decision Trees

We present example decision trees discovered by PRAXIS for several datasets used in the paper\. Note that theγ\\gamma\(or equivalently,λ\\lambda\) penalty on the number of leaves encourages sparse trees rather than fully grown trees\. We setεmult=0\.03\\varepsilon\_\{\\textrm\{mult\}\}=0\.03andλ=0\.005\\lambda=0\.005for these examples\. The examples highlight the variety of rules and tree sizes that can arise while achieving similar objective values, illustrating the Rashomon effect in practice\.

At each internal node, samples are routed to the left child when the binary feature evaluates toTrueand to the right child otherwise\.

![Refer to caption](https://arxiv.org/html/2606.00202v1/x74.png)\(a\)Accuracy: 84\.2%, Leaves: 4
![Refer to caption](https://arxiv.org/html/2606.00202v1/x75.png)\(b\)Accuracy: 84\.5%, Leaves: 5
![Refer to caption](https://arxiv.org/html/2606.00202v1/x76.png)\(c\)Accuracy: 85\.0%, Leaves: 6
![Refer to caption](https://arxiv.org/html/2606.00202v1/x77.png)\(d\)Accuracy: 84\.8%, Leaves: 6
![Refer to caption](https://arxiv.org/html/2606.00202v1/x78.png)\(e\)Accuracy: 84\.8%, Leaves: 6
![Refer to caption](https://arxiv.org/html/2606.00202v1/x79.png)\(f\)Accuracy: 84\.7%, Leaves: 6

Figure 5:Adult\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x80.png)\(a\)Accuracy: 89\.8%, Leaves: 4
![Refer to caption](https://arxiv.org/html/2606.00202v1/x81.png)\(b\)Accuracy: 90\.0%, Leaves: 5
![Refer to caption](https://arxiv.org/html/2606.00202v1/x82.png)\(c\)Accuracy: 89\.5%, Leaves: 5
![Refer to caption](https://arxiv.org/html/2606.00202v1/x83.png)\(d\)Accuracy: 90\.1%, Leaves: 7

Figure 6:Bank\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x84.png)\(a\)Accuracy: 86\.4%, Leaves: 7
![Refer to caption](https://arxiv.org/html/2606.00202v1/x85.png)\(b\)Accuracy: 87\.4%, Leaves: 10
![Refer to caption](https://arxiv.org/html/2606.00202v1/x86.png)\(c\)Accuracy: 86\.1%, Leaves: 7
![Refer to caption](https://arxiv.org/html/2606.00202v1/x87.png)\(d\)Accuracy: 87\.3%, Leaves: 10
![Refer to caption](https://arxiv.org/html/2606.00202v1/x88.png)\(e\)Accuracy: 86\.3%, Leaves: 7
![Refer to caption](https://arxiv.org/html/2606.00202v1/x89.png)\(f\)Accuracy: 86\.7%, Leaves: 8

Figure 7:Bike\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x90.png)\(a\)Accuracy: 93\.4%, Leaves: 10
![Refer to caption](https://arxiv.org/html/2606.00202v1/x91.png)\(b\)Accuracy: 93\.5%, Leaves: 11
![Refer to caption](https://arxiv.org/html/2606.00202v1/x92.png)\(c\)Accuracy: 93\.4%, Leaves: 11
![Refer to caption](https://arxiv.org/html/2606.00202v1/x93.png)\(d\)Accuracy: 93\.4%, Leaves: 11
![Refer to caption](https://arxiv.org/html/2606.00202v1/x94.png)\(e\)Accuracy: 92\.3%, Leaves: 9
![Refer to caption](https://arxiv.org/html/2606.00202v1/x95.png)\(f\)Accuracy: 91\.8%, Leaves: 8

Figure 8:Churn\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x96.png)\(a\)Accuracy: 67\.8%, Leaves: 7
![Refer to caption](https://arxiv.org/html/2606.00202v1/x97.png)\(b\)Accuracy: 67\.7%, Leaves: 8
![Refer to caption](https://arxiv.org/html/2606.00202v1/x98.png)\(c\)Accuracy: 67\.7%, Leaves: 8
![Refer to caption](https://arxiv.org/html/2606.00202v1/x99.png)\(d\)Accuracy: 67\.6%, Leaves: 8

Figure 9:Compas\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x100.png)\(a\)Accuracy: 75\.5%, Leaves: 10
![Refer to caption](https://arxiv.org/html/2606.00202v1/x101.png)\(b\)Accuracy: 73\.9%, Leaves: 7
![Refer to caption](https://arxiv.org/html/2606.00202v1/x102.png)\(c\)Accuracy: 73\.2%, Leaves: 3
![Refer to caption](https://arxiv.org/html/2606.00202v1/x103.png)\(d\)Accuracy: 74\.5%, Leaves: 7

Figure 10:Covertype\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x104.png)\(a\)Accuracy: 92\.7%, Leaves: 4
![Refer to caption](https://arxiv.org/html/2606.00202v1/x105.png)\(b\)Accuracy: 93\.4%, Leaves: 6
![Refer to caption](https://arxiv.org/html/2606.00202v1/x106.png)\(c\)Accuracy: 93\.3%, Leaves: 6
![Refer to caption](https://arxiv.org/html/2606.00202v1/x107.png)\(d\)Accuracy: 92\.8%, Leaves: 5
![Refer to caption](https://arxiv.org/html/2606.00202v1/x108.png)\(e\)Accuracy: 93\.3%, Leaves: 6
![Refer to caption](https://arxiv.org/html/2606.00202v1/x109.png)\(f\)Accuracy: 93\.2%, Leaves: 6

Figure 11:Droid\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x110.png)\(a\)Accuracy: 71\.6%, Leaves: 6
![Refer to caption](https://arxiv.org/html/2606.00202v1/x111.png)\(b\)Accuracy: 71\.6%, Leaves: 6

Figure 12:Heloc\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x112.png)\(a\)Accuracy: 82\.9%, Leaves: 7
![Refer to caption](https://arxiv.org/html/2606.00202v1/x113.png)\(b\)Accuracy: 83\.3%, Leaves: 8
![Refer to caption](https://arxiv.org/html/2606.00202v1/x114.png)\(c\)Accuracy: 83\.6%, Leaves: 9
![Refer to caption](https://arxiv.org/html/2606.00202v1/x115.png)\(d\)Accuracy: 83\.0%, Leaves: 8

Figure 13:Magic\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x116.png)\(a\)Accuracy: 89\.2%, Leaves: 2
![Refer to caption](https://arxiv.org/html/2606.00202v1/x117.png)\(b\)Accuracy: 89\.9%, Leaves: 4
![Refer to caption](https://arxiv.org/html/2606.00202v1/x118.png)\(c\)Accuracy: 89\.6%, Leaves: 4
![Refer to caption](https://arxiv.org/html/2606.00202v1/x119.png)\(d\)Accuracy: 90\.2%, Leaves: 5

Figure 14:Shopping\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x120.png)\(a\)Accuracy: 90\.6%, Leaves: 6
![Refer to caption](https://arxiv.org/html/2606.00202v1/x121.png)\(b\)Accuracy: 90\.8%, Leaves: 8
![Refer to caption](https://arxiv.org/html/2606.00202v1/x122.png)\(c\)Accuracy: 91\.3%, Leaves: 9
![Refer to caption](https://arxiv.org/html/2606.00202v1/x123.png)\(d\)Accuracy: 91\.7%, Leaves: 10

Figure 15:Spambase\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.![Refer to caption](https://arxiv.org/html/2606.00202v1/x124.png)\(a\)Accuracy: 86\.0%, Leaves: 9
![Refer to caption](https://arxiv.org/html/2606.00202v1/x125.png)\(b\)Accuracy: 84\.1%, Leaves: 4
![Refer to caption](https://arxiv.org/html/2606.00202v1/x126.png)\(c\)Accuracy: 85\.2%, Leaves: 7
![Refer to caption](https://arxiv.org/html/2606.00202v1/x127.png)\(d\)Accuracy: 85\.5%, Leaves: 8
![Refer to caption](https://arxiv.org/html/2606.00202v1/x128.png)\(e\)Accuracy: 86\.3%, Leaves: 10
![Refer to caption](https://arxiv.org/html/2606.00202v1/x129.png)\(f\)Accuracy: 84\.9%, Leaves: 7

Figure 16:Student\. Example near\-optimal trees with accuracy and number of leaves shown below each tree\.

Similar Articles

Horizon-Constrained Rashomon Sets for Chaotic Forecasting

arXiv cs.LG

Introduces horizon-constrained Rashomon sets to characterize how model multiplicity evolves in chaotic systems. The framework proves exponential contraction of predictive equivalence and develops decision-aligned algorithms that improve decision quality by 18-34%.

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

arXiv cs.CL

This paper proposes STOP (SuperTOken for Pruning), a systematic framework for pruning inefficient reasoning paths early in parallel reasoning with Large Reasoning Models. The method achieves superior efficiency and effectiveness across models from 1.5B to 20B parameters, boosting GPT-OSS-20B accuracy on AIME25 from 84% to 90% under fixed compute budgets.

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Hugging Face Daily Papers

This paper introduces STOP (Super Token for Pruning), a lightweight method that learns to prune unpromising reasoning paths early during parallel decoding by appending learnable tokens and reading KV cache states, achieving 70% token reduction while improving performance on AIME and GPQA benchmarks.