Learning High Coverage Discriminative Parsimonious Rulesets
Summary
This paper introduces CDPR, a new approach for learning highly accurate and interpretable rule sets for classification using submodular maximization, achieving over 2.5x improvement in coverage compared to existing methods.
View Cached Full Text
Cached at: 06/15/26, 09:11 AM
# Learning High Coverage Discriminative Parsimonious Rulesets
Source: [https://arxiv.org/html/2606.14156](https://arxiv.org/html/2606.14156)
11institutetext:Indian Institute of Science, CV Raman Rd, Bengaluru, Karnataka 56001222institutetext:Compass, Bengaluru, Karnataka 560012###### Abstract
Learning systems based on IF\-THEN rule representations readily offer interpretability, making them a crucial focus in contemporary AI research\. A key objective for such rule sets is to achieve both high discriminative power and interpretability\. While existing state\-of\-the\-art algorithms implicitly prioritize predictive accuracy, they often fall short on one or more quality metrics that ensure interpretability, such as coverage and parsimony of rule sets\. Motivated by this, this paper propose the development of CDPR, which aims to create highly accurate and interpretable rule sets for classification problems\. To the best of our knowledge, this represents the first attempt to establish such an approach\. In this study, we introduce two algorithms rooted in sub\-modular maximization, which not only provide provable guarantees on coverage but also yield rule sets that are both discriminative and parsimonious\. We empirically demonstrate that rule sets learned through our approaches achieve higher accuracy and interpretability and has more than a 2\.5\-fold improvement in average coverage rates when compared to the next best algorithm\.
## 1INTRODUCTION
There is currently a growing demand for the development of interpretable models in critical domains, such as healthcare\[[18](https://arxiv.org/html/2606.14156#bib.bib8)\], criminal justice\[[17](https://arxiv.org/html/2606.14156#bib.bib10)\], finance\[[23](https://arxiv.org/html/2606.14156#bib.bib9)\]\. This demand has been further fueled by the Right to Explanation as outlined in the GDPR Law, which was approved by the European Parliament\. This legislation grants individuals the right to receive an explanation for the automated inferences made by models and allows them to question and challenge associated recommendations, especially when these model decisions significantly impact an individual’s life\[[29](https://arxiv.org/html/2606.14156#bib.bib89),[10](https://arxiv.org/html/2606.14156#bib.bib92)\]\. Rule\-based models, known for their interpretability, offer practitioners and experts the opportunity to examine the logic behind the models and validate or enhance them\. While post\-hoc methods can provide interpretations for black\-box models, they do not always guarantee the fidelity of explanations for such models\.
IFthalassemia = normalandnumber of major vessels \(0\-3\) colored by fluoroscopy = 0andresting blood pressure < 127THENNo Heart DiseaseIFchest pain type = asymptomaticandage\>41andmaximum heart rate achieved <= 159THENHeart Disease …
Figure 1:Rule set consisting of attributes from Heart disease dataset\[[6](https://arxiv.org/html/2606.14156#bib.bib77)\]for diagnosing Heart diseaseThis paper focuses on developing rule\-sets111also known as DNF or AND\-of\-ORs\[[16](https://arxiv.org/html/2606.14156#bib.bib14),[5](https://arxiv.org/html/2606.14156#bib.bib6),[31](https://arxiv.org/html/2606.14156#bib.bib15),[36](https://arxiv.org/html/2606.14156#bib.bib16),[34](https://arxiv.org/html/2606.14156#bib.bib87),[24](https://arxiv.org/html/2606.14156#bib.bib86),[12](https://arxiv.org/html/2606.14156#bib.bib54)\]as interpretable models for classification\. A rule\-set consists of an independent collection ofIF\-THENrules that assign an example to a specific class label if the example meets the condition\-value pairs of the rule\. A comprehensive comparative study conducted on several datasets reveals that while rule\-sets learned by existing algorithms maintain high accuracy, they often fail to meet the desired criteria for interpretability\. Earlier rule\-set algorithms, which were based on sequential covering and associative classification, suffered from the problem of having a large number of rules, which negatively impacted interpretability\[[16](https://arxiv.org/html/2606.14156#bib.bib14)\]\. However, recent algorithms for learning rule\-sets face a different challenge: they exhibit low coverage even when predictive power is high\. This challenge is referred to as the*high accuracy\-low coverage problem*\.
A rule\-set with low\-coverage only represents a fraction of the input space\. In most scenarios, it fails to provide individuals with decisions based on verified rules from practitioners or experts\. Instead, decisions are made using a default rule, the logic of which is unknown to the individual\. Consequently, the model’s decision lacks transparency and is unavailable for explanation, which is expected from interpretable models\. This limitation hinders the adoption of rule\-sets in critical applications where transparency and faithfulness are crucial concerns\. While the importance of rule\-set accuracy has been extensively discussed in the literature, the significance of developing rule\-sets that cover the entire input space has not, to the best of our knowledge, been adequately addressed\.
This paper primarily aims to address the limitation of existing algorithms and produce*high quality*rules, measured by accuracy and interpretability metrics such as accuracy, coverage and parsimony\. This paper introduces the problem of learninghigh Coverage Discriminative Parsimonious Rule sets \(CDPR\), which aims to learn rule\-sets that meet certain quality metrics\. While submodular approaches have been applied to various problems, they are seldom used in rule learning\. The problem of learning CDPR can be formulated as submodular maximization with several constraints, some of which are sub\-modular\. The paper proposes a novelGraph Rules Algorithm\(GRA\) to solve the sub\-modular optimization problem\. Additionally, the paper introduces a variant of CDPR known asCDPRL\-MOand presents aGreedy algorithm\(GDY\) to solveCDPRL\-MO\.
The main contributions related to this problem are summarized below\.
1. 1\.Existing algorithms for learning rule\-sets often do not yieldhigh qualityrules, measured by accuracy and interpretability metrics such as coverage, accuracy and parsimony as summarized in Table[4](https://arxiv.org/html/2606.14156#S6.T4)in Section[6\.2](https://arxiv.org/html/2606.14156#S6.SS2)\. Coverage is most critically impacted \(often the coverage is as low as less than 10%\) while accuracy is very high, resulting in what we refer to as*high accuracy\-low coverage problem*\. Motivated by this problem we introduce the problem of learning CDPR\. This is a new problem which does not appear to have been studied\.
2. 2\.The main contribution of this paper is the introduction of a novelGraphRules Algorithm\(GRA\)\.GRAis shown to have\(1−\(1−\(1−β\)γ\(1\+dβ\)\)γ\)\(1\-\(1\-\\frac\{\(1\-\\beta\)\}\{\\gamma\(1\+d\\beta\)\}\)^\{\\gamma\}\)\- approximation guarantee \(Theorem[5\.1](https://arxiv.org/html/2606.14156#S5.Thmtheorem1)\), whereβ\\betaandγ\\gammaare the maximum threshold of overlap\-rate and number of rules in the rule\-set andddis the minimum in\-degree of rules added to rule set\. Denoting byBα,δB\_\{\\alpha,\\delta\}, the set of rules which have accuracy greater thanα\\alphaand width less thanδ\\delta,GRAhas a running time that is quadratic in\|Bα,δ\|\|B\_\{\\alpha,\\delta\}\|\.
3. 3\.With the goal of developing algorithms with reduced runtime, we introduce a relaxed variant of CDPR, namedCDPRL\-MO, which is a sub\-modular problem over a large setBα,δB\_\{\\alpha,\\delta\}\. This problem can be approximated using aGreedy Algorithm\(GDY\) to a constant factor \(Theorem[5\.2](https://arxiv.org/html/2606.14156#S5.Thmtheorem2)\)\. The run time ofGDYis nearly linear in\|Bα,δ\|\|B\_\{\\alpha,\\delta\}\|\.
4. 4\.Comprehensive experiments were conducted using ten publicly available datasets\. The experimental results indicate thatGRAandGDYachieve more than a 2\.5 times improvement in average coverage rate compared to IDS, the next best algorithm\. Therefore,GRAandGDYachieve better coverage than the state\-of\-the\-art baselines while remaining competitive in both discriminative power and interpretability\.
## 2RELATED WORK
Rule\-based models have been extensively studied in various fields over the years, with notable works including those by\[[30](https://arxiv.org/html/2606.14156#bib.bib93),[2](https://arxiv.org/html/2606.14156#bib.bib96),[8](https://arxiv.org/html/2606.14156#bib.bib94),[14](https://arxiv.org/html/2606.14156#bib.bib95),[27](https://arxiv.org/html/2606.14156#bib.bib98),[13](https://arxiv.org/html/2606.14156#bib.bib97),[28](https://arxiv.org/html/2606.14156#bib.bib99)\]\. In sequential covering techniques such as RIPPER\[[4](https://arxiv.org/html/2606.14156#bib.bib23)\], and CN2\[[3](https://arxiv.org/html/2606.14156#bib.bib102)\], a rule is generated at each stage from uncovered examples using heuristics and newly covered examples are subsequently removed\. In associative classification techniques\[[21](https://arxiv.org/html/2606.14156#bib.bib21),[20](https://arxiv.org/html/2606.14156#bib.bib25),[35](https://arxiv.org/html/2606.14156#bib.bib24)\], a large set of rules is initially generated through association rule mining\. Subsequently, a rule set is constructed by ranking rules based on specific interestingness criteria, followed by heuristic pruning to eliminate redundant rules\. However, these techniques commonly suffer from the issue of producing a large number of rules in the rule set, which can have a detrimental impact on interpretability\.
The initial algorithms for learning interpretable rule\-sets primarily emphasize rule\-sets with reduced model complexity, measured by having fewer conditions for improved interpretability\. For instance, Bayesian Rule Sets \(BRS\), as proposed by\[[31](https://arxiv.org/html/2606.14156#bib.bib15)\], presents a probabilistic approach for learning rule\-sets from a pre\-mined set of rules, employing a simulated annealing procedure\. Alternatively, in a different approach, the problem of learning rule\-sets is framed as a group testing problem, as described in\[[22](https://arxiv.org/html/2606.14156#bib.bib104)\]\. This problem can be effectively addressed using the linear programming \(LP\) relaxation technique derived from the field of compressed sensing literature\.
Many recent works define interpretability in terms of having a smaller rule size or fewer rules\. For instance, LIBRE,\[[24](https://arxiv.org/html/2606.14156#bib.bib86)\], employs boolean function synthesis to learn rules from data\. Another approach, DefragTrees, as proposed by\[[12](https://arxiv.org/html/2606.14156#bib.bib54)\], utilizes a probabilistic model representation to merge adjacent regions in the input space, reducing the overall number of regions\. However, a notable limitation of DefragTrees is that the process of merging different regions to create fewer rules often results in more complex rules that are challenging to interpret\. In contrast, the linear programming algorithm, discussed in works by\[[5](https://arxiv.org/html/2606.14156#bib.bib6),[36](https://arxiv.org/html/2606.14156#bib.bib16)\], generates rule sets without the need for pre\-mining of rules, using integer programming\. Nevertheless, LP\-based methods suffer from the challenge of computational intractability when applied to large datasets and are generally applicable only to datasets with binarized feature values and binarized labels\.
DRS\[[12](https://arxiv.org/html/2606.14156#bib.bib54)\], defines interpretability as the presence of parsimonious rule\-sets with minimal overlap\. In a manner akin to our approach, DRS also focuses on learning high\-quality rules\. DRS employs a rule\-discovery process in each iteration, where it samples candidate rules exclusively from the pool of uncovered records, similar to a sequential covering approach\. It’s worth noting that DRS assumes that all the features and class labels are binary\. In contrast, IDS, as presented by\[[16](https://arxiv.org/html/2606.14156#bib.bib14)\], optimize a global objective to learn rule\-sets from a pre\-mined set of rules\. Their approach seeks to strike a balance between accuracy and interpretability when constructing rule\-sets\.
## 3A comparative study on the quality of rules generated by state\-of\-the\-art algorithms for rule learning
To comprehend the quality of rules generated by current state\-of\-the\-art algorithms, we conduct an experimental study\. In this section, we present the results of the study\.
### 3\.1Formal setup and Definitions
Consider a datasetD=\{\(Xi,yi\)\|i∈1,…,N\}D=\\left\\\{\\left\(X\_\{i\},y\_\{i\}\\right\)\|i\\in 1,\\dots,N\\right\\\}consisting of N samples, whereXiX\_\{i\}andyiy\_\{i\}denote the input features and the class label respectively for data\-instanceii\. LetAAbe the universal set of rules, where each ruler∈Ar\\in Ais a pair\(s,c\)\(s,c\), withssbeing a conjunction of conditions\-value pairs, andccindicating the class label\. In our investigation, we obtainedAAfrom the decision tree ensembles \(Section[6\.1](https://arxiv.org/html/2606.14156#S6.SS1)\)\. We recall the following definitions which are standard\[[16](https://arxiv.org/html/2606.14156#bib.bib14)\]\.
- •Coverage\.Coverof a ruler=\(s,c\)r=\(s,c\)is the set of data points inDDthat satisfy the condition\-value pairs inss\. Coverage of a rule\-setSSisf\(S\)=coverage\(S\)=\|⋃r∈Scover\(r\)\|f\(S\)=coverage\(S\)=\|\\bigcup\_\{r\\in S\}cover\(r\)\|
- •Rule\-size\.Rule\-size\(SS\) is the number of rulesr∈Sr\\in S
- •Rule\-width\.Rule\-width represents the number of condition\-value pairs in a rule\. For example, a rulerr=IF \(cond1≤\\leqvalue1\) and \(cond2≤\\leqvalue2\) THEN label1hasrule\-width\(r\)=2rule\\mbox\{\-\}width\(r\)=2\.
- •Overlap\-rate\.Overlap\(ri,rjr\_\{i\},r\_\{j\}\)between two rulesri,rjr\_\{i\},r\_\{j\}whereri=\(si,ci\)r\_\{i\}=\(s\_\{i\},c\_\{i\}\)andrj=\(sj,cj\)r\_\{j\}=\(s\_\{j\},c\_\{j\}\)is the number of data points whose attributes satisfy bothsis\_\{i\}andsjs\_\{j\}\.overlap\(ri,rj\)=\|cover\(ri\)⋂cover\(rj\)\|overlap\(r\_\{i\},r\_\{j\}\)=\|cover\(r\_\{i\}\)\\bigcap cover\(r\_\{j\}\)\|\.overlap\-rate\(ri,rj\)=overlap\(ri,rj\)\|cover\(ri\)\|overlap\\mbox\{\-\}rate\(r\_\{i\},r\_\{j\}\)=\\frac\{overlap\(r\_\{i\},r\_\{j\}\)\}\{\|cover\(r\_\{i\}\)\|\}
- •Accuracy\.Correct\-coverof a ruler=\(s,c\)r=\(s,c\)represents the set of data points inDDthat satisfy bothssandcc\. Theincorrect\-coverof a ruler=\(s,c\)r=\(s,c\)are the set of data points inDDthat satisfyssbut do not satisfycc\. Accuracy of a rulerris defined asaccuracy\(r\)=\|correct\-cover\(r\)\|\|cover\(r\)\|accuracy\(r\)=\\frac\{\|correct\\mbox\{\-\}cover\(r\)\|\}\{\|cover\(r\)\|\}
The following metrics are considered for measuring the quality of rule\-sets:
- •Rule Set Accuracy \(RSA\)\.RSA is the accuracy of the entire rule\-set\. It represents the fraction of examples in the test dataset that the rule\-set correctly classifies\.
- •Coverage\-rate\.This metric measures the coverage of all the rules in the rule\-set relative to the total number of data points\.coverage\-rate\(S\)=\|∪r∈Scover\(r\)\|Ncoverage\\mbox\{\-\}rate\(S\)=\\frac\{\|\\cup\_\{r\\in S\}cover\(r\)\|\}\{N\}\. We want coverage as large as possible\.
- •Maximum rule\-width\.This refers to the maximum width among all rules within the rule\-set\.Max\.width\(S\)=maxr∈Swidth\(r\)Max\.width\(S\)=max\_\{r\\in S\}width\(r\)\. A smaller value on this metric indicates more meaningful rules in the rule\-set\.
### 3\.2Comparison of Existing Benchmarks
We conduct comprehensive experiments to assess the quality of rule\-sets learned by the existing algorithms using ten real\-world datasets from the UCI machine learning repository\[[6](https://arxiv.org/html/2606.14156#bib.bib77)\]as well as two real\-world Alzheimer’s disease datasets obtained from NACC Uniform Data Set\[[33](https://arxiv.org/html/2606.14156#bib.bib80),[1](https://arxiv.org/html/2606.14156#bib.bib82),[32](https://arxiv.org/html/2606.14156#bib.bib108)\]\. Detailed experiment information is provided in the Experiments section \([6\.1](https://arxiv.org/html/2606.14156#S6.SS1)\), and the results are summarized in Table[4](https://arxiv.org/html/2606.14156#S6.T4),[1](https://arxiv.org/html/2606.14156#S6.T1),[2](https://arxiv.org/html/2606.14156#S6.T2)and[3](https://arxiv.org/html/2606.14156#S6.T3)\.
- •It’s worth noting that IDS\[[16](https://arxiv.org/html/2606.14156#bib.bib14)\]and RIPPER\[[4](https://arxiv.org/html/2606.14156#bib.bib23)\]exhibit limitations\. The mean coverage\-rate over the test set is only 35% for IDS and 31% for RIPPER \(see Table[2](https://arxiv.org/html/2606.14156#S6.T2)\)\. This suggests that the corresponding rules are ineffective for over 65% of the examples, which is a significant concern for critical applications\.
- •While DefragTrees\[[12](https://arxiv.org/html/2606.14156#bib.bib54)\]achieves high coverage, it faces challenges related to complex rule\-sets with large rule\-width \(Table[3](https://arxiv.org/html/2606.14156#S6.T3)\)\. Rules learned by DefragTrees have a rule\-width greater than or equal to 8 in seven out of the twelve datasets, indicating increased model complexity\.
- •In terms of Rule Set Accuracy \(RSA\), DefragTrees exhibits variations of more than 10% compared to Random Forests \(RF\), which serves as an additional benchmark that, while highly accurate, lacks interpretability \(Table 1\)\.
These observations collectively highlight that existing state\-of\-the\-art \(SOTA\) algorithms often do not produce high\-quality rule\-sets as measured by various metrics\. Coverage is one of the most severely affected metrics\. To this end, the first observation points to a significant challenge, which we refer to as thehigh accuracy\-low coverage problem\.
Rule sets with high accuracy often have low coverage\.
The objective of this paper is to investigate algorithmic solutions to address this problem\.
## 4CDPR : High Coverage Discriminative Parsimonious Rule Sets
Motivated by the findings of the comparative study reported in the previous section, we introduce the problem of learning CDPR\. CDPR is a rule\-set that aims to maximizes coverage while satisfying quality constraints related torule\-width,rule\-size,rule\-accuracyandoverlapbetween rules as shown in Figure[2](https://arxiv.org/html/2606.14156#S4.F2)\.
accuracy\(r\)≥α,∀r∈S,α∈\[0,1\]accuracy\(r\)\\geq\\alpha,\\qquad\\qquad\\forall r\\in S,\\alpha\\in\[0,1\]overlap\-rate\(ri,rj\)≤β,∀ri,rj∈S,ri≠rj,β∈\[0,1\]overlap\\mbox\{\-\}rate\(r\_\{i\},r\_\{j\}\)\\leq\\beta,\\quad\\forall r\_\{i\},r\_\{j\}\\in S,r\_\{i\}\\neq r\_\{j\},\\beta\\in\[0,1\]rule\-size\(S\)≤γ,γ≥1rule\\mbox\{\-\}size\(S\)\\leq\\gamma,\\qquad\\qquad\\gamma\\geq 1rule\-width\(r\)≤δ,∀r∈S,δ≥1rule\\mbox\{\-\}width\(r\)\\leq\\delta,\\qquad\\quad\\forall r\\in S,\\delta\\geq 1
Figure 2:Quality constraints inCDPR\(α,β,γ,δ\\alpha,\\beta,\\gamma,\\delta\)Throughout the subsequent discussion,, we omit the parametersα,β,γ\\alpha,\\beta,\\gamma, andδ\\deltafromCDPR\(α,β,γ,δ\\alpha,\\beta,\\gamma,\\delta\)for the sake of simplicity\. The set of rules with high discriminative ability, referred to asdiscriminative candidate setand denoted byBα,δB\_\{\\alpha,\\delta\}, is defined as follows:
Bα,δ=\{r\|accuracy\(r\)≥α&width\(r\)≤δ,∀r∈A\},B\_\{\\alpha,\\delta\}=\\\{r\|\\ accuracy\(r\)\\geq\\alpha\\ \\&\\ width\(r\)\\leq\\delta,\\forall r\\in A\\\},Using this set we define the problem ofhigh coverage, discriminative, parsimonious rule set learning\(CDPRL\) as follows:
maximizeS⊆Bα,δcoverage\(S\),s\.t\.\\displaystyle\\underset\{S\\subseteq B\_\{\\alpha,\\delta\}\}\{\\text\{maximize\}\}\\quad\\mathrm\{coverage\}\(S\),\\;\\text\{s\.t\.\}\(CDPRL\)rule\-size\(S\)≤γ,overlap\-rate\(ri,rj\)≤β,ri,rj∈S,∀i≠j\\displaystyle rule\\mbox\{\-\}size\(S\)\\leq\\gamma,\\;overlap\\mbox\{\-\}rate\(r\_\{i\},r\_\{j\}\)\\leq\\beta,r\_\{i\},r\_\{j\}\\in S,\\forall i\\neq j
The optimization problem for learning CDPR is presented above\.To the best of our knowledge, CDPR is a novel concept that has not been previously explored in the literature\. CDPR ensures that we learn rule\-set that directly maximizes coverage without compromising accuracy and interpretability metrics\. The quality thresholdsα,β,γ\\alpha,\\beta,\\gamma, andδ\\deltaare easily interpretable, making it straightforward to align them with specific business requirements, unlike approaches such as IDS which lack interpretability in their hyperparameters\. It’s worth noting that existing algorithms are unable to directly solve the CDPR problem due to the constraint onoverlap\-rateoverlap\\mbox\{\-\}rate, which is not a matroid or knapsack constraint\[[7](https://arxiv.org/html/2606.14156#bib.bib29),[19](https://arxiv.org/html/2606.14156#bib.bib28)\]\. This implies that custom solvers are needed to learn CDPR\.
## 5Two Algorithms based on Sub\-modular Maximization
In this section, we present two approaches for learning CDPR as a cardinality\-constrained sub\-modular maximization problem: \(1\)Graph Rules Algorithm\(GRA\): This approach is designed to solveCDPRLdirectly, and \(2\)Greedy Algorithm\(GDY\): This method is employed to address a relaxed version ofCDPRLknown asCDPRL with Mean Overlap \(CDPRL\-MO\), which is a sub\-modular function with a cardinality constraint\.
### 5\.1GraphRules Algorithm \(GRA\)
In this subsection, we introduce theGraphRules Algorithm\(GRA\) for learning CDPR\. We also establish the approximation guarantee ofGRAand examine its computational complexity\.GRArelies on a novel approach that transformsCDPRLinto a submodular optimization problem over a graph data structure\. This transformation is achieved by constructing a discrete directed graph, known asoverlap\-graphover thediscriminative candidate setBα,δB\_\{\\alpha,\\delta\}\.GRAmaps the rules inBα,δB\_\{\\alpha,\\delta\}as nodes of the graph and establishes edges in a manner that satisfies the overlap\-rate constraints ofCDPRL\. As a result, we frameCDPRLas a sub\-modular function with cardinality constraint over theoverlap\-graph\.GRA, as depicted in Algorithm[1](https://arxiv.org/html/2606.14156#alg1), mainly comprises two steps:
1. 1\.Overlap\-graph Construction:In this step, we create an overlap graphG=\(V,E\)G=\(V,E\)where the verticesVVcorrespond toBα,δB\_\{\\alpha,\\delta\}, and an edge\(ri\(r\_\{i\},rj\)∈Er\_\{j\}\)\\in Eexists if and only if theoverlap\-rateoverlap\\mbox\{\-\}rate\(ri,rj\)\(r\_\{i\},r\_\{j\}\)\>β\>\\beta, withri,rj∈V,i≠jr\_\{i\},r\_\{j\}\\in V,i\\neq j\. The vertices in the overlap graph represent the rules in the discriminative candidate setBα,δB\_\{\\alpha,\\delta\}\. A directed edge from one vertexrir\_\{i\}to another vertexrjr\_\{j\}exists if more thanβ\\betapercentage of data points covered byrir\_\{i\}are also covered byrjr\_\{j\}\.
2. 2\.Find Rule\-sets from Overlap\-graph:We begin with an empty rule set, denoted asSS\. In each iteration, we select the vertex with the highest coverage in theoverlap\-graph \(G\)and add it toSS\. Subsequently, we rearrangeGGby removing the selected vertex and any vertices connected by a directed edge to the selected vertex\. If multiple vertices share the same highest coverage, we arbitrarily select one of the vertices with maximum coverage\. Vertices are added to setSSuntil it contains a maximum ofγ\\gammarules, or the overlap\-graph is empty\.
Algorithm 1GraphRulesInput:Dataset
D=\{X,y\}i=1ND=\\\{X,y\\\}\_\{i=1\}^\{N\}, set of rules in
Bα,δB\_\{\\alpha,\\delta\}, parameters
β,γ\\beta,\\gamma
Output:CDPR
SS
G←\(V,E\)G\\leftarrow\(V,E\):
V←Bα,δV\\leftarrow B\_\{\\alpha,\\delta\},
E←\{\}E\\leftarrow\\\{\\\}
foreach
ri∈Bα,δr\_\{i\}\\in B\_\{\\alpha,\\delta\}do
foreach
rj∈Bα,δr\_\{j\}\\in B\_\{\\alpha,\\delta\}do
if
overlap\-rate\(ri,rj\)≥βoverlap\\mbox\{\-\}rate\(r\_\{i\},r\_\{j\}\)\\geq\\betathen
E←E∪\(ri,rj\)E\\leftarrow E\\cup\(r\_\{i\},r\_\{j\}\)⊳\\qquad\\ \\trianglerightoverlap\-graph construction
S←\{\}S\\leftarrow\\\{\\\}
while
GGis non\-empty or size\(
S\)≤γS\)\\leq\\gammado
v←v\\leftarrowmax
coverage\(v\),∀v∈V\\ coverage\(v\),\\ \\forall v\\in V
S←S∪vS\\leftarrow S\\ \\cup\\ v
foreach
ri∈V∖vr\_\{i\}\\in V\\setminus vdo
if
\(ri,v\)∈E\(r\_\{i\},v\)\\in Ethen
V←V∖\{ri\}V\\leftarrow V\\setminus\\\{r\_\{i\}\\\}
V←V∖vV\\leftarrow V\\setminus\{v\}
ReturnS
In the context of multi\-class classification, rules are generated for each class, andGRAis applied\. Once the rule set is created, class labels are assigned as follows: If a data point does not satisfy any rules in the rule\-set, it is assigned to the default rule, and it is the user’s responsibility to determine the default rule\. If a data point satisfies exactly one rule, denoted asr=\(s,c\)r=\(s,c\), the data point is assigned to classcc\. In cases where a data point satisfies more than one rule, the user has the discretion to determine the assignment criteria\. One option is to assign the class corresponding to the rule with the highest coverage or highest accuracy on the training data, or to use a majority vote\.
##### Relevant Definitions\.
Given a setXX, a discrete functionf:S⊆Xf:S\\subseteq Xis said to be \-Submodular:If∀C⊆D⊆X\\forall C\\subseteq D\\subseteq Xande∈\(X−D\)e\\in\(X\-D\),f\(C∪e\)−f\(C\)≥f\(D∪e\)−f\(D\)f\(C\\cup e\)\-f\(C\)\\geq f\(D\\cup e\)\-f\(D\)\.Monotone:If∀C⊆D⊆X,f\(C\)≤f\(D\)\\forall C\\subseteq D\\subseteq X,f\(C\)\\leq f\(D\)\.
### 5\.2Analysis of GraphRules Algorithm
In this section, we establish the approximation guarantees forGRA\.
###### Theorem 5\.1
Consider the problemCDPRL, and letS∗S^\{\*\}be its optimal solution\. LetS=GraphRules\(D,Bα,δ,β,γ\)S=GraphRules\(D,B\_\{\\alpha,\\delta\},\\beta,\\gamma\)\(Algorithm \([1](https://arxiv.org/html/2606.14156#alg1)\)\)\. Letf\(S\)f\(S\)andf\(S∗\)f\(S^\{\*\}\)are the coverage ofSSandS∗S^\{\*\}, respectively\. We have the following result:f\(S\)≥\(1−\(1−\(1−β\)γ\(1\+dβ\)\)γ\)f\(S∗\)f\(S\)\\geq\(1\-\(1\-\\frac\{\(1\-\\beta\)\}\{\\gamma\(1\+d\\beta\)\}\)^\{\\gamma\}\)f\(S^\{\*\}\), whered=minu∈Sindegree\(u\)d=min\_\{u\\in S\}indegree\(u\)\.
###### Proof
LetS∗S^\{\*\}be the optimal solution to the problemCDPRL\. LetSkS\_\{k\}be the set ofkkrules chosen by algorithm tillkthk^\{th\}step\. Similarly, assume that the optimal solutionS∗S^\{\*\}haskkrules given bySk∗S^\{\*\}\_\{k\}, wherek≤γk\\leq\\gamma\. Letvkv\_\{k\}be the rule added inkthk^\{th\}iteration inSkS\_\{k\}\. Data\-points that are not covered bySk∗−Sk−1S^\{\*\}\_\{k\}\-S\_\{k\-1\}are covered by some of thekkrules inSk∗S^\{\*\}\_\{k\}\. One of thevi∗∈Sk∗v^\{\*\}\_\{i\}\\in S^\{\*\}\_\{k\},i=1,…,ki=1,\.\.\.,kmust cover atleastf\(Sk∗\)−f\(Sk−1\)k\\frac\{f\(S^\{\*\}\_\{k\}\)\-f\(S\_\{k\-1\}\)\}\{k\}data\-points\. Letddbe the indegree of nodevkv\_\{k\}\(which corresponds to a rule\) in overlap\-graphGG\. Letβ\\betais the overlap\-rate\. The datapoints that are covered by a rulevkv\_\{k\}might be covered by some other rules inSk−1S\_\{k\-1\}\. This means that the coverage of nodevkv\_\{k\}is given as\.f\(vk\)≥f\(Sk∗\)−f\(Sk−1\)k\(1\+dβ\)f\(v\_\{k\}\)\\geq\\frac\{f\(S^\{\*\}\_\{k\}\)\-f\(S\_\{k\-1\}\)\}\{k\(1\+d\\beta\)\}
From the above equation, coverage ofv1v\_\{1\}is given byf\(v1\)≥f\(Sk∗\)k\(1\+dβ\)≥\(1−β\)f\(Sk∗\)k\(1\+dβ\)f\(v\_\{1\}\)\\geq\\frac\{f\(S^\{\*\}\_\{k\}\)\}\{k\(1\+d\\beta\)\}\\geq\\frac\{\(1\-\\beta\)f\(S^\{\*\}\_\{k\}\)\}\{k\(1\+d\\beta\)\}sinceβ∈\(0,1\)\\beta\\in\(0,1\)\. By GraphRules algorithmv1v\_\{1\}is the node with highest coverage in overlap\-graph\. Nodevkv\_\{k\}may have some datapoints overlapping with rulesvi∈Sk−1v\_\{i\}\\in S\_\{k\-1\}\. The number of datapoints overlapping is less thanβf\(vk\)\\beta f\(v\_\{k\}\), by virtue of our algorithm\. This means thatf\(Sk−1∩vk\)≤βf\(vk\)f\(S\_\{k\-1\}\\cap v\_\{k\}\)\\leq\\beta f\(v\_\{k\}\)\.−f\(Sk−1∩vk\)≥−βf\(vk\)\-f\(S\_\{k\-1\}\\cap v\_\{k\}\)\\geq\-\\beta f\(v\_\{k\}\)\. We have the following:
f\(Sk\)=f\(Sk−1\)\+f\(vk\)−f\(Sk−1∩vk\)≥f\(Sk−1\)\+f\(vk\)−βf\(vk\)\\begin\{split\}f\(S\_\{k\}\)&=f\(S\_\{k\-1\}\)\+f\(v\_\{k\}\)\-f\(S\_\{k\-1\}\\cap v\_\{k\}\)\\\\ &\\geq f\(S\_\{k\-1\}\)\+f\(v\_\{k\}\)\-\\beta f\(v\_\{k\}\)\\end\{split\}\(1\)Substituting the value off\(vk\)f\(v\_\{k\}\)in the above equation\.
f\(Sk\)≥f\(Sk−1\)\+\(1−β\)\(f\(Sk∗\)−f\(Sk−1\)\)k\(1\+dβ\)f\(S\_\{k\}\)\\geq f\(S\_\{k\-1\}\)\+\(1\-\\beta\)\\frac\{\(f\(S^\{\*\}\_\{k\}\)\-f\(S\_\{k\-1\}\)\)\}\{k\(1\+d\\beta\)\}\\\\\(2\)For brevity, we replace1\+dβ1−β\\frac\{1\+d\\beta\}\{1\-\\beta\}by a constantCC, we get,
f\(Sk\)≥f\(Sk−1\)\+\(f\(Sk∗\)−f\(Sk−1\)\)kC≥\(1−1kC\)f\(Sk−1\)\+f\(Sk∗\)Ck\\begin\{split\}f\(S\_\{k\}\)&\\geq f\(S\_\{k\-1\}\)\+\\frac\{\(f\(S^\{\*\}\_\{k\}\)\-f\(S\_\{k\-1\}\)\)\}\{kC\}\\\\ &\\geq\(1\-\\frac\{1\}\{kC\}\)f\(S\_\{k\-1\}\)\+\\frac\{f\(S^\{\*\}\_\{k\}\)\}\{Ck\}\\\\ \\end\{split\}\(3\)Sincef\(Sk∗\)k≤f\(Sk−1∗\)k−1\\frac\{f\(S^\{\*\}\_\{k\}\)\}\{k\}\\leq\\frac\{f\(S^\{\*\}\_\{k\-1\}\)\}\{k\-1\},
f\(Sk\)≥\(1−1kC\)f\(Sk−1\)\+f\(Sk∗\)kC≥\(1−\(1−1kC\)k\)f\(Sk∗\)\\begin\{split\}f\(S\_\{k\}\)&\\geq\(1\-\\frac\{1\}\{kC\}\)f\(S\_\{k\-1\}\)\+\\frac\{f\(S^\{\*\}\_\{k\}\)\}\{kC\}\\\\ &\\geq\(1\-\(1\-\\frac\{1\}\{kC\}\)^\{k\}\)f\(S^\{\*\}\_\{k\}\)\\end\{split\}\(4\)Sincek≤γk\\leq\\gamma,\(1−\(1−1kC\)k\)≥\(1−\(1−1γC\)γ\)\(1\-\(1\-\\frac\{1\}\{kC\}\)^\{k\}\)\\geq\(1\-\(1\-\\frac\{1\}\{\\gamma C\}\)^\{\\gamma\}\)\. Therefore,f\(S\)≥\(1−\(1−1γC\)γ\)f\(S∗\)f\(S\)\\geq\(1\-\(1\-\\frac\{1\}\{\\gamma C\}\)^\{\\gamma\}\)f\(S^\{\*\}\)orf\(S\)≥\(1−\(1−\(1−β\)γ\(1\+dβ\)\)γ\)f\(S∗\)f\(S\)\\geq\(1\-\(1\-\\frac\{\(1\-\\beta\)\}\{\\gamma\(1\+d\\beta\)\}\)^\{\\gamma\}\)f\(S^\{\*\}\)
Now, let us consider the values ofβ\\betafor which the overlap\-graph denoted asG=\(V,E\)G=\(V,E\)is completely disconnected\. This means thatoverlap\-rate\(ri,rj\)=0overlap\\mbox\{\-\}rate\(r\_\{i\},r\_\{j\}\)=0for allri,rj∈V,i≠jr\_\{i\},r\_\{j\}\\in V,i\\neq j\. In this scenario, the optimization problemCDPRLsimplifies tomaximizeS⊆Bα,δ,\|S\|≤γcoverage\(S\)maximize\_\{S\\subseteq B\_\{\\alpha,\\delta\},\|S\|\\leq\\gamma\}coverage\(S\)\. This can be viewed as a monotone sub\-modular function with cardinality constraint\. For such cases, a greedy algorithm with a\(1−e−1\)\(1\-e^\{\-1\}\)\-approximation for maximizing monotone submodular functions with a cardinality constraint is available, as described in\[[25](https://arxiv.org/html/2606.14156#bib.bib46)\]\. Interestingly, this approach is equivalent toGRAwhen theoverlap\-graphis completely disconnected\. When theoverlap\-graphis completely disconnected,GRAselects the vertex with the maximum coverage in theoverlap\-graphand adds it to the rule\-set\.
###### Corollary 1
Consider the problemCDPRL, and letS∗S^\{\*\}be its optimal solution\. LetG=\(V,E\)G=\(V,E\)denote the overlap\-graph constructed fromBα,δB\_\{\\alpha,\\delta\}and letS=GraphRules\(D,Bα,δ,β,γ\)S=GraphRules\(D,B\_\{\\alpha,\\delta\},\\beta,\\gamma\)\(Algorithm[1](https://arxiv.org/html/2606.14156#alg1)\)\. IfGGis completely disconnected, i\.e\.,E=ϕE=\\phi,f\(S\)≥\(1−\(1−1γ\)γ\)f\(S∗\)f\(S\)\\geq\(1\-\(1\-\\frac\{1\}\{\\gamma\}\)^\{\\gamma\}\)f\(S^\{\*\}\), wheref\(S\)f\(S\)andf\(S∗\)f\(S^\{\*\}\)denotes the coverage ofSSandS∗S^\{\*\}respectively\.
In the limiting case, whenγ→∞\\gamma\\rightarrow\\infty, the approximation guarantee ofGRAwhen the overlap\-graph is completely disconnected is the same as the greedy algorithm proposed in\[[25](https://arxiv.org/html/2606.14156#bib.bib46)\]\.
#### 5\.2\.1Computational Complexity of GRA
LetNNbe the number of records in the dataset andAAbe the original rules obtained from the rule mining technique\. The computational complexity of obtainingBα,δB\_\{\\alpha,\\delta\}is𝒪\(N\|A\|\)\\mathcal\{O\}\(N\|A\|\), where\|A\|\|A\|represents the number of rules in setAA\. The computational complexity ofGRAis𝒪\(\|Bα,δ\|2\)\\mathcal\{O\}\(\|B\_\{\\alpha,\\delta\}\|^\{2\}\), where\|Bα,δ\|\|B\_\{\\alpha,\\delta\}\|is the number of rules in setBα,δB\_\{\\alpha,\\delta\}\. Notably asα\\alphaincreases, resulting in fewer rules in setBα,δB\_\{\\alpha,\\delta\}, the time complexity decreases\. Similarly, ifδ\\deltadecreases, leading to a reduced number of rules inBα,δB\_\{\\alpha,\\delta\}, the time complexity also decreases\. SinceGRAexhibits a running time that is quadratic in size ofBα,δB\_\{\\alpha,\\delta\}, in the next section, we will exploreGDY, which has a running time that is nearly linear in the size ofBα,δB\_\{\\alpha,\\delta\}\.
### 5\.3Greedy Algorithm \(GDY\)
In this section, we delve into the Greedy algorithm\[[15](https://arxiv.org/html/2606.14156#bib.bib88)\]designed for sub\-modular maximization with a runtime that is nearly linear in the size ofBα,δB\_\{\\alpha,\\delta\}, to learn CDPR\. Since, theoverlap\-rateconstraint inCDPRLdoes not fit the framework of a matroid or a knapsack constraint, we take a more relaxed approach to the problem\.
In this relaxed problem, we aim to minimizemean\-overlapof rule\-setS¯\\overline\{S\}while simultaneously maximizing coverage\. Themean\-overlap\(S¯\)mean\\mbox\{\-\}overlap\(\\overline\{S\}\)is defined asmean\-overlap\(S¯\)=∑ri,rj∈S¯,i≠joverlap\(ri,rj\)\|S¯\|2mean\\mbox\{\-\}overlap\(\\overline\{S\}\)=\\frac\{\\sum\_\{r\_\{i\},r\_\{j\}\\in\\overline\{S\},i\\neq j\}overlap\(r\_\{i\},r\_\{j\}\)\}\{\|\\overline\{S\}\|^\{2\}\}\.
g\(S¯\)=coverage\(S¯\)\+λ\(N−mean\-overlap\(S¯\)\)g\(\\overline\{S\}\)=\\mathrm\{coverage\}\(\\overline\{S\}\)\+\\lambda\\left\(N\-mean\\mbox\{\-\}overlap\(\\overline\{S\}\)\\right\)\(5\)
The relaxed problem referred to asHigh Coverage Discriminative Parsimonious Rule set Learning using Mean OverlapCDPRL\-MOis as follows:
maximizeS¯⊆Bα,δ,\|S¯\|≤γg\(S¯\)\\displaystyle\\underset\{\\overline\{S\}\\subseteq B\_\{\\alpha,\\delta\},\|\\overline\{S\}\|\\leq\\gamma\}\{\\text\{maximize\}\}\\ g\(\\overline\{S\}\)\(CDPRL\-MO\)InCDPRL\-MO, we optimize over the discriminative candidate setBα,δB\_\{\\alpha,\\delta\}, ensuring that the quality constraints onaccuracyandrule\-widthare satisfied\. InCDPRL, theoverlap\-rateoverlap\\mbox\{\-\}rateis strictly≤β\\leq\\beta, while inCDPRL\-MO, themean\-overlapis upper bounded byNN, whereNNbe the total number of data\-points\. We demonstrate thatCDPRL\-MOis a non\-monotone sub\-modular optimization function with a cardinality constraint\. To solve this, we employ the Greedy algorithm designed for non\-monotone submodular optimization functions with a cardinality constraint\.
Let us consider the following proposition before proceeding with the solution to this problem\.
###### Proposition 1
Letg\(S¯\)=g1\(S¯\)\+λg2\(S¯\)g\(\\overline\{S\}\)=g\_\{1\}\(\\overline\{S\}\)\+\\lambda g\_\{2\}\(\\overline\{S\}\)where,g1\(S¯\)=coverage\(S¯\)g\_\{1\}\(\\overline\{S\}\)=coverage\(\\overline\{S\}\)andg2\(S¯\)=N−∑ri,rj∈S¯,i≠joverlap\(ri,rj\)\|S¯\|2g\_\{2\}\(\\overline\{S\}\)=N\-\\sum\_\{r\_\{i\},r\_\{j\}\\in\\overline\{S\},i\\neq j\}\\frac\{overlap\(r\_\{i\},r\_\{j\}\)\}\{\|\\overline\{S\}\|^\{2\}\}\. For allλ≥0,g\(S¯\)\\lambda\\geq 0,g\(\\overline\{S\}\)is a non\-monotone sub\-modular function\.
###### Proof
Note thatg\(S¯\)g\(\\overline\{S\}\)is not monotone, if eitherg1g\_\{1\}org2g\_\{2\}is non\-monotone\. We can see thatg2g\_\{2\}is non\-monotone\. Therefore,ggis non\-monotone\.
SupposeC⊆DC\\subseteq Dande∈X−De\\in X\-D\. Let us assume thatg1\(C\)=∑ri∈Ccoverage\(ri\)=xg\_\{1\}\(C\)=\\sum\_\{r\_\{i\}\\in C\}coverage\(r\_\{i\}\)=xandg1\(D\)=∑ri∈Dcoverage\(ri\)=x\+ϵg\_\{1\}\(D\)=\\sum\_\{r\_\{i\}\\in D\}coverage\(r\_\{i\}\)=x\+\\epsilon, whereϵ\\epsilonis the number of datapoints that are covered by rules inD−CD\-Cand not any rules inCC\. Then,g1\(C∪e\)=x\+ϵ′g\_\{1\}\(C\\cup e\)=x\+\\epsilon^\{\\prime\}, whereϵ′\\epsilon^\{\\prime\}denote number of data\-points that are covered byeeand not any rules inCC\.g1\(D∪e\)=x\+ϵ\+\(ϵ′−ϵ′′\)g\_\{1\}\(D\\cup e\)=x\+\\epsilon\+\(\\epsilon^\{\\prime\}\-\\epsilon^\{\\prime\\prime\}\), whereϵ′′\\epsilon^\{\\prime\\prime\}is the number of data\-points that are covered by botheeand rules inD−CD\-C\.
g1\(C∪e\)−g1\(C\)=x\+ϵ′−x=ϵ′g1\(D∪e\)−g1\(D\)=x\+ϵ\+\(ϵ′−ϵ′′\)−\(x\+ϵ\)=ϵ′−ϵ′′⟹g1\(C∪e\)−g1\(C\)≥g1\(D∪e\)−g1\(D\)\\begin\{split\}g\_\{1\}\(C\\cup e\)\-g\_\{1\}\(C\)&=x\+\\epsilon^\{\\prime\}\-x=\\epsilon^\{\\prime\}\\\\ g\_\{1\}\(D\\cup e\)\-g\_\{1\}\(D\)&=x\+\\epsilon\+\(\\epsilon^\{\\prime\}\-\\epsilon^\{\\prime\\prime\}\)\-\(x\+\\epsilon\)=\\epsilon^\{\\prime\}\-\\epsilon^\{\\prime\\prime\}\\\\ \\implies g\_\{1\}\(C\\cup e\)\-g\_\{1\}\(C\)&\\geq g\_\{1\}\(D\\cup e\)\-g\_\{1\}\(D\)\\\\ \\end\{split\}\(6\)Hence,g1g\_\{1\}is sub\-modular\. A similar argument demonstrates thatg2g\_\{2\}is sub\-modular\. Thereforeggis sub\-modular\.
We employ the Greedy algorithm presented in\[[15](https://arxiv.org/html/2606.14156#bib.bib88)\]to compute the solution forCDPRL\-MO\. The Greedy algorithm from\[[15](https://arxiv.org/html/2606.14156#bib.bib88)\]is an efficient, nearly linear time, deterministic approximation algorithm designed for optimizing non\-monotone sub\-modular functions with a cardinality constraint\. The following theorem provides an approximation guarantee forGDY:
###### Theorem 5\.2
Letggbe a non\-monotone sub\-modular function with cardinality constraint\. LetS∗¯=argmaxS¯⊆Bα,δ,\|S¯\|≤γg\(S¯\)\\overline\{S^\{\*\}\}=\\text\{argmax\}\_\{\\overline\{S\}\\subseteq B\_\{\\alpha,\\delta\},\|\\overline\{S\}\|\\leq\\gamma\}g\(\\overline\{S\}\)\. LetS¯=Greedy\(g,Bα,δ,ρ,γ\)\\overline\{S\}=Greedy\(g,B\_\{\\alpha,\\delta\},\\rho,\\gamma\)\. Then,g\(S¯\)≥14g\(S∗¯\)g\(\\overline\{S\}\)\\geq\\frac\{1\}\{4\}g\(\\overline\{S^\{\*\}\}\)and Greedy makes𝒪\(\|Bα,δ\|log\(γ\)\)\\mathcal\{O\}\(\|B\_\{\\alpha,\\delta\}\|log\(\\gamma\)\)queries toff, where\|Bα,δ\|\|B\_\{\\alpha,\\delta\}\|is the number of rules inBα,δB\_\{\\alpha,\\delta\}\.
###### Proof
Kuhnle et al\.\[[15](https://arxiv.org/html/2606.14156#bib.bib88)\]states the following theorem:Letg:2\[n\]→ℝ\+g:2^\{\[n\]\}\\rightarrow\\mathbb\{R\}^\{\+\}be sub\-modular, letk∈\[n\]k\\in\[n\]andϵ\>0\\epsilon\>0\. LetO=argmaxS¯≤kg\(S¯\)O=argmax\_\{\\overline\{S\}\\leq k\}g\(\\overline\{S\}\)\. Chooseρ\\rhosuch that\(1−6ρ\)/4\>1/4−ϵ\(1\-6\\rho\)/4\>1/4\-\\epsilonand letC=FIG\(g,k,ρ\)C=FIG\(g,k,\\rho\)\. Then,g\(C\)≥\(1−6ρ\)g\(O\)/4≥\(1/4−ϵ\)g\(O\)g\(C\)\\geq\(1\-6\\rho\)g\(O\)/4\\geq\(1/4\-\\epsilon\)g\(O\)and FIG algorithm makes𝒪\(nρlogkρ\)\\mathcal\{O\}\(\\frac\{n\}\{\\rho\}log\\frac\{k\}\{\\rho\}\)queries toff\.This theorem applies since the objective g is sub\-modular with cardinality constraint as shown in Proposition 1\.
## 6EXPERIMENTS
In this section, we assess the performance of CDPR learned usingGRAandGDY\. We provide an overview of the baselines and experimental setup\. Our analysis encompasses the classification performance and the quality constraints of rule sets learned from ten real\-world datasets from the UCI machine learning repository\[[6](https://arxiv.org/html/2606.14156#bib.bib77)\]as well as two real\-world Alzheimer’s disease datasets obtained from the NACC Uniform Data Set\[[33](https://arxiv.org/html/2606.14156#bib.bib80),[1](https://arxiv.org/html/2606.14156#bib.bib82),[32](https://arxiv.org/html/2606.14156#bib.bib108)\]\.
### 6\.1Baselines and Experimental Setup
Table 1:Rule Set Accuracy\(RSA\) of rule sets generated by different algorithms on test data\-set\. From the Table, it is evident that GRA has the highest accuracy when compared with rule set based baselinesWe compare the CDPR learned usingGRAandGDYwith rule\-sets learned using state\-of\-the\-art algorithms :IDS222https://github\.com/lvhimabindu/interpretable\_decision\_sets\[[16](https://arxiv.org/html/2606.14156#bib.bib14)\], DefragTrees333https://github\.com/sato9hara/defragTrees\(DT\)\[[12](https://arxiv.org/html/2606.14156#bib.bib54)\], andRIPPER444https://github\.com/imoscovitz/wittgenstein\[[4](https://arxiv.org/html/2606.14156#bib.bib23)\]\. Additionally, we compare them with standard models such as logistic regression \(LR\) and random forest \(RF\)555We used scikit learn implementation\[[26](https://arxiv.org/html/2606.14156#bib.bib74)\]of logistic regression \(LR\) and random forest \(RF\)\.to demonstrate the competitiveness of rule\-set models\.
##### Initial rules\.
GRA,GDY, andIDSrequire a pre\-mined set of rules as input\. These rules are pre\-mined from gradient\-boosting decision tree ensembles that are fitted to the training data\[[9](https://arxiv.org/html/2606.14156#bib.bib107)\]\. Each rule is derived from the decision tree ensembles by concatenating condition\-value pairs from the root node to the leaf node, with the leaf node specifying the label\. In our experiments, we employed the same set of rules as input forGRA,GDY, andIDS\. DefragTrees\[[12](https://arxiv.org/html/2606.14156#bib.bib54)\]andRIPPER\[[4](https://arxiv.org/html/2606.14156#bib.bib23)\]do not require the pre\-mining rules\. If a datapoint does not satisfy any of the rules in a rule\-set, it is assigned a default rule\.
##### Hyperparameters\.
ForGRAandGDY, we sampleα∈\[0\.60,0\.95\],β∈\[0\.40,0\.80\],γ∈\[3,12\],δ∈\[3,5\],λ∈\[1,500\]\\alpha\\in\[0\.60,0\.95\],\\beta\\in\[0\.40,0\.80\],\\gamma\\in\[3,12\],\\delta\\in\[3,5\],\\lambda\\in\[1,500\]\. ForIDS, we chooseλi∈\[1,500\],i=\[1,7\]\\lambda\_\{i\}\\in\[1,500\],i=\[1,7\]\. ForDefragTrees, we chose the hyper\-parameterKmax∈\[3,10\]K\_\{max\}\\in\[3,10\]\. ForRIPPER, we set the maximum\-rules to1010\. For random forest \(RF\), we chose the number of estimators in the range \[10, 500\] and max\-depth in the range \[3, 7\]\. For logistic regression \(LR\), we considered l2 regularization with hyper\-parameterCCin the range \[0\.01, 100\]\.
### 6\.2Numerical Results
#### 6\.2\.1Discriminative Power
The disciminative power of rule\-sets is quantified by theRule Set Accuracy \(RSA\)metric, as defined in Section 3\.1\. The empirical evaluation reveals thatGRAexhibits significantly higher discriminative power when compared toGDYand other baseline methods\.GRAachieves an average RSA of 0\.85, whileGDYattains an average RSA of 0\.84 \(Table[4](https://arxiv.org/html/2606.14156#S6.T4)\)\. This is notably higher than the average RSA values forIDS, Defrag andRIPPER, as presented in Table[4](https://arxiv.org/html/2606.14156#S6.T4)\. The RSA for each dataset, along with dataset size and rule\-count, can be found in Table[1](https://arxiv.org/html/2606.14156#S6.T1)\.
The results demonstrate that the rule\-set learned byGRAeither matches or surpasses the performance of Random Forest\(RF\) in77out of1212datasets and falls within a 3% margin ofRF’s RSA in the remaining five datasets\.GRAconsistently outperforms other interpretable models, includingIDS,RIPPER, and DefragTrees\. WhileGDY, achieves comparable performance toRFin 5 out of 12 datasets, it performs similar to or better thanIDS, which is the next best algorithm\. This clearly illustrates thatGRAandGDYoutperforms other interpretable models while achieving performance comparable to that of complex models likeRF\.
Table 2:Coverage\-rate of rule\-sets on test data\-set\. From Table, it is evident that GRA and GDY has atleast20%20\\%increase in coverage\-rate compared toIDSandRIPPER\.Table 3:Maximum Rule\-Width&Rule size of rule\-sets on test data\-set\. From Table, it is evident that CDPR has low value of maximum rule\-width or low rule count or both\.
#### 6\.2\.2Interpretability
Inteepretability is determined by coverage and parsimony of rule\-sets\. As defined in Section 3\.1, we have established metrics for evaluating the interpretability of rule\-sets, includingcoverage\-rate, maximum rule\-widthandrule\-size\. Table[2](https://arxiv.org/html/2606.14156#S6.T2)provides the coverage\-rate of our algorithms and baselines for different datasets\. The results indicate thatGRAexhibits a higher coverage\-rate compared to state\-of\-the\-art baselines\.GRAconsistently achieves a coverage\-rate of more than 90% across all datasets, with an average coverage\-rate of 97%\. In contrast,GDYachieves an average coverage\-rate of 91% \(Table[4](https://arxiv.org/html/2606.14156#S6.T4)\)\.IDS, which ranks as the next best baseline in terms of discriminative power, lags behind with an average coverage\-rate of 35% \(Table[4](https://arxiv.org/html/2606.14156#S6.T4)\)\.
The parsimony of rule\-sets is gauged by maximum rule\-width and rule\-size, as outlined in Table[3](https://arxiv.org/html/2606.14156#S6.T3)\. When examining the maximum rule\-width and rule\-size, it becomes apparent thatGRAandGDY’s values are comparable to those ofIDSandRIPPER\. These algorithms have substantially lower maximum rule widths when compared to DefragTrees from Table[3](https://arxiv.org/html/2606.14156#S6.T3)\. On average, the maximum rule\-width of DefragTrees is more than twice that ofGRAandGDY\(Table[4](https://arxiv.org/html/2606.14156#S6.T4)\), negatively impacting interpretability\.
Table 4:Summary of the results obtained by averaging across the twelve datasets : RSA \(Table[1](https://arxiv.org/html/2606.14156#S6.T1)\), Coverage\-rate \(Table[2](https://arxiv.org/html/2606.14156#S6.T2)\), Maximum Rule Width and Rule size \(Table[3](https://arxiv.org/html/2606.14156#S6.T3)\)\.GRAperforms better than baselines in most of the metrics\.
### 6\.3Case Study : Designing Neurocognitive Tests through Ruleset Learning
In this case study, we attempt to design Neurocognitive Tests as an application of rule\-set learning\. Designing medical diagnostic tests using rule\-sets has not been studied in the literature\. Manual design of tests is time intensive and may fail to be fully exhaustive, potentially excluding important relationships\. There are some previous attempts\[[11](https://arxiv.org/html/2606.14156#bib.bib79)\]to reduce the overall number of Neurocognitive tests\. It is important to design tests to diagnose various diseases such as Alzheimer’s Disease\. Alzheimer’s Disease is characterized by initial memory impairment and cognitive decline and affects over 30 million people worldwide, with a high social burden for patients and caregivers\. Neurocognitive tests are regarded as the most accurate indication of early impairment\.
Neurocognitive tests comprise a set of tasks performed by a subject, designed to assess the subject’s state in the cognitive domains of orientation, memory, attention, visuospatial, language, and executive function\. Various attributes \(such as time taken, error rate, etc\.\) are extracted for each task\. Each attribute corresponds primarily to a single cognitive domain and is considered a feature in the classifier\. A collection of features, together with its limiting values, corresponds to a rule\. The duration of the testing is directly proportional to the number of tests conducted on the patient and the number of attributes\.
Since Neurocognitive tests are long and tedious, having a rule\-based system is helpful, where the entire battery of tests need not be administered to every subject\. Hence, to reduce the testing duration while improving its effectiveness, it is essential to obtain the minimum set of rules covering all the cognitive domains yielding as high an accuracy as possible\. We believe that CDPR learning is a good fit for designing accurate and interpretable diagnostis tests and can generate efficient Neurocognitive tests in our case study\.
##### Dataset and Methods
The study population comprised 145k subjects from the NACC Uniform Data Set\[[33](https://arxiv.org/html/2606.14156#bib.bib80),[1](https://arxiv.org/html/2606.14156#bib.bib82),[32](https://arxiv.org/html/2606.14156#bib.bib108)\]from approximately 32 Alzheimer’s Disease Research Centers \(ADRCs\) between 2005 and 2019\. Each record consists of 69 attributes from 11 chosen standardized neuropsychological tests\. The details of tests are in supplementary material\. Clinical assessments grade the disease into categories measured by Clinical Dementia Rating \(CDR®\), which ranges from a value of 0 for cognitively normal individuals, 0\.5 for Mild Cognitive Impairment \(MCI\), a prodrome of Alzheimer’s to 1, and higher values which correspond to Alzheimer’s Disease of various grades\.
We have proposed two classifiers in tandem for diagnosing Alzheimer’s Disease \-Dementia Screener \(Dem\.Sc\.\)andMCI Screener \(MCI\.Sc\.\)\.Dementia Screenersegregates a subject as definitively suffering from Dementia \(CDR = 1, 2, 3\) or not \(CDR = 0, 0\.5\)\. TheMCI Screenerdistinguishes between subjects with CDR = 0 and CDR = 0\.5 \(Normal vs\. Mild Cognitive Impairment \(MCI\)\)\. In subjects suspected of cognitive impairment, the state of Mild Cognitive Impairment \(CDR = 0\.5\) is the most difficult to classify\[[11](https://arxiv.org/html/2606.14156#bib.bib79)\], since they might fall into either of CDR = 0 or CDR = 1 category\. Hence to diagnose Alzheimer’s Disease, we first pass a subject through the Dementia Screener to screen if the subject is definitively suffering from Dementia or not\. If the subject is not definitively suffering from Dementia, we pass the subject through MCI Screener to diagnose between Normal and MCI\. The clinical interventions for the Dementia and MCI cases could be different\.
Table 5:Rule Set Accuracy\(RSA\) of rule sets generated by different algorithms on Alzheimer’s Disease test data\-set\. From the Table, it is evident that GRA has the highest accuracy when compared with rule set based baselines
##### Results and Discussion
Table[5](https://arxiv.org/html/2606.14156#S6.T5)reports RSA of rule\-sets learned byGRA,GDY, and baselines for Dementia Screener and MCI Screener\. The RSA of the rule\-set learned byGRAis higher than that baselines and is the same as that of RF \(Table[5](https://arxiv.org/html/2606.14156#S6.T5)\)\. We consider rule\-sets learned by algorithms for Dementia and MCI Screener for which the coverage\-rate is atleast 70% \. AlthoughIDShas reasonable accuracy, the coverage\-rate is low\. Among the algorithms considered,GRA,GDYandDThas more than 70% coverage\-rate\. From the results,GRAandGDYexhibits atleast 18% reduction in number of features compared toDT\. ThoughGDYis not as accurate compared toGRA, it is a good candidate for designing Neurocognitive tests\.GDYhas time complexity nearly linear in number of rules inBα,δB\_\{\\alpha,\\delta\}\.
It is evident from the results thatGRAlearn rule\-set for Dementia Screener and MCI Screener with high coverage\-rate, discriminative power and parsimony\. To our best knowledge, this work is the first attempt to make use of the fact that early onset of Alzheimer’s \(consisting of the transition between Normal and MCI individuals\) follows different rules from later progression \(CDR = 1 and onwards\) to arrive at a reduced number of tests\. Dementia Screener can help quickly detect cases that are not yet fully afflicted with the disease\. The MCI Screener can focus on the rules for people not yet afflicted with the disease, thereby reducing the overall testing time\. This will reduce the cognitive stress of the individuals undergoing the cognitive tests for Alzheimer’s disease and make the screening process efficient\. Clinical validation of the rule\-sets will be seperately conducted\.
## 7CONCLUSION
We believe that developing mechanisms to learn accurate and interpretable rule\-based models will improve the adoption of machine learning models in critical applications, where the model is liable to justify the decision to the stakeholders\. This paper presents a novel approach for learning rule\-sets that are accurate and interpretable by proposing CDPR\. We have developed two algorithms for learning CDPR, theGRA, and theGDY, and obtained theoretical guarantees for both approaches\. The empirical results show that our techniques perform better than existing rule learning algorithms\. There are multiple exciting lines of future work to explore\. It would be interesting to develop an objective function to learn CDPRL, where it is possible to relax at least one of the constraints\. Consequently, it will be interesting to develop a soft\-version of overlap\-graph for the GRA\.
## References
- \[1\]D\. L\. Beekly, E\. M\. Ramos, W\. W\. Lee, W\. D\. Deitrich, M\. E\. Jacka, J\. Wu, J\. L\. Hubbard, T\. D\. Koepsell, J\. C\. Morris, and W\. A\. Kukull\(2007\-07\)The national alzheimer’s coordinating center \(nacc\) database: the uniform data set\.Alzheimer Disease and Associated Disorders21\(3\),pp\. 249–258\.External Links:[Document](https://dx.doi.org/10.1097/wad.0b013e318142774e),[Link](https://doi.org/10.1097/wad.0b013e318142774e)Cited by:[§3\.2](https://arxiv.org/html/2606.14156#S3.SS2.p1.1),[§6\.3](https://arxiv.org/html/2606.14156#S6.SS3.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.14156#S6.p1.1)\.
- \[2\]N\. H\. Bshouty and J\. C\. Jackson\(1998\)Learning dnf over the uniform distribution using a quantum example oracle\.SIAM Journal on Computing28\(3\),pp\. 1136–1153\.External Links:[Document](https://dx.doi.org/10.1137/S0097539795293123),[Link](https://doi.org/10.1137/S0097539795293123),https://doi\.org/10\.1137/S0097539795293123Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[3\]P\. Clark and T\. Niblett\(1989\)The cn2 induction algorithm\.Machine learning3\(4\),pp\. 261–283\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[4\]W\. W\. Cohen\(1995\)Fast effective rule induction\.InProceedings of the Twelfth International Conference on International Conference on Machine Learning,ICML’95,San Francisco, CA, USA,pp\. 115–123\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1),[1st item](https://arxiv.org/html/2606.14156#S3.I3.i1.p1.1),[§6\.1](https://arxiv.org/html/2606.14156#S6.SS1.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2606.14156#S6.SS1.p1.1)\.
- \[5\]S\. Dash, O\. Günlük, and D\. Wei\(2018\)Boolean decision rules via column generation\.InProceedings of the 32nd International Conference on Neural Information Processing Systems,NIPS’18,Red Hook, NY, USA,pp\. 4660–4670\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p2.1),[§2](https://arxiv.org/html/2606.14156#S2.p3.1)\.
- \[6\]D\. Dua and C\. Graff\(2017\)UCI machine learning repository\.University of California, Irvine, School of Information and Computer Sciences\.External Links:[Link](http://archive.ics.uci.edu/ml)Cited by:[Figure 1](https://arxiv.org/html/2606.14156#S1.F1),[§3\.2](https://arxiv.org/html/2606.14156#S3.SS2.p1.1),[§6](https://arxiv.org/html/2606.14156#S6.p1.1)\.
- \[7\]U\. Feige, V\. S\. Mirrokni, and J\. Vondrák\(2011\)Maximizing non\-monotone submodular functions\.SIAM\.Cited by:[§4](https://arxiv.org/html/2606.14156#S4.p4.3)\.
- \[8\]V\. Feldman\(2012\)Learning dnf expressions from fourier spectrum\.InConference on Learning Theory,pp\. 17–1\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[9\]J\. Friedman and B\. Popescu\(2005\)Predictive learning via rule ensembles stanford univ\.Technical reportTech\. Rep\.Cited by:[§6\.1](https://arxiv.org/html/2606.14156#S6.SS1.SSS0.Px1.p1.1)\.
- \[10\]B\. Goodman and S\. Flaxman\(2017\)European union regulations on algorithmic decision\-making and a “right to explanation”\.AI magazine38\(3\),pp\. 50–57\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p1.1)\.
- \[11\]A\. Gupta and B\. Kahali\(2020\-01\)Machine learning\-based cognitive impairment classification with optimal combination of neuropsychological tests\.Alzheimer’s and Dementia: Translational Research and Clinical Interventions6\(1\)\.Cited by:[§6\.3](https://arxiv.org/html/2606.14156#S6.SS3.SSS0.Px1.p2.1),[§6\.3](https://arxiv.org/html/2606.14156#S6.SS3.p1.1)\.
- \[12\]S\. Hara and K\. Hayashi\(2018\-09–11 Apr\)Making tree ensembles interpretable: a bayesian model selection approach\.InProceedings of the Twenty\-First International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.84,pp\. 77–85\.External Links:[Link](https://proceedings.mlr.press/v84/hara18a.html)Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p2.1),[§2](https://arxiv.org/html/2606.14156#S2.p3.1),[§2](https://arxiv.org/html/2606.14156#S2.p4.1),[2nd item](https://arxiv.org/html/2606.14156#S3.I3.i2.p1.1),[§6\.1](https://arxiv.org/html/2606.14156#S6.SS1.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2606.14156#S6.SS1.p1.1)\.
- \[13\]J\. C\. Jackson, H\. K\. Lee, R\. A\. Servedio, and A\. Wan\(2008\)Learning random monotone dnf\.InApproximation, Randomization and Combinatorial Optimization\. Algorithms and Techniques,pp\. 483–497\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[14\]A\. R\. Klivans and R\. A\. Servedio\(2004\)Learning dnf in time 2o \(n1/3\)\.Journal of Computer and System Sciences68\(2\),pp\. 303–318\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[15\]A\. Kuhnle\(2019\)Interlaced greedy algorithm for maximization of submodular functions in nearly linear time\.Advances in neural information processing systems32\.Cited by:[§5\.3](https://arxiv.org/html/2606.14156#S5.SS3.p1.1),[§5\.3](https://arxiv.org/html/2606.14156#S5.SS3.p5.1),[§5\.3](https://arxiv.org/html/2606.14156#Thmproofx3.p1.10)\.
- \[16\]H\. Lakkaraju, S\. H\. Bach, and J\. Leskovec\(2016\)Interpretable decision sets: a joint framework for description and prediction\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’16,New York, NY, USA,pp\. 1675–1684\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p2.1),[§2](https://arxiv.org/html/2606.14156#S2.p4.1),[1st item](https://arxiv.org/html/2606.14156#S3.I3.i1.p1.1),[§3\.1](https://arxiv.org/html/2606.14156#S3.SS1.p1.10),[§6\.1](https://arxiv.org/html/2606.14156#S6.SS1.p1.1)\.
- \[17\]H\. Lakkaraju and C\. Rudin\(2016\)Learning Cost\-Effective and Interpretable Treatment Regimes for Judicial Bail Decisions\.InNIPS Symposium on Machine Learning and the Law,NIPS Symposium ’16\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p1.1)\.
- \[18\]H\. Lakkaraju and C\. Rudin\(2017\)Learning Cost\-Effective and Interpretable Treatment Regimes\.InProceedings of the 20th International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.54,pp\. 166–175\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p1.1)\.
- \[19\]J\. Lee, V\. S\. Mirrokni, V\. Nagarajan, and M\. Sviridenko\(2009\)Non\-monotone submodular maximization under matroid and knapsack constraints\.InProceedings of the Forty\-First Annual ACM Symposium on Theory of Computing,STOC ’09,New York, NY, USA,pp\. 323–332\.External Links:[Link](https://doi.org/10.1145/1536414.1536459),[Document](https://dx.doi.org/10.1145/1536414.1536459)Cited by:[§4](https://arxiv.org/html/2606.14156#S4.p4.3)\.
- \[20\]W\. Li, J\. Han, and J\. Pei\(2001\-12\-01\)CMAR: accurate and efficient classification based on multiple class\-association rules\.InProceedings \- 2001 IEEE International Conference on Data Mining, ICDM’01,Proceedings \- IEEE International Conference on Data Mining, ICDM,pp\. 369–376\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[21\]B\. Liu, W\. Hsu, and Y\. Ma\(1998\)Integrating classification and association rule mining\.InProceedings of the Fourth International Conference on Knowledge Discovery and Data Mining,KDD’98,pp\. 80–86\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[22\]D\. Malioutov and K\. Varshney\(2013\-17–19 Jun\)Exact rule learning via boolean compressed sensing\.InProceedings of the 30th International Conference on Machine Learning,S\. Dasgupta and D\. McAllester \(Eds\.\),Proceedings of Machine Learning Research, Vol\.28,Atlanta, Georgia, USA,pp\. 765–773\.External Links:[Link](https://proceedings.mlr.press/v28/malioutov13.html)Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p2.1)\.
- \[23\]R\. Mc Grath, L\. Costabello, C\. Le Van, P\. Sweeney, F\. Kamiab, Z\. Shen, and F\. Lecue\(2018\)Interpretable Credit Application Predictions With Counterfactual Explanations\.InNIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services:the Impact of Fairness, Explainability, Accuracy, and Privacy,NIPS Workshop ’18\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p1.1)\.
- \[24\]G\. Mita, P\. Papotti, M\. Filippone, and P\. Michiardi\(2020\-26–28 Aug\)LIBRE: learning interpretable boolean rule ensembles\.InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.108,pp\. 245–255\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p2.1),[§2](https://arxiv.org/html/2606.14156#S2.p3.1)\.
- \[25\]G\. L\. Nemhauser, L\. A\. Wolsey, and M\. L\. Fisher\(1978\-12\)An analysis of approximations for maximizing submodular set functions–i\.Math\. Program\.14\(1\),pp\. 265–294\.External Links:ISSN 0025\-5610,[Link](https://doi.org/10.1007/BF01588971),[Document](https://dx.doi.org/10.1007/BF01588971)Cited by:[§5\.2](https://arxiv.org/html/2606.14156#S5.SS2.p2.6),[§5\.2](https://arxiv.org/html/2606.14156#S5.SS2.p3.1)\.
- \[26\]F\. Pedregosa, G\. Varoquaux, A\. Gramfort, V\. Michel, B\. Thirion, O\. Grisel, M\. Blondel, P\. Prettenhofer, R\. Weiss, V\. Dubourg, J\. Vanderplas, A\. Passos, D\. Cournapeau, M\. Brucher, M\. Perrot, and E\. Duchesnay\(2011\)Scikit\-learn: machine learning in Python\.JMLR12,pp\. 2825–2830\.Cited by:[footnote 5](https://arxiv.org/html/2606.14156#footnote5)\.
- \[27\]L\. Sellie\(2008\)Learning random monotone dnf under the uniform distribution\.\.InCOLT,pp\. 181–192\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[28\]R\. A\. Servedio\(2001\)On learning monotone dnf under product distributions\.InInternational Conference on Computational Learning Theory,pp\. 558–573\.Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[29\]E\. Thelisson, K\. Padh, and L\. E\. Celis\(2017\)Regulatory mechanisms and algorithms towards trust in ai/ml\.InProceedings of the IJCAI 2017 workshop on explainable artificial intelligence \(XAI\), Melbourne, Australia,pp\. 19–21\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p1.1)\.
- \[30\]L\. G\. Valiant\(1984\-11\)A theory of the learnable\.Commun\. ACM27\(11\),pp\. 1134–1142\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/1968.1972),[Document](https://dx.doi.org/10.1145/1968.1972)Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[31\]T\. Wang, C\. Rudin, F\. Doshi\-Velez, Y\. Liu, E\. Klampfl, and P\. MacNeille\(2017\-01\)A bayesian framework for learning rule sets for interpretable classification\.J\. Mach\. Learn\. Res\.18\(1\),pp\. 2357–2393\.External Links:ISSN 1532\-4435Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p2.1),[§2](https://arxiv.org/html/2606.14156#S2.p2.1)\.
- \[32\]S\. Weintraub, L\. Besser, H\. H\. Dodge, M\. Teylan, S\. Ferris, F\. C\. Goldstein, B\. Giordani, J\. Kramer, D\. Loewenstein, D\. Marson, D\. Mungas, D\. Salmon, K\. Welsh\-Bohmer, X\. Zhou, S\. D\. Shirk, A\. Atri, W\. A\. Kukull, C\. Phelps, and J\. C\. Morris\(2018\-01\)Version 3 of the alzheimer disease centers’ neuropsychological test battery in the uniform data set \(UDS\)\.Alzheimer Disease & Associated Disorders32\(1\),pp\. 10–17\.External Links:[Document](https://dx.doi.org/10.1097/wad.0000000000000223),[Link](https://doi.org/10.1097/wad.0000000000000223)Cited by:[§3\.2](https://arxiv.org/html/2606.14156#S3.SS2.p1.1),[§6\.3](https://arxiv.org/html/2606.14156#S6.SS3.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.14156#S6.p1.1)\.
- \[33\]S\. Weintraub, D\. Salmon, N\. Mercaldo, S\. Ferris, N\. R\. Graff\-Radford, H\. Chui, J\. Cummings, C\. DeCarli, N\. L\. Foster, D\. Galasko, E\. Peskind, W\. Dietrich, D\. L\. Beekly, W\. A\. Kukull, and J\. C\. Morris\(2009\-04\)The alzheimer’s disease centers’ uniform data set \(UDS\) : the neuropsychologic test battery\.Alzheimer Disease & Associated Disorders23\(2\),pp\. 91–101\.External Links:[Document](https://dx.doi.org/10.1097/wad.0b013e318191c7dd),[Link](https://doi.org/10.1097/wad.0b013e318191c7dd)Cited by:[§3\.2](https://arxiv.org/html/2606.14156#S3.SS2.p1.1),[§6\.3](https://arxiv.org/html/2606.14156#S6.SS3.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.14156#S6.p1.1)\.
- \[34\]F\. Yang, K\. He, L\. Yang, H\. Du, J\. Yang, B\. Yang, and L\. Sun\(2021\)Learning interpretable decision rule sets: a submodular optimization approach\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=pZHGKM9mAp)Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p2.1)\.
- \[35\]X\. Yin and J\. Han\(2003\-03\)CPAR: classification based on predictive association rules\.In: SDM’033,pp\.\.External Links:[Document](https://dx.doi.org/10.1137/1.9781611972733.40)Cited by:[§2](https://arxiv.org/html/2606.14156#S2.p1.1)\.
- \[36\]G\. Zhang and A\. Gionis\(2020\)Diverse rule sets\.InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery amp; Data Mining,pp\. 1532–1541\.Cited by:[§1](https://arxiv.org/html/2606.14156#S1.p2.1),[§2](https://arxiv.org/html/2606.14156#S2.p3.1)\.Similar Articles
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
This paper challenges the assumption that RL teaches new reasoning capabilities to LLMs, arguing instead that it performs sparse policy selection at high-entropy decision points. It introduces ReasonMaxxer, an RL-free method that matches full RL performance with significantly lower training costs.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.
Distributional Reinforcement Learning via the Cram\'er Distance
This paper introduces C-DSAC, a new distributional reinforcement learning algorithm that uses the Cramér distance to improve performance and stability in robotic benchmarks compared to standard SAC.
Rethinking the Divergence Regularization in LLM RL
This paper introduces DRPO, which replaces the hard mask in DPPO with a smooth advantage-weighted quadratic regularizer to improve stability and efficiency in LLM reinforcement learning by providing continuous gradient corrections beyond trust-region boundaries.
Debiased Model-based Representations for Sample-efficient Continuous Control
This paper introduces the DR.Q algorithm, which improves model-based representations for Q-learning by maximizing mutual information and using faded prioritized experience replay to reduce bias and overfitting in continuous control tasks.