Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection
Summary
This paper proposes using random feature selection as a baseline and empirically shows that many state-of-the-art unsupervised feature selection methods are outperformed by random selection in both performance and efficiency.
View Cached Full Text
Cached at: 05/25/26, 08:57 AM
# Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection
Source: [https://arxiv.org/html/2605.22973](https://arxiv.org/html/2605.22973)
###### Abstract
Many novel unsupervised feature selection methods are proposed each year, yet their empirical evaluation is limited to supervised and unsupervised evaluation metrics computed on selected datasets, along with comparisons to existing methods\. However, in the absence of an established evaluation baseline, it is difficult to determine the value added to the existing literature by each of these methods, and how effective their underlying approaches are\. We propose using random feature selection as a baseline for evaluating the unsupervised feature selection methods\. We empirically show that many of the state\-of\-the\-art methods in unsupervised feature selection are outperformed by random feature selection in both performance and efficiency\. Accordingly, we emphasize on the strict requirement of considering random feature selection as a baseline in the development process of novel unsupervised feature selection methods to ensure a consistent improvement over random feature selection\.
*Preprint submitted to Elsevier Pattern Recognition Letters\.*
###### keywords:
Evaluation Baseline , Unsupervised Feature Selection , Random Feature Selection , Comparative Analysis
††journal:Pattern Recognition Letters\\affiliation
\[1\]organization=University of Southern Denmark,city=Odense, country=Denmark\\affiliation\[2\]organization=New Jersey Institute of Technology,city=Newark, NJ, country=United States\\affiliation\[3\]organization=Oratio Technologies,city=Tunis, country=Tunisia
## 1Introduction
Evaluation baselines are essential for machine learning applications to provide a point of comparison against which the performance of a complex model can be measured\. Typically, evaluation baselines are simple non\-ML models, or very basic ML models which help assessing how a realistic naïve model would perform on a specific task and dataset\. For instance, in binary classification, a constant function which always predicts the majority class can be considered as a baseline to evaluate classification models\. In regression, a simple linear regression model often acts as a baseline\.
With ongoing research over time, it is unavoidable that a different set of methods is evaluated against each other on a different collection of datasets in different studies\. But it is important to notice that performance quality comparisons are not transitive, if the datasets are not the same\. MethodBBcan perform better than MethodAAon dataset collectionXX, and methodCCcan perform better than methodBBon dataset collectionYY, yet we cannot conclude anything about the comparison between methodsCCandAAonXX,YY, or other datasets\. Therefore, without using a proper baseline, it is basically impossible to justify if a complex opaque model is truly adding value\. If a complex model is better than a trivial baseline only by a small margin, its complexity and computational cost might be considered unjustified\. A proper baseline helps in the identification of the difficulty of the task and ensures that the achieved gains in the performance are actually meaningful and not just the result of a random coincidence or of a simplistic dataset structure\.
Feature selection is the process of selecting the most important and relevant features which represent the data the best, and of removing the redundancies\. Feature selection is often employed in a supervised setting, where either statistical values are calculated as feature importance \(filter methods\) or feature importance is assessed based on the impact of the feature on a classifier’s prediction \(wrapper methods\)\(Dhal and Azad,[2022](https://arxiv.org/html/2605.22973#bib.bib15)\)\. Recently, unsupervised feature selection methods have gained attention for their ability to select the most informative features without requiring the existence of labels\(Guoet al\.,[2024](https://arxiv.org/html/2605.22973#bib.bib4); Shanget al\.,[2023](https://arxiv.org/html/2605.22973#bib.bib43); Wanget al\.,[2024](https://arxiv.org/html/2605.22973#bib.bib3)\)\.
For supervised feature selection, simple Information\-Theoretic methods such as Mutual Information \(MI\)\(Penget al\.,[2005](https://arxiv.org/html/2605.22973#bib.bib26); Zhouet al\.,[2022](https://arxiv.org/html/2605.22973#bib.bib27)\)which measure the information gain about the class label obtained by observing the values of a specific feature, can serve as a suitable baseline\. We can denote methods such as Variance\-based\(Guyon and Elisseeff,[2003](https://arxiv.org/html/2605.22973#bib.bib157)\)and Correlation\-based\(Hall,[1999](https://arxiv.org/html/2605.22973#bib.bib28); Mitraet al\.,[2002](https://arxiv.org/html/2605.22973#bib.bib30)\)as similarly simple methods for unsupervised feature selection\. However, these methods may not be suitable as baselines for feature selection in many data types, such as images, where the variance of each feature \(pixel\) appears equal, or data with signals from independent sources\. Moreover, in very high\-dimensional spaces, arguably one of the most important domains to apply feature selection algorithms on, their performance drops significantly, transforming them into an extremely naïve baseline, similar to using random chance \(50%\) as a baseline for binary classification when the majority of the samples belong to one of the classes\.
In this paper, we propose using random feature selection as a baseline for the evaluation of unsupervised feature selection methods\. Random feature selection is the process of randomly sorting the features to provide a notion of feature ranking and importance for conducting the feature selection task\. Clearly, random feature selection does not require any labels and hence can be deemed as unsupervised\. It also does not require any expensive computational steps and therefore is very efficient\. On the other hand, in very high\-dimensional spaces where a small proportion of features can still discriminate data points from each other, it is expected to offer an acceptable overall performance, advertising itself as a proper candidate to be the baseline for the evaluation on unsupervised feature selection methods\.
The rest of the paper is structured as follows: In Section[2](https://arxiv.org/html/2605.22973#S2), we provide a brief literature review on unsupervised feature selection, focusing on recent methods and their evaluation strategies\. In Section[3](https://arxiv.org/html/2605.22973#S3), we discuss the methodology behind using random feature selection as a baseline for the evaluation of unsupervised feature selection methods\. In Section[4](https://arxiv.org/html/2605.22973#S4), we provide the details of the experimental setup used for the empirical evaluation conducted in the paper\. In Section[5](https://arxiv.org/html/2605.22973#S5), we present the experimental results of an in\-depth experiment on established and state\-of\-the\-art unsupervised feature selection methods in comparison with the random baseline, discussing the results and highlighting the shortcomings of recent work in the literature\. In Section[6](https://arxiv.org/html/2605.22973#S6), we conclude the paper\.
## 2Literature Review
Unsupervised feature selection is an important topic in data mining and machine learning which has been studied over many decades\. On the more established and traditional side, methods such as Variance\-based\(Guyon and Elisseeff,[2003](https://arxiv.org/html/2605.22973#bib.bib157)\)and Correlation\-based\(Hall,[1999](https://arxiv.org/html/2605.22973#bib.bib28); Mitraet al\.,[2002](https://arxiv.org/html/2605.22973#bib.bib30)\)are considered the simplest ways of conducting unsupervised feature selection\. The unsupervised general\-purpose forward\-filter Laplacian Score \(LS\) feature selection method\(Heet al\.,[2005](https://arxiv.org/html/2605.22973#bib.bib106)\)is one of the most popular\. It belongs to the general framework of spectral feature selection\(Zhao and Liu,[2007](https://arxiv.org/html/2605.22973#bib.bib107)\), and has a computational complexity ofO\(dn2\)O\(dn^\{2\}\), whereddis the dimensionality \(number of features\) andnnis the number of samples\. Unsupervised feature selection can also be based on how well the selected features preserve the cluster structure of the data\. This is the approach of Multi\-Cluster Feature Selection \(MCFS\)\(Caiet al\.,[2010](https://arxiv.org/html/2605.22973#bib.bib139)\)\.
Despite the fact that feature selection is one of the classical problems in the field of machine learning and data mining, it is still the subject of many research papers in the recent years\. Subspace learning, cluster analysis and sparse learning are utilized for unsupervised feature selection \(SCFS\)\(Parsaet al\.,[2019](https://arxiv.org/html/2605.22973#bib.bib36)\), and a self\-expressive model is employed to learn cluster similarities\. A regularized regression approach is used to capture the existing correlations among features and clusters sparsely\. SOGFS\(Wu and Cheng,[2021](https://arxiv.org/html/2605.22973#bib.bib59)\)performs feature selection and local structure learning simultaneously\. An exponential weighting mechanism is introduced to adjust feature weight distribution \(LLSRFS\)\(Wanget al\.,[2024](https://arxiv.org/html/2605.22973#bib.bib3)\)\.
Neural Networks Embedded Self\-expression \(NNSE\)\(Youet al\.,[2023](https://arxiv.org/html/2605.22973#bib.bib45)\)utilizes neural networks and embeds them into the self\-expression model in order to enhance the representative ability by preserving the local structure with an adaptive graph regularization module\. Variance–covariance subspace distance \(VCSDFS\)\(Karamiet al\.,[2023](https://arxiv.org/html/2605.22973#bib.bib158)\)utilizes the correlation of information included in the features of data, thus determining all the feature subsets whose corresponding Variance–Covariance matrix has the minimum norm property\. Robust, Adaptive and Flexible Graph \(RAFG\)\(Jianget al\.,[2024](https://arxiv.org/html/2605.22973#bib.bib147)\)is a graph\-learning framework proposed for unsupervised feature selection\. TheL2,1L\_\{2,1\}\-norm is imposed on the flexible regression term to alleviate the adverse effects of both noisy features and outliers, and aL2,1L\_\{2,1\}\-norm regularization term is incorporated to ensure that the selected transformation matrix is sufficiently sparse\. Most of the proposed methods in recent years, however, have focused on the unsupervised multi\-view feature selection problem\(Yanget al\.,[2025](https://arxiv.org/html/2605.22973#bib.bib148); Cao and Xie,[2024](https://arxiv.org/html/2605.22973#bib.bib151); Wuet al\.,[2024](https://arxiv.org/html/2605.22973#bib.bib152)\)\.
The evaluation of unsupervised feature selection methods is often based on a limited selection of datasets\. On the other hand, the whole concept of evaluation is based on a comparison with other existing methods, and the lack of an evaluation baseline is the common shortcoming of these papers\. Some of the papers, such as the work ofCaiet al\.\([2010](https://arxiv.org/html/2605.22973#bib.bib139)\), use the performance with all of the features as a baseline\. However, in very high\-dimensional spaces, there is arguably a high amount of redundancy among features and many of the features act as noise\. Therefore, it is not difficult to beat such a baseline\.
Only few papers focus on the evaluation of feature selection algorithms\.Nogueira and Brown \([2016](https://arxiv.org/html/2605.22973#bib.bib51)\)propose a method to measure the stability of the feature selection algorithms in the presence of noise\. Baseline Fitness Index \(BFI\)\(Mostertet al\.,[2021](https://arxiv.org/html/2605.22973#bib.bib153)\)combines the amount of feature selection and the performance as a single measure\.Rajabinasabet al\.\([2024](https://arxiv.org/html/2605.22973#bib.bib145)\)propose a way to evaluate the overall quality of the feature selection process and the stability of the feature selection algorithms based on the gain of adding more features\. None of these methods, however, offers a baseline to guide the development and the evaluation of unsupervised feature selection methods\.
## 3Methodology
We propose using random feature selection as a baseline to guide the development and evaluation of unsupervised feature selection algorithms\. We run random feature selection 100 times and take the average value of the evaluation metrics as the ground value for the random baseline\. Highlighting the mean and the standard deviation of these values \(e\.g\., in the visualization\) also helps with assessing how well an unsupervised feature selection algorithm is performing in comparison\.
Given a dataset𝒟=\{𝐱i\}i=1n\\mathcal\{D\}=\\\{\\mathbf\{x\}\_\{i\}\\\}\_\{i=1\}^\{n\}withnninstances andddfeatures, the feature selection objective is to select a subset ofkkfeatures,ℱk⊂ℱd\\mathcal\{F\}\_\{k\}\\subset\\mathcal\{F\}\_\{d\}, whereℱd\\mathcal\{F\}\_\{d\}is the set of allddfeatures andk<dk<d\. The random feature selection operates by assigning a feature importance score𝐬\\mathbf\{s\}to each featurej∈\{1,…,d\}j\\in\\\{1,\\dots,d\\\}drawn from a uniform random distribution𝒰\\mathcal\{U\}\. The vector of scores is𝐬=\[s1,s2,…,sD\]\\mathbf\{s\}=\[s\_\{1\},s\_\{2\},\\dots,s\_\{D\}\], wheresj∼𝒰s\_\{j\}\\sim\\mathcal\{U\}\. Thekkfeatures are selected by taking thetopkkfeaturescorresponding to the largest scores in𝐬\\mathbf\{s\}\.111Technically, a simple shuffling and selection ofkkfirst features is all that is needed for random feature selection\. However, we present the method based on feature scores to match the framework of regular feature selection algorithms\.The set of selected feature indicesℐk\\mathcal\{I\}\_\{k\}is defined as:
ℐk=Topk\(\{j∣sj\}j=1D\)\\mathcal\{I\}\_\{k\}=\\text\{Top\}\_\{k\}\(\\\{j\\mid s\_\{j\}\\\}\_\{j=1\}^\{D\}\)\(1\)
We anticipate the presence of many redundant features in very high\-dimensional spaces\. Hence, by even randomly removing features, the overall process is still expected to be successful\. Random feature selection is also clearly efficient as it only requires to generate some random values as feature importance scores\. As the most naïve and still logical solution to the unsupervised feature selection problem, random feature selection can be considered as a suitable baseline for the evaluation and development of unsupervised feature selection algorithms\.
## 4Experimental Setup
In this section, we present the experimental setup to evaluate the feature selection performance of various unsupervised feature selection methods in comparison with the random baseline\. The experiments include traditional methods such as Variance\-based\(Guyon and Elisseeff,[2003](https://arxiv.org/html/2605.22973#bib.bib157)\), Correlation\-based\(Hall,[1999](https://arxiv.org/html/2605.22973#bib.bib28); Mitraet al\.,[2002](https://arxiv.org/html/2605.22973#bib.bib30)\), Laplacian Score\(Heet al\.,[2005](https://arxiv.org/html/2605.22973#bib.bib106)\), and MCFS\(Caiet al\.,[2010](https://arxiv.org/html/2605.22973#bib.bib139)\), as well as recent state\-of\-the\-art methods including SCFS\(Parsaet al\.,[2019](https://arxiv.org/html/2605.22973#bib.bib36)\), SOGFS\(Wu and Cheng,[2021](https://arxiv.org/html/2605.22973#bib.bib59)\), LLSRFS\(Wanget al\.,[2024](https://arxiv.org/html/2605.22973#bib.bib3)\), and VCDFS\.\(Karamiet al\.,[2023](https://arxiv.org/html/2605.22973#bib.bib158)\)\.
### 4\.1Benchmark Datasets
We conduct extensive experiments on large and high\-dimensional datasets from the scikit\-feature repository\(Liet al\.,[2018](https://arxiv.org/html/2605.22973#bib.bib56)\)\. An overview of the characteristics of the datasets included in the experiments is presented in Tab\.[1](https://arxiv.org/html/2605.22973#S4.T1)\.
Table 1:Characteristics of the high\-dimensional benchmark datasets\.
### 4\.2Evaluation Metrics and Methods
To thoroughly evaluate the feature selection methods across both supervised and unsupervised downstream tasks, four key metrics are employed: Accuracy \(ACC\), Area Under the Curve \(AUC\), Clustering Accuracy \(CLSACC\), and Normalized Mutual Information \(NMI\)\. We also use FSDEM and Z\-score for further analysis and insights\.
Accuracy \(ACC\)Accuracy is a fundamental metric for supervised classification tasks, representing the proportion of the total number of predictions that were correct\. For a set ofNNinstances, it is calculated as:
ACC=TP\+TNTP\+TN\+FP\+FN\\text\{ACC\}=\\frac\{\\text\{TP\}\+\\text\{TN\}\}\{\\text\{TP\}\+\\text\{TN\}\+\\text\{FP\}\+\\text\{FN\}\}\(2\)derivingTP\(True Positives\),TN\(True Negatives\),FP\(False Positives\), andFN\(False Negatives\) from the confusion matrix\.
Area Under the Curve \(AUC\)AUC measures the ability of a classifier to distinguish between classes\. It is the area under the Receiver Operating Characteristic \(ROC\) curve, which plots the True Positive Rate \(Sensitivity\) against the False Positive Rate \(1 \- Specificity\) at various threshold settings\. AUC ranges from 0 to 1, with 0\.5 indicating random performance of the classifier and a higher value indicating better performance\.
Clustering Accuracy \(CLSACC\)Clustering Accuracy is used to assess how well the clusters found by the algorithm match the true labels of the data\. It requires finding the best one\-to\-one mapping \(permutationπ\\pi\) between the cluster assignments \(cic\_\{i\}\) and the true labels \(lil\_\{i\}\)\.
CLSACC=1N∑i=1N𝕀\(li=π\(ci\)\)\\text\{CLSACC\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\(l\_\{i\}=\\pi\(c\_\{i\}\)\)\(3\)whereNNis the number of data points and𝕀\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function\. The optimal mappingπ\\piis typically found using the Hungarian algorithm\(Kuhn,[1955](https://arxiv.org/html/2605.22973#bib.bib154)\)\.
Normalized Mutual Information \(NMI\)NMI is another metric based on unsupervised tasks that quantifies the degree of dependence between the clustering \(CC\) and the true labels \(LL\) by normalizing the Mutual Information \(MI\), making it comparable across different cluster numbers\. NMI is calculated as:
NMI\(C,L\)=MI\(C,L\)H\(C\)H\(L\)\\text\{NMI\}\(C,L\)=\\frac\{\\text\{MI\}\(C,L\)\}\{\\sqrt\{H\(C\)H\(L\)\}\}\(4\)whereMI\(C,L\)\\text\{MI\}\(C,L\)is the Mutual Information betweenCCandLL, andH\(C\)H\(C\)andH\(L\)H\(L\)are the entropies ofCCandLL, respectively\.
Feature Selection Dynamic Evaluation Metric \(FSDEM\)In order to evaluate the overall process of feature selection, we use FSDEM\(Rajabinasabet al\.,[2024](https://arxiv.org/html/2605.22973#bib.bib145)\)combined with the aforementioned metrics\. FSDEM is calculated as:
FSDEM=∫abg\(x\)𝑑xb−a\\text\{FSDEM\}=\\frac\{\\int\_\{a\}^\{b\}g\(x\)\\,dx\}\{b\-a\}\(5\)whereg\(x\)g\(x\)indicates the approximated function based on different observations of an arbitrary evaluation metric, andaaandbbindicate the specific range of values in which the performance is evaluated\. FSDEM allows us to assess the quality of the overall feature selection process, instead of the individual points\.
Z\-scoreTo further analyze the magnitude of the difference between the benchmark methods and the random baseline, we calculate theZZ\-score for each method relative to the distribution of the Random baseline’s results\. The Z\-score is defined as:
Z=Pmethod−μrandomσrandomZ=\\frac\{P\_\{method\}\-\\mu\_\{random\}\}\{\\sigma\_\{random\}\}\(6\)wherePmethodP\_\{method\}is the performance metric of a specific feature selection algorithm, andμrandom\\mu\_\{random\}andσrandom\\sigma\_\{random\}are the mean and standard deviation of the Random baseline’s performance, respectively\. This transformation allows us to observe how many standard deviations above or below the expected random performance a method performs\.
### 4\.3Evaluation Configuration
We perform our experiments using theFSEVALbenchmarking suite\(Rajabinasab and Zimek,[2026](https://arxiv.org/html/2605.22973#bib.bib159)\), a specialized framework designed for the extensive and comprehensive evaluation of feature selection algorithms\. The supervised evaluation is done using a Random Forest\(Breiman,[2001](https://arxiv.org/html/2605.22973#bib.bib155)\)classifier with a 5\-fold stratified cross\-validation to provide a robust evaluation\. For the unsupervised evaluation, we use the average of 10 runs ofkk\-means\. For the random baseline, feature selection is conducted 100 times to clearly reflect the standard deviation\. The average value is used for further analysis, and the standard deviation is used to calculate the z\-score for competing methods\.
We conduct two different experiments to evaluate the performance of the feature selection process\. Firstly, we study the overall feature selection process by selecting from 5% to 100% of features with a step size of 5%\. As the datasets are very high\-dimensional, we also conduct a different experiment by selecting from 0\.5% to 10% of features with a step size of 0\.5%\. This allows us to have a better and more realistic view of the performance of feature selection algorithms to make sure that a high number of selected features relative to the number of instances does not obscure the results\. In realistic feature selection problems, we usually aim to select a very small number of features for the final downstream task\. The second experiment allows us to asses the performance of a feature selection method for extreme dimensionality reduction cases\.
## 5Experimental Results
We conduct experiments on the selected unsupervised feature selection methods from both performance and efficiency perspectives to highlight the gain and the cost of using complex methods for unsupervised feature selection\.
### 5\.1Efficiency Analysis
The first experiment presents the runtime of the unsupervised feature selection algorithms\. For this experiment, we consider two variables, the number of instances and the number of features\. We present two results in which one of the variables is fixed at 100 and the other variable varies from 1000 to 20000 with a step size of 500\. For each method, the runtime is capped at one hour\. The first time it takes more than one hour for a method to finish the feature selection process, the time is recorded and further investigation is skipped\. The experimental results are presented in Fig\.[1](https://arxiv.org/html/2605.22973#S5.F1)\. Theyy\-axis is presented in logarithmic scale to enhance readability\.
Figure 1:Runtime analysis of the feature selection method\. Theyy\-axis is presented in logarithmic scale\.As expected, the random baseline is the most efficient method with a large margin as it does not involve any significant computational operations\. It is followed by the traditional methods such as Variance\-based and Correlation\-based feature selection\. Recent state\-of\-the\-art methods are computationally more expensive\. SOGFS and LLSRFS are the most expensive methods and their runtime is capped in early stages\. As we aim to provide consistent and comprehensive experiments on large high\-dimensional datasets, we omit them from the rest of the experiments\.
### 5\.2Peformance Analysis
We conduct an extensive performance analysis based on our evaluation metrics and on all 23 benchmark datasets\. In Fig\.[2](https://arxiv.org/html/2605.22973#S5.F2), evaluation results of the metrics on the Isolet dataset are shown as an example for the 10% range experiment\. Clearly, evaluation metrics assign a very good performance figure to the random baseline, despite its naïve approach and low\-computational complexity\. Many of the feature selection methods, including the state\-of\-the\-art algorithms, fail to provide a better result than the random baseline\. The same goes for the second case when the range of the number of selected features is up to 100%\. The figure is included in the supplementary material\.222Supplementary material, including all figures for all datasets, is presented on:[https://fseval\.imada\.sdu\.dk/random/](https://fseval.imada.sdu.dk/random/)
Figure 2:Comparison of the feature selection performance of unsupervised feature selection methods with the random baseline on the Isolet dataset over the first 10%\. It is evident that the random baseline outperforms state\-of\-the\-art methods on most cases\.We also present the Z\-score results on the Isolet dataset as an example based on the 10% range experiment in Fig\.[3](https://arxiv.org/html/2605.22973#S5.F3)\. By centering the results on the random baseline \(Z=0Z=0\), the relative failure of several state\-of\-the\-art methods becomes even more apparent\. In many instances, most of the methods exhibit significantly negative Z\-scores, indicating they consistently underperform compared to the random baseline\. Conversely, the Z\-score plots highlight that, while methods like SCFS may occasionally achieve positive scores, they rarely deviate far enough from the zero\-line to suggest a robust, non\-trivial advantage over random feature selection\. The exact same phenomenon happens the second case when the range of the number of selected features is up to 100%, as included in the supplementary material\. It is noteworthy that SCFS is not a fully unsupervised method, as it uses the number of classes as an input which affects the feature selection process\.
Figure 3:Z\-score performance relative to the Random baseline on the Isolet dataset over the extreme dimensionality reduction experiment \(0\.5% to 10%\)\.Clearly, the feature selection performance varies on different datasets\. We selected the Isolet dataset for demonstration, as it shows a pattern commonly observed in many other datasets\. However, we analyze the overall performance over all 23 benchmark datasets used for the experiments critical difference diagrams, illustrating the rank statistics based on Wilcoxon\-Holm, sorting the methods based on their performance from right \(best\) to left \(worst\)\. The critical difference diagram follows the methodology ofDemsar \([2006](https://arxiv.org/html/2605.22973#bib.bib156)\)\. Figures[4](https://arxiv.org/html/2605.22973#S5.F4)and[5](https://arxiv.org/html/2605.22973#S5.F5)depict the critical difference diagrams for different metrics over the full range of number of features and the 10% range, respectively\. This analysis is done using the average performance measured by the FSDEM\(Rajabinasabet al\.,[2024](https://arxiv.org/html/2605.22973#bib.bib145)\)score for all of the metrics\.
Figure 4:Critical difference diagram over the full range of features for different metrics based on the average performance measured by the FSDEM score\.Figure 5:Critical difference diagram over the first 10% of features for different metrics based on the average performance measured by the FSDEM score\.Clearly, the random baseline is consistently among the best\-performing unsupervised feature selection methods\. For supervised tasks, only SCFS\(Parsaet al\.,[2019](https://arxiv.org/html/2605.22973#bib.bib36)\)ranks better than the random baseline \(which is not completely unsupervised, as mentioned earlier\)\. For unsupervised tasks, Laplacian Score\(Heet al\.,[2005](https://arxiv.org/html/2605.22973#bib.bib106)\)is the top performer\. However, there is no statistically significant difference in the performance of these methods, and the overall rank is not very different from the random baseline figures\. This makes these methods questionable to conduct unsupervised feature selection with respect to the computational cost\. Many methods, such as the recent state\-of\-the\-art VCDFS\(Karamiet al\.,[2023](https://arxiv.org/html/2605.22973#bib.bib158)\), are consistently outperformed by the random baseline, which raises the critical question whether their performance is actually acceptable at all or not\.
Overall, no method consistently outperforms the random baseline on both supervised and unsupervised downstream tasks and in specific cases that one method appears to be better, the statistical difference is insignificant, and the overall rank is very close to the random baseline\. This clearly reflects the importnace of a random baseline in the evaluation of unsupervised feature selection algorithms\.
## 6Discussion and Conclusion
Let us note that the established evaluation measures reflect the performance of the classifier or clustering method in the downstream task, and hence only very indirectly the performance of the feature selection process as such\. For example, using AUC, if the classifier randomly guesses the classes in a binary classification scenario, AUC equal to0\.50\.5would be expected\. However, a classifier is expected to perform better than random, even with a randomly selected subset of features\. For ACC, CLSACC, and NMI, determining random behavior of the classifier or cluster assignment is not always equally straightforward, but in any case, what is measured is the*downstream performance of the classifier or clustering method*on the selected features,*not the selection of the features as such*\. This might explain why close to \(or worse than\) random behavior of feature selection methods has not been noticed in the literature so far\.
*Therefore, any supervised or unsupervised feature selection method should provide a feature subset which is considerably better than any randomly selected subset to be realistically deemed effective\.*
The findings of this paper clearly indicate that no method significantly improves the performance of the unsupervised feature selection task in comparison with the random baseline\. In most cases, the random baseline appears to be the best or the second\-best choice\. When a method outperformed the random baseline, there was no significant statistical difference observed, and the overall rank value was very close to the random baseline\. Also in terms of efficiency, as expected, the random baseline is superior, followed by the classic algorithms like variance\-based and correlation\-based methods\. Recent methods showed a significantly higher computational complexity while offering no significant improvement over the random baseline\.
This paper aims to emphasize the critical lack of a baseline for the evaluation of unsupervised feature selection algorithms\. A baseline is required to guide the design and development of new methods, ensuring that the computational cost imposed by these algorithms is justified and that the proposed method can offer a significantly better performance compared to the baseline and other approaches\. We demonstrated that random feature selection can play the role of the baseline as it offers a consistently good downstream performance with almost no computational cost\.
## Acknowledgments
This study was funded by the Innovation Fund Denmark project “PREPARE: Personalized Risk Estimation and Prevention of Cardiovascular Disease”\.
## References
- L\. Breiman \(2001\)Random Forests\.Machine Learning\(1\)\.Cited by:[§4\.3](https://arxiv.org/html/2605.22973#S4.SS3.p1.1)\.
- D\. Cai, C\. Zhang, and X\. He \(2010\)Unsupervised feature selection for multi\-cluster data\.InKDD,Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p1.3),[§2](https://arxiv.org/html/2605.22973#S2.p4.1),[§4](https://arxiv.org/html/2605.22973#S4.p1.1)\.
- Z\. Cao and X\. Xie \(2024\)Structure learning with consensus label information for multi\-view unsupervised feature selection\.Expert Syst\. Appl\.\(Part C\)\.Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p3.2)\.
- J\. Demsar \(2006\)Statistical comparisons of classifiers over multiple data sets\.J\. Mach\. Learn\. Res\.\.Cited by:[§5\.2](https://arxiv.org/html/2605.22973#S5.SS2.p3.1)\.
- P\. Dhal and C\. Azad \(2022\)A comprehensive survey on feature selection in the various fields of machine learning\.Appl\. Intell\.\(4\)\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p3.1)\.
- Y\. Guo, Y\. Sun, Z\. Wang, F\. Nie, and F\. Wang \(2024\)Double\-structured sparsity guided flexible embedding learning for unsupervised feature selection\.IEEE Trans\. Neural Networks Learn\. Syst\.\(10\)\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p3.1)\.
- I\. Guyon and A\. Elisseeff \(2003\)An introduction to variable and feature selection\.J\. Mach\. Learn\. Res\.\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p4.1),[§2](https://arxiv.org/html/2605.22973#S2.p1.3),[§4](https://arxiv.org/html/2605.22973#S4.p1.1)\.
- M\. A\. Hall \(1999\)Correlation\-based feature selection for machine learning\.Ph\.D\. Thesis,University of Waikato\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p4.1),[§2](https://arxiv.org/html/2605.22973#S2.p1.3),[§4](https://arxiv.org/html/2605.22973#S4.p1.1)\.
- X\. He, D\. Cai, and P\. Niyogi \(2005\)Laplacian score for feature selection\.InNIPS,Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p1.3),[§4](https://arxiv.org/html/2605.22973#S4.p1.1),[§5\.2](https://arxiv.org/html/2605.22973#S5.SS2.p4.1)\.
- K\. Jiang, T\. Cao, L\. Zhu, and Q\. Sun \(2024\)Adaptive and flexibleℓ1\\ell\_\{1\}\-norm graph embedding for unsupervised feature selection\.Applied Intelligence\(22\)\.Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p3.2)\.
- S\. Karami, F\. Saberi\-Movahed, P\. Tiwari, P\. Marttinen, and S\. Vahdati \(2023\)Unsupervised feature selection based on variance–covariance subspace distance\.Neural Networks\.Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p3.2),[§4](https://arxiv.org/html/2605.22973#S4.p1.1),[§5\.2](https://arxiv.org/html/2605.22973#S5.SS2.p4.1)\.
- H\. W\. Kuhn \(1955\)The hungarian method for the assignment problem\.Naval Research Logistics Quarterly\(1–2\)\.External Links:[Document](https://dx.doi.org/10.1002/nav.3800020109)Cited by:[§4\.2](https://arxiv.org/html/2605.22973#S4.SS2.p4.6)\.
- J\. Li, K\. Cheng, S\. Wang, F\. Morstatter, R\. P\. Trevino, J\. Tang, and H\. Liu \(2018\)Feature selection: A data perspective\.ACM Comput\. Surv\.\(6\)\.Cited by:[§4\.1](https://arxiv.org/html/2605.22973#S4.SS1.p1.1)\.
- P\. Mitra, C\. A\. Murthy, and S\. K\. Pal \(2002\)Unsupervised feature selection using feature similarity\.IEEE Trans\. Pattern Anal\. Mach\. Intell\.\(3\)\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p4.1),[§2](https://arxiv.org/html/2605.22973#S2.p1.3),[§4](https://arxiv.org/html/2605.22973#S4.p1.1)\.
- W\. Mostert, K\. M\. Malan, and A\. P\. Engelbrecht \(2021\)A feature selection algorithm performance metric for comparative analysis\.Algorithms\(3\)\.Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p5.1)\.
- S\. Nogueira and G\. Brown \(2016\)Measuring the stability of feature selection\.InECML/PKDD,Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p5.1)\.
- M\. G\. Parsa, H\. Zare, and M\. Ghatee \(2019\)Unsupervised feature selection based on adaptive similarity learning and subspace clustering\.CoRR\.Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p2.1),[§4](https://arxiv.org/html/2605.22973#S4.p1.1),[§5\.2](https://arxiv.org/html/2605.22973#S5.SS2.p4.1)\.
- H\. Peng, F\. Long, and C\. H\. Q\. Ding \(2005\)Feature selection based on mutual information: criteria of max\-dependency, max\-relevance, and min\-redundancy\.IEEE Trans\. Pattern Anal\. Mach\. Intell\.\(8\)\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p4.1)\.
- M\. Rajabinasab, A\. D\. Lautrup, T\. Hyrup, and A\. Zimek \(2024\)A dynamic evaluation metric for feature selection\.InSISAP,Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p5.1),[§4\.2](https://arxiv.org/html/2605.22973#S4.SS2.p6.4),[§5\.2](https://arxiv.org/html/2605.22973#S5.SS2.p3.1)\.
- M\. Rajabinasab and A\. Zimek \(2026\)FSEVAL: feature selection evaluation toolbox and dashboard\.External Links:2604\.18227,[Link](https://arxiv.org/abs/2604.18227)Cited by:[§4\.3](https://arxiv.org/html/2605.22973#S4.SS3.p1.1)\.
- R\. Shang, J\. Kong, L\. Wang, W\. Zhang, C\. Wang, Y\. Li, and L\. Jiao \(2023\)Unsupervised feature selection via discrete spectral clustering and feature weights\.Neurocomputing\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p3.1)\.
- C\. Wang, J\. Wang, Z\. Gu, J\. Wei, and J\. Liu \(2024\)Unsupervised feature selection by learning exponential weights\.Pattern Recognit\.\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p3.1),[§2](https://arxiv.org/html/2605.22973#S2.p2.1),[§4](https://arxiv.org/html/2605.22973#S4.p1.1)\.
- J\. Wu, Y\. Li, J\. Gong, and W\. Min \(2024\)Collaborative and discriminative subspace learning for unsupervised multi\-view feature selection\.Eng\. Appl\. Artif\. Intell\.\.Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p3.2)\.
- X\. Wu and Q\. Cheng \(2021\)Fractal autoencoders for feature selection\.InAAAI,Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p2.1),[§4](https://arxiv.org/html/2605.22973#S4.p1.1)\.
- X\. Yang, H\. Che, and M\. Leung \(2025\)Tensor\-based unsupervised feature selection for error\-robust handling of unbalanced incomplete multi\-view data\.Inf\. Fusion\.Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p3.2)\.
- M\. You, A\. Yuan, D\. He, and X\. Li \(2023\)Unsupervised feature selection via neural networks and self\-expression with adaptive graph constraint\.Pattern Recognit\.\.Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p3.2)\.
- Z\. Zhao and H\. Liu \(2007\)Spectral feature selection for supervised and unsupervised learning\.InICML,Cited by:[§2](https://arxiv.org/html/2605.22973#S2.p1.3)\.
- H\. Zhou, X\. Wang, and R\. Zhu \(2022\)Feature selection based on mutual information with correlation coefficient\.Appl\. Intell\.\(5\)\.Cited by:[§1](https://arxiv.org/html/2605.22973#S1.p4.1)\.Similar Articles
Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection
This paper systematically studies how different evaluation objectives (accuracy, silhouette score, PCA reconstruction loss) and subset-size regularization directions affect search dynamics and solution quality in multiobjective unsupervised feature selection, showing that silhouette-based formulations bias toward trivial low-cardinality solutions while PCA loss yields compact subsets with competitive accuracy.
Adaptive data selection improves wearable prediction under low baseline performance
This paper evaluates adaptive data selection strategies for wearable health prediction, finding they significantly improve AUROC for participants with low baseline performance but offer limited gains for strong baselines.
UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs
UnpredictaBench is a benchmark for evaluating how well large language models can sample from target distributions, including statistical and natural-language random processes. Experiments show that current models struggle to capture true underlying distributions, with no model exceeding 40% on the KS@100 metric.
Why our #1 LightGBM feature by importance made predictions worse [D]
A blog post from Flyback demonstrates how a LightGBM feature that ranked #1 in importance actually worsened predictions due to target encoding leakage, highlighting the danger of relying solely on feature importance metrics.
When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction
This paper proposes a three-stage diagnostic framework to identify why offline model selectors fail to beat the best single model, applying it to dropout prediction on edX clickstream data. The study finds that the bottleneck is local representational ambiguity rather than learner choice or distribution shift, recommending state redesign or new data collection over further algorithm tuning.