Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection

arXiv cs.LG Papers

Summary

This paper systematically studies how different evaluation objectives (accuracy, silhouette score, PCA reconstruction loss) and subset-size regularization directions affect search dynamics and solution quality in multiobjective unsupervised feature selection, showing that silhouette-based formulations bias toward trivial low-cardinality solutions while PCA loss yields compact subsets with competitive accuracy.

arXiv:2605.21561v1 Announce Type: new Abstract: Unsupervised feature selection is commonly formulated as a multiobjective optimisation problem that jointly optimises subset quality and subset size. Yet the behaviour of this formulation depends critically on the choice of evaluation objective, the direction of subset-size regularisation, and the initialisation strategy. We study these factors in a controlled setting using a synthetic dataset with known informative, redundant, and irrelevant feature types. Six formulations are compared by combining three evaluation objectives: accuracy, silhouette score, and PCA reconstruction loss with subset-size minimisation or maximisation. The results show that formulation strongly affects both search dynamics and the quality of the resulting Pareto front. Silhouette-based formulations exhibit a strong bias toward trivial low-cardinality solutions and remain weak proxies for predictive performance. In contrast, the proposed PCA loss objective produces compact subsets with test accuracy comparable to subsets obtained by directly optimising supervised accuracy. Overall, the study shows that objective design is central to effective multiobjective unsupervised feature selection.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:49 AM

# Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection
Source: [https://arxiv.org/html/2605.21561](https://arxiv.org/html/2605.21561)
Thomas Bäck1Martijn R\. Tannemaat2Anna V\. Kononova1 1LIACS, Leiden University, Leiden, The Netherlands 2LUMC, Leiden University, Leiden, The Netherlands

###### Abstract

Unsupervised feature selection is commonly formulated as a multiobjective optimisation problem that jointly optimises subset quality and subset size\. Yet the behaviour of this formulation depends critically on the choice of evaluation objective, the direction of subset\-size regularisation, and the initialisation strategy\. We study these factors in a controlled setting using a synthetic dataset with known informative, redundant, and irrelevant feature types\. Six formulations are compared by combining three evaluation objectives: accuracy, silhouette score, and PCA reconstruction loss with subset\-size minimisation or maximisation\. The results show that formulation strongly affects both search dynamics and the quality of the resulting Pareto front\. Silhouette\-based formulations exhibit a strong bias toward trivial low\-cardinality solutions and remain weak proxies for predictive performance\. In contrast, the proposed PCA loss objective produces compact subsets with test accuracy comparable to subsets obtained by directly optimising supervised accuracy\. Overall, the study shows that objective design is central to effective multiobjective unsupervised feature selection\.

## 1Introduction

Feature selection \(FS\) aims to identify a compact subset of features that preserves the information content of a dataset\. This process is particularly beneficial for datasets with many features, which may lead to complexity issues due to the curse of dimensionality or result in long extraction times if computed over raw data, e\.g\., in time series\. By removing potentially irrelevant and redundant features, FS lowers computational requirements and usually enhances the ability of a model to generalise to unseen data\. By restricting the model to more informative features, it is less likely to mistake random noise in the training data for meaningful patterns\. The FS problem is inherently combinatorial, with a search space of size2d2^\{d\}, wheredddenotes the number of features\. Exhaustive search is therefore infeasible, even for moderate number of features, motivating heuristic optimisation strategies such as sequential selection methods and evolutionary algorithms\[[2](https://arxiv.org/html/2605.21561#bib.bib4)\]\. While sequential approaches iteratively construct subsets based on greedy criteria, evolutionary algorithms explore a broader set of candidate subsets, allowing for the consideration of more unique feature combinations\.

Candidate feature subsets are typically evaluated using either unsupervised criteria, such as clustering quality, or supervised performance measures when labels are available\. However, both types of objectives are strongly influenced by subset cardinality, which can lead to degenerate solutions if not explicitly controlled\[[12](https://arxiv.org/html/2605.21561#bib.bib23)\]\. A widely adopted strategy to mitigate this issue is to formulate feature selection as a multiobjective optimisation problem, jointly optimising an evaluation objective and a subset\-size regulariser \(i\.e\. maximising accuracy while minimising subset\-size\)\[[8](https://arxiv.org/html/2605.21561#bib.bib3)\]\. Despite its widespread use, this formulation introduces several underexplored design choices that can substantially influence optimisation behaviour\. In particular, the direction of the subset\-size regulariser \(minimisation versus maximisation\) and the initial distribution of subset cardinalities in the population can affect the search and determine which regions of the Pareto front are reachable under a limited evaluation budget\. Moreover, in real\-world datasets where the true structure of features is unknown, Pareto fronts are typically interpreted only through objective values, making it difficult to understand how different objectives relate to underlying feature types or redundancy structures\.

To address these limitations, we propose a controlled experimental framework based on a synthetic dataset with an explicit feature taxonomy\. The dataset is designed to include informative features, linearly and non\-linearly redundant features, and multiple forms of noise, enabling direct inspection of the composition of selected subsets\. This allows us to move beyond purely objective\-based analysis and study how different optimisation choices affect both Pareto optimality and the structural properties of solutions in terms of feature content\. Within this framework, we investigate multiobjective feature selection using three evaluation objectives: two unsupervised objectives, namely silhouette score and PCA reconstruction loss that we introduce, and a supervised classification objective used as a baseline objective\. These objectives exhibit different sensitivities to subset cardinality, making them suitable for analysing objective\-induced bias in multiobjective feature selection\.

Our study makes three main contributions\. First, we analyse how subset\-size regularisation and initial population sampling strategies shape the structure of the Pareto front under different evaluation objectives\. We show how these design choices determine which regions of the search space are explored under limited computational budgets, and, for both accuracy and silhouette\-based objectives, how this affects the attainable trade\-offs\. Second, using a synthetic dataset with a known feature taxonomy, we analyse the composition of approximations of Pareto\-optimal solutions in terms of informative, redundant, and irrelevant controlled features\. This enables us to characterise how different objectives select feature types\. Finally, we introduce an unsupervised objective, PCA loss which to the best to our knowledge has not been applied in this context, and analyse its behaviour under subset\-size regularisation, addressing its cardinality bias\. We evaluate it in terms of approximated Pareto front structure, feature composition, and alignment with downstream feature based classification performance\.

## 2Related Work

### 2\.1Unsupervised Feature Selection

Unsupervised feature selection \(UFS\) aims to identify informative subsets of features without access to target labels\. The independence of UFS methods from known labels makes them applicable when targets are unknown or unreliable\. By not relying on the target for selecting features, it reduces the risks of overfitting over the training data and information leakage\. In some cases, unsupervised selection methods have shown to achieve performance comparable to supervised approaches\[[1](https://arxiv.org/html/2605.21561#bib.bib25),[10](https://arxiv.org/html/2605.21561#bib.bib61)\]while having a higher potential at generalising to unseen data\.

UFS methods can be broadly categorised into four main strategies\[[6](https://arxiv.org/html/2605.21561#bib.bib59)\]\.Wrapper methodsevaluate candidate feature subsets using a downstream unsupervised objective, such as clustering quality\[[12](https://arxiv.org/html/2605.21561#bib.bib23)\]\.Filter methodsrank or score features based on intrinsic data properties, such as similarity preservation\[[21](https://arxiv.org/html/2605.21561#bib.bib18)\], spectral structure\[[13](https://arxiv.org/html/2605.21561#bib.bib21)\], or variance and redundancy measures\[[9](https://arxiv.org/html/2605.21561#bib.bib17)\]\.Embedded methodsintegrate feature selection directly into the learning process whilehybrid approachescombine multiple strategies, for example by using filter methods to initialise or guide a wrapper\-based search\.

A common wrapper approach for unsupervised feature selection involves the use of aclustering algorithmto assess whether the subset of interest exposes clear clusters in the data\. They generally rely on clustering quality metrics such as the Silhouette score\[[23](https://arxiv.org/html/2605.21561#bib.bib16)\]or the Dabies\-Boulin index\[[5](https://arxiv.org/html/2605.21561#bib.bib15)\]\. These metrics respectively measure both intra\-cluster cohesion and inter\-cluster separation and the ratio of within cluster scatter to between cluster separation\.

Despite offering an intuitive way to assess whether a reduced feature set preserves meaningful data structure by exposing clearly separated groups, the natural form of these objectives does not account for theinduced cardinality bias\. In feature selection settings, good clustering scores can often be achieved with very few features, as reduced dimensionality may artificially enhance cluster contrast or suppress noise\. As a result, these measures are sensitive to feature set cardinality and can favour small, potentially uninformative subsets\. Optimisation based on such criteria may lead to trivial solutions, unless such bias is explicitly controlled\[[12](https://arxiv.org/html/2605.21561#bib.bib23)\]\. Additionally, their use as a wrapper approach can lead to significant computational cost as a clustering algorithm such askk\-means must be repeated during the optimisation process\. Furthermore, finding the right hyper\-parameter values for the clustering algorithm can be difficult and may have a large impact on the perceived quality of the subset\. For many clustering techniques that require the specification of the number of clusters, it has been shown that a dynamic number of clusters is preferable\[[7](https://arxiv.org/html/2605.21561#bib.bib62)\]\.

This subset dimensionality bias in feature selection is not unique to unsupervised settings\. Even supervised objectives such as classification accuracy often exhibit weak sensitivity to the inclusion of irrelevant and redundant features, meaning that optimising accuracy alone may favour large feature subsets\. To address this bias, two main strategies have been proposed: \(i\) modifying the objective function to account for subset size, for example by normalising by the subset size\[[7](https://arxiv.org/html/2605.21561#bib.bib62),[16](https://arxiv.org/html/2605.21561#bib.bib41)\], or \(ii\) explicitly considering feature subset cardinality as a separate objective and solving the resulting multiobjective optimisation problem\[[25](https://arxiv.org/html/2605.21561#bib.bib28),[18](https://arxiv.org/html/2605.21561#bib.bib14),[11](https://arxiv.org/html/2605.21561#bib.bib13)\]\. In the rest of the paper, we focus on the latter\.

### 2\.2Multiobjective Feature Selection

The transformation a single\-objective problem into a multiobjective one by means of a helper objective is known asmultiobjectivisation\[[17](https://arxiv.org/html/2605.21561#bib.bib70)\]\. In feature selection, it typically involves optimising anevaluation objectivef1f\_\{1\}\(e\.g\., accuracy, clustering quality or reconstruction error\) jointly with a subset\-sizeregulariser objectivef2f\_\{2\}, which either penalises or rewards the number of selected features depending on the chosen direction\. The resulting multiobjective formulation aims to explicitly control the cardinality bias inherent in many evaluation objectives by exposing the trade\-off between solution quality and subset size\.

Beyond the bias control, multi\-objectivisation has been shown to offer several potential advantages\. It can reduce the number of local optima and reshape the fitness landscape, making it easier to efficiently explore the solution space\[[17](https://arxiv.org/html/2605.21561#bib.bib70)\]\. It also introduces regions of incomparability between solutions, which can promote population diversity and, thus, improve exploration\. While it is possible to keep the selection strategy single objective by combining the evaluation objective and regulariser objective into a scalar objective function, this requires careful selection of weights and does not offer the advantages of multi\-objectivisation before mentioned\. By approximating a set of Pareto\-optimal solutions, multiobjective strategies provide a more complete representation of the trade\-offs between the objectives\[[19](https://arxiv.org/html/2605.21561#bib.bib75)\]\. As a result, multiobjective feature selection has become a widely studied and promising direction for feature selection, with numerous evolutionary approaches proposed in the literature\[[14](https://arxiv.org/html/2605.21561#bib.bib76)\]\.

## 3Problem formulation

We consider feature selection as a multiobjective optimisation problem over the space of all possible feature subsets\. LetX∈ℝn×dX\\in\\mathds\{R\}^\{n\\times d\}denote the original dataset containingnnsamples andddfeatures\. A candidate solution is then represented by a binary decision vectorx∈\{0,1\}dx\\in\\\{0,1\\\}^\{d\}where each elementxix\_\{i\}acts as an indicator for theii\-th feature\. To evaluate the fitness of a candidate solution, we define the filtered datasetXxX\_\{x\}as the submatrix ofXXconsisting of the columns wherexi=1x\_\{i\}=1\. Themultiobjective feature selection\(MOFS\) problem is defined as the simultaneous minimisation of two competing objectives:

minx∈\{0,1\}d⁡\(f1​\(Xx\),f2​\(x\)\)\\min\_\{x\\in\\\{0,1\\\}^\{d\}\}\(f\_\{1\}\(X\_\{x\}\),f\_\{2\}\(x\)\)\(1\)
wheref1​\(Xx\)f\_\{1\}\(X\_\{x\}\)is an evaluation objective measuring the quality of the selected subset whilef2​\(x\)f\_\{2\}\(x\)is a regularisation objective measuring the subset cardinality\.

## 4Objectives

### 4\.1Silhouette objective

The silhouette objective evaluates the quality of clustering induced by the selected feature subset\. Given a candidate solutionxx, a clustering algorithm is applied to the filtered datasetXxX\_\{x\}\. The silhouette score\[[23](https://arxiv.org/html/2605.21561#bib.bib16)\]measures the cohesion within clusters and the separation between clusters based on this subspace\. For each sampleiiinXxX\_\{x\}, letai​\(Xx\)a\_\{i\}\(X\_\{x\}\)denote the mean distance betweeniiand all other samples in the same cluster and letbi​\(Xx\)b\_\{i\}\(X\_\{x\}\)denote the minimum mean distance betweeniiand all samples in any other cluster\. The silhouette coefficient for sampleiiis defined as:

si​\(Xx\)=bi​\(Xx\)−ai​\(Xx\)max⁡\{ai​\(Xx\),bi​\(Xx\)\}s\_\{i\}\(X\_\{x\}\)=\\frac\{b\_\{i\}\(X\_\{x\}\)\-a\_\{i\}\(X\_\{x\}\)\}\{\\max\\\{a\_\{i\}\(X\_\{x\}\),b\_\{i\}\(X\_\{x\}\)\\\}\}\(2\)
The silhouette score of the filtered datasetXxX\_\{x\}is then given by the average over all samples:

s​\(Xx\)=1n​∑i=1nsi​\(Xx\)s\(X\_\{x\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}s\_\{i\}\(X\_\{x\}\)\(3\)
We use the implementation of the silhouette score provided bys​k​l​e​a​r​nsklearn\[[22](https://arxiv.org/html/2605.21561#bib.bib10)\]\. In this work, cluster labels are obtained using thekk\-means\. For each subsetxx, the number of clusterskkis selected by maximising the silhouette score over a predefined range of candidate values\. The silhouette score lies in\[−1,1\]\[\-1,1\], where higher values indicate better clustering quality\. The negative of the silhouette score is then used as the evaluation objective, to be minimised\.

### 4\.2Accuracy objective

Although the primary focus of this work is unsupervised feature selection, we include classification accuracy as a reference objective to provide a supervised performance baseline\. For a given subsetxx, a Random Forest \(RF\) classifier is trained on the filtered datasetXxX\_\{x\}and used to predict class labels\. The classification accuracy is defined as the proportion of correctly classified samples:

a​c​c​\(Xx\)=number of correct predictionstotal number of predictionsacc\(X\_\{x\}\)=\\frac\{\\text\{number of correct predictions\}\}\{\\text\{total number of predictions\}\}\(4\)
The accuracy takes values in\[0,1\]\[0,1\], where higher values indicate better predictive performance\. The negative of accuracy is then used as the evaluation objective, to be minimised\.

### 4\.3PCA loss objective

The Principal Component Analysis \(PCA\) loss objective evaluates the capacity of the filtered datasetXxX\_\{x\}to preserve the global variance and structural characteristics of the original datasetXX\. This objective is based on the hypothesis that a truly informative feature subset should act as a sufficient basis to linearly reconstruct the latent manifold of the full feature space\.

First, PCA is applied to the full datasetX∈ℝn×dX\\in\\mathbb\{R\}^\{n\\times d\}to get the topkkprincipal components and form a target projection matrixZ∈ℝn×kZ\\in\\mathbb\{R\}^\{n\\times k\}\. This step is performed once as a pre\-processing task, ensuring that the targetZZremains constant throughout the optimization process\. Then, for a given candidate solutionxx, we train a multivariate linear regression model to predict the projection matrix from the filtered datasetXxX\_\{x\}:Z^=Xx​W\\hat\{Z\}=X\_\{x\}W, whereZ^\\hat\{Z\}is the predicted projection matrix fromXxX\_\{x\}andWWthe weights learned during training\. The objective function then tries to minimise the Mean Square Error \(loss\) betweenZZandZ^\\hat\{Z\}:

fp​c​a​\_​l​o​s​s​\(Xx\)=1n​∑i=1n‖Zi−Zi^‖2f\_\{pca\\\_loss\}\(X\_\{x\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\|\|Z\_\{i\}\-\\hat\{Z\_\{i\}\}\|\|^\{2\}\(5\)
While this approach may appear to be a complicated proxy for ranking features by PCA loadings, the reconstruction loss explicitly addresses feature redundancy\. Traditional PCA loading analysis identifies features with the highest individual contribution to variance but treats each feature independently\. Consequently, PCA often ranks highly correlated features at the top, leading to the selection of redundant subsets\. In contrast, the PCA reconstruction loss evaluates the collective information of the subset\. If a candidate feature is highly correlated with features already present inxx, its contribution towards reducing the variance reconstruction loss will be negligible, as its variance is already “explained” by existing subset, even if its individual PCA loading is high\.

The choice of PCA for defining the latent manifoldZZis motivated by the fact that principal components are linear combinations of the original features\. Consequently, a multivariate linear regression model should be sufficient to recover this structure if the selected subsetXxX\_\{x\}contains the necessary information\. While non\-linear dimensionality reduction techniques could help defineZZ, they would require a reconstruction model that accounts for non\-linearity \(e\.g\., neural networks\), significantly increasing the computational overhead per fitness evaluation\. Thus, within an evolutionary framework used for optimisation, the efficiency of linear reconstruction allows for a more careful exploration of the search space\.

For this objective, a lower value indicates a better reconstruction of the latent space from the subset with a minimal value of 0\. The evaluation objective is then defined asfp​c​a​\_​l​o​s​s​\(Xx\)f\_\{pca\\\_loss\}\(X\_\{x\}\)directly, to be minimised\.

### 4\.4Subset size regulariser objective

The regulariser objectivef2f\_\{2\}controls the cardinality of the selected feature subset, defined as :

\|x\|=∑i=1dxi\|x\|=\\sum\_\{i=1\}^\{d\}x\_\{i\}\(6\)
To fit the minimisation framework, we usef2​\(x\)=\|x\|f\_\{2\}\(x\)=\|x\|when minimising subset size andf2​\(x\)=−\|x\|f\_\{2\}\(x\)=\-\|x\|when maximising subset size\.

## 5Setup

### 5\.1Synthetic Dataset Generation

To better understand the behaviour of feature selection methods under different objective formulations, we adopt a controlled experimental setup based on a synthetic dataset with a known feature taxonomy\. A central challenge in feature selection is the absence of a ground truth for the optimal feature subset\. While supervised accuracy on a held\-out test set is often used as an indirect evaluation criterion, it is dependent on the available labels and does not uniquely define an optimal subset\. To address this limitation, we construct a three class classification problem composed of synthetic features with controlled properties and known types\. The detailed generation procedures are provided in the supplementary material and generates the following feature types:

![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/dataset_heatmap.png)Figure 1:Correlation heatmap of the synthetic dataset used in this study\.- •Informative Featuresare sampled from clusters designed to separate the samples into classes, with added Gaussian noise to introduce variability\. Features within each cluster are further linearly combined to induce covariance\. They are generated using themake\_classificationfunction from scikit\-learn\[[22](https://arxiv.org/html/2605.21561#bib.bib10)\]and result in a set that is informative but does not allow perfect separation of the samples due to noise and random label flips\.
- •Linear Redundant Featuresconsist of linear combinations of informative features\. Their purpose is to create redundancy and correlation in the dataset\. This allows us to assess whether feature selection methods are capable of not selecting redundant features together\.
- •Non\-Linear Redundant Featuresintroduce non\-linear transformations of informative features\. These features remain fully determined by the informative set but exhibit more complex relationships, testing the ability of methods to handle non\-linear dependencies between features\.
- •Gaussian Noise Featuresare sampled independently from a standard normal distribution and are completely unrelated to the target\. They serve as pure distractors, enabling the evaluation of a method’s robustness to irrelevant features\.
- •Structured Noise Featuresexhibit some internal structure: samples can be separated but are generated independently of the target labels and with a different number of groups\. Unlike Gaussian noise, they may appear informative due to their organization, allowing us to test whether methods are misled by structure that is not predictive of the task\.
- •Sweep featuresare constructed by progressively perturbing individual informative features with varying levels of noise\. This creates a collection of features with controlled correlation to the original signal, allowing us to evaluate how feature selection methods behave under gradual signal degradation and whether they can correctly prioritize stronger signals over weaker ones\.

A summary of the feature groups and their corresponding index ranges in the dataset generated for this study is provided in Table[3](https://arxiv.org/html/2605.21561#A2.T3)in the appendix\. For features derived from informative variables, we record their lineage: the set of informative features used in their construction \(for example, linear redundant feature 10 is a combination of informative features 1, 3, and 5\), see Figure[1](https://arxiv.org/html/2605.21561#S5.F1)\. This enables analysis of selection behaviour, in particular explaining cases where the removal of an informative feature does not degrade supervised performance due to the presence of redundant features that preserves similar information content\.

Although synthetic data generation approaches for feature selection have been proposed in the literature\[[16](https://arxiv.org/html/2605.21561#bib.bib41),[15](https://arxiv.org/html/2605.21561#bib.bib7)\], existing methods do not typically provide non\-linear combinations and lineage tracking\. These limitations motivate the design of the dataset used in this study\. The resulting correlation heatmap of the dataset used can be seen in Figure[1](https://arxiv.org/html/2605.21561#S5.F1)\.

While this generation procedure does not define a unique optimal feature subset \(since certain redundant transformations may be more compact or predictive than the original informative features\), it provides an intuitive reference solution consisting solely of the informative features, which we refer to as thenaive ground truth\. Although the optimal choice among redundant features is not known, we assume that valid solutions should at least exclude purely noisy features, which, by construction, contain no information about the target\.

### 5\.2Initialisation Strategies

In population\-based evolutionary algorithms, the initialisation strategy refers to the method used to sample the initial population\. Since these methods iteratively evolve a population, its initial location in the search space affects both convergence speed and the reachable regions\. In this study, we investigate three different strategies:

#### 5\.2\.1Binary Random Sampling

A common approach for sampling the initial population for feature selection is independent binary random sampling or Bernoulli trials\. Each feature is sampled from a uniform distribution and discretised using a threshold valuepp, corresponding to the probability that a feature is selected\. In this setting, if the sampled number is lower thanpp, the feature is set to 1 \(selected\) and 0 \(not selected\) otherwise\. With this initialisation strategy, the expected proportion of selected features in a solution is aroundpp, and the resulting subset cardinalities are concentrated around this value across the initial population\[[24](https://arxiv.org/html/2605.21561#bib.bib26)\]\. When subset cardinality is included as an objective, the distribution of cardinalities in the initial population has a large impact on the following search dynamics\. In particular, the interaction between this initial distribution and the direction of the subset\-size regulariser \(minimisation or maximisation\) can bias exploration towards specific regions of the search space, potentially reducing coverage of other regions under a limited evaluation budget\. For example, if the initial population is centred around a proportion ofp=0\.5p=0\.5selected features and the regulariser promotes larger subsets, regions corresponding to smaller subset sizes will be under\-explored during the search\.

#### 5\.2\.2Segmented Sampling

To mitigate this limitation,\[[24](https://arxiv.org/html/2605.21561#bib.bib26)\]proposed a segmented initialisation strategy designed to increase diversity along the cardinality axis\. The population is divided into three equally sized sub\-populations, each generated using a different selection probability \(e\.g\.,p∈\{0\.25,0\.5,0\.75\}p\\in\\\{0\.25,0\.5,0\.75\\\}\)\. The final initial population is obtained by concatenating these sub\-populations\. This results in a broader and more diverse distribution of subset cardinalities compared to standard binary random sampling\.

#### 5\.2\.3Fixed Cardinality Sampling

An alternative strategy used in the feature selection literature is fixed\-cardinality sampling\[[12](https://arxiv.org/html/2605.21561#bib.bib23)\]\. In this approach, all individuals in the initial population are constrained to have the same number of selected features\. While the subset size is fixed, diversity is introduced through variation in which features are selected\. This removes variability along the cardinality axis in the initial population and focuses exploration entirely on feature composition\. As a result, movement across different subset sizes depends entirely on the evolutionary operators and selection pressure induced by the optimisation process\. This can introduce a limitation when the optimal regions of the search space lie far from the initial cardinality level\. For example, if the population is initialised with very small subsets \(e\.g\., one selected feature\) but high\-quality solutions require large subsets, reaching these regions may require many generations, depending on the strength of the selection pressure and the ability of variation operators to increase cardinality\.

In this work, we compare how these initialisation strategies influence search dynamics, Pareto front structure, and the regions of the search space exploredunder different problem formulations\. This analysis aims to provide guidance on selecting an appropriate initialisation strategy based on the interaction between the evaluation objective and the subset\-size regulariser\.

Table 1:Overview of optimisation settings used in the experiments\.

### 5\.3Experimental Setup

We evaluate multiobjective unsupervised feature selection across different combinations of evaluation objectives, subset\-size regularisers, and initialisation strategies, while keeping optimisation parameters fixed\. Each configuration corresponds to a unique combination off1f\_\{1\},f2f\_\{2\}and initial population sampling strategy\. For each configuration, an independent optimisation process is executed, and both intermediate and final populations are recorded\. The optimisation parameters shared across all configurations are listed in Table[1](https://arxiv.org/html/2605.21561#S5.T1)\. Each evaluation objective introduces additional hyperparameter choices, the values used in the experiments are reported in Table[2](https://arxiv.org/html/2605.21561#A1.T2)in Appendix A\. All configurations are evaluated on the same synthetic dataset described in Section 5\.1\. The optimisation process is performed exclusively on the training set, and all candidate solutions are evaluated using training data only\. The final Pareto\-optimal solutions are then assessed on the held\-out test set using classification accuracy\. For visual comparability across figures, subset size is reported in plots as the proportion of selected features:\|x\|/d\|x\|/d, although the optimisation objective itself is defined in terms of the absolute cardinality\|x\|\|x\|\.

## 6Results

### 6\.1Effects of regularisation and initialisation on the search

The standard approach in supervised MOFS involves the simultaneous maximization of classification accuracy and the minimization of subset cardinality\. Since the starting point for feature selection is the hypothesis that there exists a more compact subset of features that preserves the information content of the dataset, it is intuitive to attempt to reduce the subset size in order to find optimal trade\-offs between size and accuracy\. As illustrated in Figure[2](https://arxiv.org/html/2605.21561#S6.F2), this formulation allows for well\-distributed Pareto Fronts and shows that the initialisation strategy has a strong effect on the explored search space and quality of the trade\-offs found\.

Intuitively, since the subset\-size minimisation regulariser pushes the population toward smaller subsets, one might expect that sampling the initial population with relatively large or centrally distributed subset sizes would encourage exploration across the cardinality objective in search of a compact information\-preserving subset\. However, our results suggest otherwise\. Given the high computational cost of evaluating objectives like Random Forest accuracy, the optimiser may fail to traverse the vast combinatorial space between high and low cardinalities within the allocated budget\. Conversely, when the initial population is sampled with low cardinality \(e\.g\., k=1\), the multiobjective formulation ensures that, even in the presence of a “minimize size” regulariser, selection pressure from the accuracy objective drives the search toward larger subsets when they offer true gains in performance\.

![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/search_accuracy_min_size.png)Figure 2:Search history, initial population location and found Pareto Front for simultaneousminimisationof subset size andmaximisationof classification accuracy under three different initial population sampling strategies\.In a scenario where only a small portion of features are truly informative and non\-redundant, our experiment showed better results with sampling the initial population with just one active feature\.

Naturally, this effect is amplified by the fact that our synthetic dataset contains only a small proportion of truly informative features\. As a result, initializing the population with single\-feature subsets places the initial population closer to informative set compared to the other sampling strategies\.

![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/search_silhouette_min_size.png)Figure 3:Search history, initial population location and found Pareto Front for simultaneousminimisationof subset size andmaximisationof silhouette score under three different initial population sampling strategies\.![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/composition_silhouette_max_size.png)\(a\)Feature composition across the Pareto front, clustered into groups C1 to C4\.
![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/frontclust_silhouette_max_size.png)\(b\)Pareto front distribution by cluster\.
![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/boxclust_silhouette_max_size.png)\(c\)Accuracy distribution\.

Figure 4:Analysis of Pareto\-optimal solutions formaximisingsubset size whilemaximisingsilhouette score with the fixed cardinality initialisation: \(a\) selected feature types per cluster; \(b\) mapping of clusters onto the found trade\-offs; \(c\) distribution of classification performance for each group on a held\-out test set\.Although intuitive and, as demonstrated, beneficial for maximising accuracy, minimising the subset size isnotalways the ideal formulation when searching for the best feature subset under every evaluation objectives\. As demonstrated in Figure[3](https://arxiv.org/html/2605.21561#S6.F3), pairing silhouette maximization with a size minimization regulariser can lead the search towardstrivial solutions of uninformative subsets\. The silhouette score measures intra\-cluster cohesion and inter\-cluster separation, but it is inherently biased toward lower subset sizes\[[20](https://arxiv.org/html/2605.21561#bib.bib60)\]\. In high\-dimensional spaces, the “curse of dimensionality” compresses distance metrics, whereas very small subsets can artificially inflate the silhouette score by isolating features that form tight, distinct clusters by chance\. This dimensionality bias has been previously described\[[20](https://arxiv.org/html/2605.21561#bib.bib60)\], noting that clustering\-based objectives often favour small, uninformative subsets unless the cardinality bias is explicitly controlled\. When both the evaluation and regulariser objectives favour low cardinality, the search collapses toward the origin of the search space\. As seen in the “Fixed\-cardinality \(k=1k=1\)” panel of Figure[3](https://arxiv.org/html/2605.21561#S6.F3), the optimizer rapidly converges to trivial solutions\. Despite achieving high silhouette scores, these subsets are far from the naive ground truth and fail to capture the informative structure of the synthetic taxonomy\. Thus, silhouette score objective and clustering quality metrics in generalshouldbe paired with a subset\-size maximisation although unintuitive\.

![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/composition_pca_loss_min_size.png)\(a\)Feature composition across the Pareto front, clustered into groups C1 to C4\.
![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/frontclust_pca_loss_min_size.png)\(b\)Pareto front distribution by cluster\.
![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/boxclust_pca_loss_min_size.png)\(c\)Accuracy distribution\.

Figure 5:Analysis of Pareto\-optimal solutions forminimisingsubset size whileminimisingPCA loss with the fixed cardinality initialisation: \(a\) selected feature types per cluster; \(b\) mapping of clusters onto the found trade\-offs; \(c\) distribution of classification performance for each group on a held\-out test set\.
### 6\.2Composition of Pareto\-optimal solutions

We investigate the feature composition of solutions found on the found Pareto front formaximisingthe silhouette score paired with the correct regularisation objective: subset\-maximisation\. Figure[4\(a\)](https://arxiv.org/html/2605.21561#S6.F4.sf1)illustrates the selected features across the front\. To compare different groups of solutions, we used a dendogram clustering \(visible on the left\) based on the selected features to split the solutions into groups C1 to C4\. The obtained results show that this objective formulation, although correctly addressing the cardinality bias, leads to the selection ofredundantand even someirrelevantfeatures \(e\.g\., 23, 26, 29\)\. Feature 1 for example is selected multiple times, in almost all found solutions, inducing redundancy\. Figure[4\(b\)](https://arxiv.org/html/2605.21561#S6.F4.sf2)shows that despite the fact that groups C1 and C2 selected more features than what is present in the naive ground truth, some of the information waslostin the selection, leading to lower classification accuracies seen in[4\(b\)](https://arxiv.org/html/2605.21561#S6.F4.sf2)\. Furthermore, we observe in Figure[4\(b\)](https://arxiv.org/html/2605.21561#S6.F4.sf2)that our naive ground truth is clearlydominatedby the found front but none of the dominating solutions achieve a higher classification accuracy \(Figure[4\(c\)](https://arxiv.org/html/2605.21561#S6.F4.sf3)\), suggesting that this formulation of the silhouette score evaluation objective isnot a good proxy for classification accuracyon the considered synthetic dataset\.

### 6\.3PCA loss as an unsupervised objective

The solutions obtained on the Pareto front using the proposed PCA loss objective, combined with subset sizeminimisation, are illustrated in Figure[5\(a\)](https://arxiv.org/html/2605.21561#S6.F5.sf1)\. As in previous analyses, the solutions are grouped into four clusters according to the selected feature subsets\. Inspection of clusters C1 and C2 indicates that a substantial proportion of the informative features has been successfullyretained\. Although Feature 1, despite being informative, is not explicitly selected, its derived counterpart \(Feature 80\) is included, therefore its information content waspreserved\. We notice that some degree of redundancy persists\. For instance, information associated with Feature 6 appears multiple times across certain solutions\. As shown in Figure[5\(b\)](https://arxiv.org/html/2605.21561#S6.F5.sf2), the Pareto front exhibits a clear trade\-off between subset size and latent space reconstruction loss\. Notably, the minimal ground truth solution lies in close proximity to several of the identified solutions, indicating that the methodeffectively approximatesnear\-optimal feature subsets\. The performance distribution of the solutions, presented in Figure[5\(c\)](https://arxiv.org/html/2605.21561#S6.F5.sf3), further supports these observations\. Clusters C1, C2, and C3 achieve performance levels that are close to the minimal ground truth, whereas cluster C4 is characterised by overly sparse subsets with diminished performance\. Overall, the proposed unsupervised objective showsstrong potentialto preserve accuracy while selecting compact feature subsets\.

![Refer to caption](https://arxiv.org/html/2605.21561v1/figures/fronts_comparison.png)Figure 6:Pareto fronts of feature subsets under different MOFS formulations, evaluated on the test set \(accuracy vs\. subset size\)\.
### 6\.4Multiobjective feature selection problem formulations

We compared six multiobjective feature selection formulations by combining three evaluation objectives: accuracy, silhouette score, and PCA loss with two regularisation directions: subset\-size minimisation and maximisation\. The resulting Pareto fronts were evaluated on a held\-out test set in terms of the size–accuracy trade\-off \(see Figure[6](https://arxiv.org/html/2605.21561#S6.F6)\)\. Optimising directly for accuracy produced the strongest\-performing subsets\. The PCA loss objective, although unsupervised, yielded competitive solutions close to the supervised formulation, whereas silhouette\-based formulations consistently underperformed, even with appropriate regularisation\. Additional figures for the remaining formulations and sampling strategies are available in the supplementary Zenodo archive\[[4](https://arxiv.org/html/2605.21561#bib.bib1)\]\.

## 7Conclusion

In this work, we analysed multiobjective feature selection across different combinations of evaluation objectives, subset\-size regularisation strategies, and initialisation schemes on a controlled synthetic dataset\. Our results show that problem formulation plays a decisive role in the quality of the selected subsets\. In particular, the interaction between the evaluation objective and the regularisation direction strongly shapes the search behaviour and resulting Pareto front\. Clustering\-based objectives such as silhouette score can induce undesirable biases toward trivial solutions, whereas reconstruction\-based objectives provide a more stable signal\. Notably, PCA loss emerges as a promising unsupervised alternative, achieving a favourable balance between subset compactness and predictive performance\. Future work should verify these findings on real\-world datasets, explore a broader range of unsupervised objectives, and investigate whether more complex latent representations with non\-linear reconstruction models can further improve performance\.

## Acknowledgements

This work was supported by the ARISE\-NMD project and funded by the Dutch Research Council \(NWO\) under the Open Technology Programme \(project number 20852\)\.

## Appendix AAppendix

### A\.1Objective\-specific parameters

Table 2:Values of objective\-specific parameters\.

## Appendix BSynthetic feature ranges

Table 3:Feature groups and corresponding index ranges in the synthetic dataset \(withn=1000n=1000samples\)\.
## References

- \[1\]J\. C\. Ang, A\. Mirzal, H\. Haron, and H\. N\. A\. Hamed\(2016\-09\)Supervised, Unsupervised, and Semi\-Supervised Feature Selection: A Review on Gene Selection\.IEEE/ACM Transactions on Computational Biology and Bioinformatics13\(5\),pp\. 971–989\.External Links:ISSN 1557\-9964,[Link](https://ieeexplore.ieee.org/document/7264992),[Document](https://dx.doi.org/10.1109/TCBB.2015.2478454)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p1.1)\.
- \[2\]T\. H\. W\. Bäck, A\. V\. Kononova, B\. Van Stein, H\. Wang, K\. A\. Antonov, R\. T\. Kalkreuth, J\. De Nobel, D\. Vermetten, R\. De Winter, and F\. Ye\(2023\-06\)Evolutionary Algorithms for Parameter Optimization—Thirty Years Later\.Evolutionary Computation31\(2\),pp\. 81–122\(en\)\.External Links:ISSN 1530\-9304,[Link](https://direct.mit.edu/evco/article/31/2/81/115462/Evolutionary-Algorithms-for-Parameter-Optimization),[Document](https://dx.doi.org/10.1162/evco%5Fa%5F00325)Cited by:[§1](https://arxiv.org/html/2605.21561#S1.p1.2)\.
- \[3\]J\. Blank and K\. Deb\(2020\)Pymoo: Multi\-objective Optimization in Python\.IEEE Access8,pp\. 89497–89509\.Note:arXiv:2002\.04504 \[cs\]External Links:ISSN 2169\-3536,[Link](http://arxiv.org/abs/2002.04504),[Document](https://dx.doi.org/10.1109/ACCESS.2020.2990567)Cited by:[Table 1](https://arxiv.org/html/2605.21561#S5.T1.3.5.2.2)\.
- \[4\]M\. Cherpitel, T\. Bäck, M\. R\. Tannemaat, and A\. V\. Kononova\(2026\)Supplementary material for “objective\-induced bias and search dynamics in multiobjective unsupervised feature selection”\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.19642829)Cited by:[§6\.4](https://arxiv.org/html/2605.21561#S6.SS4.p1.1)\.
- \[5\]D\. Davies and D\. Bouldin\(1979\-05\)A Cluster Separation Measure\.Pattern Analysis and Machine Intelligence, IEEE Transactions onPAMI\-1,pp\. 224–227\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.1979.4766909)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p3.1)\.
- \[6\]R\. Dwivedi, A\. Tiwari, N\. Bharill, M\. Ratnaparkhe, and A\. K\. Tiwari\(2024\-11\)A taxonomy of unsupervised feature selection methods including their pros, cons, and challenges\.The Journal of Supercomputing80\(16\),pp\. 24212–24240\(en\)\.External Links:ISSN 1573\-0484,[Link](https://doi.org/10.1007/s11227-024-06368-3),[Document](https://dx.doi.org/10.1007/s11227-024-06368-3)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p2.1)\.
- \[7\]J\. G\. DyFeature Selection for Unsupervised Learning\.\(en\)\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4419-1428-6%5F97)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p4.1),[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p5.1)\.
- \[8\]C\. Emmanouilidis, A\. Hunter, and J\. Macintyre\(2000\-02\)A multiobjective evolutionary setting for feature selection and a commonality\-based crossover operator\.Vol\.1\.Note:Journal Abbreviation: Proceedings of the 2000 Congress on Evolutionary Computation Pages: 316 vol\.1 Publication Title: Proceedings of the 2000 Congress on Evolutionary ComputationExternal Links:ISBN 978\-0\-7803\-6375\-5,[Document](https://dx.doi.org/10.1109/CEC.2000.870311)Cited by:[§1](https://arxiv.org/html/2605.21561#S1.p2.1)\.
- \[9\]A\. J\. Ferreira and M\. A\. T\. Figueiredo\(2012\-09\)An unsupervised approach to feature discretization and selection\.Pattern Recognition45\(9\),pp\. 3048–3060\.External Links:ISSN 0031\-3203,[Link](https://www.sciencedirect.com/science/article/pii/S0031320311005097),[Document](https://dx.doi.org/10.1016/j.patcog.2011.12.008)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p2.1)\.
- \[10\]L\. Haar, K\. Anding, K\. Trambitckii, and G\. Notni\(2019\)Comparison between Supervised and Unsupervised Feature Selection Methods:\.InProceedings of the 8th International Conference on Pattern Recognition Applications and Methods,Prague, Czech Republic,pp\. 582–589\(en\)\.External Links:ISBN 978\-989\-758\-351\-3,[Link](http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0007385305820589),[Document](https://dx.doi.org/10.5220/0007385305820589)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p1.1)\.
- \[11\]E\. Hancer, B\. Xue, M\. Zhang, D\. Karaboga, and B\. Akay\(2018\-01\)Pareto front feature selection based on artificial bee colony optimization\.Information Sciences422,pp\. 462–479\.External Links:ISSN 0020\-0255,[Link](https://www.sciencedirect.com/science/article/pii/S0020025516312609),[Document](https://dx.doi.org/10.1016/j.ins.2017.09.028)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p5.1)\.
- \[12\]J\. Handl and J\. Knowles\(2006\)Feature Subset Selection in Unsupervised Learning via Multiobjective Optimization\.International Journal of Computational Intelligence Research2\(3\) \(en\)\.External Links:ISSN 09741259,[Document](https://dx.doi.org/10.5019/j.ijcir.2006.64)Cited by:[§1](https://arxiv.org/html/2605.21561#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p2.1),[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p4.1),[§5\.2\.3](https://arxiv.org/html/2605.21561#S5.SS2.SSS3.p1.1)\.
- \[13\]X\. He, D\. Cai, and P\. Niyogi\(2005\)Laplacian Score for Feature Selection\.InAdvances in Neural Information Processing Systems,Vol\.18\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2005/hash/b5b03f06271f8917685d14cea7c6c50a-Abstract.html),[Document](https://dx.doi.org/10.5555/2976248.2976312)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p2.1)\.
- \[14\]R\. Jiao, B\. H\. Nguyen, B\. Xue, and M\. Zhang\(2024\-08\)A Survey on Evolutionary Multiobjective Feature Selection in Classification: Approaches, Applications, and Challenges\.IEEE Transactions on Evolutionary Computation28\(4\),pp\. 1156–1176\.External Links:ISSN 1941\-0026,[Link](https://ieeexplore.ieee.org/document/10173647),[Document](https://dx.doi.org/10.1109/TEVC.2023.3292527)Cited by:[§2\.2](https://arxiv.org/html/2605.21561#S2.SS2.p2.1)\.
- \[15\]F\. Kamalov, H\. Sulieman, and A\. K\. Cherukuri\(2022\-11\)Synthetic Data for Feature Selection\.arXiv\.Note:arXiv:2211\.03035 \[cs\]External Links:[Link](http://arxiv.org/abs/2211.03035),[Document](https://dx.doi.org/10.48550/arXiv.2211.03035)Cited by:[§5\.1](https://arxiv.org/html/2605.21561#S5.SS1.p3.1)\.
- \[16\]Y\. Kim, N\. Street, and F\. Menczer\(2002\-12\)Evolutionary model selection in unsupervised learning\.Intell\. Data Anal\.6,pp\. 531–556\.External Links:[Document](https://dx.doi.org/10.3233/IDA-2002-6605)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p5.1),[§5\.1](https://arxiv.org/html/2605.21561#S5.SS1.p3.1)\.
- \[17\]J\. D\. Knowles, R\. A\. Watson, and D\. W\. Corne\(2001\)Reducing Local Optima in Single\-Objective Problems by Multi\-objectivization\.InEvolutionary Multi\-Criterion Optimization,E\. Zitzler, L\. Thiele, K\. Deb, C\. A\. Coello Coello, and D\. Corne \(Eds\.\),Berlin, Heidelberg,pp\. 269–283\(en\)\.External Links:ISBN 978\-3\-540\-44719\-1,[Document](https://dx.doi.org/10.1007/3-540-44719-9%5F19)Cited by:[§2\.2](https://arxiv.org/html/2605.21561#S2.SS2.p1.2),[§2\.2](https://arxiv.org/html/2605.21561#S2.SS2.p2.1)\.
- \[18\]N\. Kozodoi, S\. Lessmann, K\. Papakonstantinou, Y\. Gatsoulis, and B\. Baesens\(2019\-05\)A multi\-objective approach for profit\-driven feature selection in credit scoring\.Decision Support Systems120,pp\. 106–117\.External Links:ISSN 0167\-9236,[Link](https://www.sciencedirect.com/science/article/pii/S0167923619300570),[Document](https://dx.doi.org/10.1016/j.dss.2019.03.011)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p5.1)\.
- \[19\]X\. Ma, Z\. Huang, X\. Li, Y\. Qi, L\. Wang, and Z\. Zhu\(2023\-06\)Multiobjectivization of Single\-Objective Optimization in Evolutionary Computation: A Survey\.IEEE Transactions on Cybernetics53\(6\),pp\. 3702–3715\.External Links:ISSN 2168\-2275,[Link](https://ieeexplore.ieee.org/document/9660767/),[Document](https://dx.doi.org/10.1109/TCYB.2021.3120788)Cited by:[§2\.2](https://arxiv.org/html/2605.21561#S2.SS2.p2.1)\.
- \[20\]I\. Mierswa and M\. Wurst\(2006\)Information preserving multi\-objective feature selection for unsupervised learning\.InProceedings of the 8th annual conference on Genetic and evolutionary computation,GECCO ’06,New York, NY, USA,pp\. 1545–1552\.External Links:ISBN 978\-1\-59593\-186\-3,[Link](https://dl.acm.org/doi/10.1145/1143997.1144248),[Document](https://dx.doi.org/10.1145/1143997.1144248)Cited by:[§6\.1](https://arxiv.org/html/2605.21561#S6.SS1.p5.1)\.
- \[21\]P\. Mitra, C\. Murthy, and S\. Pal\(2002\-04\)Unsupervised feature selection using feature similarity\.Pattern Analysis and Machine Intelligence, IEEE Transactions on24,pp\. 301–312\.External Links:[Document](https://dx.doi.org/10.1109/34.990133)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p2.1)\.
- \[22\]F\. Pedregosa, G\. Varoquaux, A\. Gramfort, V\. Michel, B\. Thirion, O\. Grisel, M\. Blondel, A\. Müller, J\. Nothman, G\. Louppe, P\. Prettenhofer, R\. Weiss, V\. Dubourg, J\. Vanderplas, A\. Passos, D\. Cournapeau, M\. Brucher, M\. Perrot, and É\. Duchesnay\(2018\-06\)Scikit\-learn: Machine Learning in Python\.arXiv\.Note:arXiv:1201\.0490 \[cs\]External Links:[Link](http://arxiv.org/abs/1201.0490),[Document](https://dx.doi.org/10.48550/arXiv.1201.0490)Cited by:[§4\.1](https://arxiv.org/html/2605.21561#S4.SS1.p5.5),[1st item](https://arxiv.org/html/2605.21561#S5.I1.i1.p1.1)\.
- \[23\]P\. Rousseeuw\(1987\-11\)Rousseeuw, P\.J\.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis\. Comput\. Appl\. Math\. 20, 53\-65\.Journal of Computational and Applied Mathematics20,pp\. 53–65\.External Links:[Document](https://dx.doi.org/10.1016/0377-0427%2887%2990125-7)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.21561#S4.SS1.p1.9)\.
- \[24\]H\. Xu, B\. Xue, and M\. Zhang\(2020\)Segmented initialization and offspring modification in evolutionary algorithms for bi\-objective feature selection\.InProceedings of the 2020 Genetic and Evolutionary Computation Conference,GECCO ’20,New York, NY, USA,pp\. 444–452\.External Links:ISBN 978\-1\-4503\-7128\-5,[Link](https://dl.acm.org/doi/10.1145/3377930.3390192),[Document](https://dx.doi.org/10.1145/3377930.3390192)Cited by:[§5\.2\.1](https://arxiv.org/html/2605.21561#S5.SS2.SSS1.p1.4),[§5\.2\.2](https://arxiv.org/html/2605.21561#S5.SS2.SSS2.p1.1)\.
- \[25\]B\. Xue, M\. Zhang, and W\. N\. Browne\(2013\-12\)Particle Swarm Optimization for Feature Selection in Classification: A Multi\-Objective Approach\.IEEE Transactions on Cybernetics43\(6\),pp\. 1656–1671\.External Links:ISSN 2168\-2275,[Link](https://ieeexplore.ieee.org/document/6381531/),[Document](https://dx.doi.org/10.1109/TSMCB.2012.2227469)Cited by:[§2\.1](https://arxiv.org/html/2605.21561#S2.SS1.p5.1)\.

Similar Articles

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

arXiv cs.LG

This paper demonstrates that pointwise metrics like RMSE and MAE structurally mislead for inverse problems with multimodal posteriors, because optimal point estimators collapse the posterior and distort spectral features. It proposes a three-part evaluation protocol using per-event distributional accuracy, spectrum-fidelity diagnostics, and coverage-based calibration to address these failures.

The Long-Term Effects of Data Selection in LLM Fine-Tuning

arXiv cs.LG

This paper investigates the long-term effects of data selection strategies in multi-stage LLM fine-tuning, revealing that myopic selection can harm future adaptability. It introduces a Long-Horizon Aware Selection (LHAS) objective to mitigate these issues.