MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning
Summary
This paper proposes MADS, a method that leverages neural activation states from LLMs to select diverse core sets for instruction tuning, showing that a 15% subset can outperform full-dataset fine-tuning on multiple benchmarks.
View Cached Full Text
Cached at: 06/01/26, 09:29 AM
# MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning
Source: [https://arxiv.org/html/2605.30857](https://arxiv.org/html/2605.30857)
Yi Bai1,Wenhao Zhang1,Yao Chen2,Jiao Xue2,Zhumin Chen1, Pengjie Ren1 1Shandong University, Qingdao, China2Inspurcloud, Jinan, China \{202235147, zhangwenhao\}@mail\.sdu\.edu\.cn, \{chenyao, xuejiao02\}@inspur\.com, \{chenzhumin,renpengjie\}@sdu\.edu\.cn
###### Abstract
Instruction fine\-tuning is employed to enhance the instruction\-following ability of large language models \(LLMs\)\. As the amount of instruction fine\-tuning data increases, selecting the optimal core set becomes particularly important\. However, ensuring the diversity of the core set remains a significant challenge\. Existing methods predominantly distinguish different training data based on the text features themselves, decoupled from LLMs’ own understanding and representation of the data\. To address this issue, we propose a Model\-Aware Diverse Core Set Selection method, which distinguishes data features based on the neural activation states during LLM inference\. This approach serves as an efficient instantiation of coverage\-based selection using model\-intrinsic activation features to ensure the diversity in the core set\. We extensively evaluate our method on six benchmarks that cover five distinct tasks\. In our method, the core set selected by the 3B\-parameter LLM performs effectively when utilized to fine\-tune larger models with 7B, 8B, and 13B parameters\. Experimental results on the Alpaca\-GPT4 dataset, which comprises 52K instruction–response pairs, show that the core set, sized at 15% of the original dataset and selected by Llama\-3\.2\-3B\-Instruct, achieves an average improvement of 2\.5% when fine\-tuning four larger base models compared with training on the full dataset\. The experimental results demonstrate that our method enhances model performance on multiple downstream tasks while reducing data requirements\.
MADS: Model\-Aware Diverse Core Set Selection for Instruction Tuning
## 1Introduction
With the rapid advancement of artificial intelligence technology, large language models \(LLMs\) such as GPT\-4\(Achiamet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib1)\), Mistral\(Jianget al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib2)\), Llama 3\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib3)\), and Qwen\(Baiet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib4)\)have demonstrated outstanding performance in various tasks through large\-scale training data and powerful computing capabilities\. During the pre\-training phase, LLMs are trained on large corpora to acquire general language knowledge and logical reasoning skills\. The fine\-tuning phase aims to enhance the model’s ability to follow instructions and align with human preferences\(Sanhet al\.,[2022](https://arxiv.org/html/2605.30857#bib.bib7); Ouyanget al\.,[2022](https://arxiv.org/html/2605.30857#bib.bib5)\)\. Therefore, carefully curated fine\-tuning data is crucial for optimizing model performance\.
Early approaches to obtaining high\-quality instruction data involved collecting instruction\-response pairs through crowdsourcing or leveraging powerful LLMs to generate instruction datasets\(Sanhet al\.,[2022](https://arxiv.org/html/2605.30857#bib.bib7); Taoriet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib6); Wanget al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib8)\)\. However, as the volume of available training data increases, using all of it for fine\-tuning becomes impractical\. Moreover, some studies indicate that increasing the amount of instruction data does not always enhance model performance\(Shiet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib9); Wuet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib10)\)\.Zhouet al\.\([2024](https://arxiv.org/html/2605.30857#bib.bib11)\)found that a small but well\-chosen instruction dataset can outperform a larger one\. Some studies have attempted to select data based on instance\-level quality\(Caoet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib25); Panget al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib27); Liet al\.,[2024a](https://arxiv.org/html/2605.30857#bib.bib14); Zhanget al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib54)\)\. In contrast, our approach focuses on dataset\-level features, such as diversity and coverage, which have been shown to play a more significant role in data selection than individual data quality\(Xiaet al\.,[2024b](https://arxiv.org/html/2605.30857#bib.bib60)\)\. Methods based on diversity and coverage select new data points that are most distinct from the already selected ones\. Subsets selected by these methods often achieve performance superior or comparable to that achieved using the full dataset during fine\-tuning\(Chenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib21); Luet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib17); Das and Khetan,[2024](https://arxiv.org/html/2605.30857#bib.bib22); Shaoet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib19)\)\.
However, measuring and ensuring the diversity and coverage of data remains a significant challenge\. We categorize existing methods for selecting instruction data into two types: data\-aware methods and model\-aware methods\. \(1\) Data\-aware methods use pre\-trained language models like BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2605.30857#bib.bib20)\)to extract data representations, ensuring uniform data distribution through k\-means clustering or directly assigning category labels using powerful LLMs\(Chenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib21); Luet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib17); Das and Khetan,[2024](https://arxiv.org/html/2605.30857#bib.bib22); Shaoet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib19)\)\. When selecting diverse data, these methods do not fully utilize the internal representations of LLMs to guide the data selection\. \(2\) Model\-aware methods use the LLM to be fine\-tuned or already fine\-tuned to assess the necessity of each data instance, selecting data beneficial to the model\(Liet al\.,[2024b](https://arxiv.org/html/2605.30857#bib.bib50); Liuet al\.,[2024a](https://arxiv.org/html/2605.30857#bib.bib18); Liet al\.,[2024a](https://arxiv.org/html/2605.30857#bib.bib14); Zhanget al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib54); Huet al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib58); Ranaldi and Freitas,[2024a](https://arxiv.org/html/2605.30857#bib.bib75),[b](https://arxiv.org/html/2605.30857#bib.bib76)\)\. These methods tend to select data that is challenging or learnable for the current model, rather than diversity\.
Motivated by the aforementioned challenges, we propose a novel method,Model\-AwareDiverse Core SetSelection for Instruction Tuning111[https://anonymous\.4open\.science/r/MADS\-5711/](https://anonymous.4open.science/r/MADS-5711/), which leverages the internal representations of LLMs to select a core set that is both diverse and highly representative\. The core idea of MADS is to use the neuron activation states generated by LLMs during inference as representations of instruction data to select a diverse data subset\. Previous research has shown that LLM neurons exhibit different activation states for different input data features\(Elhageet al\.,[2022](https://arxiv.org/html/2605.30857#bib.bib40); Brickenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib39); Billset al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib51); Cunninghamet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib74); Luoet al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib61); Helffet al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib62); Shafranet al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib63)\)\.
A potential concern is whether activation\-based representations can reliably capture semantic features, given that individual neurons in LLMs are known to be polysemantic—a single neuron may respond to multiple unrelated concepts\(Elhageet al\.,[2022](https://arxiv.org/html/2605.30857#bib.bib40)\)\. However, recent interpretability research reveals that neural networks encode independent features through*linear combinations of neurons*rather than individual neurons\(Brickenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib39)\)\. This insight motivates our approach: instead of tracking which individual neurons are activated, we record the*set of neurons*that are strongly co\-activated by each instruction as its activation tag\. This group\-level representation naturally captures the compositional feature structure of LLMs and mitigates the polysemanticity issue\.
To empirically validate the correlation between activation tags and semantic features, we conduct a preliminary experiment using 1,000 randomly selected instructions from each of five domains\. Through PCA visualization and pairwise similarity analysis of activation tags across multiple layers of Llama\-3\.2\-3B\-Instruct, we find that same\-domain instructions share significantly more activation tags than cross\-domain pairs, confirming that activation tags capture domain\-specific semantic features\. The full empirical analysis, including visualization figures and detailed domain discussion, is presented in Section[4\.4](https://arxiv.org/html/2605.30857#S4.SS4)\. Inspired by the correlation between neuron activation patterns and data features, we are the first to use neuron activation states of LLMs as data representations for diverse instruction data selection\. Specifically, MADS calculates all neuron activations in the original dataset and filters a subset covering all activation patterns as the core set\. During core set selection, we prioritize complex instructions that activate a richer set of neurons, as complex instructions more effectively enhance the comprehension and reasoning capabilities of LLMs\(Luet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib17)\)\. Compared with existing methods, MADS extracts model\-level data features, ensuring the selected data subset has superior diversity, coverage, and complexity\. Moreover, neuron activation can be extracted in a single inference, eliminating the need for additional training and reducing computational and time costs\.
We conduct extensive experiments on instruction\-following benchmarks, demonstrating that fine\-tuning LLMs with data selected by MADS achieves superior instruction\-following performance compared to existing methods\. We also perform further analysis to verify the coverage and robustness of our method\. Our contributions can be summarized as follows:
- •We introduce a novel model\-aware diverse instruction data selection method, utilizing neuron activation states of LLMs for the first time to achieve diverse and complex instruction data selection\.
- •Our method extracts data representations in a single inference step, without the need for manual data category definition or gradient calculation, improving the efficiency of instruction data selection\.
- •Our experiments with Alpaca show that our method improves LLMs across various tasks using just 15% of the data, outperforming others with more significant and balanced improvements\.
## 2Related Work
### 2\.1Instruction Data Selection
Data\-aware methods\.Data\-aware methods focus on the quality, diversity, and importance of the instruction during data selection\(Qinet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib30)\)\. To ensure the quality of the instruction,Caoet al\.\([2024](https://arxiv.org/html/2605.30857#bib.bib25)\)designed a metric system to evaluate the text quality such as lexical diversity and dialogue coherence\. Moreover,Xuet al\.\([2023](https://arxiv.org/html/2605.30857#bib.bib26)\); Liuet al\.\([2024b](https://arxiv.org/html/2605.30857#bib.bib29)\); Panget al\.\([2024](https://arxiv.org/html/2605.30857#bib.bib27)\)leverage powerful LLMs like GPT\-4 to measure data quality based on various aspects such as instruction complexity and response accuracy\. In terms of diversity, the most common approach involves utilizing pre\-trained language models like BERT to embed data into a high\-dimensional space, followed by clustering with methods such as k\-means or k\-center to select a subset with a uniform distribution\(Chenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib21); Das and Khetan,[2024](https://arxiv.org/html/2605.30857#bib.bib22); Shaoet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib19)\)\.Luet al\.\([2024](https://arxiv.org/html/2605.30857#bib.bib17)\)utilizes GPT\-4 to classify instructions, thereby selecting a data subset that covers multiple categories\. Importance is also considered a criterion, referring to the difficulty level of an instruction\-response pair for LLMs\. To identify hard instructions,Duet al\.\([2023](https://arxiv.org/html/2605.30857#bib.bib15)\)employs a reward model to assess whether LLMs could generate correct responses to the given instructions\.Songet al\.\([2024](https://arxiv.org/html/2605.30857#bib.bib24)\)trains BERT as a classifier to distinguish between easy and hard instructions\. These methods typically rely on additional models to classify data, making the data selection process independent of the internal representations of LLMs\.
Model\-aware methods\.Model\-aware methods typically treat the data to be selected as input and use probability distributions, loss, gradients, or other model\-related metrics generated by LLMs for data selection\(Zhanget al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib54); Daiet al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib55); Zhaoet al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib56); Zhouet al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib57)\)\. For instance,Liet al\.\([2024a](https://arxiv.org/html/2605.30857#bib.bib14)\)compares the loss generated by LLMs with and without instruction context to estimate the difficulty of the instruction\.Huet al\.\([2025](https://arxiv.org/html/2605.30857#bib.bib58)\)developed two model\-parameter\-based metrics to filter out noisy, unlearnable, and generalization\-impairing samples\. Similarly,Liuet al\.\([2024a](https://arxiv.org/html/2605.30857#bib.bib18)\)uses different grain uncertainties of LLMs to improve the accuracy of the data\. In addition to these methods that only require LLMs to perform inference, there are approaches that use gradients from backpropagation as data selection criteria\(San Joaquinet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib32); Panet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib34)\)\.Yanget al\.\([2024](https://arxiv.org/html/2605.30857#bib.bib33)\)leverages training trajectories to select mathematical data\. These model\-aware methods tend to select data that is more difficult for LLMs rather than more diverse data\.
### 2\.2Core Set Selection
The goal of core set selection is to select a subset from all training data such that the model trained on this subset achieves performance similar to that of the model trained on the full dataset\. Core set selection has deep roots in classical machine learning\. Early theoretical work established foundational algorithms based on geometric methods such as k\-median and k\-means clustering\(Har\-Peled and Kushal,[2005](https://arxiv.org/html/2605.30857#bib.bib35)\), while subsequent research extended these ideas to logistic regression\(Munteanuet al\.,[2018](https://arxiv.org/html/2605.30857#bib.bib38)\), gradient\-based selection\(Mirzasoleimanet al\.,[2020](https://arxiv.org/html/2605.30857#bib.bib37)\), and deep learning scenarios\(Paulet al\.,[2021](https://arxiv.org/html/2605.30857#bib.bib36)\)\. Closely related to core set selection is Active Learning, which iteratively selects the most informative samples for labeling\(Settles,[2009](https://arxiv.org/html/2605.30857#bib.bib64); Sener and Savarese,[2018](https://arxiv.org/html/2605.30857#bib.bib65)\)\. Methods such as uncertainty sampling and diversity\-based selection from Active Learning share similar goals with core set selection in reducing data requirements while maintaining model performance\(Ashet al\.,[2019](https://arxiv.org/html/2605.30857#bib.bib66); Margatinaet al\.,[2021](https://arxiv.org/html/2605.30857#bib.bib67)\)\. However, existing methods either are computationally intensive\(Xiaet al\.,[2024a](https://arxiv.org/html/2605.30857#bib.bib68)\), making them infeasible for large\-scale LLM data selection, or rely on predefined concepts to categorize data for diversity\. In contrast, our method leverages the LLM’s own activation states to efficiently distinguish data types and ensure diversity\.
## 3Methodology
### 3\.1Problem Formulation
Given an initial full training dataset,DfullD\_\{full\}, which containsnninstruction\-response pairs, our task is to select a subsetDcD\_\{c\}fromDfullD\_\{full\}such that\|Dc\|≪\|Dfull\|\|D\_\{c\}\|\\ll\|D\_\{full\}\|\.
The objective of our method is to leverage the neuron activation states of LLMs to select a core subsetDcD\_\{c\}that can maximally cover the features present inDfullD\_\{full\}\. This core subsetDcD\_\{c\}is designed to significantly reduce the amount of training data required for fine\-tuning while enhancing downstream performance of LLMs\.
### 3\.2Overview
Previous studies on the interpretability of LLMs have demonstrated that the activation states generated during inference can represent the features of input data\(Elhageet al\.,[2022](https://arxiv.org/html/2605.30857#bib.bib40); Brickenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib39); Billset al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib51)\)\. In particular,Brickenet al\.\([2023](https://arxiv.org/html/2605.30857#bib.bib39)\)propose that a neural network represents features of the data by assigning each feature its own linear combination of neurons, such that a corresponding set of neurons activates when the feature is present\. Inspired by this, we define the neurons activated by the training data during inference as their tags, which represent the features in the data\. Based on these activation tags, we perform core set selection by covering as many tags as possible, thereby ensuring a diverse set of features in the core set\. Our method comprises three steps: extracting activation tags, filtering activation tags, and sampling maximum complexity activation tags\. Figure[1](https://arxiv.org/html/2605.30857#S3.F1)provides an overview of our method\.
Figure 1:The overall framework of MADS, consisting of three stages\. \(1\)Activation Tags Extracting: Instructions are fed into the LLM, and neuron activation values exceeding threshold 1 are extracted from each MLP layer and converted into activation tags\. \(2\)Activation Tags Filtering: Representative layers are selected based on the proportion of strongly activated neurons, and low\-frequency activation tags are filtered out to reduce noise\. \(3\)Full\-Coverage Core Set Selection with Complexity Priority: For each unique activation pattern\(tag,vtag\)\(tag,v\_\{tag\}\), the instruction with the highest complexity \(most distinct activation tags\) is selected to ensure comprehensive coverage while prioritizing complex instructions\.
### 3\.3Activation Tags Extracting
First, we derive the activation tags of the instructions inDfullD\_\{full\}\. We utilize a fine\-tuned LLM to extract the output of the activation function for all instructions in each Multilayer Perceptron \(MLP\) layer during inference and convert these outputs into activation tags\.
Before inference, instructions are encoded into multiple tokens\{token1,…,tokenN\}\\\{token\_\{1\},\\ldots,token\_\{N\}\\\}\. These tokens are sequentially fed into the LLM withLLlayers, each containing an activation function in the MLP submodule\. Fortokenitoken\_\{i\}, we extract the output of the activation function from each of theLLlayers:
𝐚i1,𝐚i2,…,𝐚iL=LLMact\(tokeni\)\\mathbf\{a\}\_\{i\}^\{1\},\\mathbf\{a\}\_\{i\}^\{2\},\\ldots,\\mathbf\{a\}\_\{i\}^\{L\}=\\rm\{LLM\_\{act\}\}\(token\_\{i\}\)\(1\)where𝐚il\\mathbf\{a\}\_\{i\}^\{l\}is the output of the activation function fortokenitoken\_\{i\}at thell\-th layer, with𝐚il∈ℝd\\mathbf\{a\}\_\{i\}^\{l\}\\in\\mathbb\{R\}^\{d\}, anddddenotes the dimension of the output of the LLM activation function\.
We then transform\{𝐚i1,𝐚i2,…,𝐚iL\}\\\{\\mathbf\{a\}\_\{i\}^\{1\},\\mathbf\{a\}\_\{i\}^\{2\},\\ldots,\\mathbf\{a\}\_\{i\}^\{L\}\\\}into activation tags that reflect the features oftokenitoken\_\{i\}\. Each feature corresponds to a linear combination of neurons that are strongly activated\. The activation level of neurons indicates the model’s "confidence" that some feature is present, while weak activations may sometimes be erroneous\(Elhageet al\.,[2022](https://arxiv.org/html/2605.30857#bib.bib40); Brickenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib39)\)\. Therefore, we filter out neurons with high activation levels as activation tags\. Referring to experimental results from previous studies\(Brickenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib39)\)concerning the relationship between activation levels and their corresponding text, we set the filtering threshold to 1\. We provide a series of examples in the Appendix[A](https://arxiv.org/html/2605.30857#A1)to illustrate this point\.
To further validate this threshold, we analyze 39 billion activation values in our experiments and find that 99\.70% of them are less than or equal to 1, indicating that only approximately 0\.30% of activations exceed this threshold\. Referring to recent work\(Shafranet al\.,[2025](https://arxiv.org/html/2605.30857#bib.bib63)\)that retains the top 1% of activations as strong activations in Llama 3\.1 models, we believe that setting this threshold to 1 is sufficient to filter out erroneous activations\.
LetIdx\(𝐚il\)\\text\{Idx\}\(\\mathbf\{a\}\_\{i\}^\{l\}\)denote the set of dimensions in𝐚il\\mathbf\{a\}\_\{i\}^\{l\}with activation value greater than 1:
Idx\(𝐚il\)=\{j∣𝐚il\[j\]\>1\}\\text\{Idx\}\(\\mathbf\{a\}\_\{i\}^\{l\}\)=\\\{j\\mid\\mathbf\{a\}\_\{i\}^\{l\}\[j\]\>1\\\}\(2\)wherej=1,2,…,dj=1,2,\.\.\.,dis the index of the dimension in𝐚il\\mathbf\{a\}\_\{i\}^\{l\}\.
Considering the varying degrees of activation in the dimensions ofIdx\(ail\)\\text\{Idx\}\(a\_\{i\}^\{l\}\), we retain this information by sorting the dimensions in descending order according to their corresponding activation values and using them as the activation tagtagiltag\_\{i\}^\{l\}\.
tagil=Sort\(Idx\(𝐚il\),key=𝐚il\[j\]\)tag\_\{i\}^\{l\}=\\text\{Sort\}\(\\text\{Idx\}\(\\mathbf\{a\}\_\{i\}^\{l\}\),\\text\{key\}=\\mathbf\{a\}\_\{i\}^\{l\}\[j\]\)\(3\)
In addition to extracting activation tags, we also retain their corresponding activation values, as inspired byBrickenet al\.\([2023](https://arxiv.org/html/2605.30857#bib.bib39)\), we hypothesize that activation values are correlated with data characteristics\. The activation tag valuevtagilv\_\{tag\_\{i\}^\{l\}\}is defined as the maximum activation value among the neurons indexed bytagiltag\_\{i\}^\{l\}:
vtagil=maxj∈tagil𝐚il\[j\]v\_\{tag\_\{i\}^\{l\}\}=\\max\_\{j\\in tag\_\{i\}^\{l\}\}\\mathbf\{a\}\_\{i\}^\{l\}\[j\]\(4\)To investigate our hypothesis, we analyze 260K activation tags \(withvtagilv\_\{tag\_\{i\}^\{l\}\}rounded to one decimal place, using standard rounding, i\.e\., round half up\) and find that 98\.14% have only one unique activation value, while 1\.86% have multiple values\. A sampling study on the 1\.86% of tags with multiple activation values reveals a large number of samples exhibiting the following phenomenon: while instructions sharing the same activation tag possess common characteristics, those with different activation values within the tag exhibit finer\-grained distinctions, with partial examples presented in Table[6](https://arxiv.org/html/2605.30857#S5.T6)\. This observation motivates us to incorporate activation values into the core set selection process to achieve finer\-grained data partitioning\.
Based on the above, the activation tags oftokenitoken\_\{i\}are defined asT\(tokeni\)=\{tagi1,tagi2,…,tagiL\}T\(token\_\{i\}\)=\\\{tag\_\{i\}^\{1\},tag\_\{i\}^\{2\},\.\.\.,tag\_\{i\}^\{L\}\\\}, where the activation tags of the instructioninsinsare the union of all tokens it contains:
T\(ins\)\\displaystyle T\(ins\)=T\(token1\)∪T\(token2\)∪⋯∪T\(tokenN\)\\displaystyle=T\(token\_\{1\}\)\\cup T\(token\_\{2\}\)\\cup\\cdots\\cup T\(token\_\{N\}\)\(5\)=\{tagil∣i=1,2,…,N;l=1,2,…,L\}\\displaystyle=\\left\\\{\\text\{tag\}\_\{i\}^\{l\}\\mid i=1,2,\\ldots,N;\\,l=1,2,\\ldots,L\\right\\\}
### 3\.4Activation Tags Filtering
In the previous section, we introduce the activation tags of the instructionT\(ins\)T\(ins\)\. Here, considering the following two factors: \(1\) the activation tags between adjacent layers may have significant overlap, and \(2\) different layers of LLMs capture different features\(Belinkovet al\.,[2017](https://arxiv.org/html/2605.30857#bib.bib41); Peterset al\.,[2018](https://arxiv.org/html/2605.30857#bib.bib42); Blevinset al\.,[2018](https://arxiv.org/html/2605.30857#bib.bib43)\), we propose filtering the activation tags\. The filtering process consists of two steps: selecting representative layers based on activation patterns, and removing low\-frequency activation tags to reduce noise\.
Step 1: Layer Selection Based on Activation Patterns\.Assuming we need to selectMMlayers fromLLlayers as final activation tags, our principle is to select as evenly as possible\. Following this principle, we further design a specific layer selection strategy, which selectsMMlayers with a high proportion of strongly activated neurons\. For thell\-th layer, we first extract the activation vectors of all instructions inDfullD\_\{full\}\. We then calculate the proportion of activation values greater than 1 within each activation vector\. Averaging these proportion values gives us the overall proportion of strongly activated neurons in thell\-th layer\. Formally, for an instructioninsinsat thell\-th layer, let𝐚insl∈ℝd\\mathbf\{a\}\_\{ins\}^\{l\}\\in\\mathbb\{R\}^\{d\}denote its activation vector\. The proportion of strongly activated neurons for this instruction at layerllis defined as:
pl\(ins\)=\|\{j∣𝐚insl\[j\]\>1,j=1,…,d\}\|dp\_\{l\}\(ins\)=\\frac\{\|\\\{j\\mid\\mathbf\{a\}\_\{ins\}^\{l\}\[j\]\>1,j=1,\\ldots,d\\\}\|\}\{d\}\(6\)The overall proportion for layerllis computed as the average across all instructions:
pl=1\|Dfull\|∑ins∈Dfullpl\(ins\)p\_\{l\}=\\frac\{1\}\{\|D\_\{full\}\|\}\\sum\_\{ins\\in D\_\{full\}\}p\_\{l\}\(ins\)\(7\)This process is repeated for each layer in the LLM, allowing us to track the proportion of strongly activated neurons across layers\. As depicted in Figure[2](https://arxiv.org/html/2605.30857#S3.F2), this proportion increases in a wave\-like pattern starting from the first layer\. A higher proportion may indicate that the layer contains more information, with the peak of the wave representing the local maximum proportion of strongly activated neurons\. We select layers corresponding to these peaks as candidates for final activation tags\. However, when multiple peaks occur in close proximity \(i\.e\., separated by only one layer\), selecting all of them would violate our principle of even distribution across the network depth\. In such cases, we retain only the peak with the higher proportion value\. Consequently, we select the layer corresponding to these peaks for final activation tags, denoted asSL=\{SL1,SL2,…,SLM\}SL=\\\{SL\_\{1\},SL\_\{2\},\.\.\.,SL\_\{M\}\\\}\.
Figure 2:The proportion of strongly activated neurons \(activation value\>1\>1\) in each layer of Llama\-3\.2\-3B\-Instruct\. The x\-axis represents the layer index \(1–28\), and the y\-axis represents the average proportion of neurons with activation values exceeding 1\. Red flags indicate selected layers \(2nd, 9th, 16th, 21st, and 28th\), corresponding to local maxima in the wave\-like pattern\. The 7th layer is excluded despite being a local maximum because it is adjacent to the 9th layer, which has a higher proportion value\. This selection strategy ensures even distribution across network depth while prioritizing information\-rich layers\.Step 2: Low\-Frequency Activation Tags Filtering\.After selectingMMlayers, we further filter out the activation tags that appear infrequently\. This is because activation tags with very low frequency lack generality and can even be noise\. For each unique activation tagtagtag, we define its frequency as:
f\(tag\)=\|\{ins∈Dfull∣tag∈T\(ins\)\}\|f\(tag\)=\|\\\{ins\\in D\_\{full\}\\mid tag\\in T\(ins\)\\\}\|\(8\)which counts the number of instructions inDfullD\_\{full\}that contain the activation tagtagtag\. To filter out low\-frequency activation tags, we first set a basic frequency filtering thresholdθbase\\theta\_\{base\}\. However, because the number of distinct activation tags varies significantly between lower and higher layers, it is unreasonable to set the same threshold for all layers\. To address this issue, we propose calculating the filtering weight for each layer based on the number of distinct activation tags contained in each layer\. For all activation tags of the instructions inDfullD\_\{full\}, we calculate the number of distinct activation tags contained in theMMlayers we selected, denoted asTN=\{TN1,TN2,…,TNM\}TN=\\\{TN\_\{1\},TN\_\{2\},\.\.\.,TN\_\{M\}\\\}\. Then, the filtering weight for thell\-th layer is:
wl=max\(TN1,TN2,…,TNM\)TNlw\_\{l\}=\\frac\{max\(TN\_\{1\},TN\_\{2\},\.\.\.,TN\_\{M\}\)\}\{TN\_\{l\}\}\(9\)
We calculate the filtering threshold of thell\-th layer:
θl=θbase×wl\\theta\_\{l\}=\\theta\_\{base\}\\times w\_\{l\}\(10\)
Then, we filter out the low\-frequency activation tags in thell\-th layer with frequencies lower than the corresponding thresholdθl\\theta\_\{l\}, for all selected layersl∈SLl\\in SL, to obtain the final activation tags of the instructions\. The activation tags of the instructions after filtering are:
Tf\(ins\)=\{tagil∣l∈SL,tagil∉Tagsf\},T^\{f\}\(ins\)=\\\{\\text\{tag\}\_\{i\}^\{l\}\\mid l\\in SL,\\,\\text\{tag\}\_\{i\}^\{l\}\\notin Tags\_\{f\}\\\},\(11\)
whereTagsfTags\_\{f\}=\{tag∣f\(tag\)<θl\}=\\\{tag\\mid f\(tag\)<\\theta\_\{l\}\\\}is the set of low\-frequency activation tags to be filtered out\.
Algorithm 1Full\-Coverage Core Set Selection with Complexity PriorityInput:Full datasetDfullD\_\{full\}, filtered activation tagsTf\(ins\)T^\{f\}\(ins\)for each instruction, activation valuesvtagv\_\{tag\}for each tag Output:Core subsetDcD\_\{c\}
1:// Step 1: Building Activation Pattern Mapping
2:Initialize
𝒫←∅\\mathcal\{P\}\\leftarrow\\emptyset⊳\\trianglerightSet of all activation patterns
3:Initialize
ℐ←∅\\mathcal\{I\}\\leftarrow\\emptyset⊳\\trianglerightMapping from patterns to instructions
4:foreach
ins∈Dfullins\\in D\_\{full\}do
5:foreach
tag∈Tf\(ins\)\\text\{tag\}\\in T^\{f\}\(ins\)do
6:
vtag←activation value oftagininsv\_\{tag\}\\leftarrow\\text\{activation value of \}tag\\text\{ in \}ins
7:
vtag←Round\(vtag,1\)v\_\{tag\}\\leftarrow\\text\{Round\}\(v\_\{tag\},1\)⊳\\trianglerightStandard rounding \(round half up\) to one decimal place, enabling grouping of tags with similar activation intensity
8:
𝒫←𝒫∪\{\(tag,vtag\)\}\\mathcal\{P\}\\leftarrow\\mathcal\{P\}\\cup\\\{\(tag,v\_\{tag\}\)\\\}⊳\\trianglerightAdd activation pattern
9:Append
insinsto
ℐ\(tag,vtag\)\\mathcal\{I\}\(tag,v\_\{tag\}\)⊳\\trianglerightBuild pattern\-to\-instructions mapping
10:endfor
11:endfor
12:// Step 2: Complexity\-Based Representative Selection
13:Compute complexity
C\(ins\)=\|Tf\(ins\)\|C\(ins\)=\|T^\{f\}\(ins\)\|for each
ins∈Dfullins\\in D\_\{full\}
14:Initialize
Dc←∅D\_\{c\}\\leftarrow\\emptyset
15:foreach
\(tag,vtag\)∈𝒫\(tag,v\_\{tag\}\)\\in\\mathcal\{P\}do⊳\\trianglerightIterate over each unique activation pattern
16:
candidates←ℐ\(tag,vtag\)\\text\{candidates\}\\leftarrow\\mathcal\{I\}\(tag,v\_\{tag\}\)
17:
ins∗←argmaxins∈candidatesC\(ins\)ins^\{\*\}\\leftarrow\\arg\\max\_\{ins\\in\\text\{candidates\}\}C\(ins\)⊳\\trianglerightSelect most complex instruction
18:if
ins∗∉Dcins^\{\*\}\\notin D\_\{c\}then
19:
Dc←Dc∪\{ins∗\}D\_\{c\}\\leftarrow D\_\{c\}\\cup\\\{ins^\{\*\}\\\}⊳\\trianglerightEnsure pattern coverage
20:endif
21:endfor
22:return
DcD\_\{c\}
### 3\.5Full\-Coverage Core Set Selection with Complexity Priority
In this stage, we describe how to select the core setDcD\_\{c\}fromDfullD\_\{full\}based on the filtered activation tags\. Our goal is to ensure thatDcD\_\{c\}covers all activation patterns inDfullD\_\{full\}while prioritizing complex instructions\. The selection process consists of two steps: \(1\) building a mapping from activation patterns to instructions, and \(2\) selecting representative instructions based on complexity\.
Step 1: Building Activation Pattern Mapping\.For each instruction, we extract not only the activation tags but also their corresponding activation values\. We use\(tag,vtag\)\(tag,v\_\{tag\}\)pairs to represent distinct activation patterns because the same neuron may exhibit different activation intensities for different features\.
For each instructioninsins, we define its activation\-pattern set as:
Pf\(ins\)=\{\(tag,vtag\)∣tag∈Tf\(ins\)\}P^\{f\}\(ins\)=\\\{\(tag,v\_\{tag\}\)\\mid tag\\in T^\{f\}\(ins\)\\\}\(12\)
The set of all distinct activation patterns is defined as the union of the activation\-patterns of each instruction inDfullD\_\{full\}:
𝒫=⋃ins∈DfullPf\(ins\)\\mathcal\{P\}=\\bigcup\_\{ins\\in D\_\{full\}\}P^\{f\}\(ins\)\(13\)For each activation pattern\(tag,vtag\)∈𝒫\(tag,v\_\{tag\}\)\\in\\mathcal\{P\}, we build a mapping to the set of instructions containing this pattern:
ℐ\(tag,vtag\)=\{ins∈Dfull∣\(tag,vtag\)∈Pf\(ins\)\}\\mathcal\{I\}\(tag,v\_\{tag\}\)=\\\{ins\\in D\_\{full\}\\mid\(tag,v\_\{tag\}\)\\in P^\{f\}\(ins\)\\\}\(14\)
Step 2: Complexity\-Based Representative Selection\.For each instructioninsins, we define its complexityC\(ins\)C\(ins\)as the number of distinct activation tags it contains:
C\(ins\)=\|Tf\(ins\)\|C\(ins\)=\|T^\{f\}\(ins\)\|\(15\)Instructions with higher complexity values are considered more complex and are preferred as representatives, as they can potentially cover more activation patterns\.
For each activation pattern\(tag,vtag\)∈𝒫\(tag,v\_\{tag\}\)\\in\\mathcal\{P\}, we select the instruction with the highest complexity fromℐ\(tag,vtag\)\\mathcal\{I\}\(tag,v\_\{tag\}\)as the representative:
ins∗=argmaxins∈ℐ\(tag,vtag\)C\(ins\)ins^\{\*\}=\\arg\\max\_\{ins\\in\\mathcal\{I\}\(tag,v\_\{tag\}\)\}C\(ins\)\(16\)The final core setDcD\_\{c\}is the union of all selected representatives:
Dc=⋃\(tag,vtag\)∈𝒫\{ins∗\}D\_\{c\}=\\bigcup\_\{\(tag,v\_\{tag\}\)\\in\\mathcal\{P\}\}\\\{ins^\{\*\}\\\}\(17\)
The detailed algorithm is presented in Algorithm[1](https://arxiv.org/html/2605.30857#alg1)\.
## 4Experiments
### 4\.1Datasets
Training DatasetsWe utilize the Alpaca\-GPT4 dataset\(Penget al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib53)\), which comprises 52,002 instruction\-response pairs\. The responses are generated by GPT\-4 model\(Achiamet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib1)\), resulting in higher data quality compared to the original Alpaca dataset\(Taoriet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib6)\)from Stanford University\. WizardLM dataset\(Xuet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib59)\)leverages the Evol\-Instruct algorithm to generate high\-quality instruction data\. Specifically, we used its WizardLM\-7b subset comprising 70,000 instruction\-response pairs\.
Evaluation DatasetsTo ensure a comprehensive and unbiased evaluation, we used 6 evaluation datasets covering 5 distinct tasks\.
- •Factual knowledge: We use the Massive Multitask Language Understanding dataset \(MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2605.30857#bib.bib44)\)\) to assess this ability\. MMLU assesses the ability of LLMs to understand factual knowledge, covering knowledge from 57 disciplines including STEM, humanities, and social sciences\.
- •Math Reasoning: We assess this ability using the Grade School Math dataset \(GSM\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.30857#bib.bib45)\)\)\. GSM evaluates the mathematical ability of the LLM, including 1319 grade school math test questions\.
- •Code Generation: We utilize HumanEval dataset\(Chenet al\.,[2021](https://arxiv.org/html/2605.30857#bib.bib46)\)to evaluate the understanding and code\-writing capabilities of LLMs\. HumanEval contains 164 programming questions to evaluate the language understanding and code\-writing capabilities of LLMs, which we refer to as CodeX\.
- •Natural Language Inference: We evaluate this ability of LLMs using two widely utilized datasets: HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.30857#bib.bib47)\)and TruthfulQA\(Linet al\.,[2022](https://arxiv.org/html/2605.30857#bib.bib48)\)\. HellaSwag contains 70K multiple\-choice questions related to commonsense inference\. TruthfulQA contains 817 questions spanning 38 categories, where the questions are carefully designed to potentially produce incorrect answers due to misconceptions\.
- •Knowledge Reasoning: We assess this ability using the ARC Challenge dataset \(ARC\-C\(Clarket al\.,[2018](https://arxiv.org/html/2605.30857#bib.bib49)\)\)\. ARC\-C includes 2590 science exam questions from grade 3 to grade 9, which require powerful knowledge and reasoning to complete\.
For each evaluation dataset, the number of few\-shot examples and the evaluation metric used are as shown in Table[1](https://arxiv.org/html/2605.30857#S4.T1)\.
Table 1:Detailed information of our evaluation settings\. For each evaluation dataset, we provide the few\-shot number and metric used for evaluation\.
### 4\.2Baselines
We compare our method with the following baselines:
- •DEITA\(Liuet al\.,[2024b](https://arxiv.org/html/2605.30857#bib.bib29)\)proposes a data\-efficient instruction tuning approach that automatically evaluates instruction complexity and response quality using LLM\-based scorers, and applies diversity\-based sampling to select high\-quality instruction data\.
- •MoDS\(Duet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib15)\)introduces a model\-oriented data selection method that combines quality, coverage, and necessity dimensions to select instruction data by evaluating whether LLMs can correctly respond to given instructions\.
- •IFD\(Liet al\.,[2024a](https://arxiv.org/html/2605.30857#bib.bib14)\)develops a self\-guided metric for data selection, namely the Instruction\-Following Difficulty \(IFD\) metric\.
- •NUGGETS\(Liet al\.,[2024b](https://arxiv.org/html/2605.30857#bib.bib50)\)constructs a scoring system based on a predefined task set to evaluate whether the data can significantly improve performance across diverse tasks\.
- •ClusterClip\(Shaoet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib19)\)uses clustering to reflect the distribution of data and balances the common samples and rare samples\.
- •SelectIT\(Liuet al\.,[2024a](https://arxiv.org/html/2605.30857#bib.bib18)\)leverages the intrinsic uncertainty present in LLMs with different parameter sizes to select high\-quality data\.
- •InsTag\(Luet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib17)\)uses ChatGPT to obtain 6\.6K tags to comprehensively describe user queries and selects diverse and complex samples based on the tags\.
### 4\.3Experiments Setup
Core Set Selection\.When selecting the core set from the Alpaca\-GPT4 dataset using our method, we employ two different LLMs to extract the activation tags: Llama\-3\.2\-3B\-Instruct and Llama\-3\.1\-8B\-Instruct\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib3)\)\.
Table 2:The overall results on the benchmark tasks\. “Full” denotes training with the complete dataset, while all other methods use 15% of the full dataset for training\.MADS3B\\rm MADS\_\{3B\}denotes core set selection using Llama\-3\.2\-3B\-Instruct, whileMADS8B\\rm MADS\_\{8B\}denotes core set selection using Llama\-3\.1\-8B\-Instruct\. "Imp\." denotes the average improvement across all tasks\.Table 3:Performance on downstream tasks of Llama\-2\-7B fine\-tuned with core sets selected by activation tags from each layer chosen through our strategy\. "All Layers" denotes using the activation tags from all 28 layers of Llama\-3\.2\-3B\-Instruct for core set selection\.Training Details\.We fine\-tune the Llama\-2\-7B\(Touvronet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib52)\)base model and Llama\-3\-8B\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib3)\)base model\. For the Llama\-2\-7B base model, we fine\-tune it for 3 epochs, with a batch size of 128, using the Adam optimizer with a1×10−51\\times 10^\{\-5\}learning rate and a 0\.01 warm\-up ratio\. For the Llama\-3\-8B base model, we fine\-tune it for 6 epochs, with a batch size of 256, using the Adam optimizer with a2×10−62\\times 10^\{\-6\}learning rate and a 0\.06 warm\-up ratio\.
### 4\.4Preliminary Validation of Activation\-Tag Representations
In this subsection, we empirically validate that neuron activation tags capture domain\-specific semantic features, providing the foundational motivation for MADS\. We conduct experiments using 1,000 randomly selected instructions from each of five domains: code222[https://huggingface\.co/datasets/ise\-uiuc/Magicoder\-OSS\-Instruct\-75K](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K), mathematics333[https://huggingface\.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k), legal analysis444[https://huggingface\.co/datasets/cais/mmlu/tree/main/professional\_law](https://huggingface.co/datasets/cais/mmlu/tree/main/professional_law), medical consultation555[https://huggingface\.co/datasets/hongzhouyu/FineMed\-SFT](https://huggingface.co/datasets/hongzhouyu/FineMed-SFT), and historical Q&A666[https://huggingface\.co/datasets/nielsprovos/world\-history\-1500\-qa](https://huggingface.co/datasets/nielsprovos/world-history-1500-qa)\. We extract activation tags at five layers of Llama\-3\.2\-3B\-Instruct, and perform two analyses: \(1\)PCA Visualization: For each instruction, we construct acc\-dimensional one\-hot vector, whereccis the total number of distinct activation tags across all five domains\. Each dimension corresponds to an activation tag, set to 1 if present in the instruction and 0 otherwise\. We apply PCA to reduce these vectors to 2D for visualization, revealing domain\-specific clustering \(Figure[3](https://arxiv.org/html/2605.30857#S4.F3)\)\. \(2\)Similarity Analysis: We compute the average number of shared activation tags between instruction pairs, comparing intra\-domain pairs with cross\-domain pairs\. Results show that same\-domain instructions share significantly more tags than different\-domain pairs \(Figure[4](https://arxiv.org/html/2605.30857#S4.F4)\)\. For brevity, we display the results for three representative layers here; the results for the other two layers are provided in Appendix[B](https://arxiv.org/html/2605.30857#A2)\. Both experimental results indicate a correlation between activation tags and data features\. Moreover, both experiments reveal a consistent pattern: code and historical Q&A instructions exhibit the largest separation from other domains\. While there are distinct differences in activation tags among mathematics, legal, and medical instructions, the variances between these three categories are relatively smaller compared to code and historical instructions\. This aligns with intuition: code instructions contain syntactically unique programming languages, and historical Q&A focuses on factual knowledge retrieval, whereas the other three domains require logical reasoning to derive answers\.
To provide a more grounded interpretation beyond domain labels alone, we further examine the dataset\-internal characteristics of the sampled instructions\. \(see Appendix[C](https://arxiv.org/html/2605.30857#A3)for representative examples\)\. For the code domain, 60\.4% of instructions contain embedded code blocks \(marked by backtick delimiters\), and 74\.3% include programming\-specific tokens \(class,def,import,return, etc\.\) that are virtually absent in other domains—this lexically unique structure produces highly distinctive activation patterns\. For the historical Q&A domain, we note that although individual historical questions may indeed involve multi\-factor reasoning \(e\.g\., analyzing political, economic, or cultural dimensions of historical events\), all sampled instructions share a uniform document\-comprehension format: each presents a primary source passage and asks the model to select the correct interpretation from multiple\-choice options, without requiring the model to retrieve or generate knowledge from memory\. This highly consistent task structure—not merely domain\-level content—likely contributes substantially to the distinct activation clustering observed\. By contrast, mathematical, legal, and medical instructions exhibit greater internal heterogeneity in task format and reasoning demands: mathematical instructions predominantly require procedural step\-by\-step arithmetic reasoning \(98\.6% following explicit chain\-of\-thought computation notation\); legal instructions span constitutional analysis \(16\.7%\), criminal law scenarios \(23\.8%\), and civil procedure cases \(11\.8%\), encompassing both rule\-retrieval and multi\-step legal reasoning; and medical instructions mix symptom\-based clinical consultations \(40\.6%\) and open\-ended knowledge explanation \(26\.6%\)\. This greater task diversity within these three domains, combined with their shared reliance on natural language analytical reasoning, leads to more overlapping activation patterns and explains the relatively smaller inter\-domain separation observed among them\.



Figure 3:PCA visualization of activation tag vectors at layer 1, 8 and 15 of Llama\-3\.2\-3B\-Instruct for five instruction categories: code \(orange\), math \(purple\), law \(green\), medical \(red\), and history \(blue\)\. Instructions are encoded ascc\-dimensional one\-hot vectors, whereccis the total number of distinct activation tags, and reduced to 2D using PCA\. The clustering demonstrates that activation tags capture domain\-specific semantic features\.


Figure 4:Heatmap of the average number of shared activation tags between instruction categories at layers 1, 8, and 15 of Llama\-3\.2\-3B\-Instruct\. Each cell\(i,j\)\(i,j\)represents the average number of activation tags shared between instruction pairs from domainiiand domainjj\. Most diagonal values \(intra\-domain similarity\) are higher than off\-diagonal values \(cross\-domain similarity\), with darker colors indicating more shared tags\. This pattern confirms that same\-domain instructions share more tags than different\-domain pairs, validating the correlation between activation patterns and data features\.
## 5Results
### 5\.1Main Results
We applied our method separately to neuron activation data from Llama\-3\.2\-3B\-Instruct and Llama\-3\.1\-8B\-Instruct, using a core set size of 15% of the full dataset\. The experimental results are presented in Table[2](https://arxiv.org/html/2605.30857#S4.T2)\. From the results, we derive the following observations:
Our method significantly outperforms the full dataset and other baselines in terms of average improvement across all tasks\. While some methods excel in specific tasks, they underperform in others, indicating that our method enhances LLM capabilities without sacrificing specific abilities\. This is attributed to our method’s focus on ensuring diversity and coverage within the core set\.
Our method shows robustness in core set selection across different models\. Comparing the core set selection results of Llama\-3\.2\-3B\-Instruct and Llama\-3\.1\-8B\-Instruct, we find that our method effectively performs the core set selection task and both achieve the best results, regardless of whether the model has a similar parameter size to the base model or a smaller parameter size\. This indicates that our method provides an effective approach to leverage smaller LLMs to enhance the capabilities of larger LLMs, which is important for selecting datasets under resource constraints\.
### 5\.2Analysis
We conducted detailed analysis experiments to answer the following questions: \(1\) How does the layer selection strategy affect our method’s effectiveness? \(§[5\.2\.1](https://arxiv.org/html/2605.30857#S5.SS2.SSS1)\) \(2\) How do different selected layers and their interactions influence task\-specific improvements? \(§[5\.2\.2](https://arxiv.org/html/2605.30857#S5.SS2.SSS2)\) \(3\) How does filtering low\-frequency activation tags affect downstream tasks? \(§[5\.2\.3](https://arxiv.org/html/2605.30857#S5.SS2.SSS3)\) \(4\) How robust is MADS across different base models? \(§[5\.2\.4](https://arxiv.org/html/2605.30857#S5.SS2.SSS4)\) \(5\) How does activation\-based coverage differ from embedding\-based coverage in terms of redundancy and feature coverage? \(§[5\.2\.10](https://arxiv.org/html/2605.30857#S5.SS2.SSS10)\) \(6\) How does the rounding granularity ofvtagv\_\{tag\}affect the grouping mechanism and downstream task performance? \(§[5\.2\.7](https://arxiv.org/html/2605.30857#S5.SS2.SSS7)\)
#### 5\.2\.1Effect of Layer Selection Strategy
In our method, during the activation tag filtering stage, we selectMMlayers where the proportion of strongly activated neurons is locally maximal\. To verify this strategy’s effectiveness, we compared it with a uniform selection ofMMlayers, which is a more direct and straightforward selection approach\. Specifically, when using the activation tags of Llama\-3\.2\-3B\-Instruct, which consists of 28 layers, our method selects the 2nd, 9th, 16th, 21st, and 28th layers\. The uniform selection strategy selects the 1st, 8th, 15th, 22nd, and 28th layers\. We fine\-tuned Llama\-2\-7B with core sets from both strategies\. Results in Figure[5](https://arxiv.org/html/2605.30857#S5.F5)show that both strategies can improve the model’s capabilities, but our strategy results in an average improvement of 5\.17%, compared to 3\.59% with the uniform strategy\.
Figure 5:Comparison of two layer selection strategies on Llama\-2\-7B fine\-tuning performance\. The x\-axis shows six benchmark tasks, and the y\-axis shows relative improvement over the full dataset baseline\. Green bars represent our peak\-based strategy \(selecting layers 2, 9, 16, 21, 28\), achieving 5\.17% average improvement\. Blue bars represent the uniform selection strategy \(selecting layers 1, 8, 15, 22, 28\), achieving 3\.59% average improvement\. Our strategy outperforms uniform selection by 1\.58% on average\.
#### 5\.2\.2Effect of Different Layers
The activation tags used for core set selection are sampled from different layers\. To validate \(1\) the impact of the number of selected layers on LLM capabilities and \(2\) the effect of different layers’ activation tags, we applied our method using activation tags from different layers and then fine\-tuned Llama\-2\-7B\.
From the results in Table[3](https://arxiv.org/html/2605.30857#S4.T3), we observed that as the number of sampled layers increased, the overall capability of the LLM generally improved, and the best results were obtained when using the 2nd, 9th, 16th, 21st, and 28th layers\. An exception was found with the 2nd, 9th, 16th, and 21st layers\. To verify whether the 21st layer has a negative impact on the final results, we further used the 2nd, 9th, 16th, and 28th layers, and found that the results were significantly worse than when using these five layers\. We hypothesize that the 21st layer can work in conjunction with the 28th layer to exert its effect, suggesting potential synergistic interactions between layers\.
In order to further study whether the activation tags of different layers have different effects on improving LLM capabilities, we performed core set selection based on the activation tags of individual layers\. From the results in Table[3](https://arxiv.org/html/2605.30857#S4.T3), it can be observed that different layers exhibit distinct enhancement effects for different tasks\. For instance, the 16th layer showed significant improvements in TruthfulQA and ARC\-C tasks, but performed poorly in other tasks, such as MMLU and GSM\. Our method’s multi\-layer core set selection allows mutual compensation, achieving balanced LLM enhancement\. Furthermore, we conducted an experiment using all layers for core set selection\. As shown in Table[3](https://arxiv.org/html/2605.30857#S4.T3), utilizing all layers achieves a 4\.12% average improvement, which is still lower than the 5\.10% improvement achieved by our layer selection strategy\. This result confirms that while incorporating more layers provides more information, our selective approach effectively identifies the most informative layers and avoids redundant information\.
To validate our layer selection strategy for handling adjacent peaks, we conducted additional experiments on layer 7, which is a local maximum but was excluded from our final selection due to its proximity to layer 9 \(separated by only one layer\)\. As shown in Table[3](https://arxiv.org/html/2605.30857#S4.T3), when using layer 7 alone for core set selection, the average improvement is only 0\.72%, which is significantly lower than layer 9’s improvement of 2\.22%\. Additionally, when including layer 7 in the multi\-layer selection \(2\-7\-16\-21\-28\), the average improvement drops to 2\.71%, compared to 5\.10% when using layer 9 instead\. Furthermore, to directly examine the impact of including both adjacent peaks simultaneously, we conducted an experiment using layers 2\-7\-9\-16\-21\-28\. As shown in Table[3](https://arxiv.org/html/2605.30857#S4.T3), although using both layer 7 and layer 9 achieves strong performance on TruthfulQA and ARC\-C tasks, it exhibits notable degradation on MMLU and GSM tasks\. The results indicate that violating the even distribution principle by selecting adjacent peaks impedes balanced enhancement across multiple capabilities, leading to uneven improvements on different tasks\. This experiment provides direct empirical evidence that excluding adjacent peaks is essential for optimal performance\.
These results are consistent with prior findings on layerwise processing in Transformers: representations are hierarchically organized \(earlier layers are more lexical/syntactic and middle\-to\-late layers become increasingly semantic/discourse or prediction\-related\)\(Li and Subramani,[2025](https://arxiv.org/html/2605.30857#bib.bib70); Heet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib71)\), while adjacent layers tend to be more representationally similar than distant ones\(Jianget al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib72)\)\. Therefore, our choice of \(2,9,16,21,28\) can be viewed as a lightweight way to sample heterogeneous feature regimes across layers and reduce redundancy from neighboring layers\.
#### 5\.2\.3Effect of Activation Tags Filtering
Figure 6:Long\-tail distribution of activation tag frequencies at layer 15 of Llama\-3\.2\-3B\-Instruct\. The x\-axis represents activation tags sorted by frequency in descending order, and the y\-axis represents the frequency \(number of instructions containing each tag\)\. Green region indicates high\-frequency tags retained in the core set, while red region indicates low\-frequency tags filtered out as noise\. This filtering strategy removes rare activation patterns that lack generality while preserving representative activation patterns\.To demonstrate the rationality of filtering out low\-frequency activation tags in our method, we conducted a statistical analysis of the frequencyf\(tag\)f\(tag\)of each activation tag\. We found that they present a long\-tail distribution, where a few activation tags appear frequently, while most appear infrequently\. Our method filters out low\-frequency activation tags, retaining only those at the head of the long\-tail distribution, as illustrated in Figure[6](https://arxiv.org/html/2605.30857#S5.F6)\.
We further explored the impact of activation tags filtering on the performance of LLMs on different tasks\. In our method, by adjusting the frequency filtering thresholdθbase\\theta\_\{base\}, we can control the number of filtered activation tags, thus controlling the size of the core set\. When using Llama\-3\.2\-3B\-Instruct for core set selection, we set the size of the core set to 5%, 10%, 15%, 20%, and 25% of the original dataset, with correspondingθbase\\theta\_\{base\}values of 58, 28, 17, 12 and 10\. We then fine\-tuned Llama\-2\-7B using these core sets and evaluated the performance on different tasks, as shown in Figure[7](https://arxiv.org/html/2605.30857#S5.F7)\. The experiments show that the optimalθbase\\theta\_\{base\}varies for different tasks; for example, GSM achieved the best results whenθbase=17\\theta\_\{base\}=17\(15% of the data\), while ARC\-C achieved the best results whenθbase=12\\theta\_\{base\}=12\(20% of the data\)\. Moreover, the sensitivity toθbase\\theta\_\{base\}varies across different tasks\. For instance, the performance on the HellaSwag task remains relatively stable regardless of changes inθbase\\theta\_\{base\}, while the performance on the GSM task exhibits significant fluctuations with variations inθbase\\theta\_\{base\}\.
Figure 7:Performance of Llama\-2\-7B on six benchmark tasks with varying core set sizes\. The x\-axis represents the percentage of Alpaca\-GPT4 data used \(5%–25%\), controlled by thresholdθbase\\theta\_\{base\}\(58, 28, 17, 12, 10 respectively\)\. The y\-axis shows accuracy on each task\. The optimal data size varies by task, with 15% achieving the best average improvement\.Table 4:Performance of MADS for core set selection on the WizardLM Dataset
#### 5\.2\.4Robustness across Models and Datasets
To verify the robustness of MADS, we tested its performance on various base models\. We used the full Alpaca\-GPT4 dataset and the core set containing 15% of the data selected based on Llama\-3\.2\-3B\-Instruct to fine\-tune different base models, as shown in Figure[8](https://arxiv.org/html/2605.30857#S5.F8)\. The experimental results demonstrate that, in terms of average performance improvement across all benchmark tasks, MADS enhances the capabilities of both Llama 2 and Llama 3 models, regardless of their parameter sizes\. Additionally, although the core set was selected by Llama\-3\.2\-3B\-Instruct, it was also applicable to Mistral\-7B\(Jianget al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib2)\)\. This experiment demonstrates the generalizability of MADS which does not rely on a specific base model but can effectively improve the capabilities of LLMs of different series or scales\.
Furthermore, to validate the robustness of our method across different datasets, we performed core set selection on the WizardLM dataset based on Llama\-3\.2\-3B\-Instruct and subsequently fine\-tuned Llama\-2\-7B\. As shown in Table[4](https://arxiv.org/html/2605.30857#S5.T4), we experimented with both 10% and 15% of the WizardLM data\. The results demonstrate that both settings achieve performance comparable to that obtained using the full dataset, which validates the robustness of our method\. Notably, using 10% of the data yields the best improvement, which is different from the results on the Alpaca\-GPT4 dataset where 15% of the data performed best\. This discrepancy may be attributed to the larger size of the WizardLM dataset compared to the Alpaca\-GPT4 dataset, which provides greater potential for compression when selecting diverse data, in addition to differences in the inherent quality of the datasets\.
Figure 8:Performance improvement of different base models fine\-tuned with the core set selected by Llama\-3\.2\-3B\-Instruct compared to the full dataset\. The x\-axis shows four base models: Llama\-2\-7B, Llama\-2\-13B, Llama\-3\-8B, and Mistral\-7B\. Bars represent improvement on individual tasks, and horizontal lines indicate average improvement across all six benchmark tasks\. All models show positive average improvement, demonstrating that the core set selected by Llama\-3\.2\-3B\-Instruct generalizes effectively across different model families and scales\.
#### 5\.2\.5Case Study
Our method assumes different neuron activations correspond to different data features\. To verify this, we present data cases corresponding to different activation tags in Table[5](https://arxiv.org/html/2605.30857#S5.T5)\. Data with identical activation labels exhibit similar features, validating the rationale of our method\.
Table 5:Examples of activation tags and corresponding instructions\.Tagvtagv\_\{tag\}Instruction Exampleslayer1\-4665: Activated when computing the median1\.0\(Median computation on sequence data\)Use the given data to calculate themedian\.\[2, 3, 7, 8, 10\]How do you calculate themedianfrom the given data?1, 2, 8, 9, 12, 13Create a Python program to calculate themedianof an array of numbers\.5, 15, 20, 2, 101\.1\(Median computation on non\-sequence data\)Given a table of data, calculate the mean,median, and range of the data\.For the given input, … Calculating themediancost of gasPredict themedianage of London\.layer1\-7955: Activated when the instruction requires the text to be reorganized\.1\.0\(Re\-arrange tasks\)Re\-arrangethe following letters to form a meaningful word\. vhicsRe\-arrangethis sentence: “Happy are those who dream”Re\-arrangethe given words to make it into a valid sentence\. the students best performed1\.1\(Rewrite tasks\)Rewritethe following sentence in the third person\. I am anxiousRewritethe sentence with more descriptive words\. The game is fun\.Rewritethe given sentence using a different but similar word\. She partook in the event\.layer8\-819,4424,7062,4990: Activated when multiplicative operations are present1\.5\(The sentence pattern is "Calculate the product…"\)Calculate the productof 5 and 3\.Calculate the productgiven two numbers\. 4 and 8Calculate the productof the two values\. 3 and 51\.2\(The sentence pattern is "Find the product of…"\)Find the productof the numbers\. 5 and 8Find the productof 29 and 32Find the productof 1\.8 and 5\.layer8\-5712,4567,465: Activated when append actions or addition operations are present1\.3\(The added quantity is 3\)Add 3words to make the sentence more vivid\.The teacher gave a speech\.Add 3more animals to the following list\. Dogs, Cats, MonkeysAdd 3new ingredients to a pasta dish1\.5\(The added quantity is 5\)Add 5items to a grocery shopping list\.Add 5eights to the number 9\.Add 5pxto each of the current margin values\. margin\-left: 20px; margin\-top: 30px;layer8\-3221,6071,6082,2225: Activated when the instruction requires describing something\.1\.8\(Describe the ’scenario\.’\)Describe a scenariowhere someone could be accused of plagiarism\.Describe a scenariowhere a student’s choices can either cause success or failure\.Describe a scenariowhere a GPT language model could be used for task completion\.2\.2\(Describe the ’situation\.’\)Describe a situationwhere you had to demonstrate teamwork\.Describe a situationwhere body language can help facilitate understanding\.Describe a situationwhere the use of solar energy is beneficial\.layer8\-4818,5557,3221,6052: Activated when the instruction requires providing something\.1\.6\(The quantity is three\)Provide threeexample sentences that use the word “redundant”Provide threesteps to solve a particular problem\. How to create a budgetProvide threeexample words for the following category: fruits1\.1\(The quantity is four\)Provide fourexamples of data visualizations\.Provide fourkey advantages of using a cloud\-based systemProvide fourideas to boost employee morale\.Table 6:Case study of activation values\.Table 7:Ablation study results on the four components of our framework\.w/o Neuron Activation: replacing neuron activation states with hidden states, specifically the final output representation of each layer, for tag extraction\.w/o Filtering: removing low\-frequency activation filtering \(θbase=0\\theta\_\{base\}=0\)\.w/o Complexity\-Priority: randomly selecting one instruction per activation tag instead of using complexity\-priority selection\.w/o Activation Values: selecting one instruction per unique tag instead of per\(tag,vtag\)\(tag,v\_\{tag\}\)pair\.
#### 5\.2\.6Ablation Study
To verify the effectiveness of the three key components in our framework, we conducted ablation studies by setting up the following four experiments: \(1\)Effect of Activation Tags Extracting: We replaced the extraction of neuron activation states with the extraction of hidden states, specifically the final output representation of each layer, to verify the role of neuron activation states\. \(2\)Effect of Activation Tags Filtering: We did not filter low\-frequency activations, specifically setting the value ofθbase\\theta\_\{base\}to 0\. In this case, the data volume of the core set is 81\.87%\. \(3\)Effect of Full\-Coverage Core Set Selection with Complexity Priority: For each activation tag, instead of selecting the instruction containing the most different tags, we randomly selected one instruction from the instructions containing that tag\. \(4\)Effect of Activation Values: We ignore activation values during core set selection, selecting one instruction with the highest complexity for each uniquetagtaginstead of for each\(tag,vtag\)\(tag,v\_\{tag\}\)pair\.
The experimental results are shown in Table[7](https://arxiv.org/html/2605.30857#S5.T7)\. From the results, we can observe that: \(1\) Replacing neuron activation states with hidden states results in a significant decrease in model performance, with the average improvement dropping to 0\.93%\. This indicates that neuron activation states can better represent the features of the data compared to hidden states\. \(2\) When low\-frequency activations are not filtered, although the core set size increases to 81\.87%, the model performance is the worst, with an average improvement of only 0\.40%\. This suggests that low\-frequency activations may be noise, and retaining them introduces a large amount of redundant or low\-quality data\. \(3\) Randomly selecting instructions for each activation tag results in an average improvement of 2\.02%, which is lower than the performance of MADS with 5\.10%\. This demonstrates that selecting instructions with higher complexity, which contain more diverse activation tags, is more effective for constructing a high\-quality core set\. \(4\) When activation values are ignored, the average improvement decreases to 3\.43%, which is 1\.67% lower than the full MADS method\. This indicates that treating\(tag,vtag\)\(tag,v\_\{tag\}\)pairs as distinct activation patterns enables finer\-grained data partitioning, thereby improving the diversity and quality of the selected core set\.
#### 5\.2\.7Analysis ofvtagv\_\{tag\}Rounding Granularity
The rounding operationRound\(vtag,1\)\\text\{Round\}\(v\_\{tag\},1\)in Algorithm[1](https://arxiv.org/html/2605.30857#alg1)is central to the\(tag,vtag\)\(tag,v\_\{tag\}\)grouping mechanism\. The choice of rounding precision directly controls the granularity of this sub\-grouping\. To analyze how rounding precision affects the grouping structure and downstream task performance, we compare three settings: integer precision \(0 decimal places\), one decimal place \(our default\), and two decimal places\. All experiments use Llama\-3\.2\-3B\-Instruct for activation extraction and fine\-tune Llama\-2\-7B with a 15% core set from Alpaca\-GPT4\.
Table[8](https://arxiv.org/html/2605.30857#S5.T8)reports, for each precision setting, the total number of unique activation tags retained in the core set, the average number of distinctvtagv\_\{tag\}values per unique activation tag, and theθbase\\theta\_\{base\}required to achieve a 15% core set size\. At integer precision, each unique activation tag has on average only 1\.25 distinct integervtagv\_\{tag\}value\. At two decimal places, the high precision creates many fine\-grained sub\-groups per tag, requiring an aggressive filtering threshold ofθbase=108\\theta\_\{base\}=108to reach the 15% target; this over\-filtering leaves only 817 activation tags, resulting in insufficient coverage of the activation space\. Compared to the extremes of integer and two\-decimal precision, using one decimal place achieves a more reasonable balance:θbase=17\\theta\_\{base\}=17retains 5253 diverse activation tags, with each activation tag split into an average of 3\.28 sub\-groups, providing fine\-grained intensity discrimination while preserving broad activation\-space coverage\.
Table 8:Statistics of\(tag,vtag\)\(tag,v\_\{tag\}\)patterns under differentvtagv\_\{tag\}rounding precisions, with a 15% core set from Alpaca\-GPT4\. “\#Act\. Patterns” denotes the total number of unique activation tags in the final core set\. “Avg\.vtagv\_\{tag\}per Tag” denotes the average number of distinctvtagv\_\{tag\}values per unique activation tag type\. “θbase\\theta\_\{base\}” denotes the uniform filtering threshold applied to each group to achieve the 15% target core set size\.Table[9](https://arxiv.org/html/2605.30857#S5.T9)presents the downstream task performance under each precision setting\. Integer precision achieves 1\.60% average improvement, confirming that when the grouping granularity is excessively coarse, the activation value information provides limited benefit\. Two decimal places yields only 4\.76% improvement, as the overly fine\-grained grouping ofvtagv\_\{tag\}leaves too few activation tags to ensure diverse coverage\. One decimal place achieves the best performance of 5\.10%\. This demonstrates that such level of granularity establishes an optimal balance: it captures meaningful variations in activation intensity for fine\-grained sub\-group diversity, while also preventing excessive pattern fragmentation, thereby preserving a comprehensive and representative coverage of the activation space\.
Table 9:Performance of Llama\-2\-7B fine\-tuned with 15% core sets selected under differentvtagv\_\{tag\}rounding precisions\.
#### 5\.2\.8Analysis of Length Bias
One potential concern is whether our greedy algorithm exhibits length bias, i\.e\., whether it simply selects longer instructions because they naturally activate more neurons\. To investigate this, we analyzed the instruction length distribution of the core set selected byMADS3B\\rm MADS\_\{3B\}and compared it with the two best\-performing baselines, SelectIT and NUGGETS\.
Figure[9](https://arxiv.org/html/2605.30857#S5.F9)presents the frequency distribution of instruction lengths \(measured in tokens\) for the three methods\. As shown in the figure, althoughMADS3B\\rm MADS\_\{3B\}includes some longer instructions, the majority of selected instructions still fall within a moderate length range\. Table[10](https://arxiv.org/html/2605.30857#S5.T10)provides detailed statistics of the instruction lengths\. The results show that whileMADS3B\\rm MADS\_\{3B\}has a higher average and median instruction length compared to SelectIT and NUGGETS, the mode values are similar across all methods, indicating that the most frequently selected instructions are of comparable length\. Importantly, as shown in Figure[9](https://arxiv.org/html/2605.30857#S5.F9), the length distributions of all three methods remain concentrated in similar ranges, with the majority of instructions containing fewer than 50 tokens\. The minimum instruction length for all methods is identical\. These experimental results indicate that while our method tends to select more complex instructions, it does not completely overlook shorter instructions\. Additionally, compared to the baselines, the instruction lengths we select are more balanced\.
While the above descriptive statistics show that MADS does not exclusively select long instructions, a natural concern is whether the instruction complexity metricC\(ins\)=\|Tf\(ins\)\|C\(ins\)=\|T^\{f\}\(ins\)\|simply acts as a proxy for instruction length, since longer instructions contain more tokens and thus have more opportunities to activate neurons\. To disentangle the effects of length and activation complexity on MADS’s selection behavior, we perform a logistic regression analysis on all instructions in the Alpaca\-GPT4 dataset\. For each instructioninsins, we compute two features: \(1\) the instruction lengthL\(ins\)L\(ins\)\(number of tokens\), and \(2\) the activation complexityC\(ins\)=\|Tf\(ins\)\|C\(ins\)=\|T^\{f\}\(ins\)\|\(number of distinct activation tags\)\. The binary outcome variable indicates whether the instruction is selected by MADS \(P\(ins\)=1P\(ins\)=1\) or not \(P\(ins\)=0P\(ins\)=0\)\. We fit the following logistic regression model:
P\(ins\)=σ\(β0\+β1⋅L\(ins\)\+β2⋅C\(ins\)\)P\(ins\)=\\sigma\(\\beta\_\{0\}\+\\beta\_\{1\}\\cdot L\(ins\)\+\\beta\_\{2\}\\cdot C\(ins\)\)\(18\)whereσ\(⋅\)\\sigma\(\\cdot\)denotes the sigmoid function\.
The regression results are presented in Table[11](https://arxiv.org/html/2605.30857#S5.T11)\. Both instruction length \(β1=0\.042\\beta\_\{1\}=0\.042,p<0\.001p<0\.001\) and activation complexity \(β2=0\.161\\beta\_\{2\}=0\.161,p<0\.001p<0\.001\) are statistically significant predictors\. Notably, if MADS merely selected longer instructions, the complexity coefficient would become non\-significant when length is controlled; instead, it remains highly significant, confirming that complexity contributes independently to selection\. More importantly, the standardized coefficients reveal that activation complexity is a much stronger predictor than length\. The effect size for complexity is 1\.041, which is1\.91\.9times larger than the 0\.551 for length, demonstrating that complexity is the dominant factor driving MADS selection\. These results provide evidence that MADS captures semantic complexity beyond merely reflecting instruction length\.
Figure 9:Frequency distribution of instruction lengths \(measured in tokens\) for core sets selected by three methods:MADS3B\\rm MADS\_\{3B\}, SelectIT, and NUGGETS\. The x\-axis represents instruction length, and the y\-axis represents frequency count\. WhileMADS3B\\rm MADS\_\{3B\}includes some longer instructions, the majority of selected instructions remain within moderate length ranges similar to the baselines\. The distributions show that MADS does not exclusively favor long instructions but maintains balanced length diversity\.Table 10:Statistics of instruction token counts for different core set selection methods\.Table 11:Logistic regression results predicting MADS selection\. “Coef\.” denotes the original\-scale coefficient, “Std\. Coef\.” denotes the standardized coefficient for effect size comparison\. PseudoR2R^\{2\}measures the proportion of variance explained by the model, where higher values indicate better explanatory power\. The result ofp∗∗∗<0\.001\{\}^\{\*\*\*\}p<0\.001indicates that the probability of observing such results under the null hypothesis is less than 0\.1%\.To further validate whether MADS favors higher\-quality long instructions rather than arbitrary lengthy instructions under the influence of instruction complexity, we conduct a length\-controlled ablation study\. We construct a length\-matched random set by stratified sampling from the Alpaca\-GPT4 dataset, ensuring its instruction\-length distribution fully aligns with that of theMADS3B\\rm MADS\_\{3B\}core set\. As shown in Table[12](https://arxiv.org/html/2605.30857#S5.T12), this length\-matched random set achieves only 2\.28% average improvement, substantially lower than the 5\.10% achieved byMADS3B\\rm MADS\_\{3B\}\. This result demonstrates that MADS selects higher\-quality long instructions rather than merely long ones\.
Table 12:Length\-controlled ablation on Llama\-2\-7B\. "Length\-Matched" refers to a random set created by stratified sampling from the Alpaca\-GPT4 dataset, ensuring its instruction\-length distribution fully aligns with that of theMADS3B\\rm MADS\_\{3B\}core set\.
#### 5\.2\.9Computational Cost Analysis
In addition to downstream task performance, computational efficiency during data selection is an important practical consideration\. In this section, we extract a 15% core set from the Alpaca\-GPT4 dataset, and the experiment for each method is independently conducted on a single NVIDIA A800 GPU\. Table[13](https://arxiv.org/html/2605.30857#S5.T13)summarizes the key computational factors across different methods\.
MethodGPTWarmupPre\-def\.Representation ExtractionLLM\#FwdSmall Model\#FwdTimeMem\.DEITA✓✓\-Llama\-1\-7B2\-\-4h45m28GBMoDS\-✓\-Llama\-2\-7B1DeBERTa \(435M\)287h25m70GBIFD\-✓\-Llama\-2\-7B2\-\-1h11m15GBNUGGETS\-\-✓Llama\-2\-7Bmm\-\-223h45m69GBClusterClip\-\-✓\-\-JinaBERT \(137M\)12m20s5GBSelectIT\-\-\-Llama\-2\-7B/13B/70B15\-\-\-‡\-‡InsTag✓✓\-Llama\-2\-7B1\-\-29m57s74GBMADS\-\-\-Llama\-3\.2\-3B1\-\-1h09m15GBTable 13:Comparison of computational requirements for data selection methods\. “GPT” indicates whether ChatGPT/GPT\-4 annotation is required\. “Warmup” indicates whether model training is required before data selection\. “Pre\-def\.” indicates whether task categories or cluster numbers must be specified in advance\. “LLM” shows the large language model used \(either directly or as a base for training\) for representation extraction or scoring\. “\#Fwd” denotes the number of forward passes required per instruction, wheremmdenotes the number of predefined tasks in NUGGETS\. “Small Model” indicates additional small models required, with parameter counts in parentheses\. Time and memory are measured on a single NVIDIA A800 GPU using each method’s official implementation, reflecting actual runtime rather than theoretical minimum \(as some methods implement parallelization while others do not\)\.‡SelectIT requires 7B, 13B, and 70B models with 5 forward passes each, exceeding our available computational resources\.We briefly describe each baseline method’s computational requirements:
- •DEITA\(Liuet al\.,[2024b](https://arxiv.org/html/2605.30857#bib.bib29)\)employs ChatGPT to score instruction complexity and response quality, then trains a scoring model based on Llama\-1\-7B\. Data selection requires two forward passes: one for complexity scoring and one for quality scoring\.
- •MoDS\(Duet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib15)\)employs a reward model777[https://huggingface\.co/OpenAssistant/reward\-model\-deberta\-v3\-large\-v2](https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2)for both quality evaluation and necessity evaluation\. The method first uses the reward model to filter high\-quality data \(one DeBERTa forward pass\), then fine\-tunes Llama\-2\-7B on seed data to obtain an initial model\. Finally, it uses the initial model to generate responses for all high\-quality instructions \(one LLM forward pass\) and applies the reward model again to identify necessary data \(another DeBERTa forward pass\)\.
- •IFD\(Liet al\.,[2024a](https://arxiv.org/html/2605.30857#bib.bib14)\)requires a warmup model fine\-tuned on a seed dataset, then computes Instruction\-Following Difficulty scores by comparing losses with and without instruction context, requiring two forward passes per instruction\.
- •NUGGETS\(Liet al\.,[2024b](https://arxiv.org/html/2605.30857#bib.bib50)\)designs a one\-shot learning metric using Llama\-2\-7B\. For each candidate instruction, it requiresmmforward passes to compute one\-shot scores on each of themmpredefined tasks\. In our experiments, we use the official implementation withm=100m=100tasks sampled from Alpaca\-GPT4 dataset via K\-Means clustering\.
- •
- •SelectIT\(Liuet al\.,[2024a](https://arxiv.org/html/2605.30857#bib.bib18)\)computes uncertainty\-based metrics across three granularities \(token, sentence, model\) using multiple LLMs \(7B, 13B, 70B\), requiring 15 forward passes per instruction \(5 passes per model for uncertainty estimation\)\.
- •InsTag\(Luet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib17)\)uses ChatGPT to generate semantic tags for instructions, then trains a tagging model based on Llama\-2\-7B for tag prediction, requiring a single forward pass per instruction\.
From Table[13](https://arxiv.org/html/2605.30857#S5.T13), we analyze the computational efficiency of MADS\. In terms of both time and memory consumption, only ClusterClip outperforms our method, as it relies solely on a small embedding model without requiring any LLM inference\. Regarding time efficiency alone, InsTag achieves faster processing than MADS, benefiting from its pre\-trained tagging model that requires only a single forward pass for tag prediction\. However, overall, MADS demonstrates competitive computational efficiency compared to most baseline methods while offering several practical advantages: \(1\)No external API dependency: Unlike DEITA and InsTag, MADS does not require ChatGPT annotation, eliminating API costs and potential data privacy concerns\. \(2\)No warmup training: Unlike DEITA, MoDS, and IFD, MADS operates directly on the original dataset without requiring preliminary model training, significantly reducing total computation time\. \(3\)No pre\-defined categories: Unlike NUGGETS and ClusterClip, MADS does not require specifying task categories or cluster numbers in advance, as data categories are naturally derived from neuron activation patterns of LLMs\. \(4\)Single forward pass: MADS requires only one forward pass per instruction using a 3B\-parameter model, which is more efficient than NUGGETS and SelectIT\.
#### 5\.2\.10Redundancy Quantification: Activation Tags vs\. Text Embeddings
A natural question is whether the optimization landscape induced by activation tags meaningfully differs from that of embedding\-based coverage methods\. To investigate this, we conduct a redundancy quantification analysis comparing the core sets selected by MADS and ClusterClip\(Shaoet al\.,[2024](https://arxiv.org/html/2605.30857#bib.bib19)\), a representative embedding\-based coverage method that uses JinaBERT to extract text embeddings followed by k\-means clustering\. We evaluate redundancy from two complementary perspectives: \(1\)Embedding\-space redundancy: We use JinaBERT to extract text embeddings for each instruction in both core sets and compute the average pairwise cosine similarity, where lower similarity indicates less redundancy and greater diversity in the embedding space\. \(2\)Activation\-space coverage: We use Llama\-3\.2\-3B\-Instruct to extract activation tags for each instruction in both core sets and count the total number of distinct activation tags covered, where a higher count indicates broader coverage of the model’s internal feature space\.
Table 14:Redundancy quantification of core sets selected by ClusterClip andMADS3B\\rm MADS\_\{3B\}\. “Avg\. Cosine Sim\.” denotes the average pairwise cosine similarity of JinaBERT text embeddings within each core set \(lower indicates less redundancy\)\. “\#Act\. Tags” denotes the total number of distinct activation tags covered by each core set \(higher indicates broader coverage\)\.The results are presented in Table[14](https://arxiv.org/html/2605.30857#S5.T14)\. We observe that the two core sets exhibit comparable embedding\-space redundancy, with average cosine similarities of 0\.2414 \(ClusterClip\) and 0\.2508 \(MADS3B\\rm MADS\_\{3B\}\)\. This near\-parity is expected: ClusterClip explicitly optimizes for embedding\-space diversity via k\-means clustering on JinaBERT embeddings, so it naturally achieves low redundancy in that space\. The fact that MADS achieves a similar level of embedding\-space diversity*without*directly optimizing for it suggests that activation\-based selection implicitly induces comparable textual diversity\.
More importantly, the two methods differ dramatically in activation\-space coverage: MADS covers 757K distinct activation tags, which is78\.1%more than the 425K tags covered by ClusterClip\. Text embeddings from models like JinaBERT encode general semantic similarity, but they do not reflect the fine\-grained feature distinctions that arise within the LLM’s own representational space\. This broader activation\-space coverage provides a plausible explanation for the superior downstream performance of MADS over ClusterClip \(Table[2](https://arxiv.org/html/2605.30857#S4.T2)\): by covering more of the LLM’s internal feature space, the fine\-tuned model is exposed to a more comprehensive set of training signals, leading to more balanced and effective capability improvement\.
## 6Conclusion
In this paper, we introduce MADS, which uses the neuron activation states during LLMs’ inference process as the data tags, to select a diverse instruction fine\-tuning dataset, achieving better fine\-tuning performance with only part of training data\. Our method fully utilizes the inherent ability of LLMs to distinguish instructions with different features, and achieves the diversity and coverage of core set selection through LLMs’ self\-guided manner\. We evaluated MADS on multiple benchmarks, and the results indicate that MADS comprehensively enhances the performance of LLMs across various downstream tasks\.
## Limitations
Despite achieving effective diverse data selection to enhance LLM performance, our method has several limitations\. The frequency filtering threshold in MADS is a hyper\-parameter that requires manual setting\. Future work could explore automatic threshold determination based on the distribution of activation tags, inspired byXiaoet al\.\([2025](https://arxiv.org/html/2605.30857#bib.bib69)\)\. Additionally, the Full\-Coverage Core Set Selection with Complexity Priority algorithm used in MADS for core set selection is a greedy algorithm\. Future work could explore other activation tags\-based core set selection algorithms to enhance the performance of LLMs\. Furthermore, our experiments indicate that different layers excel in different tasks\. Future work could investigate whether the MADS method can be extended to task\-specific data selection, allowing for the selection of the optimal core set for specific tasks\.
## References
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.30857#S4.SS1.p1.1)\.
- J\. T\. Ash, C\. Zhang, A\. Krishnamurthy, J\. Langford, and A\. Agarwal \(2019\)Deep batch active learning by diverse, uncertain gradient lower bounds\.InInternational Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang,et al\.\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p1.1)\.
- Y\. Belinkov, L\. Màrquez, H\. Sajjad, N\. Durrani, F\. Dalvi, and J\. Glass \(2017\)Evaluating layers of representation in neural machine translation on part\-of\-speech and semantic tagging tasks\.InProceedings of the Eighth International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 1–10\.Cited by:[§3\.4](https://arxiv.org/html/2605.30857#S3.SS4.p1.1)\.
- S\. Bills, N\. Cammarata, D\. Mossing, H\. Tillman, L\. Gao, G\. Goh, I\. Sutskever, J\. Leike, J\. Wu, and W\. Saunders \(2023\)Language models can explain neurons in language models\.Note:[https://openaipublic\.blob\.core\.windows\.net/neuron\-explainer/paper/index\.html](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p4.1),[§3\.2](https://arxiv.org/html/2605.30857#S3.SS2.p1.1)\.
- T\. Blevins, O\. Levy, and L\. Zettlemoyer \(2018\)Deep RNNs encode soft hierarchical syntax\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 14–19\.Cited by:[§3\.4](https://arxiv.org/html/2605.30857#S3.SS4.p1.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. Turner, C\. Anil, C\. Denison, A\. Askell, R\. Lasenby, Y\. Wu, S\. Kravec, N\. Schiefer, T\. Maxwell, N\. Joseph, Z\. Hatfield\-Dodds, A\. Tamkin, K\. Nguyen, B\. McLean, J\. E\. Burke, T\. Hume, S\. Carter, T\. Henighan, and C\. Olah \(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Note:[https://transformer\-circuits\.pub/2023/monosemantic\-features/index\.html](https://transformer-circuits.pub/2023/monosemantic-features/index.html)Cited by:[Appendix A](https://arxiv.org/html/2605.30857#A1.p1.1),[§1](https://arxiv.org/html/2605.30857#S1.p4.1),[§1](https://arxiv.org/html/2605.30857#S1.p5.1),[§3\.2](https://arxiv.org/html/2605.30857#S3.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.30857#S3.SS3.p3.2),[§3\.3](https://arxiv.org/html/2605.30857#S3.SS3.p7.2)\.
- Y\. Cao, Y\. Kang, C\. Wang, and L\. Sun \(2024\)Instruction mining: instruction data selection for tuning large language models\.InFirst Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1)\.
- H\. Chen, Y\. Zhang, Q\. Zhang, H\. Yang, X\. Hu, X\. Ma, Y\. Yanggong, and J\. Zhao \(2023\)Maybe only 0\.5% data is needed: a preliminary exploration of low training data instruction tuning\.arXiv preprint arXiv:2305\.09246\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. Ponde de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[3rd item](https://arxiv.org/html/2605.30857#S4.I1.i3.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[5th item](https://arxiv.org/html/2605.30857#S4.I1.i5.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[2nd item](https://arxiv.org/html/2605.30857#S4.I1.i2.p1.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2309.08600),[Link](https://arxiv.org/abs/2309.08600)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p4.1)\.
- Q\. Dai, D\. Zhang, J\. W\. Ma, and H\. Peng \(2025\)Improving influence\-based instruction tuning data selection for balanced learning of diverse capabilities\.arXiv preprint arXiv:2501\.12147\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1)\.
- D\. Das and V\. Khetan \(2024\)DEFT\-UCS: data efficient fine\-tuning for pre\-trained language models via unsupervised core\-set selection for text\-editing\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 20296–20312\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1)\.
- J\. Devlin, M\.\-W\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423),[Link](https://aclanthology.org/N19-1423)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p3.1)\.
- Q\. Du, C\. Zong, and J\. Zhang \(2023\)MoDS: model\-oriented data selection for instruction tuning\.arXiv preprint arXiv:2311\.15653\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1),[2nd item](https://arxiv.org/html/2605.30857#S4.I2.i2.p1.1),[2nd item](https://arxiv.org/html/2605.30857#S5.I1.i2.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p1.1),[§4\.3](https://arxiv.org/html/2605.30857#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2605.30857#S4.SS3.p2.2)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen, R\. Grosse, S\. McCandlish, J\. Kaplan, D\. Amodei, M\. Wattenberg, and C\. Olah \(2022\)Toy models of superposition\.Note:[https://transformer\-circuits\.pub/2022/toy\_model/index\.html](https://transformer-circuits.pub/2022/toy_model/index.html)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p4.1),[§1](https://arxiv.org/html/2605.30857#S1.p5.1),[§3\.2](https://arxiv.org/html/2605.30857#S3.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.30857#S3.SS3.p3.2)\.
- S\. Har\-Peled and A\. Kushal \(2005\)Smaller coresets for k\-median and k\-means clustering\.InProceedings of the twenty\-first annual symposium on Computational geometry,pp\. 126–134\.Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- L\. He, P\. Chen, E\. Nie, Y\. Li, and J\. R\. Brennan \(2024\)Decoding probing: revealing internal linguistic structures in neural language models using minimal pairs\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 4488–4497\.Cited by:[§5\.2\.2](https://arxiv.org/html/2605.30857#S5.SS2.SSS2.p5.1)\.
- L\. Helff, R\. Härle, W\. Stammer, F\. Friedrich, M\. Brack, A\. Wüst, H\. Shindo, P\. Schramowski, and K\. Kersting \(2025\)ActivationReasoning: logical reasoning in latent activation spaces\.arXiv preprint arXiv:2510\.18184\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p4.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[1st item](https://arxiv.org/html/2605.30857#S4.I1.i1.p1.1)\.
- J\. Hu, S\. Yang, D\. Zhou, and L\. Wu \(2025\)DONOD: robust and generalizable instruction fine\-tuning for LLMs via model\-intrinsic dataset pruning\.arXiv preprint arXiv:2504\.14810\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier,et al\.\(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p1.1),[§5\.2\.4](https://arxiv.org/html/2605.30857#S5.SS2.SSS4.p1.1)\.
- J\. Jiang, J\. Zhou, and Z\. Zhu \(2024\)Tracing representation progression: analyzing and enhancing layer\-wise similarity\.arXiv preprint arXiv:2406\.14479\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2406.14479),[Link](https://arxiv.org/abs/2406.14479)Cited by:[§5\.2\.2](https://arxiv.org/html/2605.30857#S5.SS2.SSS2.p5.1)\.
- M\. Li, Y\. Zhang, Z\. Li, J\. Chen, L\. Chen, N\. Cheng, J\. Wang, T\. Zhou, and J\. Xiao \(2024a\)From quantity to quality: boosting LLM performance with self\-guided data selection for instruction tuning\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 7595–7628\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1),[3rd item](https://arxiv.org/html/2605.30857#S4.I2.i3.p1.1),[3rd item](https://arxiv.org/html/2605.30857#S5.I1.i3.p1.1)\.
- M\. Li and N\. Subramani \(2025\)Echoes of bert: do modern language models rediscover the classical nlp pipeline?\.arXiv preprint arXiv:2506\.02132\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.02132),[Link](https://arxiv.org/abs/2506.02132)Cited by:[§5\.2\.2](https://arxiv.org/html/2605.30857#S5.SS2.SSS2.p5.1)\.
- Y\. Li, B\. Hui, X\. Xia, J\. Yang, M\. Yang, L\. Zhang, S\. Si, L\.\-H\. Chen, J\. Liu, T\. Liu, F\. Huang, and Y\. Li \(2024b\)One\-shot learning as instruction data prospector for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\.\-W\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 4586–4601\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.252),[Link](https://aclanthology.org/2024.acl-long.252)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[4th item](https://arxiv.org/html/2605.30857#S4.I2.i4.p1.1),[4th item](https://arxiv.org/html/2605.30857#S5.I1.i4.p1.3)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3214–3252\.Cited by:[4th item](https://arxiv.org/html/2605.30857#S4.I1.i4.p1.1)\.
- L\. Liu, X\. Liu, D\. F\. Wong, D\. Li, Z\. Wang, B\. Hu, and M\. Zhang \(2024a\)SelectIT: selective instruction tuning for large language models via uncertainty\-aware self\-reflection\.arXiv preprint arXiv:2401\.03938\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1),[6th item](https://arxiv.org/html/2605.30857#S4.I2.i6.p1.1),[6th item](https://arxiv.org/html/2605.30857#S5.I1.i6.p1.1)\.
- W\. Liu, W\. Zeng, K\. He, Y\. Jiang, and J\. He \(2024b\)What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1),[1st item](https://arxiv.org/html/2605.30857#S4.I2.i1.p1.1),[1st item](https://arxiv.org/html/2605.30857#S5.I1.i1.p1.1)\.
- K\. Lu, H\. Yuan, Z\. Yuan, R\. Lin, J\. Lin, C\. Tan, C\. Zhou, and J\. Zhou \(2024\)\#InsTag: instruction tagging for analyzing supervised fine\-tuning of large language models\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[§1](https://arxiv.org/html/2605.30857#S1.p6.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1),[7th item](https://arxiv.org/html/2605.30857#S4.I2.i7.p1.1),[7th item](https://arxiv.org/html/2605.30857#S5.I1.i7.p1.1)\.
- Y\. Luo, Z\. Zhou, and B\. Dong \(2025\)InverseScope: scalable activation inversion for interpreting large language models\.arXiv preprint arXiv:2506\.07406\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p4.1)\.
- K\. Margatina, G\. Vernikos, L\. Barrault, and N\. Aletras \(2021\)Active learning by acquiring contrastive examples\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 650–663\.Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- B\. Mirzasoleiman, J\. Bilmes, and J\. Leskovec \(2020\)Coresets for data\-efficient training of machine learning models\.InInternational Conference on Machine Learning,pp\. 6950–6960\.Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- A\. Munteanu, C\. Sohler, C\. Schwiegelshohn, D\. P\. Woodruff,et al\.\(2018\)On coresets for logistic regression\.Advances in Neural Information Processing Systems,pp\. 6561–6570\.Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p1.1)\.
- X\. Pan, L\. Huang, L\. Kang, Z\. Liu, Y\. Lu, and S\. Cheng \(2024\)G\-DIG: towards gradient\-based DIverse and hiGh\-quality instruction data selection for machine translation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\.\-W\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15395–15406\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.821),[Link](https://aclanthology.org/2024.acl-long.821)Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1)\.
- J\. Pang, J\. Wei, A\. P\. Shah, Z\. Zhu, Y\. Wang, C\. Qian, Y\. Liu, Y\. Bao, and W\. Wei \(2024\)Improving data efficiency via curating LLM\-driven rating systems\.arXiv preprint arXiv:2410\.10877\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1)\.
- M\. Paul, S\. Ganguli, and G\. K\. Dziugaite \(2021\)Deep learning on a data diet: finding important examples early in training\.Advances in Neural Information Processing Systems34,pp\. 20596–20607\.Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- B\. Peng, C\. Li, P\. He, M\. Galley, and J\. Gao \(2023\)Instruction tuning with GPT\-4\.arXiv preprint arXiv:2304\.03277\.Cited by:[§4\.1](https://arxiv.org/html/2605.30857#S4.SS1.p1.1)\.
- M\. E\. Peters, M\. Neumann, L\. Zettlemoyer, and W\. Yih \(2018\)Dissecting contextual word embeddings: architecture and representation\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 1499–1509\.Cited by:[§3\.4](https://arxiv.org/html/2605.30857#S3.SS4.p1.1)\.
- Y\. Qin, Y\. Yang, P\. Guo, G\. Li, H\. Shao, Y\. Shi, Z\. Xu, Y\. Gu, K\. Li, and X\. Sun \(2024\)Unleashing the power of data tsunami: a comprehensive survey on data assessment and selection for instruction tuning of language models\.Transactions on Machine Learning Research\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1)\.
- L\. Ranaldi and A\. Freitas \(2024a\)Aligning large and small language models via chain\-of\-thought reasoning\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 1812–1827\.External Links:[Link](https://aclanthology.org/2024.eacl-long.109/),[Document](https://dx.doi.org/10.18653/v1/2024.eacl-long.109)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p3.1)\.
- L\. Ranaldi and A\. Freitas \(2024b\)Self\-refine instruction\-tuning for aligning reasoning in language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 2325–2347\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.139/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.139)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p3.1)\.
- A\. San Joaquin, B\. Wang, Z\. Liu, P\. Muller, N\. Asher, B\. Y\. Lim, and N\. F\. Chen \(2024\)In2Core: leveraging influence functions for coreset selection in instruction finetuning of large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 10324–10335\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1)\.
- V\. Sanh, A\. Webson, C\. Raffel, S\. H\. Bach, L\. Sutawika, Z\. Alyafeai, A\. Chaffin, A\. Stiegler, T\. L\. Scao, M\. Dey,et al\.\(2022\)Multitask prompted training enables zero\-shot task generalization\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p1.1),[§1](https://arxiv.org/html/2605.30857#S1.p2.1)\.
- O\. Sener and S\. Savarese \(2018\)Active learning for convolutional neural networks: a core\-set approach\.InInternational Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- B\. Settles \(2009\)Active learning literature survey\.Technical reportTechnical Report1648,University of Wisconsin–Madison\.Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- O\. Shafran, A\. Geiger, and M\. Geva \(2025\)Decomposing mlp activations into interpretable features via semi\-nonnegative matrix factorization\.arXiv preprint arXiv:2506\.10920\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p4.1),[§3\.3](https://arxiv.org/html/2605.30857#S3.SS3.p4.1)\.
- Y\. Shao, L\. Li, Z\. Fei, H\. Yan, D\. Lin, and X\. Qiu \(2024\)Balanced data sampling for language model training with clustering\.InFindings of the Association for Computational Linguistics: ACL 2024,L\.\-W\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 14012–14023\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.833),[Link](https://aclanthology.org/2024.findings-acl.833)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1),[5th item](https://arxiv.org/html/2605.30857#S4.I2.i5.p1.1),[5th item](https://arxiv.org/html/2605.30857#S5.I1.i5.p1.1),[§5\.2\.10](https://arxiv.org/html/2605.30857#S5.SS2.SSS10.p1.1)\.
- H\. Shi, Z\. Xu, H\. Wang, W\. Qin, W\. Wang, Y\. Wang, Z\. Wang, S\. Ebrahimi, and H\. Wang \(2024\)Continual learning of large language models: a comprehensive survey\.arXiv preprint arXiv:2404\.16789\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1)\.
- J\. Song, S\. Liu, B\. Zhu, and Y\. Rao \(2024\)IterSelectTune: an iterative training framework for efficient instruction\-tuning data selection\.arXiv preprint arXiv:2410\.13464\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Stanford Alpaca: an instruction\-following LLaMA model\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.30857#S4.SS1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)LLaMA 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§4\.3](https://arxiv.org/html/2605.30857#S4.SS3.p2.2)\.
- Y\. Wang, Y\. Kordi, S\. Mishra, A\. Liu, N\. A\. Smith, D\. Khashabi, and H\. Hajishirzi \(2023\)Self\-Instruct: aligning language models with self\-generated instructions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13484–13508\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1)\.
- T\. Wu, L\. Luo, Y\.\-F\. Li, S\. Pan, T\.\-T\. Vu, and G\. Haffari \(2024\)Continual learning for large language models: a survey\.arXiv preprint arXiv:2402\.01364\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1)\.
- M\. Xia, S\. Malladi, S\. Gururangan, S\. Arora, and D\. Chen \(2024a\)LESS: selecting influential data for targeted instruction tuning\.InProceedings of the 41st International Conference on Machine Learning,pp\. 54104–54132\.Cited by:[§2\.2](https://arxiv.org/html/2605.30857#S2.SS2.p1.1)\.
- T\. Xia, B\. Yu, K\. Dang, A\. Yang, Y\. Wu, Y\. Tian, Y\. Chang, and J\. Lin \(2024b\)Rethinking data selection at scale: random selection is almost all you need\.arXiv preprint arXiv:2410\.09335\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1)\.
- Y\. Xiao, H\. Ye, L\. Chen, H\. T\. Ng, L\. Bing, X\. Li, and R\. K\. Lee \(2025\)Finding the sweet spot: preference data construction for scaling preference optimization\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 12538–12552\.External Links:[Link](https://aclanthology.org/2025.acl-long.615/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.615),ISBN 979\-8\-89176\-251\-0Cited by:[Limitations](https://arxiv.org/html/2605.30857#Sx1.p1.1)\.
- C\. Xu, Q\. Sun, K\. Zheng, X\. Geng, P\. Zhao, J\. Feng, C\. Tao, Q\. Lin, and D\. Jiang \(2024\)WizardLM: empowering large pre\-trained language models to follow complex instructions\.InThe Twelfth International Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2605.30857#S4.SS1.p1.1)\.
- Y\. Xu, Y\. Yao, Y\. Huang, M\. Qi, M\. Wang, B\. Gu, and N\. Sundaresan \(2023\)Rethinking the instruction quality: lift is what you need\.arXiv preprint arXiv:2310\.07317\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p1.1)\.
- Y\. Yang, S\. Mishra, J\. Chiang, and B\. Mirzasoleiman \(2024\)SmalltoLarge \(S2L\): scalable data selection for fine\-tuning large language models by summarizing training trajectories of small models\.Advances in Neural Information Processing Systems37,pp\. 83465–83496\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4791–4800\.Cited by:[4th item](https://arxiv.org/html/2605.30857#S4.I1.i4.p1.1)\.
- D\. Zhang, Q\. Dai, and H\. Peng \(2025\)The best instruction\-tuning data are those that fit\.arXiv preprint arXiv:2502\.04194\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1),[§1](https://arxiv.org/html/2605.30857#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1)\.
- Y\. Zhao, L\. Du, X\. Ding, Y\. Ouyang, H\. Wang, K\. Xiong, J\. Gao, Z\. Sun, D\. Xu, Y\. Qing,et al\.\(2025\)Beyond similarity: a gradient\-based graph method for instruction tuning data selection\.arXiv preprint arXiv:2502\.11062\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1)\.
- C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu,et al\.\(2024\)LIMA: less is more for alignment\.Advances in Neural Information Processing Systems36\.Cited by:[§1](https://arxiv.org/html/2605.30857#S1.p2.1)\.
- H\. Zhou, T\. Liu, Q\. Ma, Y\. Zhang, J\. Yuan, P\. Liu, Y\. You, and H\. Yang \(2025\)DAVIR: data selection via implicit reward for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9220–9237\.Cited by:[§2\.1](https://arxiv.org/html/2605.30857#S2.SS1.p2.1)\.
## Appendix AActivation Level of Neurons
We extracted experimental results from previous work\(Brickenet al\.,[2023](https://arxiv.org/html/2605.30857#bib.bib39)\)concerning the relationship between activation levels and their corresponding text, including examples of strongly activated neurons and cases where activation values were around 1, which is the threshold we set to filter activation values, as shown in Table[15](https://arxiv.org/html/2605.30857#A1.T15)\.
Table 15:Examples of activation levels and their corresponding text
## Appendix BAdditional Visualization Results
In this section, we provide the PCA visualization and similarity heatmap for layers 20 and 27, which are not included in the main text due to space limitations\.


Figure 10:PCA visualization of activation tag vectors at layer 20 and 27 of Llama\-3\.2\-3B\-Instruct for five instruction categories: code \(orange\), math \(purple\), law \(green\), medical \(red\), and history \(blue\)\. The code and history categories demonstrate more pronounced separation from the remaining three categories\. This observation is intuitively reasonable: code\-related instructions involve distinct programming syntax, while history\-related queries primarily require factual recall, in contrast to the other three domains which emphasize logical reasoning processes\.

Figure 11:Heatmap of average shared activation tags between instruction categories at layers 20 and 27 of Llama\-3\.2\-3B\-Instruct\. Consistent with Figure[4](https://arxiv.org/html/2605.30857#S4.F4), diagonal values \(intra\-domain\) exceed off\-diagonal values \(cross\-domain\), confirming the robustness of the correlation between activation patterns and data features across deeper layers\.
## Appendix CDataset\-Internal Characteristics: Representative Examples
Table[16](https://arxiv.org/html/2605.30857#A3.T16)presents representative instruction examples from each of the five domains used in the preliminary experiment, illustrating the dataset\-internal structural differences discussed in Section[4\.4](https://arxiv.org/html/2605.30857#S4.SS4)\.
DomainInternal StructureRepresentative InstructionCodeProgramming syntax: code blocks,def/class/importtokensYou are given a Python class method that processes a string ‘sval‘ and extracts certain keywords based on the positions stored in the ‘cut‘ list\. Your task is to implement a function that takes the input string ‘sval‘ and the ‘cut‘ list, and returns the extracted keywords as a dictionary…Function signature: ‘def extract\_keywords\(sval: str, cut: List\[int\]\) \-\> Dict\[str, Union\[int, str\]\] …HistoryUniform passage\-based MCQ: read excerpt→\\rightarrowselect answerThis question refers to the following information\. \[Primary source excerpt on a historical event…\\ldots\] Which of the following best describes the author’s view? \[Multiple options\]MathArithmetic reasoningIf Clover walks 1\.5 miles in the morning and 1\.5 miles in the evening every day, how many miles does he walk in 30 days?
\(Solution:1\.5\+1\.5=31\.5\{\+\}1\.5\{=\}3miles/day;3×30=903\{\\times\}30\{=\}90miles\)LawMixed task types: rule\-retrieval from statutes vs\. multi\-step legal scenario reasoning\[Rule\-retrieval\]Congress enacts a $100 tax on handgun sales to private individuals\. Will this survive a constitutional challenge? \(A\) Yes, if Congress could have banned possession of handguns outright\. \(B\) Yes, if…\\ldots\[Scenario reasoning\]As part of his defense to a murder charge, a defendant offered testimony that he was committing a bank robbery in another state on the day the victim was killed\. The testimony is: \(A\) admissible as not hearsay\. \(B\) admissible as an admission\.…\\ldotsMedicalMixed task types: knowledge explanation vs\. symptom\-based consultation\[Knowledge\]What does “androgenic” mean in the context of anabolic\-androgenic steroids?\[Consultation\]\[Clinical case description: patient history, symptoms, treatment course…\\ldots\] What clinical management approach is most appropriate?Table 16:Representative instructions from each domain in the preliminary experiment\. Code and history domains exhibit highly uniform task structures \(programming syntax and passage\-based MCQ, respectively\), driving their pronounced separation in activation space\. Mathematics, law, and medical domains exhibit greater task diversity and share a reliance on natural language analytical reasoning, leading to overlapping activation patterns and smaller inter\-domain separation\.Similar Articles
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
This paper introduces Sequential Agent Tuning (SAT), a coordinator-free training paradigm for multi-LLM teams that provides monotonic improvement guarantees and plug-and-play invariance, enabling smaller models to outperform larger ones.
Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning
This paper proposes Badit, a method that decomposes large language model parameters into orthogonal high-singular-value LoRA experts to mitigate cross-task interference during multi-task instruction tuning.
Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging
MERIT introduces conflict-aware splitting and weight merging for decentralized instruction tuning, achieving improved performance without gradient synchronization across partitions.
Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.
SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning
Proposes SLAP, a novel data selection framework for efficient instruction tuning of large language models that evaluates batch learnability and uses stratified sampling to achieve superior performance with 20-40% less training data.