SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

arXiv cs.CL Papers

Summary

Proposes SLAP, a novel data selection framework for efficient instruction tuning of large language models that evaluates batch learnability and uses stratified sampling to achieve superior performance with 20-40% less training data.

arXiv:2605.23969v1 Announce Type: new Abstract: Instruction tuning has optimized the specialized capabilities of large language models (LLMs), but it often requires extensive datasets and prolonged training times. The challenge lies in developing specific capabilities by identifying useful data and efficiently fine-tuning. High-quality and diverse pruned data can help models achieve lossless performance at a lower cost. In this paper, we propose \textbf{SLAP}, a novel batch-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual. SLAP ensures comprehensive data distribution coverage through distribution-aware stratified sampling while maximizing intra-batch diversity through relative distance optimization. By leveraging Hessian-approximated gradient information for dynamic batch selection, SLAP significantly outperforms existing state-of-the-art methods across multiple model architectures (LLaMA, ChatGLM) and diverse downstream tasks including multi-turn dialogue, multilingual translation, and question answering. Most notably, SLAP achieves superior performance with 20-40\% less training data compared to full dataset training, substantially reducing computational costs while maintaining or improving model capabilities. These results establish SLAP as a powerful approach for efficient and effective instruction tuning of large language models.
Original Article
View Cached Full Text

Cached at: 05/26/26, 08:59 AM

# SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning
Source: [https://arxiv.org/html/2605.23969](https://arxiv.org/html/2605.23969)
11institutetext:Alibaba International Digital Commerce Group, HangZhou, China22institutetext:Hangzhou Lingju Intelligence AI Lab, Hangzhou, China33institutetext:Hangzhou Dianzi University, HangZhou, China
33email:renshugu@hdu\.edu\.cn###### Abstract

Instruction tuning has optimized the specialized capabilities of large language models \(LLMs\), but it often requires extensive datasets and prolonged training times\. The challenge lies in developing specific capabilities by identifying useful data and efficiently fine\-tuning\. High\-quality and diverse pruned data can help models achieve lossless performance at a lower cost\. In this paper, we proposeSLAP, a novel batch\-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual\. SLAP ensures comprehensive data distribution coverage through distribution\-aware stratified sampling while maximizing intra\-batch diversity through relative distance optimization\. By leveraging Hessian\-approximated gradient information for dynamic batch selection, SLAP significantly outperforms existing state\-of\-the\-art methods across multiple model architectures \(LLaMA, ChatGLM\) and diverse downstream tasks including multi\-turn dialogue, multilingual translation, and question answering\. Most notably, SLAP achieves superior performance with 20\-40% less training data compared to full dataset training, substantially reducing computational costs while maintaining or improving model capabilities\. These results establish SLAP as a powerful approach for efficient and effective instruction tuning of large language models\.

## 1Introduction

Instructional tuning has become essential for enhancing LLM capabilities\[[18](https://arxiv.org/html/2605.23969#bib.bib27)\]\. While recent research focuses on high\-quality data collection and on\-policy training strategies\[[10](https://arxiv.org/html/2605.23969#bib.bib14),[31](https://arxiv.org/html/2605.23969#bib.bib28),[6](https://arxiv.org/html/2605.23969#bib.bib29)\], the challenge of data quality persists, as duplicate and low\-quality data can degrade model performance\[[9](https://arxiv.org/html/2605.23969#bib.bib36)\]\.

Current data selection methods fall into two categories: off\-policy and on\-policy approaches\[[36](https://arxiv.org/html/2605.23969#bib.bib13),[10](https://arxiv.org/html/2605.23969#bib.bib14),[37](https://arxiv.org/html/2605.23969#bib.bib26)\]\. Off\-policy methods rely on static features like loss\[[12](https://arxiv.org/html/2605.23969#bib.bib6),[29](https://arxiv.org/html/2605.23969#bib.bib7)\], influence scores\[[14](https://arxiv.org/html/2605.23969#bib.bib8),[34](https://arxiv.org/html/2605.23969#bib.bib24)\], or embedding\-based metrics\. However, these methods lack adaptability to model updates\. On\-policy methods\[[16](https://arxiv.org/html/2605.23969#bib.bib43),[4](https://arxiv.org/html/2605.23969#bib.bib44),[10](https://arxiv.org/html/2605.23969#bib.bib14),[20](https://arxiv.org/html/2605.23969#bib.bib23),[21](https://arxiv.org/html/2605.23969#bib.bib5)\]calculate importance scores in real\-time but require substantial computational resources\[[16](https://arxiv.org/html/2605.23969#bib.bib43)\]\. While Feng\[[10](https://arxiv.org/html/2605.23969#bib.bib14)\]improved batch selection through orthogonal representativeness, data learnability remains unexplored\.

This paper proposes an on\-policy data selection framework, dubbed SLAP, that considers batch learnability, data coverage, data diversity, and computational efficiency\. SLAP evaluates the learnability of entire batch compositions rather than individual\. To achieve comprehensive data distribution coverage, SLAP approximates the NP\-hard global search in coreset selection by distribution\-aware stratified sampling\. We provide a theoretical analysis for coreset selection in Natural Language Processing \(NLP\) tasks from the perspective of geometric coverage\. Meanwhile, SLAP maximizes intra\-batch diversity through relative distance optimization\. This prevents selecting samples that contain redundant information\[[10](https://arxiv.org/html/2605.23969#bib.bib14)\]and increases the learnability of the data\. Inspired by the Adam algorithm\[[13](https://arxiv.org/html/2605.23969#bib.bib34)\], we integrate second\-moment cumulative gradient updates to reduce fluctuations from random sampling, helping the model to recognize key features across batches consistently\.

Through extensive experiments, we demonstrate that SLAP achieves optimal loss across various pruning rates and LLMs \(LLaMa3, ChatGLM3\), consistently maintaining or improving performance while reducing computational costs by 20\-40%\. Our results show particular strength in handling multi\-turn dialogues, multilingual translation, and complex question\-answering tasks, suggesting broad applicability across different Natural Language Processing \(NLP\) domains\. Furthermore, SLAP exhibits robust generalization capabilities, maintaining consistent performance even with reduced training data, making it particularly valuable for resource\-constrained scenarios\.

Our contributions can be summarized as follows:

1. 1\.We propose an on\-policy batch\-aware data pruning strategy that preserves data coverage through distribution\-aware stratified sampling and maximizes data diversity within batches\.
2. 2\.We propose a Hessian\-approximated gradient optimization method to maximize sample distance in high\-dimensional feature space, which enables more precision and dynamics than embeddings\.
3. 3\.We provide a theoretical analysis for coreset selection in instructional\-tuning of large language models\. Further, we show how to solve the NP\-hard problem in practice using an efficient approximation\.
4. 4\.We evaluate our approach on three diverse downstream datasets: llama3\-Chinese\-chat \(LLaMaQA\), WikiMatrix, and our net literature dialogue dataset \(NetLit\)\. Results demonstrate that SLAP consistently achieves superior performance in various pruning rates and LLMs \(LLaMa3 and ChatGLM3\) with lower computational cost\.

## 2Related work

Coreset selection\.Existing methods\[[8](https://arxiv.org/html/2605.23969#bib.bib31),[35](https://arxiv.org/html/2605.23969#bib.bib32)\]focus on creating representative data subsets for efficient training\. While traditional approaches prioritize difficult samples\[[26](https://arxiv.org/html/2605.23969#bib.bib30)\], this can lead to including outliers and noise\. Although\[[33](https://arxiv.org/html/2605.23969#bib.bib25)\]addresses this by selecting median\-difficulty samples, their method lacks consideration of data diversity\. Zheng’s\[[36](https://arxiv.org/html/2605.23969#bib.bib13)\]stratified approach with K strata improves distribution coverage but fails to guarantee the learning value of selected samples\.

On\-policy batch selection\.Recent approaches fall into two categories: dependent on the reference models\[[5](https://arxiv.org/html/2605.23969#bib.bib47),[17](https://arxiv.org/html/2605.23969#bib.bib48)\]and not dependent\[[10](https://arxiv.org/html/2605.23969#bib.bib14),[22](https://arxiv.org/html/2605.23969#bib.bib20)\]\. While Feng\[[10](https://arxiv.org/html/2605.23969#bib.bib14)\]optimizes directional diversity and Qin\[[22](https://arxiv.org/html/2605.23969#bib.bib20)\]achieves acceleration through selective pruning, both approaches have limitations in considering data learnability or relying on fixed thresholds\. SLAP overcomes these limitations by combining stratified loss sampling with distance\-based diversity control\.

Feature Selection\.Traditional embedding\-based approaches in image processing\[[26](https://arxiv.org/html/2605.23969#bib.bib30),[36](https://arxiv.org/html/2605.23969#bib.bib13),[33](https://arxiv.org/html/2605.23969#bib.bib25)\]fail to capture dynamic training contributions and model changes\. Recent gradient\-based methods\[[32](https://arxiv.org/html/2605.23969#bib.bib19),[27](https://arxiv.org/html/2605.23969#bib.bib33)\]show promise in dynamic feature capture and influence estimation\. SLAP builds upon this direction, utilizing gradients for both coverage and diversity assessment\.

## 3Method

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap03/method1.png)Figure 1:The workflow of SLAP\.Step 1:We divide a batch of data intoKKstrata based on loss\. Then, we select\|S\|\|S\|data according to the probability of normalized exp\(loss\) and calculate the number of data in each stratum\.Step 2:We calculate the Hessian\-approximated gradientHtH\_\{t\}of the data as features\.Step 3:For stratum 1, we randomly initialize a point\. We calculate theL2L\_\{2\}distance to the first point and select the farthest point from the same stratum as the second point\. We update the minimum distance from the remaining points to the selected point and repeatedly choose\|Si\|\|S\_\{i\}\|samples\. For strata 2 and 3, in order to select points for the new stratum, we need to consider the selected points from the previous strata\. Finally, we will obtain a diversified subset that is relatively far from each other\.Here, we describe how SLAP adapts the gradient and stratified sampling to select samples that effectively induce a target capability\. In[subsection 3\.1](https://arxiv.org/html/2605.23969#S3.SS1)we begin with a theoretical analysis of how to select a coreset that can approximately replace the full set through geometric analysis\. Our analysis extends traditional coverage and diversity metrics to high\-dimensional gradient spaces through rigorous mathematical derivation\. Given the significant convergence and precision of Hessian when approaching the optimal solution, we utilize the Hessian\-approximated gradient throughout the optimization process \([subsection 3\.2](https://arxiv.org/html/2605.23969#S3.SS2)\)\. In[subsection 3\.3](https://arxiv.org/html/2605.23969#S3.SS3), we detail the incorporation of the SLAP framework, wherein we approximate the theory presented in[subsection 3\.1](https://arxiv.org/html/2605.23969#S3.SS1)through stratified sampling, complemented by dynamic on\-policy batch selection, which facilitates O\(n\) computational efficiency\.

### 3\.1Geometric coverage of Coreset Selection

Ozan Sener\[[24](https://arxiv.org/html/2605.23969#bib.bib21)\]explained coreset selection from the perspective of geometric spatial distribution, and proved that the assumption of zero training error holds when the model loss function satisfies Lipschitz continuity\. The zero training error assumption means that for a model trained on a coreset, a bounded risk can be attained across the entire training dataset\. Zheng\[[36](https://arxiv.org/html/2605.23969#bib.bib13)\]analyzed computer vision tasks based on the aforementioned theory\.

Below we provide an analysis for NLP tasks to explain the LLM model trained on coreset has a training bounded risk with radiusrrcovering the full set\. We represent the full dataset of NLP task asS=\{\(xi,yi\)\}i=1NS=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\},xi=\(xi1,xi2,…,xit\)x\_\{i\}=\(x^\{1\}\_\{i\},x^\{2\}\_\{i\},\\ldots,x^\{t\}\_\{i\}\)is the input token sequence andyiy\_\{i\}is the prediction sequenceyi=\(yi1,yi2,…,yit\)y\_\{i\}=\(y^\{1\}\_\{i\},y^\{2\}\_\{i\},\\ldots,y^\{t\}\_\{i\}\), whereyit∈\[𝒞\]y^\{t\}\_\{i\}\\in\[\\mathcal\{C\}\]is the vocabulary label for token\. A model trained on a coresetS′S^\{\\prime\}has a training risk bounded with the covering radiusrr, if the loss functionl​\(x,y,hS\)l\(x,y,h\_\{S\}\)isλl\\lambda\_\{l\}\-Lipschitz continuous for allyywith bounded byLL\(Lipschitz constant\), and the cross\-entropy loss functionl​\(x,y,hS′\)=−∑i∈𝒮′yi​log⁡\(y^i\)l\(x,y,h\_\{S\}^\{\\prime\}\)=\-\\sum\_\{i\\in\\mathcal\{S^\{\\prime\}\}\}y\_\{i\}\\log\(\\hat\{y\}\_\{i\}\)isλη\\lambda\_\{\\eta\}\-Lipschitz continuous\.hS′h\_\{S\}^\{\\prime\}isrrcover ofhSh\_\{S\}and\|l​\(x,y;hS\)−l​\(x,y;hS′\)\|=0\|l\(x,y;h\_\{S\}\)\-l\(x,y;h\_\{S\}^\{\\prime\}\)\|=0,∀\(x,y\)∈S′\\forall\(x,y\)\\in S^\{\\prime\}\. Then use Hoeffding’s Bound and conclude that with probability at least1−γ1\-\\gamma:

\|1\|S\|​∑i∈𝒮l​\(xi,yi;h𝒮\)−1\|S′\|​∑j∈𝒮l​\(xj,yj;h𝒮′\)\|≤r​\(λl\+λη​L​C\)\+L2​log⁡\(1γ\)2​n\\displaystyle\\left\|\\frac\{1\}\{\|S\|\}\\sum\_\{i\\in\\mathcal\{S\}\}l\\left\(x\_\{i\},y\_\{i\};h\_\{\\mathcal\{S\}\}\\right\)\-\\frac\{1\}\{\|S^\{\\prime\}\|\}\\sum\_\{j\\in\\mathcal\{S\}\}l\\left\(x\_\{j\},y\_\{j\};h\_\{\\mathcal\{S\}^\{\\prime\}\}\\right\)\\right\|\\leq r\\left\(\\lambda\_\{l\}\+\\lambda\_\{\\eta\}LC\\right\)\+\\sqrt\{\\frac\{L^\{2\}\\log\\left\(\\frac\{1\}\{\\gamma\}\\right\)\}\{2n\}\}\(1\)
It means that given a full\-setSSand a cover radiusrrfor theSS, we can get a cover percentageppcoverage on the original distributionP​μP\\mu\. The coreset is like a ball of radiusrrthat covers the full set and each sample is represented as a point in a high\-dimensional space\.

### 3\.2Hessian\-approximated gradient optimization

Common representations for sample typically rely on embeddings\[[26](https://arxiv.org/html/2605.23969#bib.bib30),[36](https://arxiv.org/html/2605.23969#bib.bib13),[33](https://arxiv.org/html/2605.23969#bib.bib25)\], which primarily capture the intrinsic features of samples\. However, this approach tends to overlook the importance of the model’s influence during the training process\. In contrast, we adopt gradient representations\[[32](https://arxiv.org/html/2605.23969#bib.bib19),[27](https://arxiv.org/html/2605.23969#bib.bib33)\], as gradients offer a dynamic measure of sample relevance, reflecting each sample’s influence on model updates\. And in high\-dimensional space, the distances between gradient features facilitate a clearer differentiation of the relevance among samples\.

In NLP tasks, we adopt a sequence\-level gradient approach and leverage the sum of many token gradients from the last layer\(lm\_head\) of LLM to represent the entire sample sequence\[[36](https://arxiv.org/html/2605.23969#bib.bib13)\]\. Summation is calculated because the feature weights of important tokens are preserved\. The gradient atlm\_headcaptures highly abstracted features of the sample\. The gradients from this layer are defined as follows:

gradientlm\_head=∇ℒoutput​\(θ;hlast\_layer\)⋅hlast\_layerT\\displaystyle\\text\{gradient\}\_\{\\text\{lm\\\_head\}\}=\\nabla\\mathcal\{L\}\_\{\\text\{output\}\}\(\\theta;h\_\{\\text\{last\\\_layer\}\}\)\\cdot h\_\{\\text\{last\\\_layer\}\}^\{T\}\(2\)The term∇ℒoutput​\(θ;hlast\_layer\)\\nabla\\mathcal\{L\}\_\{\\text\{output\}\}\(\\theta;h\_\{\\text\{last\\\_layer\}\}\)signifies the gradient of the output layer with the model parametersθ\\theta\.hlast\_layerTh\_\{\\text\{last\\\_layer\}\}^\{T\}is the last layer hidden states\. In addition, theL2L\_\{2\}norm\[[11](https://arxiv.org/html/2605.23969#bib.bib50),[2](https://arxiv.org/html/2605.23969#bib.bib49)\]is used for the gradient from thelm\_headlayer\.

Given the significant convergence and precision of Hessian when approaching the optimal solution, we substitute Hessian\-approximated gradient optimization for the original gradient norm\. Hessian\-approximated gradientsHtH\_\{t\}are derived from the norm of the gradient𝐠t=∇ℒoutput​\(θ;hlast\_layer\)\\mathbf\{g\}\_\{t\}=\\nabla\\mathcal\{L\}\_\{\\text\{output\}\}\(\\theta;h\_\{\\text\{last\\\_layer\}\}\), adjusted by the second moment:

Ht=‖𝐠t𝐯^t‖2H\_\{t\}=\\left\\\|\\frac\{\\mathbf\{g\}\_\{t\}\}\{\\sqrt\{\\hat\{\\mathbf\{v\}\}\_\{t\}\}\}\\right\\\|\_\{2\}\(3\)whereHt∈ℝDH\_\{t\}\\in\\mathbb\{R\}^\{D\},DDdenotes the vocabulary size of the model\.𝐯^t\\hat\{\\mathbf\{v\}\}\_\{t\}is the second moment estimation\.

### 3\.3SLAP: Optimizing On\-Policy Batch Selection

In[subsection 3\.1](https://arxiv.org/html/2605.23969#S3.SS1), an upper bound for the loss function of the coreset selection is an NP\-Hard\[[24](https://arxiv.org/html/2605.23969#bib.bib21),[3](https://arxiv.org/html/2605.23969#bib.bib4)\]problem\. We employ a stratified sampling method to effectively approximate a search across the entire range of the global set\. As illustrated in[Figure 1](https://arxiv.org/html/2605.23969#S3.F1), SLAP initially computes the loss for each individual sample through forward propagation\[[12](https://arxiv.org/html/2605.23969#bib.bib6),[29](https://arxiv.org/html/2605.23969#bib.bib7)\]and employs stratified sampling to select a representative subset from a batch samplesBBin each training step\. Specifically, we selectnin\_\{i\}samples from each stratum with the objective of maximizing the distance between the Hessian\-approximated gradients of the selected samples\.

Distribution\-aware stratified sampling based on loss probability\.To align our sampling with the overall data distribution and account for both challenging and easy samples, we use a loss probability\-based stratification approach\. This method is detailed in[algorithm 1](https://arxiv.org/html/2605.23969#algorithm1)\. We partition the batch intoKKstrata based on loss values\. The width of each stratum is determined by the formula\(max\_loss−min\_loss\)/K\(\\text\{max\\\_loss\}\-\\text\{min\\\_loss\}\)/K\. Subsequently, we calculate the size of the selected sample set as\|S\|=\|B\|×α\|S\|=\|B\|\\times\\alpha, whereα\\alphadenotes the pruning rate\. In our sampling process, we usenormalized exp\(loss\)as the selection probability for each sample and we sample from the dataset without replacement to create the initial set\. These samples are then categorized intoKKstrata, allowing us to tally the number of samples in each stratum\.

MaximizingL2L\_\{2\}Hessian\-approximated gradient distance within the batch\.In each stratum, we aim to maximize theL2L\_\{2\}Hessian\-approximated gradient distance among points to ensure greater separation within the batch as illustrated in[Figure 2](https://arxiv.org/html/2605.23969#S3.F2)\. Specifically, for the number of points to be sampled in each stratum, indicated bySiS\_\{i\}, we begin by randomly selecting one point\. Subsequently, we compute the distances from the remaining points in the stratum to this selected point, referred to asd​i​s​t​a​n​c​edistance\. We then identify the point that is farthest away as the second point\. We continue by calculating thed​i​s​t​a​n​c​e′distance^\{\\prime\}from the remaining points to this second point, updating distance asd​i​s​t​a​n​c​e=m​i​n​\(d​i​s​t​a​n​c​e,d​i​s​t​a​n​c​e′\)distance=min\(distance,distance^\{\\prime\}\)\. This iterative process continues until we have selected the required number of points\.

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap03/kmeans++1.png)Figure 2:MaximizingL2L\_\{2\}Distance Within the Batch\.Step 1: For stratum 1, randomly initialize a point and calculate theL2L\_\{2\}distance from the points in the same stratum to the first point\. Step 2\-3: Select the point that is farthest from the first point as the second point, then update the minimum distance from the remaining points to the selected points and iteratively choose\|Si\|\|S\_\{i\}\|\(e\.g\. 3\) samples\. Steps 4 and 7: For stratum 2, consider the selected points from the previous strata\. Calculate the minimum distance from the points in stratum 2 to the selected points and take the maximum one as the first point\. Then, iterate to complete the selection for stratum 2\. Step 9: Finally, obtain a subset that is relatively far apart and diverse within the batch\.![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap01/hard.png)

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap01/CCS.png)

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap01/stratified.png)

Figure 3:The data distribution under hard sampling, CCS, and SLAP\.We compare the coreset distribution of our data selection strategies with existing strategies further\. As shown in[Figure 3](https://arxiv.org/html/2605.23969#S3.F3), hard sampling prioritizes the selection of samples starting from the most challenging \(highest scoring\) ones\. Conversely, CCS emphasizes coverage, ensuring nearly equal representation of samples across different score ranges\. SLAP integrates these approaches, balancing the consideration of both difficult and simple samples\. This strategy prevents the model from experiencing excessive initial difficulty, thereby avoiding a significant increase in loss\.

Input:BatchBB; current modelfθf\_\{\\theta\}; pruning rateα\\alpha; the number of strataKK; Hessian\-approximated gradientHtH\_\{t\}\.

Output:coreset

SS
1

21exloss

←\\leftarrowfθ​\(B\)f\_\{\\theta\}\(B\);

3coreset size

\|S\|\|S\|=

α\\alpha\*

\|B\|\|B\|;

4coreset

SS←\\leftarrowSample

\|S\|\|S\|samples without replacement according to the probability exp\(loss\);

5

B0,B1,…,BK−1B\_\{0\},B\_\{1\},\.\.\.,B\_\{K\-1\}←\\leftarrowSplit loss in

\|B\|\|B\|into

KKranges with an even range width;

6

\|S0\|,\|S1\|,…,\|SK−1\|\|S\_\{0\}\|,\|S\_\{1\}\|,\.\.\.,\|S\_\{K\-1\}\|←\\leftarrowNumber of points selected in each stratum computed by

e​x​p​\(l​o​s​s\)exp\(loss\);

7

S00←S\_\{00\}\\leftarrowRandomly select the first point in

B0B\_\{0\};

8for*i in range\(0,K−1K\-1\)*do

9

distance←min\(L2distance\\leftarrow\\min\(L\_\{2\}Hessian gradient distances of

BiB\_\{i\}to selected points

S\)S\);

10for*j in range\(0,\|Si\|\|S\_\{i\}\|\)*do

11

Si​j←arg⁡max⁡\(d​i​s​t​a​n​c​e\)S\_\{ij\}\\leftarrow\\arg\\max\(distance\);

12

d​i​s​t​a​n​c​e′distance^\{\\prime\}←\\leftarrowCalculate the distances of the remaining points to

Si​jS\_\{ij\};

13

d​i​s​t​a​n​c​e←min⁡\(d​i​s​t​a​n​c​e,d​i​s​t​a​n​c​e′\)distance\\leftarrow\\min\(distance,distance^\{\\prime\}\);

14

15end for

16

17end for

return*coresetSS*;

Algorithm 1SLAP

## 4Experiment

### 4\.1Experimental Setup

Datasets\.We conduct our finetuning on the following instruction tuning datasets: \(1\) Net Literature Dialogue Dataset \(NetLit\); \(2\) llama3\-Chinese\-chat \(LLaMaQA\)111https://modelscope\.cn/datasets/baicai003/Llama3\-Chinese\-dataset; \(3\) WikiMatrix222https://github\.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix\[[23](https://arxiv.org/html/2605.23969#bib.bib17)\]\. NetLit consists of multi\-turn dialogues related to web literature\. Each entry is 1000 characters long, with a total of 100 million training samples\. LLaMaQA is the dataset for the Chinese version of LLaMa3, which includes system instructions, queries, and GPT\-4 responses, amounting to 1\.69 million training samples\. WikiMatrix extracts parallel sentences in all possible language pairs from the content of Wikipedia\. We use 10 language pairs with 28 million training samples\.

Baselines\.We compare ourSLAPwith various baseline methods, including random selection, online hard, CCS, and InfoBatch\.Random selectionrandomly select data from training datasets\.Online hardstarts by selecting data points with the highest loss and continues downwards until enough data points are chosen\.CCS\[[36](https://arxiv.org/html/2605.23969#bib.bib13)\]stratifies the data based on loss and randomly selects a fixed number of data points from each stratum\. It utilizes important scores derived from the computer vision domain; We did the same experimental analysis for the number of stratakk\([Figure 5](https://arxiv.org/html/2605.23969#S4.F5)\)\. Since the number of strata is not a sensitive hyperparameter\. The default number of k for CCS is 50, but we use 8 instead\.InfoBatch\[[22](https://arxiv.org/html/2605.23969#bib.bib20)\]randomly removes samples with low information content based on the loss distribution, then adjusts the gradients of the remaining samples to match the original gradients\. The threshold is set to the average loss\. Nevertheless, this approach proves inadequate when dealing with high pruning rates\. To address this limitation, we increase the threshold appropriately for higher pruning rates\.

Implementation details\.We train with two base models: Meta\-LLama\-3\-8B\-Instruct\[[1](https://arxiv.org/html/2605.23969#bib.bib16)\]and ChatGLM\-3\-6B\[[7](https://arxiv.org/html/2605.23969#bib.bib18)\]\. The hyper\-parameters we used are as follows\. We fine\-tune the base model for 1 epoch with AdamW using a cosine learning rate scheduling strategy,β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999,ϵ=1​e−8\\epsilon=1e\-8\. We set the initial learning rate is set to1​e−51e\-5\. The batch size is set to 320, the context window’s maximum length is 2048 tokens, and longer samples are trimmed to fit in\. We use 16×\\timesA800 80G GPUs for training\.

Evaluation Metrics\.Through extensive empirical analysis spanning ten consecutive A/B experiments with a substantial user base \(n \> 200,000\), we discovered a statistically significant negative correlation between model loss and user engagement metrics, specifically conversation turn length\. As illustrated in[Figure 5](https://arxiv.org/html/2605.23969#S4.F5), lower loss values consistently correspond to higher user retention rates, manifesting as increased conversation duration\. This robust correlation, validated with confidence less than 0\.01 in all experimental iterations, provides a strong empirical justification for adoptingvalidation lossas our primary evaluation metric\. Furthermore, considering the objectivity of the experiments, common metrics are used for evaluation, such as GPT4 evaluation, human expert evaluation\[[25](https://arxiv.org/html/2605.23969#bib.bib40),[28](https://arxiv.org/html/2605.23969#bib.bib46)\], BLUE, Rlouge\[[19](https://arxiv.org/html/2605.23969#bib.bib42),[15](https://arxiv.org/html/2605.23969#bib.bib41)\], etc\.

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/k_cur.png)Figure 4:Evaluation with different k on NetLit using ChatGLM3 model\.
![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap03/loss_and_round.png)Figure 5:Loss comparison based on ChatGLM model flywheel training\.

### 4\.2Main results

Performance under different datasets\.Based on the comparisons in[Figure 6](https://arxiv.org/html/2605.23969#S4.F6), we have the following observations: \(1\) SLAP exhibits superior performance across a variety of datasets\. In particular, both the NetLit dataset and the LLaMaQA dataset reveal that SLAP consistently achieves a significantly low loss than other methods\. \(2\) SLAP strikes a balance between diversity coverage and task difficulty, thereby achieving enhanced performance across all three downstream tasks\. The performance of CCS and InfoBatch varies significantly across different datasets, presenting a contrasting trend\. CCS demonstrates superior efficacy compared to InfoBatch in both the NetLit and WikiMatrix datasets, suggesting that the aspect of diversity coverage is particularly crucial within the realms of web literature and translation tasks\. Conversely, the LLaMaQA dataset places a greater emphasis on the dimension of task difficulty\.

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/Onliter20performance.png)

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/Wikimatrix20performance.png)

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/LLaMaQA20performance_LLaMa1.png)

Figure 6:Evaluation on different datasets using ChatGLM3 model with pruning rate 70%\.Performance under different pruning rates\.In[Figure 7](https://arxiv.org/html/2605.23969#S4.F7), SLAP consistently demonstrates the lowest loss across various pruning rates\. Atα=50%\\alpha=50\\%pruning rate, all methods achieved the lowest loss\. This is primarily because both hard data and easy data have been retained to a certain extent\. Notably, the losses atα=90%\\alpha=90\\%andα=20%\\alpha=20\\%are remarkably similar, likely due to the high redundancy in the web text domain\. Therefore, we can train the model with less data to reduce computational costs in reality\.

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/Onliter10performance2.png)

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/Onliter20performance2.png)

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/Onliter40performance2.png)

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/Onliter80performance2.png)

Figure 7:Evaluation with different pruning rates on NetLit using ChatGLM3 modelPerformance between different models\.In[Figure 8](https://arxiv.org/html/2605.23969#S4.F8), we find that SLAP exhibits superior stability across various models, consistently achieving the lowest loss values in both cases examined\. In contrast, InfoBatch and CCS demonstrate varying degrees of instability across the same models\. Specifically, SLAP is particularly effective for the ChatGLM3 model when applied to large datasets, while InfoBatch is more advantageous for smaller datasets\. Conversely, the LLaMa3 model shows a stronger alignment with the Online Hard and SLAP methodologies\.

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/LLaMaQA-1.png)
![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/LLaMaQA-2.png)

Figure 8:Evaluation on LLaMaQA using ChatGLM3 and LLaMa3 with pruning rate 70%\.GPT\-4 Evaluation\.For our net literature dataset NetLit, we are more concerned with the performance of LLMs to generate responses under various pruning methods, similar to the capabilities of LLMs in role\-playing scenarios\[[25](https://arxiv.org/html/2605.23969#bib.bib40),[28](https://arxiv.org/html/2605.23969#bib.bib46)\]\. We evaluate the performance on personality and speaking style as the character’s primary feature\. We instruct GPT\-4 to score the generated responses, and the Chain of Thought\[[30](https://arxiv.org/html/2605.23969#bib.bib45)\]process allows us to evaluate the performance of different pruning methods effectively\. We refer to the prompt template\[[25](https://arxiv.org/html/2605.23969#bib.bib40)\]\. GPT\-4 scores each response from 1 to 5, with 5 indicating strong alignment with the character’s personality and speaking style, and 1 indicating poor alignment\. In the evaluation results from GPT\-4, both CCS and InfoBatch, along with our method, achieved commendable outcomes\. Such as in[Figure 9](https://arxiv.org/html/2605.23969#S4.F9)\(a\), SLAP has the highest percentage of high\-score samples, reaching 60\.5%\. In comparison, the CCS method achieves 52\.6%, while the InfoBatch method stands at 43\.5%\. Additionally, SLAP produces fewer low\-score examples than the other methods\.

Human Evaluation\.Furthermore, human judgment is still the most thorough and realistic assessment of whether the generated response is character\-aligned\. Some poor GPT\-4 annotation cases are discovered during our task\. As illustrated in[Figure 9](https://arxiv.org/html/2605.23969#S4.F9)\(b\), we invite annotators to rectify the scoring results of GPT\-4 for each test data, leading to human evaluation results\. Due to space limitations, detailed prompts for response generation and scores on other datasets are omitted here\.

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/appendix/gpt_eva_net.png)\(a\)GPT\-4 Evaluation\.
![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/appendix/hum_eva_net.png)\(b\)Human Evaluation\.

Figure 9:GPT\-4 and human evaluation scores for LLM generated responses on NetLit dataset\.Reference\-based Metrics\.We employ further industry\-recognized metrics Bleu\-4\[[19](https://arxiv.org/html/2605.23969#bib.bib42)\], Rouge\-1, Rouge\-2, and Rouge\-L\[[15](https://arxiv.org/html/2605.23969#bib.bib41)\]to validate the performance on different datasets\. Consistent with previous experiments, we utilize the LLaMa3 model for training and validation on WikiMatrix and LLaMaQA dataset, while the ChatGLM3 model employed for NetLit datasets\. All experiments are conducted at a pruning rate of 70%\. As shown in[Table 1](https://arxiv.org/html/2605.23969#S4.T1), the results demonstrate that our SLAP approach significantly outperforms its approaches, highlighting its versatility and efficacy across a range of tasks and models\.

Table 1:Quantitative comparison on NetLit \| WikiMatrix \| LLaMaQA dataset\.Bleu\-4Rouge\-1Rouge\-2Rouge\-LRandom13\.40\|31\.52\|22\.0934\.82\|44\.57\|40\.9218\.10\|21\.69\|21\.7930\.03\|36\.56\|35\.38Online Hard10\.43\|24\.46\|27\.0428\.10\|39\.62\|43\.5211\.86\|13\.84\|22\.7424\.11\|30\.22\|38\.77InfoBatch13\.16\|26\.76\|25\.3135\.30\|42\.67\|40\.7318\.05\|19\.45\|22\.6230\.22\|34\.17\|37\.88CCS13\.03\|32\.94\|26\.1734\.60\|46\.17\|42\.5717\.71\|22\.84\| 22\.3330\.05\|38\.24\|37\.31SLAP13\.73\|32\.24\|27\.1135\.02\|47\.51\|44\.0718\.25\|22\.56\|22\.8130\.53\|38\.97\|39\.52
### 4\.3Efficient Performance

To quantify the computational savings achieved by our method, we compare the FLOPs \(Floating Point Operations\) required for both full batch data and pruned batch data using SLAP\. For full batch processing, the computation is represented by the formula:

Full\-Batch FLOPs=B×\(Li\+Lo\)×Ff\+B×Lo×Fb\\begin\{array\}\[\]\{l\}\\text\{Full\-Batch FLOPs\}=B\\times\(L\_\{i\}\+L\_\{o\}\)\\times F\_\{f\}\+B\\times L\_\{o\}\\times F\_\{b\}\\end\{array\}\(4\)whereBBis the batch size,LiL\_\{i\}is the input sentence length,LoL\_\{o\}is the output sentence length,FfF\_\{f\}denotes the FLOPs for the forward pass of one token, andFbF\_\{b\}represents the FLOPs for the backward pass, which generally requires approximately twice the FLOPs of the forward pass\. In the case of SLAP, which incorporates pruning, the computational requirement is reduced by a factor corresponding to the pruning rateα\\alpha\. The computation for SLAP can be expressed as:

SLAP FLOPs=B×\(Li\+Lo\)×Ff\+\(1−α\)×B×Lo×Fb\\begin\{array\}\[\]\{l\}\\text\{SLAP FLOPs\}=B\\times\(L\_\{i\}\+L\_\{o\}\)\\times F\_\{f\}\+\(1\-\\alpha\)\\times B\\times L\_\{o\}\\times F\_\{b\}\\end\{array\}\(5\)The backward pass computation is scaled down by\(1−α\)\(1\-\\alpha\)\. For example, with an 70% pruning rate, only 30% of the backward pass computations need to be performed\. The comparison of the computational requirements of SLAP, Random, and Full Batch methods at a 70% pruning rate is illustrated in[Figure 10](https://arxiv.org/html/2605.23969#S4.F10)\. It is evident that our method can reduce computational requirements by at least 30% while maintaining equivalent loss\. This demonstrates the efficiency of SLAP in significantly lowering the computation burden without compromising the performance\.

![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/LLaMaQA_combined-1.png)
![Refer to caption](https://arxiv.org/html/2605.23969v1/Template/figure/chap04/LLaMaQA_combined-2.png)

Figure 10:Comparison of FLOPs for Pruning and Full data\.

## 5Conclusion

In conclusion, our proposed methodSLAP, significantly enhances the efficiency of instruction tuning for LLMs by focusing on the learnability of whole batch data rather than individual samples\. By employing distribution\-aware stratified sampling to ensure data distribution coverage and maximizing the relative distance between batch samples for diversity, SLAP effectively curates high\-quality, diverse data\. Additionally, it utilizes Hessian\-approximated gradient optimization to guide batch selection strategy, leading to superior performance compared to previous state\-of\-the\-art methods\. Our experiments demonstrate that SLAP achieves robust generalization across various downstream tasks and models, reduces computational cost by 20\-40%, and consistently delivers optimal results in multilingual translation, QA, and multi\-dialogue evaluations\.

## References

- \[1\]AI@Meta\(2024\)Llama 3 model card\.External Links:[Link](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)Cited by:[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p3.5)\.
- \[2\]A\. Babenko and V\. Lempitsky\(2015\-10\)Aggregating deep convolutional features for image retrieval\.Cornell University \- arXiv,Cornell University \- arXiv\(en\-US\)\.Cited by:[§3\.2](https://arxiv.org/html/2605.23969#S3.SS2.p2.7)\.
- \[3\]W\. J\. Cook, W\. H\. Cunningham, W\. R\. Pulleyblank, and A\. Schrijver\(1998\)Combinatorial optimization\.Vol\.605,Springer\.Cited by:[§3\.3](https://arxiv.org/html/2605.23969#S3.SS3.p1.2)\.
- \[4\]Z\. Deng, P\. Cui, and J\. Zhu\(2023\)Towards accelerated model training via bayesian data selection\.Advances in Neural Information Processing Systems36,pp\. 8513–8527\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1)\.
- \[5\]T\. Evans, N\. Parthasarathy, H\. Merzic, and O\. J\. Henaff\(2024\)Data curation via joint example selection further accelerates multimodal learning\.External Links:2406\.17711,[Link](https://arxiv.org/abs/2406.17711)Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p2.1)\.
- \[6\]D\. Everaert and C\. Potts\(2023\)Gio: gradient information optimization for training dataset selection\.arXiv preprint arXiv:2306\.11670\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p1.1)\.
- \[7\]T\. GLM, A\. Zeng, B\. Xu, B\. Wang, C\. Zhang, D\. Yin, D\. Rojas, G\. Feng, H\. Zhao, H\. Lai, H\. Yu, H\. Wang, J\. Sun, J\. Zhang, J\. Cheng, J\. Gui, J\. Tang, J\. Zhang, J\. Li, L\. Zhao, L\. Wu, L\. Zhong, M\. Liu, M\. Huang, P\. Zhang, Q\. Zheng, R\. Lu, S\. Duan, S\. Zhang, S\. Cao, S\. Yang, W\. L\. Tam, W\. Zhao, X\. Liu, X\. Xia, X\. Zhang, X\. Gu, X\. Lv, X\. Liu, X\. Liu, X\. Yang, X\. Song, X\. Zhang, Y\. An, Y\. Xu, Y\. Niu, Y\. Yang, Y\. Li, Y\. Bai, Y\. Dong, Z\. Qi, Z\. Wang, Z\. Yang, Z\. Du, Z\. Hou, and Z\. Wang\(2024\)ChatGLM: a family of large language models from glm\-130b to glm\-4 all tools\.External Links:2406\.12793Cited by:[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p3.5)\.
- \[8\]C\. Guo, B\. Zhao, and Y\. Bai\(2022\)Deepcore: a comprehensive library for coreset selection in deep learning\.InInternational Conference on Database and Expert Systems Applications,pp\. 181–195\.Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p1.1)\.
- \[9\]D\. Hernandez, T\. Brown, T\. Conerly, N\. DasSarma, D\. Drain, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, T\. Henighan, T\. Hume,et al\.\(2022\)Scaling laws and interpretability of learning from repeated data\.arXiv preprint arXiv:2205\.10487\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p1.1)\.
- \[10\]F\. Hong, Y\. Lyu, J\. Yao, Y\. Zhang, I\. W\. Tsang, and Y\. Wang\(2024\)Diversified batch selection for training acceleration\.arXiv preprint arXiv:2406\.04872\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p1.1),[§1](https://arxiv.org/html/2605.23969#S1.p2.1),[§1](https://arxiv.org/html/2605.23969#S1.p3.1),[§2](https://arxiv.org/html/2605.23969#S2.p2.1)\.
- \[11\]K\. Hsieh, G\. Ananthanarayanan, P\. Bodik, P\. Bahl, M\. Philipose, PhillipB\. Gibbons, and O\. Mutlu\(2018\-01\)Focus: querying large video datasets with low latency and low cost\.arXiv: Databases,arXiv: Databases\(en\-US\)\.Cited by:[§3\.2](https://arxiv.org/html/2605.23969#S3.SS2.p2.7)\.
- \[12\]L\. Jiang, Z\. Zhou, T\. Leung, L\. Li, and L\. Fei\-Fei\(2018\)Mentornet: learning data\-driven curriculum for very deep neural networks on corrupted labels\.InInternational conference on machine learning,pp\. 2304–2313\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.23969#S3.SS3.p1.2)\.
- \[13\]D\. P\. Kingma and J\. Ba\(2017\)Adam: a method for stochastic optimization\.External Links:1412\.6980,[Link](https://arxiv.org/abs/1412.6980)Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p3.1)\.
- \[14\]P\. W\. Koh and P\. Liang\(2017\)Understanding black\-box predictions via influence functions\.InInternational conference on machine learning,pp\. 1885–1894\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1)\.
- \[15\]C\. Lin\(2004\)Rouge: a package for automatic evaluation of summaries\.InText summarization branches out,pp\. 74–81\.Cited by:[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p4.1),[§4\.2](https://arxiv.org/html/2605.23969#S4.SS2.p6.1)\.
- \[16\]S\. Mindermann, J\. M\. Brauner, M\. T\. Razzak, M\. Sharma, A\. Kirsch, W\. Xu, B\. Höltgen, A\. N\. Gomez, A\. Morisot, S\. Farquhar,et al\.\(2022\)Prioritized training on points that are learnable, worth learning, and not yet learnt\.InInternational Conference on Machine Learning,pp\. 15630–15649\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1)\.
- \[17\]S\. Mindermann, J\. Brauner, M\. Razzak, M\. Sharma, A\. Kirsch, W\. Xu, B\. Höltgen, A\. N\. Gomez, A\. Morisot, S\. Farquhar, and Y\. Gal\(2022\)Prioritized training on points that are learnable, worth learning, and not yet learnt\.External Links:2206\.07137,[Link](https://arxiv.org/abs/2206.07137)Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p2.1)\.
- \[18\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p1.1)\.
- \[19\]K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu\(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th annual meeting of the Association for Computational Linguistics,pp\. 311–318\.Cited by:[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p4.1),[§4\.2](https://arxiv.org/html/2605.23969#S4.SS2.p6.1)\.
- \[20\]M\. Paul, S\. Ganguli, and G\. K\. Dziugaite\(2021\)Deep learning on a data diet: finding important examples early in training\.Advances in neural information processing systems34,pp\. 20596–20607\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1)\.
- \[21\]O\. Pooladzandi, D\. Davini, and B\. Mirzasoleiman\(2022\)Adaptive second order coresets for data\-efficient machine learning\.InInternational Conference on Machine Learning,pp\. 17848–17869\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1)\.
- \[22\]Z\. Qin, K\. Wang, Z\. Zheng, J\. Gu, X\. Peng, Z\. Xu, D\. Zhou, L\. Shang, B\. Sun, X\. Xie, and Y\. You\(2023\)InfoBatch: lossless training speed up by unbiased dynamic data pruning\.External Links:2303\.04947,[Link](https://arxiv.org/abs/2303.04947)Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p2.1.6)\.
- \[23\]H\. Schwenk, V\. Chaudhary, S\. Sun, H\. Gong, and F\. Guzmán\(2019\)WikiMatrix: mining 135m parallel sentences in 1620 language pairs from wikipedia\.External Links:1907\.05791,[Link](https://arxiv.org/abs/1907.05791)Cited by:[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p1.1)\.
- \[24\]O\. Sener and S\. Savarese\(2018\)Active learning for convolutional neural networks: a core\-set approach\.arXiv preprint arXiv:1708\.00489\.Cited by:[§3\.1](https://arxiv.org/html/2605.23969#S3.SS1.p1.1),[§3\.3](https://arxiv.org/html/2605.23969#S3.SS3.p1.2)\.
- \[25\]Y\. Shao, L\. Li, J\. Dai, and X\. Qiu\(2023\)Character\-llm: a trainable agent for role\-playing\.arXiv preprint arXiv:2310\.10158\.Cited by:[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p4.1),[§4\.2](https://arxiv.org/html/2605.23969#S4.SS2.p4.1)\.
- \[26\]B\. Sorscher, R\. Geirhos, S\. Shekhar, S\. Ganguli, and A\. Morcos\(2022\)Beyond neural scaling laws: beating power law scaling via data pruning\.Advances in Neural Information Processing Systems35,pp\. 19523–19536\.Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p1.1),[§2](https://arxiv.org/html/2605.23969#S2.p3.1),[§3\.2](https://arxiv.org/html/2605.23969#S3.SS2.p1.1)\.
- \[27\]X\. Wang, W\. Zhou, Q\. Zhang, J\. Zhou, S\. Gao, J\. Wang, M\. Zhang, X\. Gao, Y\. Chen, and T\. Gui\(2023\)Farewell to aimless large\-scale pretraining: influential subset selection for language model\.arXiv preprint arXiv:2305\.12816\.Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p3.1),[§3\.2](https://arxiv.org/html/2605.23969#S3.SS2.p1.1)\.
- \[28\]Z\. M\. Wang, Z\. Peng, H\. Que, J\. Liu, W\. Zhou, Y\. Wu, H\. Guo, R\. Gan, Z\. Ni, M\. Zhang,et al\.\(2023\)Rolellm: benchmarking, eliciting, and enhancing role\-playing abilities of large language models\.arXiv preprint arXiv:2310\.00746\.Cited by:[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p4.1),[§4\.2](https://arxiv.org/html/2605.23969#S4.SS2.p4.1)\.
- \[29\]H\. Wei, L\. Feng, X\. Chen, and B\. An\(2020\)Combating noisy labels by agreement: a joint training method with co\-regularization\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 13726–13735\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.23969#S3.SS3.p1.2)\.
- \[30\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§4\.2](https://arxiv.org/html/2605.23969#S4.SS2.p4.1)\.
- \[31\]M\. Xia, S\. Malladi, S\. Gururangan, S\. Arora, and D\. Chen\(2024\)Less: selecting influential data for targeted instruction tuning\.arXiv preprint arXiv:2402\.04333\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p1.1)\.
- \[32\]M\. Xia, S\. Malladi, S\. Gururangan, S\. Arora, and D\. Chen\(2024\)LESS: selecting influential data for targeted instruction tuning\.External Links:2402\.04333,[Link](https://arxiv.org/abs/2402.04333)Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p3.1),[§3\.2](https://arxiv.org/html/2605.23969#S3.SS2.p1.1)\.
- \[33\]X\. Xia, J\. Liu, J\. Yu, X\. Shen, B\. Han, and T\. Liu\(2022\)Moderate coreset: a universal method of data selection for real\-world data\-efficient deep learning\.InThe Eleventh International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p1.1),[§2](https://arxiv.org/html/2605.23969#S2.p3.1),[§3\.2](https://arxiv.org/html/2605.23969#S3.SS2.p1.1)\.
- \[34\]S\. Yang, Z\. Xie, H\. Peng, M\. Xu, M\. Sun, and P\. Li\(2022\)Dataset pruning: reducing training data by examining generalization influence\.arXiv preprint arXiv:2205\.09329\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1)\.
- \[35\]J\. Yoon, D\. Madaan, E\. Yang, and S\. J\. Hwang\(2021\)Online coreset selection for rehearsal\-based continual learning\.arXiv preprint arXiv:2106\.01085\.Cited by:[§2](https://arxiv.org/html/2605.23969#S2.p1.1)\.
- \[36\]H\. Zheng, R\. Liu, F\. Lai, and A\. Prakash\(2023\)Coverage\-centric coreset selection for high pruning rates\.arXiv preprint arXiv:2210\.15809\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1),[§2](https://arxiv.org/html/2605.23969#S2.p1.1),[§2](https://arxiv.org/html/2605.23969#S2.p3.1),[§3\.1](https://arxiv.org/html/2605.23969#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2605.23969#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.23969#S3.SS2.p2.2),[§4\.1](https://arxiv.org/html/2605.23969#S4.SS1.p2.1.5)\.
- \[37\]H\. Zheng, E\. Tsai, Y\. Lu, J\. Sun, B\. R\. Bartoldson, B\. Kailkhura, and A\. Prakash\(2024\)ELFS: enhancing label\-free coreset selection via clustering\-based pseudo\-labeling\.arXiv preprint arXiv:2406\.04273\.Cited by:[§1](https://arxiv.org/html/2605.23969#S1.p2.1)\.

Similar Articles

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.

The Long-Term Effects of Data Selection in LLM Fine-Tuning

arXiv cs.LG

This paper investigates the long-term effects of data selection strategies in multi-stage LLM fine-tuning, revealing that myopic selection can harm future adaptability. It introduces a Long-Horizon Aware Selection (LHAS) objective to mitigate these issues.