Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning
Summary
SAI-DPO introduces a dynamic sampling framework that adapts training data to a model's evolving capabilities during mathematical reasoning tasks, using self-aware difficulty metrics and knowledge semantic alignment to achieve state-of-the-art efficiency with less data on benchmarks like AIME24 and AMC23.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning
Source: https://arxiv.org/html/2505.16176
## Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning
Jun Rao1, Xuebo Liu1, Hexuan Deng1, Zepeng Lin1, Zixiong Yu2, Jiansheng Wei2, Xiaojun Meng2∗, and Min Zhang1
1Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China
2Huawei Large Model Data Technology Lab
{rao7jun,zepenglin11,hxuandeng}@gmail.com, {liuxuebo,zhangmin2021}@hit.edu.cn
[email protected],{weijiansheng,xiaojun.meng}@huawei.com
###### Abstract
In mathematical reasoning, data selection strategies predominantly rely on static, externally defined metrics, which fail to adapt to the evolving capabilities of models during training. This misalignment limits the efficiency of Supervised Fine-Tuning and Reinforcement Learning. To bridge this gap, we introduce SAI-DPO (Self-Aware Iterative Data Persistent Optimization), a dynamic sampling framework that aligns training data with the model's intrinsic competence. SAI-DPO operationalizes two novel metrics: Knowledge Semantic Alignment for targeting domain weaknesses, and Self-Aware Difficulty, derived from pass rates and reasoning path characteristics, to gauge instance complexity relative to the model's current state. By iteratively recalibrating the data distribution based on real-time feedback, SAI-DPO dynamically aligns training samples with the model's evolving competence, ensuring the data remains strictly relevant to the model's current capability level. Extensive experiments on eight benchmarks (including AIME24 and AMC23) demonstrate that SAI-DPO outperforms static baselines by nearly 6 points, achieving state-of-the-art efficiency with significantly less data.
## 1 Introduction
Recent advances in Large Language Models (LLMs), particularly in reasoning tasks, highlight the critical role of high-quality data. However, current data selection paradigms remain largely static, relying on fixed datasets or external difficulty scorers. This creates a fundamental disconnection: as a model learns, what was once "hard" becomes "easy", rendering static datasets progressively inefficient. Continued training on trivial samples yields diminishing returns, while overly complex samples may induce hallucinations.
Current works mainly focus on Supervised Fine-Tuning (SFT) after data filtering or online reinforcement learning algorithms. Most of these methods are static, failing to adaptively select suitable data for continuous training based on the model's current capabilities, thereby limiting the sustainable improvement of its reasoning abilities. As illustrated in Figure 1, different models have varying capabilities, thus leading to differences in their discrimination of the questions. Although some existing works have addressed the impact of difficulty on models, the related metrics remain unclear.
To address the issue of the lack of dynamic adaptive training for reasoning data, we propose the SAI-DPO (Self-Aware Iterative Data Persistent Optimization) algorithm for mathematical reasoning. This algorithm dynamically selects training data that matches the model's current competence (Self-Aware Difficulty) and weaknesses (Knowledge Semantic Alignment), enhancing its reasoning abilities through iterations. Using the defined metrics, the algorithm dynamically selects data and filters low-quality inputs to enhance training efficiency.
We conducted extensive experiments to explore the defined metric, data acquisition strategy, and the gradual improvement through iterative training. The experiments were carried out on 8 existing public mathematical test sets and 4 public models (Qwen2.5-7B-Math-Base, Qwen2.5-Math-7B-SFT, Llama3.1-8B-Instruct and Qwen3-8B). Our approach not only achieves better performance compared to the original DPO but also accelerates the training process. And compared to some current common strategies, such as externally defined difficulty and curriculum learning, our strategy has better results. Our results show that externally defined difficulty does not align with what is difficult for the model, and it is better to train with the model's defined difficulty.
Our main contributions are as follows:
- We propose a Dynamic Data Acquisition strategy that clusters knowledge tags to systematically target specific weakness domains.
- We formulate a Self-Aware Difficulty Metric that integrates statistical priors (pass rate) with cognitive load indicators (step count and length), providing a nuanced view of model competence.
- We demonstrate through extensive experimentation that aligning data difficulty with model capability yields superior performance, improving accuracy on competition-level benchmarks (AIME24 and AMC23) by nearly 4 points over strong baselines.
## 2 Related Work
### 2.1 Post-training Preference Optimization
In the post-training stage, many RL algorithms improve model performance by aligning the model's output objectives with human preferences—specifically, by increasing the probability of generating high-quality responses and decreasing the probability of producing low-quality ones. A common algorithm is Proximal Policy Optimization (PPO), which has been applied in multiple current LLM systems. Recently, more powerful reasoning models such as KIMI K1.5, Deepseek V3, and R1 have made modifications to PPO, giving rise to algorithms like GRPO and REINFORCE++. Although these algorithms have shown good performance, their practical deployment is often complicated due to the time-consuming nature of the online exploration involved.
In contrast, some offline methods are simpler to deploy. Direct Preference Optimization (DPO) efficiently trains large models for knowledge alignment using preference rankings instead of reward models. DPO optimizes classification loss from preference data, making implementing it simpler than RL from human feedback. Some papers collectively advance LLM alignment by shifting from static datasets to iterative self-improvement, demonstrating that dynamic, online feedback loops and repeated preference optimization significantly boost both general instruction following and complex reasoning capabilities. SPHERE and IDPO employ a self-evolving, iterative data augmentation approach for mathematical reasoning, called Online DPO. Unlike existing work, we improve the effectiveness through the model's self-judgment of the current data selection, rather than the algorithm.
### 2.2 Post-training Data Strategies
Data plays a crucial role in unlocking the capabilities of models. In the early days, LIMA found that a small amount of data could activate the relevant capabilities of the model and improve the test results of multiple tasks. Recently, some data selections in the field of mathematics have also demonstrated the importance of data quality and diversity. For instance, selections like S1 and LIMO, which used a small amount of data, managed to stimulate the mathematical reasoning capabilities of the models. KIMI K1.5 adopted curriculum learning and constructed a curriculum-based data training strategy. Pangu Ultra also assigned quality and difficulty labels to the data and used a curriculum-based sampling strategy throughout its three pre-training stages.
In this work, we explored an approach to dynamic data training during the training process, aiming to enhance the final RL performance by selecting training data that is aligned with the model's own competency.
## 3 Methods
### 3.1 Overview
The system operates in cycles. At iteration t, the current model M_t acts as a probe to evaluate the training data pool D. By leveraging two distinct metrics: Knowledge Semantic Alignment and Self-Aware Difficulty, we construct a dynamic curriculum D_train^(t) that targets the model's specific weaknesses. The model is then updated via preference optimization to yield M_{t+1}, shifting the difficulty frontier for the subsequent cycle.
### 3.2 Metric Definition
To operationalize dynamic data selection, we introduce two complementary metrics: one for semantic coverage (what the problem is about) and one for intrinsic complexity (how hard it is for the current model).
#### 3.2.1 Knowledge Semantic Alignment
Effective training requires diversity across mathematical concepts. We treat knowledge identification as a Latent Semantic Clustering problem.
**Annotation:** We employ an expert model to generate explicit knowledge tags T(x) for each instance x (e.g., Geometry, Sequence Summation).
**Embedding and Clustering:** These tags are mapped into a vector space using Sentence-Transformers (all-MiniLM-L6-v2). We then apply K-Means clustering to partition the dataset into n semantic domains C = {C_1, C_2, ..., C_n}. This granular partitioning allows us to detect and up-sample specific domains where the model exhibits high error rates.
#### 3.2.2 Self-Aware Difficulty Calibration
Unlike external difficulty scorers which are static, we define difficulty as a function of the model's interaction with the data. We propose a Hierarchical Difficulty Metric composed of three dimensions:
**1) Probabilistic Solvability (NoP, Primary):** We perform K explorations for each query. The Number of Passes (NoP), defined as the count of correct responses, serves as the primary proxy for difficulty. A lower NoP indicates higher aleatoric uncertainty and difficulty. We define the "solvable range" as problems where the model is neither consistently correct nor consistently incorrect. Specifically, this includes instances with a NoP in a particular range.
#### 3.2.3 Knowledge Point Identification
Knowledge points are identified by prompting an expert model with the following instruction:
"What knowledge points need to be involved in solving the following questions? Answer should be output in the following format, no need to output the answer, reply in English. ### Knowledge Points: {}
Please output the results directly, reducing the thought process."
As shown in the examples, the second and third examples both involve trigonometry, indicating repetitive knowledge points. By leveraging this tagging, we can better identify data with similar knowledge points, thereby enabling self-learning by locating example problems for knowledge points where the model currently has weaknesses.
### A.3 Examples of Problem-Solving Steps
Several models can automatically continue to generate replies in this format by adding the field "Step: 1".Similar Articles
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation
Proposes Distribution-Aligned Self-Distillation (DASD), which dynamically filters tokens during self-distillation to preserve beneficial logical corrections while suppressing distributionally misaligned style noise, improving robust reasoning on math, code, and commonsense benchmarks.
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent
This paper proposes SAM, a state-adaptive memory framework that dynamically manages interaction histories for long-horizon agentic reasoning, enabling intent-driven recall without retraining the backbone model. It outperforms strong baselines across multiple benchmarks like BrowseComp and HLE.
Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling
This paper introduces Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that improves the efficiency of power sampling for enhancing base language model reasoning. EGPS achieves up to 12.6x speedup over standard Metropolis-Hastings sampling while reaching best or tied-best accuracy on benchmarks like MATH500, HumanEval, and GPQA.
Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility
Introduces the Data-Model Compatibility (DMC) metric to evaluate how well a reasoning dataset aligns with a student model during distillation. Experiments show DMC strongly correlates with distillation performance and that dynamically selecting datasets based on DMC further improves reasoning capabilities.
Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models
This paper proposes ASAG, a training-free method that adaptively stops reasoning in large reasoning models based on attention distributions, reducing token usage by ~40% while improving accuracy by 3.2% on benchmarks using DeepSeek-R1-Distill and Qwen3 models.