Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

arXiv cs.CL Papers

Summary

SAI-DPO introduces a dynamic sampling framework that adapts training data to a model's evolving capabilities during mathematical reasoning tasks, using self-aware difficulty metrics and knowledge semantic alignment to achieve state-of-the-art efficiency with less data on benchmarks like AIME24 and AMC23.

arXiv:2505.16176v2 Announce Type: replace-cross Abstract: In mathematical reasoning, data selection strategies predominantly rely on static, externally defined metrics, which fail to adapt to the evolving capabilities of models during training. This misalignment limits the efficiency of Supervised Fine-Tuning and Reinforcement Learning. To bridge this gap, we introduce SAI-DPO (Self-Aware Iterative Data Persistent Optimization), a dynamic sampling framework that aligns training data with the model's intrinsic competence. SAI-DPO operationalizes two novel metrics: Knowledge Semantic Alignment for targeting domain weaknesses, and Self-Aware Difficulty, derived from pass rates and reasoning path characteristics, to gauge instance complexity relative to the model's current state. By iteratively recalibrating the data distribution based on real-time feedback, SAI-DPO dynamically aligns training samples with the model's evolving competence, ensuring the data remains strictly relevant to the model's current capability level. Extensive experiments on eight benchmarks (including AIME24 and AMC23) demonstrate that SAI-DPO outperforms static baselines at most nearly 6 points, achieving state-of-the-art efficiency with significantly less data.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

Source: https://arxiv.org/html/2505.16176

## Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

Jun Rao1, Xuebo Liu1, Hexuan Deng1, Zepeng Lin1, Zixiong Yu2, Jiansheng Wei2, Xiaojun Meng2∗, and Min Zhang1

1Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China
2Huawei Large Model Data Technology Lab

{rao7jun,zepenglin11,hxuandeng}@gmail.com, {liuxuebo,zhangmin2021}@hit.edu.cn
[email protected],{weijiansheng,xiaojun.meng}@huawei.com

###### Abstract

In mathematical reasoning, data selection strategies predominantly rely on static, externally defined metrics, which fail to adapt to the evolving capabilities of models during training. This misalignment limits the efficiency of Supervised Fine-Tuning and Reinforcement Learning. To bridge this gap, we introduce SAI-DPO (Self-Aware Iterative Data Persistent Optimization), a dynamic sampling framework that aligns training data with the model's intrinsic competence. SAI-DPO operationalizes two novel metrics: Knowledge Semantic Alignment for targeting domain weaknesses, and Self-Aware Difficulty, derived from pass rates and reasoning path characteristics, to gauge instance complexity relative to the model's current state. By iteratively recalibrating the data distribution based on real-time feedback, SAI-DPO dynamically aligns training samples with the model's evolving competence, ensuring the data remains strictly relevant to the model's current capability level. Extensive experiments on eight benchmarks (including AIME24 and AMC23) demonstrate that SAI-DPO outperforms static baselines by nearly 6 points, achieving state-of-the-art efficiency with significantly less data.

## 1 Introduction

Recent advances in Large Language Models (LLMs), particularly in reasoning tasks, highlight the critical role of high-quality data. However, current data selection paradigms remain largely static, relying on fixed datasets or external difficulty scorers. This creates a fundamental disconnection: as a model learns, what was once "hard" becomes "easy", rendering static datasets progressively inefficient. Continued training on trivial samples yields diminishing returns, while overly complex samples may induce hallucinations.

Current works mainly focus on Supervised Fine-Tuning (SFT) after data filtering or online reinforcement learning algorithms. Most of these methods are static, failing to adaptively select suitable data for continuous training based on the model's current capabilities, thereby limiting the sustainable improvement of its reasoning abilities. As illustrated in Figure 1, different models have varying capabilities, thus leading to differences in their discrimination of the questions. Although some existing works have addressed the impact of difficulty on models, the related metrics remain unclear.

To address the issue of the lack of dynamic adaptive training for reasoning data, we propose the SAI-DPO (Self-Aware Iterative Data Persistent Optimization) algorithm for mathematical reasoning. This algorithm dynamically selects training data that matches the model's current competence (Self-Aware Difficulty) and weaknesses (Knowledge Semantic Alignment), enhancing its reasoning abilities through iterations. Using the defined metrics, the algorithm dynamically selects data and filters low-quality inputs to enhance training efficiency.

We conducted extensive experiments to explore the defined metric, data acquisition strategy, and the gradual improvement through iterative training. The experiments were carried out on 8 existing public mathematical test sets and 4 public models (Qwen2.5-7B-Math-Base, Qwen2.5-Math-7B-SFT, Llama3.1-8B-Instruct and Qwen3-8B). Our approach not only achieves better performance compared to the original DPO but also accelerates the training process. And compared to some current common strategies, such as externally defined difficulty and curriculum learning, our strategy has better results. Our results show that externally defined difficulty does not align with what is difficult for the model, and it is better to train with the model's defined difficulty.

Our main contributions are as follows:

- We propose a Dynamic Data Acquisition strategy that clusters knowledge tags to systematically target specific weakness domains.
- We formulate a Self-Aware Difficulty Metric that integrates statistical priors (pass rate) with cognitive load indicators (step count and length), providing a nuanced view of model competence.
- We demonstrate through extensive experimentation that aligning data difficulty with model capability yields superior performance, improving accuracy on competition-level benchmarks (AIME24 and AMC23) by nearly 4 points over strong baselines.

## 2 Related Work

### 2.1 Post-training Preference Optimization

In the post-training stage, many RL algorithms improve model performance by aligning the model's output objectives with human preferences—specifically, by increasing the probability of generating high-quality responses and decreasing the probability of producing low-quality ones. A common algorithm is Proximal Policy Optimization (PPO), which has been applied in multiple current LLM systems. Recently, more powerful reasoning models such as KIMI K1.5, Deepseek V3, and R1 have made modifications to PPO, giving rise to algorithms like GRPO and REINFORCE++. Although these algorithms have shown good performance, their practical deployment is often complicated due to the time-consuming nature of the online exploration involved.

In contrast, some offline methods are simpler to deploy. Direct Preference Optimization (DPO) efficiently trains large models for knowledge alignment using preference rankings instead of reward models. DPO optimizes classification loss from preference data, making implementing it simpler than RL from human feedback. Some papers collectively advance LLM alignment by shifting from static datasets to iterative self-improvement, demonstrating that dynamic, online feedback loops and repeated preference optimization significantly boost both general instruction following and complex reasoning capabilities. SPHERE and IDPO employ a self-evolving, iterative data augmentation approach for mathematical reasoning, called Online DPO. Unlike existing work, we improve the effectiveness through the model's self-judgment of the current data selection, rather than the algorithm.

### 2.2 Post-training Data Strategies

Data plays a crucial role in unlocking the capabilities of models. In the early days, LIMA found that a small amount of data could activate the relevant capabilities of the model and improve the test results of multiple tasks. Recently, some data selections in the field of mathematics have also demonstrated the importance of data quality and diversity. For instance, selections like S1 and LIMO, which used a small amount of data, managed to stimulate the mathematical reasoning capabilities of the models. KIMI K1.5 adopted curriculum learning and constructed a curriculum-based data training strategy. Pangu Ultra also assigned quality and difficulty labels to the data and used a curriculum-based sampling strategy throughout its three pre-training stages.

In this work, we explored an approach to dynamic data training during the training process, aiming to enhance the final RL performance by selecting training data that is aligned with the model's own competency.

## 3 Methods

### 3.1 Overview

The system operates in cycles. At iteration t, the current model M_t acts as a probe to evaluate the training data pool D. By leveraging two distinct metrics: Knowledge Semantic Alignment and Self-Aware Difficulty, we construct a dynamic curriculum D_train^(t) that targets the model's specific weaknesses. The model is then updated via preference optimization to yield M_{t+1}, shifting the difficulty frontier for the subsequent cycle.

### 3.2 Metric Definition

To operationalize dynamic data selection, we introduce two complementary metrics: one for semantic coverage (what the problem is about) and one for intrinsic complexity (how hard it is for the current model).

#### 3.2.1 Knowledge Semantic Alignment

Effective training requires diversity across mathematical concepts. We treat knowledge identification as a Latent Semantic Clustering problem.

**Annotation:** We employ an expert model to generate explicit knowledge tags T(x) for each instance x (e.g., Geometry, Sequence Summation).

**Embedding and Clustering:** These tags are mapped into a vector space using Sentence-Transformers (all-MiniLM-L6-v2). We then apply K-Means clustering to partition the dataset into n semantic domains C = {C_1, C_2, ..., C_n}. This granular partitioning allows us to detect and up-sample specific domains where the model exhibits high error rates.

#### 3.2.2 Self-Aware Difficulty Calibration

Unlike external difficulty scorers which are static, we define difficulty as a function of the model's interaction with the data. We propose a Hierarchical Difficulty Metric composed of three dimensions:

**1) Probabilistic Solvability (NoP, Primary):** We perform K explorations for each query. The Number of Passes (NoP), defined as the count of correct responses, serves as the primary proxy for difficulty. A lower NoP indicates higher aleatoric uncertainty and difficulty. We define the "solvable range" as problems where the model is neither consistently correct nor consistently incorrect. Specifically, this includes instances with a NoP in a particular range.

#### 3.2.3 Knowledge Point Identification

Knowledge points are identified by prompting an expert model with the following instruction:

"What knowledge points need to be involved in solving the following questions? Answer should be output in the following format, no need to output the answer, reply in English. ### Knowledge Points: {}

Please output the results directly, reducing the thought process."

As shown in the examples, the second and third examples both involve trigonometry, indicating repetitive knowledge points. By leveraging this tagging, we can better identify data with similar knowledge points, thereby enabling self-learning by locating example problems for knowledge points where the model currently has weaknesses.

### A.3 Examples of Problem-Solving Steps

Several models can automatically continue to generate replies in this format by adding the field "Step: 1".

Similar Articles

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Hugging Face Daily Papers

This paper proposes SAM, a state-adaptive memory framework that dynamically manages interaction histories for long-horizon agentic reasoning, enabling intent-driven recall without retraining the backbone model. It outperforms strong baselines across multiple benchmarks like BrowseComp and HLE.

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

arXiv cs.LG

This paper introduces Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that improves the efficiency of power sampling for enhancing base language model reasoning. EGPS achieves up to 12.6x speedup over standard Metropolis-Hastings sampling while reaching best or tied-best accuracy on benchmarks like MATH500, HumanEval, and GPQA.