What properties of reasoning supervision are associated with improved downstream model quality?

arXiv cs.AI Papers

Summary

This paper investigates intrinsic data metrics to predict the utility of reasoning supervision before costly fine-tuning, finding that smaller models benefit from alignment-focused metrics while larger models gain from verbose traces, thus establishing a scale-aware framework for validating reasoning datasets.

arXiv:2605.13290v1 Announce Type: new Abstract: Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:15 AM

# What properties of reasoning supervision are associated with improved downstream model quality?
Source: [https://arxiv.org/html/2605.13290](https://arxiv.org/html/2605.13290)
11institutetext:Wroclaw Tech, 50\-370 Wrocław, Poland
11email:\{mikolaj\.langner, dzmitry\.pihulski, jan\.eliasz, michal\.rajkowski, kazienko, maciej\.piasecki, jan\.kocon, teddy\.ferdinan\}@pwr\.edu\.plDzmitry PihulskiJan EliaszMichał RajkowskiPrzemysław KazienkoMaciej PiaseckiJan KocońTeddy Ferdinan

###### Abstract

Validating training data for reasoning models typically requires expensive trial\-and\-error fine\-tuning cycles\. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics\. We propose a suite of quantitative measures and evaluate their predictive power by fine\-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset\. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance\. Crucially, we find that the predictors of utility are scale\-dependent: smaller models rely on alignment\-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks\. These findings establish a scale\-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing\.

## 1Introduction

Explicit reasoning strategies\[[36](https://arxiv.org/html/2605.13290#bib.bib3)\]and specialized models\[[26](https://arxiv.org/html/2605.13290#bib.bib24),[12](https://arxiv.org/html/2605.13290#bib.bib25)\]have transformed the capabilities of Large Language Models \(LLMs\)\. Consequently, fine\-tuning on datasets enriched with reasoning traces has become the standard paradigm for imbuing these models with such skills\. However, while the importance of high\-quality data is universally acknowledged, the definition of quality for reasoning traces remains ambiguous\.

Currently, validating a reasoning dataset is an inefficient process that relies on post\-hoc evaluation: researchers must fine\-tune a model to discover if their data is effective\. Thistraining\-as\-validationapproach is computationally prohibitive and unscalable\. To democratize the development of robust reasoning models, the community requires objective and computable metrics that can validate the utility of training databeforethe expensive fine\-tuning process begins\.

In this paper, we address this gap by establishing a link between intrinsic data characteristics and downstream model performance\. We leverage a controlled set of Polish reasoning variants from our previous work\[[29](https://arxiv.org/html/2605.13290#bib.bib5)\]and the corresponding fine\-tuned 8B and 11B models\. By subjecting these known reasoning variants to a rigorous set of quantitative measurements, from linguistic complexity to semantic alignment, we determine which metrics serve as reliable predictors of a model’s final reasoning ability\.

Our analysis is guided by the following research questions:

1. RQ1:Is it feasible to validate the utility prior to fine\-tuning?
2. RQ2:Which specific quantitative measures provide the most meaningful signal for validating training data quality?

The contributions of this work are as follows: \(1\) a systematic evaluation of validation metrics for reasoning datasets, distinguishing between superficial statistics and deep semantic indicators; \(2\) a correlation analysis linking pre\-training data scores with downstream performance; and, \(3\) a scale\-aware framework for selecting reasoning data, allowing researchers to estimate model performance without incurring the full cost of training\.

### 1\.1Related Work

The precise utility of generatedreasoning tracesremains a subject of active debate\. Shojaee et al\.\[[32](https://arxiv.org/html/2605.13290#bib.bib11)\]argue that reasoning\-augmented models often exhibitillusionaryimprovements, failing catastrophically on complex tasks whileoverthinkingsimple ones\. Although Lawsen et al\.\[[20](https://arxiv.org/html/2605.13290#bib.bib12)\]challenged these findings based on methodological discrepancies, the consensus remains that reasoning traces are not a guaranteed panacea\. Furthermore, studies indicate that LLM reasoning often diverges from genuine logical inference\[[7](https://arxiv.org/html/2605.13290#bib.bib32),[5](https://arxiv.org/html/2605.13290#bib.bib33),[17](https://arxiv.org/html/2605.13290#bib.bib34),[38](https://arxiv.org/html/2605.13290#bib.bib56),[3](https://arxiv.org/html/2605.13290#bib.bib53),[10](https://arxiv.org/html/2605.13290#bib.bib57),[6](https://arxiv.org/html/2605.13290#bib.bib55),[29](https://arxiv.org/html/2605.13290#bib.bib5)\], with models frequently omitting premises or generating hallucinated reasoning steps that do not correlate with the accuracy of the final answer\.

Recent work has attempted to isolate specific attributes of reasoning data that drive performance, particularly sequence length\. Jin et al\.\[[16](https://arxiv.org/html/2605.13290#bib.bib38)\]posit that extending the length of reasoning, regardless of quality, can boost performance\. In contrast, Wu et al\.\[[39](https://arxiv.org/html/2605.13290#bib.bib30)\]demonstrate an inverted U\-shaped relationship, suggesting that excessive length introduces error accumulation\.

Collectively, these conflicting findings suggest that neitherlengthnorpresence of reasoningalone are sufficient proxies for the utility of training data\. Although prior work largely evaluates reasoning quality by analyzing model outputs, there is insufficient research on validation methods that evaluate reasoning data before committing computational resources to fine\-tuning\. Our work addresses this gap by correlating intrinsic data metrics with downstream performance established in our prior experiments\.

![Refer to caption](https://arxiv.org/html/2605.13290v1/x1.png)

Figure 1:We translated a subset of Mixture\-of\-Thoughts\[[13](https://arxiv.org/html/2605.13290#bib.bib13),[23](https://arxiv.org/html/2605.13290#bib.bib15),[27](https://arxiv.org/html/2605.13290#bib.bib14),[2](https://arxiv.org/html/2605.13290#bib.bib16)\]into Polish, and split it into a training \(MoT\-PL\) and evaluation set \(MoT\-PL\-eval\)\. Three additional variants of MoT\-PL were created by paraphrasing only the reasoning part of each example: theSummarizedstyle made the reasoning much more concise, theBabyThinkstyle greatly simplified the reasoning, and theLengthystyle prolonged the reasoning\. Afterwards, we fine\-tuned PLLuM\-8B\-instruct and Bielik\-11B\-v2\.6\-Instruct on these datasets separately and evaluated them\.

## 2Experimental Setup

### 2\.1Datasets

To rigorously evaluate the efficacy of pre\-training validation metrics, four distinct reasoning datasets derived from thePolish Mixture\-of\-Thoughts \(MoT\-PL\)were used\. The original MoT\-PL dataset was created by sampling approximately 32,000 examples from the English Mixture\-of\-Thoughts collection\[[13](https://arxiv.org/html/2605.13290#bib.bib13)\]and translating them into Polish using DeepSeek\-V3\[[9](https://arxiv.org/html/2605.13290#bib.bib17)\]\. After filtering for errors and context length, the final dataset contained 22,571 examples spanning three domains: Mathematics \(28%\), Programming \(17%\), and Science \(55%\)\. To ensure the generated traces exhibited natural, human\-like fluency rather than rigid machine translation artifacts, a randomly sampled subset of the DeepSeek\-V3 outputs was manually verified by native Polish speakers\.

From this foundational dataset, we generated four semantically distinct variants to serve as our controlled variables \(see Figure[1](https://arxiv.org/html/2605.13290#S1.F1)\)\. These datasets,Detailed,Summarized,BabyThink, andLengthy, share identical user prompts and final answers but differ significantly in the style, length, and semantic density of their reasoning traces\. The general statistics of the dataset variants are shown in Table[1](https://arxiv.org/html/2605.13290#S2.T1)\. The variants were generated by automatic paraphrasing using DeepSeek\-V3, resulting in the following profiles:

- •Detailed:The unmodified MoT\-PL dataset, representing high\-quality standard reasoning\. The traces mimic the depth of the original English Mixture\-of\-Thoughts, serving as our control forstandardreasoning density\.
- •Summarized:A concise variant in which reasoning traces were compressed to retain essential logic while stripping stylistic fluff\. This dataset tests the hypothesis that a higher information density correlates with efficiency\.
- •BabyThink:A variant paraphrased into "childlike" language\. Rather than merely reducing statistical readability, the prompt intentionally obfuscates specific details and calculations with vague filler\. The original train of thought and structure are strictly preserved to avoid injecting artificial hallucinations or new reasoning fallacies\.
- •Lengthy:An artificially prolonged variant designed to be approximately twice as long as theDetailedversion\. It preserves the original logic but introduces verbosity, allowing us to test if metrics favoring longer chains are misleading\.

Table 1:Statistical profile of the dataset variants used for metric validation\. All variants share identical question/answer pairs; variations occur strictly within the reasoning trace\. The first value of a token count comes from using the PLLuM\-8B\-instruct tokenizer, while the second value comes from using the Bielik\-11B\-v2\.6\-Instruct tokenizer\.All examples exceeding the context window limit \(32k tokens\) were filtered out prior to statistical analysis and training to ensure a consistent evaluation across all variants\.

### 2\.2Target Models

To establish a robust performance baseline across different architectures, we utilized two state\-of\-the\-art Polish\-centric LLMs as a backbone for our experiments:

- •PLLuM\-8B\-instruct\[[18](https://arxiv.org/html/2605.13290#bib.bib8),[28](https://arxiv.org/html/2605.13290#bib.bib9),[19](https://arxiv.org/html/2605.13290#bib.bib7)\]: A derivative of Llama\-3\.1\-8B\[[11](https://arxiv.org/html/2605.13290#bib.bib10)\], adapted via continual pre\-training and instruction tuning on a massive Polish corpus;
- •Bielik\-11B\-v2\.6\-Instruct\[[25](https://arxiv.org/html/2605.13290#bib.bib39)\]: Built upon Mistral 7B v0\.2\[[15](https://arxiv.org/html/2605.13290#bib.bib40)\], similarly enhanced with Polish\-specific pre\-training and fine\-tuning\.

Since none of the model possesses native reasoning capabilities, we adapted them by introducing special tokens<think\>and</think\>and expanding their embedding layers accordingly\. The models were fine\-tuned separately on four dataset variants \(Section[2\.1](https://arxiv.org/html/2605.13290#S2.SS1)\), resulting in a diverse set of checkpoints with varying reasoning behaviors\. The technical specifications are detailed in Appendix[8](https://arxiv.org/html/2605.13290#S8)\.

### 2\.3Downstream Performance Benchmarks

To measure the utility of the training data variants, we evaluated the fine\-tuned models on a comprehensive suite of benchmarks\. These evaluation scores serve asground truth labelsagainst which we correlate our pre\-training data metrics\.

We selected four diverse benchmarks to capture different aspects of reasoning and language understanding:

- •MoT\-PL\-eval: The held\-out test split of ourMoT\-PLdataset \(see Section[2\.1](https://arxiv.org/html/2605.13290#S2.SS1)\), serving as the primary metric in\-domain for Polish reasoning\.
- •Belebele\[[1](https://arxiv.org/html/2605.13290#bib.bib35)\]: A challenging multilingual reading comprehension benchmark testing the models’ ability to extract information from complex passages\.
- •Aya Collection\[[33](https://arxiv.org/html/2605.13290#bib.bib36)\]: A broad instruction\-following suite covering summarization, classification, and QA, used to verify general capability retention\.
- •LightR1\[[37](https://arxiv.org/html/2605.13290#bib.bib37)\]: An English\-language benchmark for high\-difficulty logical tasks, included to assess the transfer of cross\-lingual reasoning\.

### 2\.4Evaluation Protocol

To obtain the ground\-truth performance scores needed for our correlation analysis, we evaluated all fine\-tuned models on the four benchmarks described above\. For each dataset, we sampled a stratified test set of 900 examples to ensure balanced coverage of reasoning lengths and task types\. We report each model performance using two primary metrics: Absolute Accuracy and Relative Percentage Change compared to the base model, to isolate the specific impact of training data\.

Given the scale of evaluation, we adopted theLLM\-as\-a\-judgeparadigm\. We usedDeepSeek\-R1\-0528\[[12](https://arxiv.org/html/2605.13290#bib.bib25)\]as an oracle judge\. The judge was strictly prompted to assess the correctness of the final answer \(ignoring intermediate reasoning steps\) against the ground truth\. This binary decision process was applied across all benchmarks\.

To ensure the reliability of these generated scores, we conducted a manual audit on a subset of 100 random samples from theMoT\-PL\-evaldataset\. A human expert annotated these samples blindly \(without seeing the model’s judgment\)\. The agreement rate between the human annotator andDeepSeek\-R1\-0528was95%, with a Cohen’s Kappa score of0\.886\. This strong alignment confirms that our automated ground\-truth labels are a reliable proxy for human evaluation\.

During evaluation, the judge was provided with the query, reference answer, and model prediction, and instructed to output a binary decision in a constrained JSON format\. The exact prompt templates for all benchmarks are available in our public repository111[https://github\.com/DzmitryPihulski/prompts](https://github.com/DzmitryPihulski/prompts)\.

## 3Methodology

To systematically evaluate the utility of reasoning data prior to training, we propose a multi\-dimensional validation framework\. We categorize our metrics into two distinct groups:Model\-based MetricsandAnalytical Metrics\. With Model\-based Metrics, we aim to assess the logical integrity of the reasoning trace\. We adopted the FVCU \(Factuality, Validity, Coherence, Utility\) taxonomy proposed by\[[21](https://arxiv.org/html/2605.13290#bib.bib18)\]for these metrics\. Meanwhile, we designed our Analytical Metrics to measure statistical and structural properties of the text\.

### 3\.1Model\-based Metrics

To assess the intrinsic quality of the reasoning steps beyond binary correctness, we implement an automated evaluation pipeline based on theFVCUtaxonomy \(Factuality, Validity, Coherence, Utility\)\[[21](https://arxiv.org/html/2605.13290#bib.bib18)\]\. This approach verifies whether the reasoning process itself is sound at the atomic level\.

We utilize a two\-stage pipeline consisting of anAtomizerand aJudge, both powered byQwen3\-235B\-A22B\-Instruct\-2507\-FP8\[[30](https://arxiv.org/html/2605.13290#bib.bib22)\]\.

1. 1\.Atomizer:Decomposes raw reasoning traces into atomic steps using a strict verbatim extraction strategy\. This preserves the original density and style of the text, aligning with process supervision standards\[[22](https://arxiv.org/html/2605.13290#bib.bib21)\]\.
2. 2\.Judge:Evaluates each step, one\-by\-one, against the FVCU taxonomy\.

##### Metric Definitions

- •Factuality \(FF\):Assesses the consistency with premises and external truths using thePrincipal Knowledge Groundingmethod\[[14](https://arxiv.org/html/2605.13290#bib.bib19)\], ensuring that steps are supported by explicit problem statements rather than hallucinated constraints\.
- •Validity \(VV\):Evaluates the mathematical and inferential correctness of the derivation\. It distinguishes between calculation errors and logical fallacies\.
- •Coherence \(CC\):Checks if the step logically follows the preceding one without gaps, satisfying the Markov property of the chain\[[35](https://arxiv.org/html/2605.13290#bib.bib20)\]\.
- •Utility \(UU\):Measures whether the step contributes effective progress towards the solution, distinguishing constructive decomposition from "reasoning loops"\.

### 3\.2Analytical Metrics

To complement the computationally expensive FVCU, we compute scalable structural metrics across the full training dataset:

- •Semantic Alignment:Cosine similarity between query and reasoning trace embeddings \(usingmmlw\-roberta\-large\[[8](https://arxiv.org/html/2605.13290#bib.bib4)\]\), serving as a proxy for instruction adherence\[[40](https://arxiv.org/html/2605.13290#bib.bib52)\]\.
- •Semantic Flow:Average cosine similarity between consecutive sentences, quantifying narrative smoothness, and transitional logic\[[40](https://arxiv.org/html/2605.13290#bib.bib52)\]\.
- •Redundancy Ratio:Information density calculated as\(1−lenc​o​m​p​r​e​s​s​e​dleno​r​i​g​i​n​a​l\)\\left\(1\-\\frac\{\\text\{len\}\_\{compressed\}\}\{\\text\{len\}\_\{original\}\}\\right\), usingzlibcompression\. Higher values indicate repetitive patterns or verbosity\[[31](https://arxiv.org/html/2605.13290#bib.bib51),[4](https://arxiv.org/html/2605.13290#bib.bib50)\]\.
- •Syntactic Depth:Average maximum depth of dependency trees \(computed viaspacylibrary\), indicating linguistic complexity and cognitive load\[[40](https://arxiv.org/html/2605.13290#bib.bib52)\]\.
- •Symbolic Fraction:Ratio of non\-alphanumeric characters to total text, capturing the density of mathematical or code\-like notation\[[31](https://arxiv.org/html/2605.13290#bib.bib51),[4](https://arxiv.org/html/2605.13290#bib.bib50)\]\.
- •Perplexity:Exponentiated average negative log\-likelihood per token taken fromQwen3\-4B\[[30](https://arxiv.org/html/2605.13290#bib.bib22)\], measuring the text’s conformity to general knowledge\[[31](https://arxiv.org/html/2605.13290#bib.bib51),[4](https://arxiv.org/html/2605.13290#bib.bib50)\]\.

A core motivation of our framework is replacing costly trial\-and\-error fine\-tuning with efficient pre\-training validation\. While brute\-force empirical validation requires heavy forward and backward passes across all candidate datasets, incurring massive computational debt, our analytical pipeline bypasses gradient updates entirely\. By relying strictly on lightweight processing and single\-pass embedding extraction, we reduce the validation footprint from dozens of multi\-GPU hours to negligible compute time\.

## 4Results and Analysis

In this section, we present the empirical findings of our validation study\. We begin by analyzing the intrinsic quality of the datasets using our framework: first, the model\-based evaluation on a 1,000 subsample and second, the analytical profiling of the full training corpora\. Finally, we report the performance of the model in downstream tasks and correlate these metrics to identify the most reliable predictors of success\.

### 4\.1Model\-based Metrics

Due to the prohibitive computational cost of model\-based judging, FVCU metrics were evaluated on a single subsample of 1,000 examples per variant\. To mitigate the variance inherent in single\-batch evaluation, we employed rigorous stratified sampling, ensuring the subset accurately preserves the domain and complexity distribution of the full dataset\. While the lack of multiple independent batches precludes formal variance calculations, this stratified design yields a highly representative estimate\. Consequently, we frame these FVCU scores not as absolute statistical bounds, but as robust directional indicators of the reasoning trade\-offs between our dataset variants\. Table[2](https://arxiv.org/html/2605.13290#S4.T2)presents these results\.

Table 2:Model\-based metrics evaluation on 1,000MoT\-PLsubsamples\.TheSummarizedvariant maximizes Utility \(90\.5%\) but at the expense of Validity \(87\.3%\)\. This expected drop in Validity occurs because our LLM judge strictly evaluates explicit step\-by\-step derivation, penalizing the intentional omission of intermediate steps as logical gaps even when the final conclusion remains factual and highly useful\. In contrast,Lengthyachieves the highest Validity \(95\.2%\) and Coherence \(98\.5%\), indicating that granular, explicit derivations are essential for stabilizing the reasoning process\. Finally, the baselineBabyThinkdemonstrates that high Coherence \(91\.4%\) is insufficient for reasoning quality — its low Validity \(65\.8%\) confirms that the model can generate linguistically smooth but factually ungrounded chains\.

These findings highlight a critical trade\-off in reasoning data curation: while stripping intermediate steps \(Summarized\) increases immediate task utility, it degrades the rigorous logical grounding required for out\-of\-distribution generalization\. Conversely, verbosity \(Lengthy\) acts as a safeguard against hallucination by enforcing strict state\-tracking, which is essential for complex reasoning but requires sufficient model capacity to process\.

### 4\.2Analytical Metrics

We extended our analysis to the entire training dataset using computationally efficient metrics\. Table[3](https://arxiv.org/html/2605.13290#S4.T3)summarizes the profiles of each variant\.

Table 3:Analytical metrics calculated on the full trainingMoT\-PLdatasets\.TheSummarizeddataset emerges as the most information\-dense, exhibiting the highestSymbolic Fraction\(0\.201\) andSyntactic Depth\(4\.92\) while maintaining the lowestRedundancy Ratio\(0\.441\)\. In contrast, theLengthyandDetailedvariants share nearly identical redundancy scores \(∼\\sim0\.62\), suggesting that theLengthyvariant scales volume without altering the fundamental compression rate of the text\. Notably,BabyThinkvariant, despite its simplified vocabulary, yields the highestPerplexity\(2\.42\) and the lowestSemantic Alignment\(0\.916\)\.

These structural differences imply that reasoning quality is not merely a function of length but of information pacing\. The high Perplexity and low Semantic Alignment of theBabyThinkvariant suggest that artificially simplifying vocabulary disrupts the natural language distribution the model expects, paradoxically making the reasoning harder to learn from despite its simpler syntax\.

### 4\.3Downstream Model Performance

We evaluate the fine\-tuned models across four benchmarks to establish the ground truth for our correlation analysis\. We present the results in three stages: Absolute Accuracy, Relative Performance Change, and finally domain\-specific breakdown\.

Table[4](https://arxiv.org/html/2605.13290#S4.T4)presents the absolute accuracy\. Consistent with the difference in model size, Bielik\-11B significantly outperforms PLLuM\-8B\. ForPLLuM\-8B, theDetailedvariant achieves the highest average performance \(0\.513\), showing particular strength in Polish reasoning tasks on MoT\-PL\-eval \(0\.374\)\. ForBielik\-11B, theLengthyvariant emerges as the superior specialist in reasoning overall, achieving the highest absolute scores on both MoT\-PL\-eval \(0\.701\) and LightR1 \(0\.599\)\.

Table 4:Absolute Accuracy on downstream tasks\.Avg\.is the macro\-average\.Table[5](https://arxiv.org/html/2605.13290#S4.T5)reports the Relative Percentage Change to normalize for the base model capabilities\. TheDetailedmodel proved to be the safest strategy, delivering consistent gains for PLLuM\-8B on most benchmarks\. Including general NLP benchmarks in the average demonstrates that high\-quality reasoning \(Detailed\) improves standard tasks\. However, isolated reasoning benchmarks reveal an expected trade\-off: fine\-tuning exclusively on MoT\-PL boosts our target Polish reasoning but degrades English reasoning \(LightR1\) due to mild catastrophic forgetting of English chain\-of\-thought capabilities\. Finally, theLengthydataset exhibits a volatile profile: on the smaller PLLuM\-8B, it caused catastrophic forgetting, but in the larger Bielik\-11B model, it unlocked significant reasoning capabilities, increasing performance on MoT\-PL\-eval by \+12\.3% and LightR1 by \+15\.0%\.

Table 5:Relative Percentage Change \(%\) on downstream tasks between original and finetuned models\.The domain\-specific breakdown in Table[6](https://arxiv.org/html/2605.13290#S4.T6)exposes a critical dependency between model capacity and reasoning density\. InMATH, we observe a striking inversion of preferences: the smallerPLLuM\-8Bbenefits exclusively from theSummarizedvariant \(\+26\.2%\), likely succumbing to context drift in longer chains, whereas the largerBielik\-11Beffectively utilizes the "thinking space" ofLengthyderivations \(\+12\.5%\) to navigate complex logic\. This capacity gap is most acute inCODE, where verbose reasoning acts as a crucial scaffold for Bielik\-11B \(\+131\.4%\) but induces catastrophic forgetting in PLLuM\-8B \(\-73% to \-96%\)\. In contrast, inSCIENCE, the smaller model sees the largest relative gains \(\+28\.6%\), suggesting that reasoning traces help unlock latent knowledge, while the larger model hits a performance ceiling with only marginal improvements \(\+5\.0%\)\.

Table 6:MoT\-PL\-eval performance by domain\. Cells showAccuracyfollowed by\(Relative Gain %\)compared to the Original baseline\. Bold indicates the best result per model/domain\.
### 4\.4Correlation Analysis: Drivers of Reasoning Performance

To understand the mechanisms behind the observed performance changes, we analyzed the relationship between our training data metrics \(defined in Sections[3\.1](https://arxiv.org/html/2605.13290#S3.SS1)and[3\.2](https://arxiv.org/html/2605.13290#S3.SS2)\) and the downstream performance\. We calculated the Spearman Rank Correlation \(ρ\\rho\) between each metric and the relative performance gain\.

Table[7](https://arxiv.org/html/2605.13290#S4.T7)highlights a distinct divergence in how training data characteristics translate to downstream performance\. ForPLLuM\-8B, performance is primarily driven bySemantic Alignment\(ρa​v​g=0\.75\\rho\_\{avg\}=0\.75\),Semantic Flow\(ρa​v​g=0\.65\\rho\_\{avg\}=0\.65\) andFactuality\(ρa​v​g=0\.45\\rho\_\{avg\}=0\.45\)\. This suggests that the smaller model relies heavily on clear, instruction\-compliant data\. In particular,Utilityshows a strong negative correlation with the complexLightR1benchmark \(ρ=−0\.74\\rho=\-0\.74\)\. This indicates that data optimized for high utility, typically concise summaries, deprive the model of the intermediate reasoning tokens necessary to learn complex logic steps\. In contrast,Bielik\-11Bdemonstrates a strong dependence on reasoning volume and correctness\. Although theRedundancy Ratioperfectly predicts the success inLightR1\(ρ=1\.0\\rho=1\.0\), the model\-based metrics clarify the nature of this redundancy\.ValidityandCoherenceshow near\-perfect correlations with reasoning tasks, confirming that the model leverages redundant tokens effectively only when they form a logically valid reasoning\. In contrast,Semantic Flowcorrelates negatively with hard reasoning \(ρ=−0\.80\\rho=\-0\.80\), reinforcing that narrative smoothness is less critical than a rigorous step\-by\-step derivation for the larger model\.

Table 7:Spearman’sρ\\rhobetween training dataset metrics and downstream performance on general benchmarks\. The metrics are divided intoAnalytical\(on full dataset\) andModel\-based\(on a stratified subsample of 1,000 examples\)\.Table[8](https://arxiv.org/html/2605.13290#S4.T8)differentiates the drivers for procedural logic versus knowledge retrieval\. In theCodeandMathdomains,Semantic Flowcorrelates negatively for Bielik\-11B \(reachingρ=−0\.80\\rho=\-0\.80\), indicating that narrative smoothness often impedes strict logical derivation\. For Bielik\-11B, performance in these domains depends on a combination of reasoning volume and correctness\. The model shows a perfect correlation withRedundancy Ratio\(ρ=1\.0\\rho=1\.0\) alongside strong correlations withValidityandCoherence\(ρ=0\.80\\rho=0\.80\)\. This suggests that the benefit of verbose reasoning comes from the generation of valid and coherent intermediate steps rather than redundancy alone\. PLLuM\-8B shows a divergent pattern inMath, where performance correlates perfectly withSymbolic Fraction\(ρ=1\.0\\rho=1\.0\) and strongly withUtility\(ρ=0\.80\\rho=0\.80\), but weakly withValidity\(ρ=0\.20\\rho=0\.20\)\. This implies a reliance on formal notation and concise answers rather than the verification of the logical chain\. However, inCode, PLLuM\-8B aligns with the larger model, showing strong correlations with bothRedundancy\(ρ=1\.0\\rho=1\.0\) andValidity\(ρ=0\.80\\rho=0\.80\)\. Finally,Scienceis distinct; here, Bielik\-11B exhibits a perfect correlation withFactuality\(ρ=1\.0\\rho=1\.0\), identifying factual accuracy as the sole critical driver, while PLLuM\-8B relies primarily onSemantic FlowandSemantic Alignment\(ρ=0\.80\\rho=0\.80\)\.

Table 8:Spearman’sρ\\rhoonReasoning DomainsforMoT\-PLdataset comparison\. Left side: PLLuM\-8B, Right side: Bielik\-11B\.

## 5Discussion

##### RQ1\. Is it feasible to validate the utility prior to fine\-tuning?

Yes, but the predictive signal of the metrics depends the model size\.Our analysis confirms that dataset metrics are reliable performance predictors \(ρ≥0\.75\\rho\\geq 0\.75\), yet there is no universal quality profile\. For example,Redundancy Ratioacts as a decisive positive signal for the Bielik\-11B model in reasoning tasks \(ρ=1\.0\\rho=1\.0\) but remains neutral or negative for the PLLuM\-8B model\. Similarly, whileSemantic Alignmentuniversally benefits general instruction following, it fails to predict success in complex reasoning for larger models\. This indicates that pre\-validation of training data requires a scale\-based calibration; small models benefit more from semantic coherence with less redundancy, while larger models can more effectively leverage redundancy in longer reasoning trace\.

##### RQ2\. Which specific quantitative measures provide the most meaningful signal for validating training data quality?

We observe a fundamental dichotomy in metric efficacy driven by the complexity threshold of the model\.

ForPLLuM\-8B, performance is driven bySemantic Alignment\(ρ=0\.75\\rho=0\.75\) andFactuality\. However, we observe a distinct negative correlation betweenUtilityand complex reasoning \(ρ=−0\.74\\rho=\-0\.74in LightR1\)\. This suggests that data optimized for high human utility deprives smaller models of the intermediate tokens necessary to learn logic\. Thus, for smaller models, the most critical signal is the directness and factual grounding of the data, rather than its reasoning depth\.

ForBielik\-11B,Redundancy Ratiois the strongest predictor of reasoning success \(ρ=1\.0\\rho=1\.0\), provided that it is supported by highValidity\(ρ=0\.80\\rho=0\.80\)\. Crucially,Semantic Flowcorrelates negatively with Math and Code performance \(ρ=−0\.80\\rho=\-0\.80\)\. This indicates that the larger model benefit from verbose, rigorous derivation steps, even if repetitive, rather than smooth narrative explanations\. In knowledge\-heavy domains like Science, this shifts entirely toFactuality\(ρ=1\.0\\rho=1\.0\), rendering structural metrics less relevant\.

## 6Conclusions

This study establishes that effective data validation requires calibrating metrics to model capacity\. By analyzing the correlation between the properties of the intrinsic data and the downstream performance, we identified distinct optimization requirements for different scales of parameters\.

For a smaller model \(PLLuM\-8B\), we observed a negative correlation between metrics favoring conciseness \(Utility\) and reasoning performance\. These models rely primarily onSemantic AlignmentandFactualityto prevent hallucinations, suggesting that training data should prioritize direct instruction adherence over complex reasoning chains\. In contrast, the larger model \(Bielik\-11B\) demonstrated a strong positive correlation withRedundancy Ratioin formal domains\. This indicates that verbose iterative derivation steps are essential for performance on this scale\. Consequently, data curation must distinguish between knowledge\-intensive tasks, which benefit from factual density, and reasoning tasks, which require structural redundancy\.

Building on these findings, future work will focus on extending this capacity\-aware validation framework across a wider spectrum of model sizes to pinpoint the exact parameter threshold for reasoning verbosity\. Additionally, we aim to employ instance\-level influence functions to establish a direct causal link between specific structural data patterns and inference\-time logical robustness\. Simultaneously, we will investigate how much the reasoning setup affects other LLM properties, such as the tendency towards hallucination\[[24](https://arxiv.org/html/2605.13290#bib.bib2)\]or in\-context learning\[[34](https://arxiv.org/html/2605.13290#bib.bib1)\]\.

## 7Limitations

Our findings suggest that 8B and 11B models use verbose reasoning data differently, but because we did not test intermediate or much larger models, we cannot tell whether this shift is gradual or appears at a specific scale\. We also used disjoint train and test sets, which supports rigorous system\-level correlation analysis but obscures instance\-level effects, so we cannot identify how particular reasoning patterns affect individual predictions\. In addition, our evaluation depends on an LLM judge, which may introduce bias despite strong agreement with human raters; reasoning variants generated with DeepSeek\-V3 may contain paraphrasing artifacts that models can exploit; and testing Polish\-fine\-tuned models on English benchmarks introduces cross\-lingual effects that make it harder to isolate reasoning ability from language processing limitations\.

\{credits\}

#### 7\.0\.1Acknowledgements

This work was supported by: \(1\) the National Science Center, Poland, grant no\. 2021/41/B/ST6/04471; \(2\) CLARIN\-PL: Common Language Resources and Technology Infrastructure \(POIR\.04\.02\.00\-00C002/19, 2024/WK/01, FENG\.02\.04\-IP\.040004/24\); \(3\) Digital Research Infrastructure for the Arts and Humanities DARIAH\-PL: POIR\.04\.02\.00\-00\-D006/20, KPOD\.01\.18\-IW\.03\-0013/23; \(4\) the statutory funds of the Dept\.of AI, Wroclaw Tech; \(5\) Polish Ministry of Education and Science: “International Projects Co\-Funded”; \(6\) the EU under the Horizon Europe, grant no\. 101086321 \(OMINO\)\. The views expressed are those of the authors and do not necessarily reflect those of the EU or the European Research Executive Agency\.

#### 7\.0\.2\\discintname

All authors have received funding from the Ministry of Digital Affairs of Poland, Polish National Science Center, and the European Union\.

## References

- \[1\]L\. Bandarkaret al\.\(2024\)The belebele benchmark: a parallel reading comprehension dataset in 122 language variants\.InACL,pp\. 749–775\.Cited by:[2nd item](https://arxiv.org/html/2605.13290#S2.I3.i2.p1.1)\.
- \[2\]A\. Bercovichet al\.\(2025\)Llama\-nemotron: efficient reasoning models\.External Links:2505\.00949Cited by:[Figure 1](https://arxiv.org/html/2605.13290#S1.F1),[Figure 1](https://arxiv.org/html/2605.13290#S1.F1.7.2)\.
- \[3\]T\. A\. Changet al\.\(2025\)Global piqa: evaluating physical commonsense reasoning across 100\+ languages and cultures\.arXiv preprint arXiv:2510\.24081\.Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[4\]D\. Chenet al\.\(2024\)Data\-juicer: a one\-stop data processing system for large language models\.InProceedings of SIGMOD,pp\. 120–134\.Cited by:[3rd item](https://arxiv.org/html/2605.13290#S3.I3.i3.p1.1),[5th item](https://arxiv.org/html/2605.13290#S3.I3.i5.p1.1),[6th item](https://arxiv.org/html/2605.13290#S3.I3.i6.p1.1)\.
- \[5\]Y\. Chenet al\.\(2025\)Reasoning models don’t always say what they think\.arXiv preprint arXiv:2505\.05410\.Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[6\]G\. Chodaket al\.\(2025\)Typology of image crises using large language models: a novel approach to crisis classification\.J\. of Contingencies and Crisis Management\.Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[7\]J\. Chua and O\. Evans\(2025\)Are deepseek r1 and other reasoning models more faithful?\.InICLR Workshop on Foundation Models in the Wild,Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[8\]S\. Dadaset al\.\(2024\)PIRB: a comprehensive benchmark of Polish dense and hybrid text retrieval methods\.InProceedings of LREC\-COLING,pp\. 12761–12774\.Cited by:[1st item](https://arxiv.org/html/2605.13290#S3.I3.i1.p1.1)\.
- \[9\]DeepSeek\-AI\(2024\)DeepSeek\-v3 technical report\.External Links:2412\.19437Cited by:[§2\.1](https://arxiv.org/html/2605.13290#S2.SS1.p1.1)\.
- \[10\]T\. Ferdinanet al\.\(2025\)Architectural concepts for integrating fundamental drives and emotions into artificial intelligence\.IEEE Intelligent Systems\.Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[11\]A\. Grattafioriet al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783Cited by:[1st item](https://arxiv.org/html/2605.13290#S2.I2.i1.p1.1)\.
- \[12\]D\. Guoet al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.13290#S1.p1.1),[§2\.4](https://arxiv.org/html/2605.13290#S2.SS4.p2.1)\.
- \[13\]HuggingFace\(2025\)Open r1: a fully open reproduction of deepseek\-r1\.External Links:[Link](https://github.com/huggingface/open-r1)Cited by:[Figure 1](https://arxiv.org/html/2605.13290#S1.F1),[Figure 1](https://arxiv.org/html/2605.13290#S1.F1.7.2),[§2\.1](https://arxiv.org/html/2605.13290#S2.SS1.p1.1)\.
- \[14\]H\. Hwanget al\.\(2025\)Assessing LLM reasoning steps via principal knowledge grounding\.InFindings of EMNLP,pp\. 19925–19948\.Cited by:[1st item](https://arxiv.org/html/2605.13290#S3.I2.i1.p1.1)\.
- \[15\]A\. Q\. Jianget al\.\(2023\)Mistral 7b\.External Links:2310\.06825Cited by:[2nd item](https://arxiv.org/html/2605.13290#S2.I2.i2.p1.1)\.
- \[16\]M\. Jinet al\.\(2024\)The impact of reasoning step length on large language models\.InFindings of ACL,pp\. 1830–1842\.Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p2.1)\.
- \[17\]S\. Kambhampatiet al\.\(2025\)Stop anthropomorphizing intermediate tokens as reasoning/thinking traces\!\.arXiv preprint arXiv:2504\.09762\.Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[18\]J\. Kocońet al\.\(2025\)PLLuM: A Family of Polish Large Language Models\.arXiv preprint arXiv:2511\.03823\.Cited by:[1st item](https://arxiv.org/html/2605.13290#S2.I2.i1.p1.1)\.
- \[19\]M\. Langneret al\.\(2025\)Divide, cache, conquer: dichotomic prompting for efficient multi\-label llm\-based classification\.In2025 IEEE International Conference on Data Mining Workshops \(ICDMW\),Cited by:[1st item](https://arxiv.org/html/2605.13290#S2.I2.i1.p1.1)\.
- \[20\]A\. Lawsen\(2025\)Comment on the illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity\.External Links:2506\.09250Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[21\]J\. Lee and J\. Hockenmaier\(2025\)Evaluating step\-by\-step reasoning traces: a survey\.InFindings of EMNLP,pp\. 1789–1814\.Cited by:[§3\.1](https://arxiv.org/html/2605.13290#S3.SS1.p1.1),[§3](https://arxiv.org/html/2605.13290#S3.p1.1)\.
- \[22\]H\. Lightmanet al\.\(2024\)Let’s verify step by step\.InICLR,Cited by:[item 1](https://arxiv.org/html/2605.13290#S3.I1.i1.p1.1)\.
- \[23\]A\. Lozhkovet al\.\(2025\)OpenR1\-math\-220k\.Note:[https://huggingface\.co/datasets/open\-r1/OpenR1\-Math\-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k)Cited by:[Figure 1](https://arxiv.org/html/2605.13290#S1.F1),[Figure 1](https://arxiv.org/html/2605.13290#S1.F1.7.2)\.
- \[24\]P\. Matyset al\.\(2025\)AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLMs\.InICCS’2025,pp\. 227–243\.Cited by:[§6](https://arxiv.org/html/2605.13290#S6.p3.1)\.
- \[25\]K\. Ociepaet al\.\(2025\)Bielik 11b v2 technical report\.External Links:2505\.02410Cited by:[2nd item](https://arxiv.org/html/2605.13290#S2.I2.i2.p1.1)\.
- \[26\]OpenAI\(2025\)Introducing OpenAI o3 and o4\-mini\.Note:[https://openai\.com/index/introducing\-o3\-and\-o4\-mini](https://openai.com/index/introducing-o3-and-o4-mini)Cited by:[§1](https://arxiv.org/html/2605.13290#S1.p1.1)\.
- \[27\]G\. Penedoet al\.\(2025\)CodeForces cots\.Note:[https://huggingface\.co/datasets/open\-r1/codeforces\-cots](https://huggingface.co/datasets/open-r1/codeforces-cots)Cited by:[Figure 1](https://arxiv.org/html/2605.13290#S1.F1),[Figure 1](https://arxiv.org/html/2605.13290#S1.F1.7.2)\.
- \[28\]P\. Pęziket al\.\(2025\)The PLLuM Instruction Corpus\.arXiv preprint arXiv:2511\.17161\.Cited by:[1st item](https://arxiv.org/html/2605.13290#S2.I2.i1.p1.1)\.
- \[29\]D\. Pihulskiet al\.\(2026\)Breaking the illusion of reasoning in Polish LLMs: quality over quantity of thought\.InFindings of EACL,pp\. 1796–1811\.Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1),[§1](https://arxiv.org/html/2605.13290#S1.p3.1)\.
- \[30\]Qwen Team\(2025\)Qwen3 technical report\.External Links:2505\.09388Cited by:[6th item](https://arxiv.org/html/2605.13290#S3.I3.i6.p1.1),[§3\.1](https://arxiv.org/html/2605.13290#S3.SS1.p2.1)\.
- \[31\]J\. W\. Raeet al\.\(2021\)Scaling language models: methods, analysis & insights from training gopher\.arXiv preprint arXiv:2112\.11446\.Cited by:[3rd item](https://arxiv.org/html/2605.13290#S3.I3.i3.p1.1),[5th item](https://arxiv.org/html/2605.13290#S3.I3.i5.p1.1),[6th item](https://arxiv.org/html/2605.13290#S3.I3.i6.p1.1)\.
- \[32\]P\. Shojaeeet al\.\(2025\)The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity\.External Links:2506\.06941Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[33\]S\. Singhet al\.\(2024\)Aya dataset: an open\-access collection for multilingual instruction tuning\.InProceedings of ACL,Cited by:[3rd item](https://arxiv.org/html/2605.13290#S2.I3.i3.p1.1)\.
- \[34\]A\. Szczęsnyet al\.\(2025\)Leveraging positional bias of llm in\-context learning with class\-few\-shot and maj\-min alternating ordering\.InICCS’2025,pp\. 54–62\.Cited by:[§6](https://arxiv.org/html/2605.13290#S6.p3.1)\.
- \[35\]F\. Tenget al\.\(2025\)Atom of thoughts for markov llm test\-time scaling\.External Links:2502\.12018Cited by:[3rd item](https://arxiv.org/html/2605.13290#S3.I2.i3.p1.1)\.
- \[36\]J\. Weiet al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.NeurIPS35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.13290#S1.p1.1)\.
- \[37\]L\. Wenet al\.\(2025\)Light\-r1: curriculum sft, dpo and rl for long cot from scratch and beyond\.External Links:2503\.10460Cited by:[4th item](https://arxiv.org/html/2605.13290#S2.I3.i4.p1.1)\.
- \[38\]S\. Woźniaket al\.\(2024\)Personalized large language models\.In2024 IEEE International Conference on Data Mining Workshops \(ICDMW\),Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p1.1)\.
- \[39\]Y\. Wuet al\.\(2025\)When more is less: understanding chain\-of\-thought length in llms\.arXiv preprint arXiv:2502\.07266\.Cited by:[§1\.1](https://arxiv.org/html/2605.13290#S1.SS1.p2.1)\.
- \[40\]S\. E\. Zanotto and S\. Aroyehun\(2025\)Linguistic and embedding\-based profiling of texts generated by humans and large language models\.InProceedings of EMNLP,Cited by:[1st item](https://arxiv.org/html/2605.13290#S3.I3.i1.p1.1),[2nd item](https://arxiv.org/html/2605.13290#S3.I3.i2.p1.1),[4th item](https://arxiv.org/html/2605.13290#S3.I3.i4.p1.1)\.

## 8Appendix

Experiments were conducted on the WCSS LEM cluster222[https://www\.wcss\.pl/en/](https://www.wcss.pl/en/)using nodes equipped with4×4\\timesNVIDIA H100\-94GB GPUs and Intel Xeon Platinum 8462Y\+ CPUs\. We utilized thetrllibrary with DeepSpeed ZeRO Stage\-3\. Table[9](https://arxiv.org/html/2605.13290#S8.T9)details the hyperparameters for both model families\. We used the AdamW optimizer \(β1=0\.9,β2=0\.999,ϵ=10−8\\beta\_\{1\}=0\.9,\\beta\_\{2\}=0\.999,\\epsilon=10^\{\-8\}\)\. All model outputs were generated using fixed decoding strategies: temperature=0\.6, top−p\-p=0\.95, top−k\-k=20, min−p\-p=0\.1, and a repetition penalty of 1\.2\.

Table 9:Fine\-tuning hyperparameters used in our experiments\.

Similar Articles

Improving mathematical reasoning with process supervision

OpenAI Blog

OpenAI demonstrates that process supervision—rewarding intermediate reasoning steps rather than just final answers—improves mathematical reasoning while reducing alignment costs. This approach produces more interpretable, human-aligned reasoning without sacrificing model performance.

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Hugging Face Daily Papers

This paper introduces a method for monitoring the reasoning process of Large Reasoning Models by analyzing probe trajectories—the evolution of a concept's probability across generated tokens. The approach uses temporal and signal-processing features from hidden representations to better predict future model behavior, achieving up to 95% AUROC with max-pooling.

Decoding the Critique Mechanism in Large Reasoning Models

Hugging Face Daily Papers

This paper investigates how large reasoning models can detect and correct their own errors internally, identifying a highly interpretable critique vector that enhances error detection without additional training, improving test-time scaling performance.