Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning
Summary
Introduces Aryabhata 2, a reasoning-focused language model for competitive STEM exams, trained via reinforcement learning on PhysicsWallah's question banks, outperforming its base model with fewer tokens.
View Cached Full Text
Cached at: 05/29/26, 09:11 AM
# Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning
Source: [https://arxiv.org/html/2605.28829](https://arxiv.org/html/2605.28829)
Ritvik Rastogi PhysicsWallah ritvik\.rastogi@pw\.live &Vishal Singh PhysicsWallah vishal\.singh16@pw\.live &Tejas Chaudhari PhysicsWallah tejas\.chaudhari@pw\.live &Sandeep Varma PhysicsWallah sandeep\.varma@pw\.live
###### Abstract
††Correspondence toritvik\.rastogi@pw\.live\.
Competitive STEM examinations such as JEE and NEET require multi\-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics\. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain\-specific, consistently structured problem solving\.
We introduceAryabhata 2, a reasoning\-focused language model for competitive STEM examinations, trained via reinforcement\-learning post\-training\. Using PhysicsWallah’s internal question banks, we construct a high\-quality training curriculum and post\-train GPT\-OSS\-20B through reinforcement learning with verifiable rewards\. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes\.
We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out\-of\-distribution reasoning datasets such as AIME, HMMT, MMLU\-Pro, MMLU\-Redux 2\.0, and GPQA\. Results show that Aryabhata 2 outperforms its base model GPT\-OSS\-20B on competitive STEM reasoning while requiring substantially fewer output tokens \(up to 64% fewer\)\.
## 1Introduction
Competitive examinations such as the Joint Entrance Examination \(JEE\) and the National Eligibility cum Entrance Test \(NEET\)\[[10](https://arxiv.org/html/2605.28829#bib.bib13),[11](https://arxiv.org/html/2605.28829#bib.bib14)\]represent some of the most demanding reasoning tasks encountered in large\-scale educational systems\. Solving these problems requires multi\-step symbolic manipulation, precise numerical reasoning, and deep conceptual understanding across physics, chemistry, and mathematics\. Unlike many standard reasoning benchmarks, competitive exam questions are carefully constructed to test conceptual depth and often require chaining multiple reasoning steps under strict constraints\.
PhysicsWallah’s online classes generate millions of student doubts, creating a large\-scale requirement for fast, reliable, and clear STEM reasoning support\. A substantial fraction of these doubts are tied to competitive examination preparation, where students expect not just final answers but step\-by\-step explanations that are accurate, concise, and aligned with exam\-solving strategies\.
This setting exposes a practical gap in current large language models \(LLMs\)\. Although recent models perform strongly on internet\-scale and benchmark\-style evaluations\[[4](https://arxiv.org/html/2605.28829#bib.bib2),[14](https://arxiv.org/html/2605.28829#bib.bib9)\], they remain infeasible to deploy at scale due to high inference costs\. For open\-source models, these costs are driven by large model sizes and long chain\-of\-thought reasoning, while frontier models incur prohibitively high per\-token pricing\. This challenge is particularly acute for real student doubts, which require domain\-specific syllabus coverage, multi\-step symbolic reasoning, and strict correctness across physics, chemistry, and mathematics\.
In this work, we introduceAryabhata 2, a reasoning\-focused language model trained specifically for competitive STEM examinations\. Aryabhata 2 is obtained by post\-training GPT\-OSS\-20B\[[13](https://arxiv.org/html/2605.28829#bib.bib10)\], an open\-source 20B\-parameter Mixture of Experts model with 3\.6B activate parameters using reinforcement learning on a curated dataset derived from PhysicsWallah’s internal question banks covering Physics, Chemistry, Mathematics, and General Reasoning\.
We evaluate Aryabhata 2 on competitive examination benchmarks including JEE Main, JEE Advanced, and NEET, as well as out\-of\-distribution reasoning datasets including AIME\[[9](https://arxiv.org/html/2605.28829#bib.bib15)\], HMMT\[[5](https://arxiv.org/html/2605.28829#bib.bib16)\], MMLU\-Pro\[[19](https://arxiv.org/html/2605.28829#bib.bib18)\], MMLU\-Redux 2\.0\[[2](https://arxiv.org/html/2605.28829#bib.bib19)\], and GPQA\[[15](https://arxiv.org/html/2605.28829#bib.bib20)\]\. Our results demonstrate that targeted reinforcement learning on exam\-style curricula can substantially improve reasoning performance on competitive STEM problems\.
## 2Related Work
Recent progress in reasoning\-focused language models has increasingly relied on reinforcement learning \(RL\) during post\-training\. While supervised fine\-tuning \(SFT\) remains the foundation for instruction\-following behavior, reinforcement learning with verifiable rewards \(RLVR\) has proven particularly effective for domains where correctness can be programmatically evaluated, such as mathematics, coding, and symbolic reasoning\. In such settings, reward signals derived from automatic verifiers enable scalable optimization beyond what is achievable with purely supervised datasets\[[16](https://arxiv.org/html/2605.28829#bib.bib1),[4](https://arxiv.org/html/2605.28829#bib.bib2)\]\.
However, general reasoning spans heterogeneous domains with different supervision signals and verification costs\. Mathematical problems often allow fast symbolic verification, while coding tasks require execution environments and subjective tasks rely on preference\-based reward models\. As a result, several distinct RL post\-training paradigms have emerged\.
### 2\.1Sequential Reinforcement Learning
Sequential RL pipelines apply reinforcement learning in staged curricula across domains\. TheNemotron\-Cascadeframework\[[20](https://arxiv.org/html/2605.28829#bib.bib3)\]proposes such a strategy, where models are optimized through a sequence of reinforcement learning stages including alignment, instruction following, mathematical reasoning, and coding tasks\.
This approach treats post\-training as a curriculum where general behaviors are learned before specialized reasoning capabilities\. Sequential pipelines also offer practical engineering benefits since different domains can be trained with domain\-specific infrastructure and verifier latency constraints\. However, performance may depend on the ordering of stages, and poorly aligned objectives may lead to regressions in previously learned capabilities\.
### 2\.2Decentralized Training via Model Merging
Another paradigm decomposes general reasoning into capability\-specific experts that are trained independently and later combined through parameter merging\. This approach has been explored in systems such asCommand A\[[17](https://arxiv.org/html/2605.28829#bib.bib4)\], where multiple domain\-specialized models are trained separately and merged via linear weight averaging\.
Model merging enables parallel development of domain expertise and allows post\-hoc rebalancing of capabilities without retraining\. However, merged models may exhibit behavioral inconsistencies, often requiring additional alignment stages to produce a coherent policy\.
### 2\.3Unified Multi\-Domain Reinforcement Learning
Unified reinforcement learning approaches train models across multiple domains simultaneously within a single RL loop\. TheNemotron 3 Nano\[[18](https://arxiv.org/html/2605.28829#bib.bib5)\]training pipeline demonstrates this paradigm by exposing models to a mixture of reasoning environments including mathematics, coding, structured output tasks, and tool use\.
Joint optimization across tasks helps mitigate catastrophic forgetting, since no domain is absent from training for extended periods\. However, unified RL requires more complex infrastructure to handle heterogeneous reward functions and verification environments\.
### 2\.4Scaling Reinforcement Learning
Recent work has also explored scaling reinforcement learning along new dimensions\.Prolonged Reinforcement Learning \(ProRL\)\[[8](https://arxiv.org/html/2605.28829#bib.bib6)\]shows that reasoning performance can continue improving when RL training is extended to thousands of optimization steps, challenging the assumption that RL quickly reaches a performance plateau\.
Complementary work onBroadened Reinforcement Learning \(BroRL\)\[[7](https://arxiv.org/html/2605.28829#bib.bib7)\]demonstrates that increasing the number of sampled rollouts per prompt can significantly improve exploration during training\. By expanding the set of candidate reasoning trajectories, larger rollout groups increase the probability of discovering high\-reward reasoning strategies\.
## 3Methodology
This section describes the full training pipeline used for Aryabhata 2, including data curation, answer verification, curriculum construction, and reinforcement learning\. Our goal is to build a model for advanced STEM reasoning under practical compute constraints, by maintaining strong data quality and stable optimization\.
Our approach uses a unified RL\-only training pipeline\. We begin with rigorous data cleaning and answer verification to ensure reliable reward supervision, then organize the verified corpus into a difficulty\-aware curriculum\. Training is performed in phased on\-policy reinforcement learning: an initial format\-alignment stage, a prolonged optimization stage for sustained capability gains, and a broadened exploration stage with larger rollout groups\. This design combines the stability benefits of careful data curation with the performance gains of prolonged and broadened RL under constrained compute\.
### 3\.1Base Model
Aryabhata 2 is built on top ofGPT\-OSS\-20B, an open\-source 20B\-parameter model released by OpenAI\[[13](https://arxiv.org/html/2605.28829#bib.bib10)\]\. This model serves as the initial policy and is adapted using reinforcement learning\.
To ensure training efficiency under limited hardware constraints, we useparameter\-efficient fine\-tuning with Low\-Rank Adaptation \(LoRA\)\[[6](https://arxiv.org/html/2605.28829#bib.bib8)\]instead of updating all model parameters\. LoRA adapters are inserted into attention projection layers and the token embedding layer\. This design substantially reduces memory usage while preserving enough adaptation capacity to improve reasoning behavior through reinforcement learning\.
### 3\.2Data Preparation
#### 3\.2\.1Source Dataset
The training dataset is constructed from PhysicsWallah’s internal question banks covering multiple STEM domains relevant to competitive examinations in India\. These include questions fromPhysics, Chemistry, Mathematics, and General Reasoning, reflecting the distribution typically encountered in examinations such as JEE Main, JEE Advanced, and NEET\.
The raw dataset initially contains1\.78Mquestions\. After applying a multi\-stage cleaning pipeline and answer verification procedure, the dataset is reduced to1\.25Mhigh\-quality questions that form the basis of our reinforcement learning curriculum\.
To reduce benchmark contamination risk, we enforce aknowledge cutoff of mid\-2024for the training corpus and apply decontamination against all held\-out evaluation suites used in this work\.
The distribution of questions across subjects at different stages of the pipeline is summarized in Table[1](https://arxiv.org/html/2605.28829#S3.T1)\.
Table 1:Dataset distribution across different stages of the preprocessing pipeline\.
#### 3\.2\.2Question Cleaning Pipeline
To ensure dataset reliability, we design a deterministic cleaning pipeline that removes malformed or unsuitable questions before the reinforcement learning stage\.
First,HTML artifacts are removedfrom the dataset\. In particular, questions containing<img\>tags are discarded, since such questions typically depend on diagrams or figures that cannot be interpreted reliably by a text\-only language model\.
Second, we performLaTeX validation\. Many questions contain mathematical expressions written in LaTeX\. To detect malformed expressions, we attempt to render all questions usingpdflatex\. Questions that fail compilation are discarded\. This step ensures that the model receives syntactically valid mathematical content during training\.
Third, we detectincomplete or ill\-posed questionsusing a language model classifier\. Specifically, we promptQwen/Qwen3\-30B\-A3B\-Thinking\-2507\[[14](https://arxiv.org/html/2605.28829#bib.bib9)\]to determine whether a question lacks sufficient information to be solved\. Questions classified as incomplete are removed\.
Finally, we applydomain filteringto remove non\-STEM questions that do not belong to Physics, Chemistry, Mathematics, or General Reasoning categories using an instruction\-tuned filtering model\[[14](https://arxiv.org/html/2605.28829#bib.bib9)\]\.
Across all stages, approximately24%of the original dataset is removed\.
### 3\.3Answer Verification
During preliminary inspection of the dataset, we observed that some of the original answer keys were incorrect or inconsistent\. Since reinforcement learning relies on reward signals derived from answer correctness, such errors can significantly degrade training quality\.
To address this issue, we design an automated answer verification pipeline based on two large language models\.
The verification system consists of:
- •Policy model: GPT\-OSS\-120B\[[13](https://arxiv.org/html/2605.28829#bib.bib10)\]
- •Judge model: Qwen/Qwen3\-30B\-A3B\-Thinking\-2507\[[14](https://arxiv.org/html/2605.28829#bib.bib9)\]
The policy model generates chain\-of\-thought \(CoT\) solutions\[[21](https://arxiv.org/html/2605.28829#bib.bib21)\]with temperature set to1\.01\.0, while the judge model evaluates whether the final answer of generated solution matches the ground truth\.
#### 3\.3\.1Multi\-Pass Verification
To maximize verification coverage while controlling computational cost, we employ a multi\-stage sampling procedure\.
##### Single\-Sample Stage
A single chain\-of\-thought solution is generated\. If the judge model marks the answer as correct, the question\-answer pair is accepted\. Approximately80%of the dataset is verified at this stage\.
##### Four\-Sample Stage
For the remaining 20% of questions, we generate four independent chain\-of\-thought solutions with different random seeds\. A question is accepted ifanyof the generated solutions is judged correct\. This stage verifies an additional8%of the dataset\.
##### Sixteen\-Sample Stage
For the remaining unresolved questions, we generate sixteen independent solutions and again accept the question if any solution is judged correct\. This final stage verifies an additional4%of the dataset\.
The judge model provides binary correctness signals and questions are accepted if at least one valid reasoning trajectory exists\.
### 3\.4Curriculum Construction
Once answer verification is complete, we construct a training curriculum based on empirical difficulty estimation\.
For each question, we sample four independent generations at temperature1\.01\.0using different random seeds\. Based on the number of correct generations, questions are categorized into three difficulty levels:
- •Trivial \(4 out of 4 correct generations\)
- •Learnable \(1–3 out of 4 correct generations\)
- •Challenging \(0 out of 4 correct generations\)
This categorization provides a natural measure of question difficulty relative to the base model’s capabilities\. We exclude trivial questions from the prolonged and broadened reasoning phases because they provide minimal learning signal\. A small trivial subset is retained only for Phase 1 format alignment, and we retain challenging questions for the third phase of training\.
During initial experiments, we observed that the base model performed significantly worse onchemistry questions\. To mitigate this imbalance, chemistry questions areupsampledwithin the reinforcement learning curriculum\.
### 3\.5Unified Reinforcement Learning
To improve reasoning ability, Aryabhata 2 is trained using anon\-policy reinforcement learning framework built upon Group Relative Policy Optimization \(GRPO\)\[[16](https://arxiv.org/html/2605.28829#bib.bib1)\]\.
#### 3\.5\.1Parameter\-Efficient Adaptation
Training uses Low\-Rank Adaptation \(LoRA\), which enables parameter\-efficient fine\-tuning by introducing a small set of trainable parameters while keeping the base model weights frozen\. This approach significantly reduces memory footprint and training cost while maintaining strong adaptation performance on domain\-specific tasks\.
LoRA configuration:Table[2](https://arxiv.org/html/2605.28829#S3.T2)summarizes the key hyperparameters\.
Parameter efficiency:Table[3](https://arxiv.org/html/2605.28829#S3.T3)reports the total and trainable parameter counts\.
In early ablation experiments, adding adapters to the token embedding layer significantly improved learning capacity\.
Table 2:LoRA hyperparameters used for reinforcement learning post\-training\.
Table 3:Total and trainable parameter counts for Aryabhata 2 with LoRA adapters\.
#### 3\.5\.2RL Algorithm
We useGRPO as the base reinforcement learning algorithm\. For each prompt, we sample a group of responses and compute advantages relative to group reward statistics\.
Compared to standard token\-level GRPO implementations, we make the following modifications:
- •KL\-free training:We remove KL regularization and do not use a reference model\. This design is motivated by GPU memory limits: keeping both policy and reference models exceeded the available memory on our two\-H100 setup\.
- •DAPO\-style clipped objective:We optimize a clipped policy\-ratio objective with an asymmetric upper clipping threshold\[[4](https://arxiv.org/html/2605.28829#bib.bib2)\]\.
- •No variance normalization:Advantages are computed by subtracting the mean reward within each sampled group, without standard\-deviation scaling\.
- •Truncation masking:Completions that reach the maximum generation length are masked during optimization to avoid learning from incomplete trajectories\.
- •Multiplicative reward composition:We use a product\-form reward,R=Raccuracy×RformatR=R\_\{accuracy\}\\times R\_\{format\}, to jointly enforce correctness and response\-format quality\.
#### 3\.5\.3Reward Function
As mentioned above, the final reward is computed multiplicatively:
R=Raccuracy×RformatR=R\_\{accuracy\}\\times R\_\{format\}
##### Accuracy reward\.
The accuracy term is computed using an ordered matching cascade and depends on the question type\. The cascade proceeds through the following base matchers:
1. 1\.Case\-insensitive string equalityafter trimming whitespace\.
2. 2\.Numeric matching with tolerance:For numeric answers, letaadenote the parsed model prediction andbbdenote the parsed gold value after whitespace stripping,LaTeXwrapper removal, and scientific\-notation normalization\. We mark a numeric match when \|a−b\|≤max\(0\.01⋅max\(\|a\|,\|b\|\),0\.01\)\.\|a\-b\|\\leq\\max\\left\(0\.01\\cdot\\max\(\|a\|,\|b\|\),\\,0\.01\\right\)\.
3. 3\.Symbolic equivalenceviamath\-verifyafter lightLaTeXnormalization \(with numeric fallback when symbolic verification fails\)\.
These matchers are applied sequentially, with the first successful match determining correctness\. Their usage varies slightly depending on the question format, as described below:
- •True/false questions:We apply raw andLaTeX\-stripped string matching\.
- •Numerical, fill\-in, and type\-in questions:We apply the full string–numeric–symbolic cascade\.
- •Choice\-style questions\(single\-correct, multiple\-correct, assertion–reasoning, and matching\-list\): Gold options are normalized to canonical labels and matched accordingly\. For single\-correct MCQs, if label matching fails but the predicted option value matches the gold option text, we assign partial credit of0\.50\.5\. Thus,Raccuracy∈\{0,0\.5,1\}R\_\{\\text\{accuracy\}\}\\in\\\{0,0\.5,1\\\}\.
- •Reward system design:In initial experiments, the deterministic rule\-based reward system defined above covers almost all cases in practice and is used as the sole reward\-evaluation mechanism during RL training\.
##### Format reward\.
In early experiments with just the accuracy reward, we observed that the model often terminated with a correct answer immediately after completing its reasoning, without providing a sufficiently detailed explanation for the student\. Conversely, unconstrained reasoning could lead to excessively long outputs or reasoning loops\. To balance these behaviors, we design a format reward that encourages sufficiently informative final answers while maintaining a controlled proportion between reasoning and solution length\.
The format term is computed from output structure using character\-level heuristics\. Each output is split at the end\-of\-thinking delimiter into a reasoning segment and a final\-answer segment\. Letctotc\_\{tot\}be the total number of output characters,csolc\_\{sol\}be the number of characters in the final\-answer segment, and
ρ=csolctot\.\\rho=\\frac\{c\_\{sol\}\}\{c\_\{tot\}\}\.
We define
Rformat=Slen\(csol\)×Sratio\(ρ\),R\_\{format\}=S\_\{len\}\(c\_\{sol\}\)\\times S\_\{ratio\}\(\\rho\),
with
Slen\(csol\)=\{0,csol<1000\.6,100≤csol<2500\.8,250≤csol<5001\.0,csol≥500,S\_\{len\}\(c\_\{sol\}\)=\\begin\{cases\}0,&c\_\{sol\}<100\\\\ 0\.6,&100\\leq c\_\{sol\}<250\\\\ 0\.8,&250\\leq c\_\{sol\}<500\\\\ 1\.0,&c\_\{sol\}\\geq 500,\\end\{cases\}
Sratio\(ρ\)=\{ρ/0\.3,ρ<0\.31\.0,0\.3≤ρ≤0\.7\(1−ρ\)/0\.3,ρ\>0\.7\.S\_\{ratio\}\(\\rho\)=\\begin\{cases\}\\rho/0\.3,&\\rho<0\.3\\\\ 1\.0,&0\.3\\leq\\rho\\leq 0\.7\\\\ \(1\-\\rho\)/0\.3,&\\rho\>0\.7\.\\end\{cases\}
If parsing fails \(e\.g\., missing delimiter\), we setRformat=0R\_\{format\}=0\.
Intuitively,SlenS\_\{len\}rewards sufficiently detailed final answers by increasing the score with solution length, whileSratioS\_\{ratio\}encourages a balanced allocation between reasoning and answer segments\. Together, these terms discourage both overly brief responses and disproportionately long reasoning chains\.
### 3\.6Training Phases
Training proceeds in three sequential phases\.
#### 3\.6\.1Phase 1: Format Alignment
The first phase consists of300 reinforcement learning stepswith a group size of8\. Training is performed on a format\-mixed dataset with chemistry questions upsampled\.
The goal of this stage is to align the model with the desired answering format before scaling reasoning difficulty\.
#### 3\.6\.2Phase 2: Prolonged Reinforcement Learning
The second phase runs for approximately5,000 steps\. The group size is gradually increased from8 to 16\.
During this phase, the dataset mixture isadaptively adjustedbased on model evaluation results\. Question difficulty is increased when the model sustains an accuracy reward greater than0\.7for around20 consecutive optimization steps\.
To stabilize training, we performEMA\-based checkpoint mergingwhen reward improvements plateau\. Multiple previous checkpoints are merged using exponential moving averaging to reduce instability\.
#### 3\.6\.3Phase 3: Broadened Reinforcement Learning
The final phase focuses on exploration and generalization\. This stage runs for approximately700 stepsand increases the group size from64 to 128\.
Larger group sizes enable broader exploration of reasoning trajectories, allowing the model to discover alternative solution strategies\.
##### Training configuration across phases\.
We summarize the key hyperparameters across different training phases in Table[4](https://arxiv.org/html/2605.28829#S3.T4)\.
Table 4:Training hyperparameters across the three reinforcement learning phases\.
### 3\.7Training Infrastructure
All experiments are conducted ontwo NVIDIA H100 NVL GPUs\. Reinforcement learning is performed using on\-policy sampling with generation temperature set to1\.0during both training and evaluation\.
Model evaluation is performed every50 stepsusing a held\-out validation set\. The final checkpoint is selected based on the highest validation accuracy\. We use stochastic decoding and samplek=4k=4responses per question, and compute Pass@1 as the mean correctness across sampled responses\. Majority voting is not used in reported metrics\.
## 4Evaluation
We evaluate Aryabhata 2 on a suite of competitive examination datasets and established reasoning benchmarks\. The evaluation is designed to measure performance both onin\-distribution exam\-style problemsandout\-of\-distribution reasoning benchmarks\.
### 4\.1Metrics
We default to stochastic pass@kkevaluation \(rather than greedy decoding\) and report Pass@1 using non\-zero\-temperature sampling\[[1](https://arxiv.org/html/2605.28829#bib.bib22)\]\. For each question, we samplekkresponses \(withk=4k=4in our main evaluation\), and compute
Pass@1=1k∑i=1kpi,\\mathrm\{Pass@1\}=\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}p\_\{i\},wherepi∈\{0,1\}p\_\{i\}\\in\\\{0,1\\\}denotes the correctness of theii\-th response\.
In addition to raw accuracy, we report theaccuracy–token trade\-off, which measures accuracy relative to the number of output tokens generated during inference\. This metric captures practical deployment efficiency for real\-world, large\-scale tutoring and question\-answering systems\. We computeaccuracy per 1K output tokensas
Acc\./1K tokens=Pass@1 \(4\-sample mean\)output tokens×1000,\\text\{Acc\./1K tokens\}=\\frac\{\\text\{Pass@1 \(4\-sample mean\)\}\}\{\\text\{output tokens\}\}\\times 1000,where Pass@1 and output tokens are averaged across benchmarks within each split\.
### 4\.2Benchmarks
#### 4\.2\.1In\-Distribution Benchmarks
Evaluation is conducted on recent competitive examinations\. These exams require a combination of conceptual understanding, multi\-step reasoning, and precise numerical computation across physics, chemistry, and mathematics\. All benchmarks are restricted to text\-only questions, excluding those that require diagrams or visual interpretation\. The benchmark composition is summarized in Table[5](https://arxiv.org/html/2605.28829#S4.T5)\. NEET includes Biology questions; since Biology is not part of the RL curriculum described in Section[3\.2](https://arxiv.org/html/2605.28829#S3.SS2), we treat this as a partial distribution shift within the in\-distribution exam suite and report aggregate results for operational comparability\.
Table 5:In\-distribution benchmark datasets and question counts used for evaluation\.
#### 4\.2\.2Out\-of\-Distribution Benchmarks
To measure generalization beyond the training distribution, we evaluate on a diverse set of widely used reasoning benchmarks spanning olympiad\-style mathematics and broad STEM knowledge: AIME\[[9](https://arxiv.org/html/2605.28829#bib.bib15)\], HMMT\[[5](https://arxiv.org/html/2605.28829#bib.bib16)\], MMLU\-Pro\[[19](https://arxiv.org/html/2605.28829#bib.bib18)\], MMLU\-Redux 2\.0\[[2](https://arxiv.org/html/2605.28829#bib.bib19)\], and GPQA\[[15](https://arxiv.org/html/2605.28829#bib.bib20)\]\. The dataset breakdown is provided in Table[6](https://arxiv.org/html/2605.28829#S4.T6)\.
Table 6:Out\-of\-distribution benchmark datasets and question counts used to evaluate generalization\.
The evaluation spans both competition\-level mathematics \(AIME, HMMT\) and broad STEM reasoning \(MMLU\-Pro\[[19](https://arxiv.org/html/2605.28829#bib.bib18)\], MMLU\-Redux 2\.0\[[2](https://arxiv.org/html/2605.28829#bib.bib19)\], and GPQA\[[15](https://arxiv.org/html/2605.28829#bib.bib20)\]\), providing a comprehensive measure of generalization\.
### 4\.3Baselines
We compare Aryabhata 2 against a mixture of open\-source reasoning models and frontier proprietary models\.
##### Open\-source models\.
We include recent open\-weight models with strong reasoning capabilities: Qwen3\-30B\-A3B \(Thinking\)\[[14](https://arxiv.org/html/2605.28829#bib.bib9)\], Nemotron 3 Nano 30B A3B\[[18](https://arxiv.org/html/2605.28829#bib.bib5)\], GPT\-OSS\-20B, and GPT\-OSS\-120B\[[13](https://arxiv.org/html/2605.28829#bib.bib10)\]\.
##### Frontier models\.
We additionally evaluate against frontier proprietary models included in our result tables: GPT\-5 Mini and GPT\-5 Nano\[[12](https://arxiv.org/html/2605.28829#bib.bib11)\], and Gemini 2\.5 Flash\[[3](https://arxiv.org/html/2605.28829#bib.bib12)\]\.
### 4\.4Answer Verification
To determine answer correctness, we apply a multi\-stage answer extraction and matching pipeline:
1. 1\.String Matching:case\-insensitive exact matching after whitespace normalization\.
2. 2\.Numeric Matching:tolerance\-based numerical equivalence checks after whitespace normalization,LaTeXwrapper stripping, and scientific\-notation normalization\.
3. 3\.Symbolic Matching:symbolic equivalence using themath\-verifylibrary after lightLaTeXnormalization\.
4. 4\.Option Matching:canonical\-label matching for choice\-style questions\.
5. 5\.LLM\-as\-Judge:if previous steps fail, an LLM extracts the final answer from the response and compares it to the ground truth while ignoring intermediate reasoning steps\.
This pipeline ensures robust evaluation across numeric, symbolic, and multiple\-choice answers while minimizing extraction errors from generated reasoning traces\.
### 4\.5Results
#### 4\.5\.1In\-Distribution Results
Table[7](https://arxiv.org/html/2605.28829#S4.T7)reports overall Pass@1 \(4\-sample mean\) on the in\-distribution exam benchmarks\. Aryabhata 2 achieves the strongest open\-source aggregate performance, with an average of88\.95, compared with 88\.28 for GPT\-OSS\-120B, 88\.55 for Qwen3\-30B\-A3B \(Thinking\), and 83\.00 for GPT\-OSS\-20B\.
In addition to accuracy gains, Aryabhata 2 is substantially more token\-efficient than GPT\-OSS\-20B on in\-distribution exams, reducing output tokens by approximately 52–64% across the four datasets\.
Aryabhata 2 achieves42\.31Acc\./1K tokens\. This is substantially higher than GPT\-OSS\-20B \(15\.68\), GPT\-OSS\-120B \(26\.66\), Nemotron 3 Nano 30B A3B \(14\.41\), and Qwen3\-30B\-A3B \(Thinking\) \(19\.44\)\. Detailed in\-distribution accuracy–token values for all models are listed in Appendix Table[9](https://arxiv.org/html/2605.28829#A1.T9)\.
Table 7:In\-distribution Pass@1 \(4\-sample mean, %\) overall accuracy\. Avg\. is the average across the four in\-distribution benchmarks\.
#### 4\.5\.2Out\-of\-Distribution Results
Table[8](https://arxiv.org/html/2605.28829#S4.T8)summarizes OOD performance on Olympiad\-style mathematics and broad STEM reasoning benchmarks using Pass@1 \(4\-sample mean\)\. Aryabhata 2 attains an OOD average of87\.64, improving over GPT\-OSS\-20B \(84\.95\) and Nemotron 3 Nano 30B A3B \(83\.48\), while remaining below GPT\-5 Mini \(88\.85\), Gemini 2\.5 Flash \(89\.13\), Qwen3\-30B\-A3B \(Thinking\) \(89\.42\), and GPT\-OSS\-120B \(89\.50\)\.
Compared with GPT\-OSS\-20B, Aryabhata 2 matches AIME performance \(86\.67\), improves on HMMT \(\+1\.54\), GPQA \(\+4\.35\), and MMLU\-Pro \(\+3\.07\), and shows a small decline on MMLU\-Redux 2\.0 \(\-0\.40\)\. Against Qwen3\-30B\-A3B\-Thinking, Aryabhata 2 shows a large gain on HMMT \(\+27\.08\), indicating stronger robustness on this harder Olympiad\-style benchmark\.
Aryabhata 2 is also more token\-efficient than GPT\-OSS\-20B across all OOD benchmarks, reducing output tokens by approximately 24–71%\.
Aryabhata 2 attains39\.58Acc\./1K tokens\. This improves over GPT\-OSS\-20B \(17\.48\), GPT\-OSS\-120B \(24\.44\), Nemotron 3 Nano 30B A3B \(13\.61\), and Qwen3\-30B\-A3B \(Thinking\) \(20\.80\)\. The full OOD accuracy–token summary is reported in Appendix Table[10](https://arxiv.org/html/2605.28829#A1.T10)\.
Figures[1](https://arxiv.org/html/2605.28829#S4.F1)and[2](https://arxiv.org/html/2605.28829#S4.F2)show accuracy versus output tokens for in\-distribution and out\-of\-distribution averages, respectively\.
Table 8:Out\-of\-distribution Pass@1 \(4\-sample mean, %\) accuracy\. Avg\. is the average across five OOD benchmarks\.
![[Uncaptioned image]](https://arxiv.org/html/2605.28829v1/uploads/accuracy_token_tradeoff_in_distribution.png)
Figure 1:In\-distribution accuracy\-token trade\-off \(y\-axis: Pass@1 \(4\-sample mean\), x\-axis: output tokens\)\. Each point is a model\-level average across the in\-distribution benchmarks\.
![[Uncaptioned image]](https://arxiv.org/html/2605.28829v1/uploads/accuracy_token_tradeoff_ood.png)
Figure 2:Out\-of\-distribution accuracy\-token trade\-off \(y\-axis: Pass@1 \(4\-sample mean\), x\-axis: output tokens\)\. Each point is a model\-level average across the OOD benchmarks\.
## Conclusion
In this work, we presentedAryabhata 2, a reinforcement\-learning post\-trained 20B model designed for advanced competitive STEM reasoning\. Our pipeline combines rigorous data cleaning and answer verification with phased reinforcement learning that includes format alignment, prolonged optimization, and broadened exploration\. This design enables stable training under constrained compute while preserving strong reasoning quality on exam\-style problems\.
Across in\-distribution benchmarks, Aryabhata 2 achieves strong performance, including a92\.99score onJEE Main 2026\. Relative to the base GPT\-OSS\-20B model, Aryabhata 2 improves overall accuracy on all in\-distribution exams and delivers substantially shorter generations\. On out\-of\-distribution benchmarks, Aryabhata 2 improves over GPT\-OSS\-20B in Pass@1 \(4\-sample mean\) while remaining competitive with larger baselines\. The accuracy\-token analysis further shows favorable deployment efficiency, particularly on in\-distribution tasks, where Aryabhata 2 attains88\.95Pass@1 at2,102\.25output tokens and42\.31Acc\./1K tokens\.
These results suggest that targeted RL on domain\-specific curricula is an effective strategy for scaling practical STEM reasoning systems for real educational workloads\.
## References
- \[1\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.External Links:2107\.03374Cited by:[§4\.1](https://arxiv.org/html/2605.28829#S4.SS1.p1.3)\.
- \[2\]\(2025\)MMLU\-redux: a dynamically debiased multi\-task language understanding benchmark\.External Links:2502\.17578Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p5.1),[§4\.2\.2](https://arxiv.org/html/2605.28829#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2605.28829#S4.SS2.SSS2.p2.1)\.
- \[3\]Google DeepMind\(2025\)Gemini 2\.5 flash\.Note:[https://deepmind\.google/en/models/gemini/flash/](https://deepmind.google/en/models/gemini/flash/)Accessed: 2026\-04\-08Cited by:[§4\.3](https://arxiv.org/html/2605.28829#S4.SS3.SSS0.Px2.p1.1)\.
- \[4\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p3.1),[§2](https://arxiv.org/html/2605.28829#S2.p1.1),[2nd item](https://arxiv.org/html/2605.28829#S3.I3.i2.p1.1)\.
- \[5\]HMMT Organizing Committee\(2026\)Harvard\-mit mathematics tournament \(hmmt\)\.Note:[https://www\.hmmt\.org/](https://www.hmmt.org/)Accessed: 2026\-04\-06Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p5.1),[§4\.2\.2](https://arxiv.org/html/2605.28829#S4.SS2.SSS2.p1.1)\.
- \[6\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685Cited by:[§3\.1](https://arxiv.org/html/2605.28829#S3.SS1.p2.1)\.
- \[7\]P\. Hu, W\. Chen, Z\. Zhao, W\. Wang, L\. Lu, J\. Gao, X\. Huang, S\. Bubeck, S\. Surya, S\. Sahoo,et al\.\(2025\)BroRL: recovering exploration in rl for llms\.External Links:2510\.01180Cited by:[§2\.4](https://arxiv.org/html/2605.28829#S2.SS4.p2.1)\.
- \[8\]H\. Liu, R\. Lee, S\. Bubeck, S\. Surya, J\. Lee, S\. Sahoo, X\. Huang, A\. Jain, Z\. Li,et al\.\(2025\)ProRL: prolonged reinforcement learning expands reasoning boundaries in large language models\.External Links:2501\.12895Cited by:[§2\.4](https://arxiv.org/html/2605.28829#S2.SS4.p1.1)\.
- \[9\]Mathematical Association of America\(2026\)American invitational mathematics examination \(aime\)\.Note:[https://maa\.org/student\-programs/amc/american\-invitational\-mathematics\-examination\-aime/](https://maa.org/student-programs/amc/american-invitational-mathematics-examination-aime/)Accessed: 2026\-04\-06Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p5.1),[§4\.2\.2](https://arxiv.org/html/2605.28829#S4.SS2.SSS2.p1.1)\.
- \[10\]National Testing Agency\(2026\)Joint entrance examination \(main\)\.Note:[https://jeemain\.nta\.nic\.in/](https://jeemain.nta.nic.in/)Accessed: 2026\-04\-06Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p1.1)\.
- \[11\]National Testing Agency\(2026\)National eligibility cum entrance test \(ug\)\.Note:[https://neet\.nta\.nic\.in/](https://neet.nta.nic.in/)Accessed: 2026\-04\-06Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p1.1)\.
- \[12\]OpenAI\(2025\)Introducing gpt\-5\.Note:[https://openai\.com/index/introducing\-gpt\-5/](https://openai.com/index/introducing-gpt-5/)Accessed: 2026\-04\-08Cited by:[§4\.3](https://arxiv.org/html/2605.28829#S4.SS3.SSS0.Px2.p1.1)\.
- \[13\]OpenAI\(2025\)Introducing gpt\-oss\.Note:[https://openai\.com/index/introducing\-gpt\-oss/](https://openai.com/index/introducing-gpt-oss/)Accessed: 2026\-04\-06Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p4.1),[1st item](https://arxiv.org/html/2605.28829#S3.I1.i1.p1.1),[§3\.1](https://arxiv.org/html/2605.28829#S3.SS1.p1.1),[§4\.3](https://arxiv.org/html/2605.28829#S4.SS3.SSS0.Px1.p1.1)\.
- \[14\]Qwen Team\(2025\)Qwen3 technical report\.External Links:2505\.09388Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p3.1),[2nd item](https://arxiv.org/html/2605.28829#S3.I1.i2.p1.1),[§3\.2\.2](https://arxiv.org/html/2605.28829#S3.SS2.SSS2.p4.1),[§3\.2\.2](https://arxiv.org/html/2605.28829#S3.SS2.SSS2.p5.1),[§4\.3](https://arxiv.org/html/2605.28829#S4.SS3.SSS0.Px1.p1.1)\.
- \[15\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Pang, J\. Balis, J\. Tang, K\. Wampler,et al\.\(2023\)GPQA: a graduate\-level google\-proof q&a benchmark\.External Links:2311\.12022Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p5.1),[§4\.2\.2](https://arxiv.org/html/2605.28829#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2605.28829#S4.SS2.SSS2.p2.1)\.
- \[16\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Yang, J\. Sun, A\. Li, Y\. K\. Xu, D\. Wu, H\. Zhang, Y\. Li,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300Cited by:[§2](https://arxiv.org/html/2605.28829#S2.p1.1),[§3\.5](https://arxiv.org/html/2605.28829#S3.SS5.p1.1)\.
- \[17\]C\. Team\(2025\)Command a: an enterprise\-focused large language model for diverse business applications\.External Links:2504\.00698Cited by:[§2\.2](https://arxiv.org/html/2605.28829#S2.SS2.p1.1)\.
- \[18\]N\. Team\(2025\)Nemotron\-3 nano: a compact, efficient, and high\-performing hybrid mamba\-transformer language model\.External Links:2512\.20848Cited by:[§2\.3](https://arxiv.org/html/2605.28829#S2.SS3.p1.1),[§4\.3](https://arxiv.org/html/2605.28829#S4.SS3.SSS0.Px1.p1.1)\.
- \[19\]Y\. Wang, X\. Ma, G\. Zhang, A\. Ni, R\. Chandra, S\. Guo, W\. Ren, Y\. Arulrajah, X\. Jiang, Z\. Wang,et al\.\(2024\)MMLU\-pro: a more robust and challenging multi\-task language understanding benchmark\.External Links:2406\.01574Cited by:[§1](https://arxiv.org/html/2605.28829#S1.p5.1),[§4\.2\.2](https://arxiv.org/html/2605.28829#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2605.28829#S4.SS2.SSS2.p2.1)\.
- \[20\]Z\. Wang, R\. Agarwal, S\. Mukherjee, B\. Samineni, N\. Duan, H\. Kedia,et al\.\(2025\)Nemotron\-cascade: enhancing llm reasoning through sequential reinforcement learning\.External Links:2512\.13607Cited by:[§2\.1](https://arxiv.org/html/2605.28829#S2.SS1.p1.1)\.
- \[21\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.External Links:2201\.11903Cited by:[§3\.3](https://arxiv.org/html/2605.28829#S3.SS3.p5.1)\.
## Appendix AAdditional Evaluation Tables
### A\.1Accuracy–Token Trade\-off Tables
Table 9:In\-distribution accuracy–token trade\-off summary\.Pass@1 and output tokens are averaged across JEE Advanced 2025, NEET 2025, JEE Main 2025, and JEE Main 2026\.
Table 10:Out\-of\-distribution accuracy–token trade\-off summary\.Pass@1 and Output tokens are averaged across AIME, HMMT, GPQA, MMLU\-Pro, and MMLU\-Redux 2\.0\.
### A\.2Subject\-wise In\-Distribution Accuracy
Table 11:JEE Advanced 2025 subject\-wise Pass@1 \(4\-sample mean, %\)\.
Table 12:NEET 2025 subject\-wise Pass@1 \(4\-sample mean, %\)\.
Table 13:JEE Main 2025 subject\-wise Pass@1 \(4\-sample mean, %\)\.
Table 14:JEE Main 2026 subject\-wise Pass@1 \(4\-sample mean, %\)\.Similar Articles
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
This survey synthesizes recent advancements in mathematical reasoning with large language models, covering benchmarks, architectures, training strategies, and evaluation protocols. It identifies key challenges such as reasoning faithfulness and benchmark biases.
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
This paper introduces SWARR, a two-stage recipe using supervised fine-tuning and reinforcement learning to adapt sliding-window attention models for mathematical reasoning, showing that RL can narrow the performance gap with self-attention while maintaining efficiency.
Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
This paper evaluates three approaches (pure chain-of-thought reasoning, single-shot code execution, and iterative code execution) on 1,000 GSM-Symbolic problems using Claude Haiku 4.5, finding that chain-of-thought is the most robust to perturbation, while code execution does not improve reasoning robustness on grade-school math problems.
@stingning: We’re releasing a 30B-A3B reasoning model that reaches gold-medal level across both physics and math Olympiad evaluatio…
Researchers release SU-01, a 30B-A3B reasoning model achieving gold-medal-level performance on physics and math Olympiad problems using a unified scaling recipe for proof search.
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
This paper presents a simple and unified recipe combining supervised fine-tuning, two-stage reinforcement learning, and test-time scaling to train a reasoning model (SU-01) that achieves gold-medal-level performance on International Mathematical and Physics Olympiad problems.