@f14bertolotti: Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5…

X AI KOLs Timeline 06/16/26, 05:20 AM Papers

Summary

This technical report introduces VibeThinker-3B, a 3B parameter model that achieves frontier-level verifiable reasoning performance through post-training refinements on Qwen2.5-Coder, including curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation, matching or exceeding much larger models like DeepSeek V3.2.

Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5-Coder. The paper doesn't provide many details, but it appears they distill from RL ckpts and then do a final RL-based instruct RL. 🔗https://t.co/FmdRwGNMOg https://t.co/QPez8Ddbgp

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:53 AM

Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5-Coder. The paper doesn’t provide many details, but it appears they distill from RL ckpts and then do a final RL-based instruct RL.

🔗https://t.co/FmdRwGNMOg https://t.co/QPez8Ddbgp

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Source: https://arxiv.org/html/2606.16140

Abstract

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

Refer to caption Figure 1:VibeThinker-3B reaches frontier reasoning performance at 3B scale. CLR denotes Claim-Level Reliability Assessment, a claim-level test-time scaling strategy.Figure 2:Parameter efficiency on IMO-AnswerBench, a highly demanding benchmark comprising 400 IMO-level problems, among open-source reasoning models with disclosed parameter counts. VibeThinker-3B achieves 76.4 with only 3B parameters and reaches 80.6 with CLR, demonstrating that a model within a strictly small-model regime can reach the performance range of substantially larger models such as DeepSeek V3.2 (78.3, 671B), GLM-5 (82.5, 744B), and Kimi K2.5 (81.8, 1T).## 1Introduction

As reinforcement learning[32,44,47,18]has become increasingly integrated into the post-training stage of language models, the complex logical reasoning abilities of large models have improved substantially. At present, the field commonly relies on increasing parameter scale, following scaling laws, to cross the threshold required by difficult reasoning tasks. As a result, frontier reasoning ability is often concentrated in models with tens or hundreds of billions of parameters. In contrast, small language models (SLMs) with 3B parameters or fewer offer clear advantages in deployment cost, inference efficiency, and broader accessibility for academic research, but they are generally considered to face inherent bottlenecks when handling difficult mathematical derivations or complex programming tasks.

Our previous work on VibeThinker-1.5B[42]demonstrated that even models with extremely small parameter counts can be elicited to produce stable and basic chains of logic. This was an initial attempt to challenge the common belief that small models struggle with long-horizon reasoning. However, the 1.5B model mainly demonstrated the feasibility of reasoning in small models, while its upper bound remained to be explored. This led us to a further question:instead of treating SLMs simply as compute-saving fallbacks, what is their true capability boundary? Can a strictly 3B model actually achieve frontier-level performance comparable to top-tier LLMs?Therefore, in this report, we present empirical observations on VibeThinker-3B to further examine the limits of complex verifiable reasoning at the 3B scale.

To further unlock the reasoning capacity of a 3B model, we systematically upgrade the post-training pipeline built upon the Spectrum-to-Signal Principle introduced in VibeThinker-1.5B. In the SFT stage, we strengthen data synthesis, quality filtering, and curriculum learning, allowing the model to first acquire broad coverage across mathematics, code, STEM, general dialogue, and instruction following, and then focus on harder long-horizon reasoning samples. In the RL stage, we retain the core idea of MGPO[42]while extending training to multiple verifiable domains and adopting a more stable long-context strategy to preserve complete reasoning trajectories. We further introduce Long2Short Math RL to improve reasoning efficiency by reducing redundant tokens without sacrificing accuracy. Finally, offline self-distillation and Instruct RL consolidate the capabilities elicited at different stages into a unified model and improve its controllability under complex, constraint-heavy user instructions. Compared with VibeThinker-1.5B, VibeThinker-3B therefore represents not only a moderate increase in parameter scale, but also a more complete post-training system that jointly addresses capability construction, reasoning amplification, efficiency optimization, and instruction alignment.

Extensive evaluations across multiple independent competition systems, under strict data decontamination, confirm the exceptional parameter efficiency of VibeThinker-3B and ensure our findings are not isolated to a single benchmark. While it consistently outperforms existing small and mid-sized reasoning models, its most significant achievement is demonstrating competitiveness against top-tier systems that are orders of magnitude larger. Despite having only 3B parameters, VibeThinker-3B achieves a score of 94.3 on AIME26[24], matching the performance of much larger models such as DeepSeek V3.2[20](671B) and Kimi K2.5[34](1T). It also scores 89.3 on HMMT25[17]and achieves an 80.2 Pass@1 on LiveCodeBench[19]v6, closely trailing the performance of GPT-OSS-120B and DeepSeek V3.2. Furthermore, we employ Claim-Level Reliability Assessment (CLR), a test-time scaling strategy, which yields additional gains on answer-verifiable mathematics benchmarks, elevating its AIME26 score to 97.1, HMMT25 to 95.4, and BruMO25[8]to 99.2. Beyond standard benchmarks, VibeThinker-3B exhibits strong out-of-distribution generalization, achieving a 96.1% acceptance rate on recent LeetCode weekly and biweekly contests (2026.04.25–2026.05.31), a level of pass rate comparable to industry-leading models such as GPT-5.2[27]and Gemini 3 Flash[14]. Extending the technical lineage of the VibeThinker series, these achievements illustrate that a strict 3B parameter budget is entirely sufficient to approach the performance range of leading reasoning models such as Gemini 3 Pro, GLM-5, and Kimi K2.5, proving that the boundaries of reasoning capacity of compact models far exceeds conventional expectations.

Motivated by these findings, we introduce the Parametric Compression-Coverage Hypothesis, which posits that foundational model capabilities differ not only in the amount of parameter capacity they require, but also in the structural form of their parameter demands. Under this view, they can be broadly divided into parameter-dense capabilities and parameter-expansive capabilities. Verifiable reasoning exemplifies the former: its core challenge lies not in memorizing vast open-domain facts, but in performing search, constraint satisfaction, error correction, and multi-step composition within a structured solution space. Consequently, this class can be highly compressed into a compact and reusable reasoning core. In contrast, knowledge-intensive and general-purpose abilities align more closely with the latter, as they require broad coverage over open-domain facts, domain-specific concepts, semantic associations, and long-tail scenarios. Their parameter demands therefore resemble a coverage problem rather than the compression of a reusable reasoning core. This perspective elucidates why VibeThinker-3B achieves performance comparable to top-tier systems on verifiable tasks, such as mathematics and coding, while still exhibiting a gap relative to larger models on knowledge-intensive benchmarks such as GPQA-Diamond.

While parameter scaling remains a fundamental driver of broad model capabilities, we propose the Reasoning-Knowledge Decoupling Paradigm to reveal the highly specialized potential of smaller models. Under this paradigm, large-scale models continue to serve as natural vehicles for expansive knowledge breadth, as absorbing diverse semantics and long-tail distributions inherently requires massive parameter capacity. Conversely, provided with structurally constrained spaces and reliable training signals, smaller models are already sufficient to encapsulate high-density reasoning depth. Therefore, the true significance of VibeThinker-3B does not lie in proving that a 3B model can replace large-scale generalists, but rather in providing a concrete empirical signal: the development of compact models is no longer merely a passive compromise for deployment efficiency or cost control; it emerges as a promising research trajectory that is fundamentally complementary to the traditional parameter scaling paradigm.

Refer to caption Figure 3:Overall training pipeline of VibeThinker-3B.

2Methods

Overall Pipeline.VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B dense foundation model. Our focus is on systematically eliciting and consolidating reasoning capabilities through data synthesis, diversity-oriented supervised fine-tuning, multi-domain reinforcement learning, offline self-distillation, and instruction-oriented alignment. The overall post-training framework continues theSpectrum-to-Signal Principle (SSP)introduced in VibeThinker-1.5B[42]. Building upon the core methodology of our previous work, we continue to employ Diversity-Exploring Distillation in the SFT stage to construct a broad solution space (the “Spectrum”), and utilize MaxEnt-Guided Policy Optimization (MGPO) in the RL stage to amplify high-value reasoning signals (the “Signal”).

For this 3B iteration, we have comprehensively optimized the data construction and overall training pipeline based on our original foundation. As depicted in Fig.3, the complete post-training framework unfolds sequentially in stages. First, in the Supervised Fine-Tuning (SFT) stage, we have upgraded the rigorous data synthesis and filtering pipeline, thereby supporting the introduction of a two-stage curriculum learning strategy. This enables the model to transition smoothly from broad capability coverage to deep, long-horizon reasoning. Subsequently, in the Reinforcement Learning (RL) stage, we apply MGPO to multi-domain reasoning tasks utilizing a significantly expanded context window; furthermore, in the mathematical RL phase, we introduce a Long2Short stage designed to optimize reasoning efficiency without compromising accuracy. Following the completion of the core reasoning RL, the pipeline immediately proceeds to an Offline Self-Distillation phase to backfeed the newly elicited capabilities, and finally concludes with an Instruct RL stage to further reinforce the model’s strict adherence to complex, multi-step instructions. The subsequent subsections will systematically elaborate on the detailed implementations of each stage.

2.1Supervised Fine-tuning

2.1.1Data Construction

During the SFT phase, we construct a multi-domain mixed supervised dataset based on the base model to provide a stable cold-start policy for subsequent RL phase. The dataset encompasses various tasks, including math, code, STEM reasoning, general chat, and instruction following.

Data Synthesis and Query Expansion.VibeThinker-3B introduces an automated data synthesis pipeline during the SFT phase to broaden the coverage of training queries. We only select queries with reliable supervision signals from existing datasets as seed queries: mathematical queries must possess explicit and credible final answers or solving rationales, while competitive programming queries must be equipped with reliable unit tests or executable evaluation rules. Based on these high-confidence seed samples, we rewrite and expand the queries across multiple dimensions, e.g., concept composition, problem-solving skeletons, constraints, and evaluation objectives, yielding derivative queries that encompass a wider array of knowledge configurations and reasoning patterns. For the initially filtered synthetic queries, we further perform multiple independent samplings using strong teacher models and generate pseudo-labels via majority voting, establishing the foundation for subsequent distillation and training.

Multi-path Reasoning Distillation.For reasoning-intensive samples in mathematics, code, and STEM, we adopt a multi-path distillation approach to construct SFT responses. Specifically, we employ strong teacher models to sample multiple candidate reasoning traces for each query, retaining the complete intermediate reasoning steps rather than keeping only a single standard solution. This design inherits the Spectrum-to-Signal paradigm from VibeThinker-1.5B that the SFT phase is tasked with constructing a solution spectrum that covers diverse valid methods, offering a broader candidate solution space for subsequent RL. By explicitly preserving this multi-solution structure, the model learns various decomposition methods, derivation paths, and verification strategies, thereby improving exploration diversity during subsequent on-policy sampling.

Multi-level Quality Control.The quality of SFT data directly determines the performance upper bound of subsequent RL. Consequently, we implement more rigorous, multi-level quality control process:

•1). N-gram-based filtering.We discard samples containing anomalous repetitive segments, templated degeneration patterns, or n-gram overlaps with evaluation sets, to remove low-quality generations and benchmark contamination.
•2). LLM-based Query Quality Filtering.We utilize capable LLMs to assess query quality, filtering out samples with incomplete descriptions, unreasonable conditions, invalid logic, or an inability to effectively assess target knowledge points.
•3). Trace Correctness Filtering.At the distilled response level, we screen reasoning traces through a combination of answer verification, code sandbox execution, and LLM majority voting. Traces with incorrect final answers, failed execution results, or evidently invalid reasoning steps are filtered out.

The quality-controlled data is then stratified based on reasoning chain length and problem difficulty, establishing the data foundation for the subsequent curriculum SFT.

2.1.2Training Process

Curriculum-based two-stage SFT strategy.Compared with VibeThinker-1.5B, VibeThinker-3B adopts a curriculum-based two-stage SFT procedure. The first stage focuses on broad capability coverage and behavioral cold start. We utilize the entire quality-filtered reasoning dataset for training to maximize the diversity of task types and reasoning patterns. Given the substantial variance in Chain-of-Thought (CoT) lengths within the first stage data, we employ sequence packing to enhance training efficiency. For optimization, we use a global batch size of 128 and set the initial learning rate to5×10−55\times 10^{-5}. The learning rate follows a cosine annealing schedule and decays to a minimum value of8×10−88\times 10^{-8}. The first stage is trained for 5 epochs with a 5% linear warmup.

Upon acquiring a stable, broad-coverage SFT model, we proceed to the second stage, shifting the training data distribution toward higher-difficulty and longer-horizon reasoning samples. Initialized from the final checkpoint of the first stage, this phase continues training on a hard-reasoning subset generated through a joint length-difficulty filtration. Specifically, we first discard samples with reasoning traces shorter than 5K tokens. Subsequently, using VibeThinker-1.5B as a reference model, we perform 8 independent rollouts per query, filtering out relatively easy problems that yield an error rate below 0.75. This filtering strategy effectively reduces the proportion of shallow reasoning data, compelling the stage-two SFT to concentrate on long-horizon logical derivation, complex constraint satisfaction, and advanced problem-solving. Retaining the exact hyperparameter configuration from the first stage, this phase undergoes an additional 2 epochs of training on the hard-sample subset.

Diversity-Exploring Distillation.Following VibeThinker-1.5B[42], we apply Diversity-Exploring Distillation in both SFT stages to mitigate potential gradient interference in multi-domain training and preserve the reasoning diversity of model outputs. This method follows the Spectrum-to-Signal Principle: the SFT stage does not aim for optimal imitation along a single solution path, but instead prioritizes the construction of a broader candidate solution space, providing a richer exploration basis for subsequent RL.

Specifically, we periodically save intermediate checkpoints during training and evaluate their Pass@K performance on domain-specific probing sets. For each domain, we select the checkpoint that produces more valid solutions as the corresponding specialist model, rather than simply choosing the checkpoint with the lowest validation loss or the highest Pass@1. These domain specialist models are then merged at the parameter level to obtain a unified SFT model. The resulting merged model preserves domain-specific reasoning capabilities while maintaining high output diversity, thereby providing a wider solution spectrum for subsequent training stages.

2.2Reinforcement Learning

2.2.1Algorithm Backbone

We reuse MaxEnt-Guided Policy Optimization (MGPO), introduced in VibeThinker-1.5B[42], as the core RL algorithm. Under the Spectrum-to-Signal Principle, SFT constructs a diverse solution space, and RL is responsible for amplifying the correct reasoning signals within it. MGPO serves this role by dynamically selecting prompts near the model’s current capability boundary.

For each promptqq, we sampleGGresponses from the old policy and evaluate them with verifiable rewards. The empirical group accuracy is computed as:

p(q)=1G∑i=1G𝕀(ri=1).p(q)=\frac{1}{G}\sum_{i=1}^{G}\mathbb{I}(r_{i}=1).(1) Prompts withp(q)≈0p(q)\approx 0are too difficult and provide sparse positive signals, while prompts withp(q)≈1p(q)\approx 1are already saturated. Therefore, MGPO assigns higher weights to prompts with intermediate correctness:

w(q)=exp⁡(−γDME(p(q)∥p0)),wherep0=0.5,γ>0.w(q)=\exp\left(-\gamma D_{\mathrm{ME}}(\,p(q)\,\|\,p_{0}\,)\right),\quad\mathrm{where}\,\,p_{0}=0.5,\,\gamma>0.(2) Here,DME(p(q)∥p0)D_{\mathrm{ME}}(p(q)\|p_{0})measures how far the empirical correctnessp(q)p(q)deviates from the maximum-entropy point0.50.5. A smaller value indicates that the prompt lies closer to the model’s current capability boundary, where correct and incorrect rollouts coexist. This weight is applied to the group-relative advantage inside a GRPO-style clipped objective:

𝒥MGPO(θ)=𝔼q,{yi}[1G∑i=1G1|yi|∑t=1|yi|min⁡(ρi,t(θ)w(q)Ai,clip(ρi,t(θ),1−ε,1+ε)w(q)Ai)],\mathcal{J}_{\mathrm{MGPO}}(\theta)=\mathbb{E}_{q,\{y_{i}\}}\!\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|y_{i}|}\sum_{t=1}^{|y_{i}|}\min\!\bigl(\rho_{i,t}(\theta)\,w(q)A_{i},\;\mathrm{clip}(\rho_{i,t}(\theta),1\!-\!\varepsilon,1\!+\!\varepsilon)\,w(q)A_{i}\bigr)\right]\!,(3) whereAiA_{i}is the group-relative advantage,ρi,t(θ)\rho_{i,t}(\theta)is the token-level probability ratio between the current and old policies, andε\varepsilonis the clipping coefficient. Inspired by the maximum-entropy principle, this weighting mechanism encourages RL updates to focus on prompts with sufficient uncertainty, thereby producing more stable and healthy gradient signals. It also mitigates over-optimization on high-probability tokens and reduces the negative impact of noisy tokens during policy updates.

In VibeThinker-3B, we keep the core MGPO formulation unchanged, while making several adjustments to improve the training stability. During training, we observe that as the rollout engine becomes increasingly optimized for inference throughput, the training-inference probability mismatch is gradually amplified by multiple implementation factors. This mismatch can destabilize or even collapse RL training. To mitigate this issue, we adopt the stabilization strategy from[13,43]and perform all RL stages in an on-policy manner.

2.2.2Multi-domain Reasoning RL

We apply MGPO to multi-domain verifiable reasoning tasks, including mathematics, code, and STEM reasoning. These domains share the same policy optimization framework, but use different reward sources and verification mechanisms: mathematical tasks mainly rely on final-answer verification, code tasks rely on sandbox execution and test cases, and STEM tasks combine answer matching with option verification.

Training Data.For all domains, the training sets comprise data with reliable supervision signals and have undergone strict benchmark decontamination. Additionally, before training commences, we filter out samples yielding an accuracy of exactly 0.0 or 1.0 as evaluated by the starting checkpoint of each respective phase.

Single Long-context Learning.[21]introduce a multi-stage RL strategy based on progressive context-window expansion, improving both training efficiency and final reasoning performance. We observed a similar phenomenon in VibeThinker-1.5B, where progressively expanding the context window led to better reasoning performance with lower training cost.

However, this conclusion does not hold in VibeThinker-3B. We find that a high-truncation early stage weakens the model’s long-thinking capability and biases the policy toward incomplete or overly shortened reasoning trajectories. We hypothesize that this reversal is related to the stronger RL initialization checkpoint: compared with VibeThinker-1.5B, VibeThinker-3B undergoes stricter SFT data quality control and contains fewer invalid reasoning patterns. As a result, high-truncation warm-up may no longer mainly remove noisy thinking traces, but instead disrupt existing high-quality long-horizon reasoning behaviors. Even after the context window is expanded later, this degradation is difficult to fully recover. Therefore, we directly conduct RL with a single 64K long-context window, reducing rollout truncation and better preserving long-horizon reasoning trajectories.

Training Strategy.As illustrated in Fig.3, we adopt a sequential multi-domain Reasoning RL pipeline. Training starts with Math RL, which strengthens the model’s long-horizon symbolic derivation, complex condition composition, and multi-step search capabilities. It then smoothly transitions to Code RL, focusing on improving the rigor of executable logic, boundary-case handling, and program constraint satisfaction. Finally, we conduct STEM RL to generalize the underlying logical reasoning ability to multidisciplinary scientific scenarios, enhancing knowledge utilization and cross-domain reasoning. The checkpoint obtained after each RL stage is preserved and used in the subsequentoffline self-distillation phase, where high-quality reasoning trajectories elicited at different stages are collected to further consolidate the model’s overall reasoning capability.

Long2Short Math RL.Different from our previous work, VibeThinker-3B adopts a’from accuracy to efficiency’two-stage reinforcement learning strategy. In the first stage, we optimize for accuracy using standard MGPO, allowing the model to fully unfold its reasoning process and explore diverse solution paths. Subsequently, we introduce a Long2Short stage in Math RL, extending the optimization objective from pure accuracy improvement to token-efficiency optimization. The goal of this stage is to reduce redundant reasoning and improve output efficiency while preserving validation-set performance. In Long2Short RL, we redistribute rewards only among correct trajectories in each prompt group according to response length, increasing the rewards of shorter correct responses and decreasing those of longer correct responses. After obtaining the binary correctness rewardri∈{0,1}r_{i}\in\{0,1\}for each sampled trajectoryyiy_{i}, we keep all incorrect trajectories unchanged. For the correct set𝒞={i∣ri=1}\mathcal{C}=\{i\mid r_{i}=1\}, we define a brevity scoresi=1/Lis_{i}=1/L_{i}, whereLiL_{i}denotes the response length, and apply a centered length-aware reward shift:

ri′=ri+λ⋅si−s¯maxj∈𝒞⁡|sj−s¯|,i∈𝒞,r_{i}^{\prime}=r_{i}+\lambda\cdot\frac{s_{i}-\bar{s}}{\max_{j\in\mathcal{C}}|s_{j}-\bar{s}|},\qquad i\in\mathcal{C},wheres¯\bar{s}is the mean brevity score over correct trajectories andλ\lambda, set to 0.2, controls the maximum redistribution magnitude. If all correct trajectories have the same length, the rewards are left unchanged. Since the reward shifts are centered within𝒞\mathcal{C}, their sum is zero:

∑i∈𝒞(ri′−ri)=0.\sum_{i\in\mathcal{C}}(r_{i}^{\prime}-r_{i})=0.Therefore, the mean reward before and after redistribution remains unchanged for the correct subset and, since incorrect rewards are also unchanged, for the whole prompt group. This zero-sum design avoids introducing a systematic shift to the group-level reward baseline used in advantage estimation, while still reshaping the relative preference among correct trajectories toward more concise reasoning paths.

2.3Offline Self-Distillation

After completing multi-domain Reasoning RL, we use the checkpoints from the Math, Code, and STEM RL stages, together with data filtering, to extract offline trajectories that contain high-quality reasoning patterns. These trajectories are then distilled back into a unified student model through supervised fine-tuning, enabling more stable integration of multi-domain reasoning capabilities.

Learning-potential Filtering.We first perform rejection sampling with domain-specific verifiers to remove incorrect trajectories. After obtaining verified teacher trajectories, we further introduce a learning potential score to estimate the distillation value of each correct trace for the student model. Specifically, for an inputqqand a verified teacher trajectoryyy, we compute the length-normalized negative log-likelihood under the student model:

SLP(q,y)=−1|y|∑t=1|y|log⁡πθstu(yt∣q,y<t).S_{\mathrm{LP}}(q,y)=-\frac{1}{|y|}\sum_{t=1}^{|y|}\log\pi_{\theta_{\mathrm{stu}}}(y_{t}\mid q,y_{<t}).(4)A higher score indicates that the trace, although successfully generated and verified by the teacher, is not yet well modeled by the student, and therefore carries higher distillation value.

To prevent this score from being biased by sequence length or abnormal tokens, we do not rank traces globally. Instead, we compute priorities within domain-specific length buckets. Extremely short traces are excluded from score-based selection, as their average score can be dominated by a few abnormal tokens; extreme high-score outliers are also filtered to reduce the impact of format errors, distributional shifts, or noisy samples. Finally, we prioritize verified traces from the middle-to-high score range and mix the selected data across Math, Code, and STEM to construct the offline self-distillation dataset.

2.4Instruct RL

We finally apply Instruct RL to convert the reasoning-enhanced checkpoint into a more reliable user-facing model. We train on a mixed instruction dataset containing format-sensitive prompts, long-context instructions, and general alignment examples. For samples with explicit constraints, rewards are computed by rule-based validators that check format, ordering, item count, keyword constraints, and task completion. For open-ended prompts, we use rubric-based reward models to evaluate helpfulness, coherence, instruction adherence, and redundancy. By combining constraint checking with rubric-based rewards under the same on-policy RL framework, Instruct RL reinforces strict controllability while preserving the reasoning ability obtained from previous stages.

3Evaluation

3.1Evaluation Setup

Benchmarks.We evaluate VibeThinker-3B on a broad set of verifiable and instruction-oriented benchmarks that cover mathematical reasoning, code generation, scientific knowledge, and instruction following. For mathematics, we use AIME25[5], AIME26[24], HMMT25[17], BruMO25[8], and IMO-AnswerBench[22](abbreviated as IMO-Ans in tables), which together include recent competition-style problems with different formats and difficulty profiles. For coding, we report LiveCodeBench[19]v6 and OJBench[38]as standard executable-code benchmarks. We further include GPQA-Diamond[31]to measure graduate-level scientific reasoning, and IFEval and IFBench to evaluate whether the reasoning-enhanced model can still follow explicit user constraints. In addition to these standard benchmarks, we evaluate recent LeetCode weekly and biweekly contests as a practical out-of-distribution test for algorithmic problem solving.

Evaluation protocol.All VibeThinker-3B evaluations are performed with vLLM as the inference backend. Unless otherwise specified, we use temperature 1.0, top-p=0.95p=0.95, and top-k=−1k=-1for benchmark evaluation. We do not impose an additional output length cap beyond the model’s maximum generation length, allowing the model to complete long reasoning trajectories when needed. For mathematical tasks, unlike the evaluation in VibeThinker-1.5B, we jointly use math verify and LLM-as-judge to evaluate answer consistency. This is particularly important for benchmarks such as IMO-AnswerBench, where final answers can take more complex forms and rule-based symbolic verification alone may produce unreliable judgments. For code tasks, correctness is determined by executing the generated solution against the corresponding tests.

To ensure the stability of the evaluation metrics, we adopt different repeated sampling strategies based on the problem scale of various benchmarks. Specifically, for mathematical benchmarks, we report the mean Pass@1 over 64 independent generations, except for IMO-AnswerBench where 16 independent generations are used. For knowledge and coding benchmarks, we calculate the average performance using 16 and 8 independent generations, respectively. Scores of comparison models are collected from their released reports, public leaderboards, or official evaluation records when available.

Test-time scaling with claim-level reliability assessment.We additionally evaluate VibeThinker-3B with Claim-Level Reliability Assessment (CLR), a test-time scaling strategy for answer-verifiable tasks. Unlike most test-time scaling methods that aggregate whole reasoning traces, CLR focuses on the important claims that affect key decisions during problem solving. It follows a streamlined two-stage procedure. First, using the exact same sampling parameters as our standard evaluation, the model generatesK=32K=32candidate trajectories per problem and extractsM=5M=5decision-relevant claims alongside the final answer for each trajectory. Second, the model acts as its own self-verifier, attempting to falsify or validate these extracted claims to yield binary verdictsvk,m∈{0,1}v_{k,m}\in\{0,1\}. To heavily penalize trajectories containing flawed intermediate logic, CLR maps these verdicts into a nonlinear trajectory-level reliability scorerkr_{k}:

rk=(1M∑m=1Mvk,m)Mr_{k}=\left(\frac{1}{M}\sum_{m=1}^{M}v_{k,m}\right)^{M}(5)Finally, candidate answers are clustered by equivalence, and the answer maximizing the reliability-weighted aggregation is selected:

Score(G)=∑{k∣yk∈G}rk\mathrm{Score}(G)=\sum_{\{k\mid y_{k}\in G\}}r_{k}(6)This claim-level assessment effectively reduces noise from long traces without updating model parameters. Compared to trace-level self-verification methods that require processing the entire verbose trajectory, CLR isolates critical logical anchors to significantly reduce token consumption while consistently improving Pass@1 performance. In our experiments, we independently execute this entire test-time scaling flow88times and report the averaged Pass@1 performance as “+ CLR” in Table2.

3.2Evaluation Results

Overview.The central question of this evaluation follows directly from our previous VibeThinker-1.5B study. That model demonstrated that small models can perform reasoning tasks well, rather than merely producing shallow or unstable reasoning traces. VibeThinker-3B takes the next step: instead of asking whether a small model can reason at all, we ask how much parameter capacity is needed for a small model to enter the performance band of first-tier reasoning systems. After increasing the scale from 1.5B to 3B while preserving the diversity-driven post-training paradigm, the results below suggest that the reasoning capability of compact models is not linearly bounded by their parameter scale, and a 3B model can move from “strong for its size” toward genuine first-tier competitiveness.

Table 1:Performance of VibeThinker-3B on Core BenchmarksModelMathematicsCodingKnowledgeInstructionNameParamsAIME25AIME26HMMT25BruMO25IMO-AnsLCBv6OJBenchGPQA-DIFEvalIFBenchSmall Reasoning ModelsSmolLM33B36.741.026.049.228.729.15.241.771.227.6Hunyuan-4B-Instruct4B66.557.735.262.739.646.812.161.176.626.5Qwen3-4B-Thinking-25074B81.379.055.577.751.655.217.965.887.452.9Qwen3.5-4B4B79.884.073.883.548.762.023.576.289.859.2Olmo-3-Think7B67.969.143.869.049.452.615.646.277.930.0Mimo7B-RL-05307B70.276.048.579.853.952.220.260.659.731.6OpenReasoning-Nemotron7B78.280.263.578.860.664.925.961.144.031.3Gemma-4-it12B72.977.563.380.454.972.0–78.888.445.2Phi4-Reasoning-Plus14B68.473.650.366.546.256.814.481.984.951.7Ministral-3-Reasoning-251214B82.985.067.186.763.466.015.171.273.932.3Large Reasoning ModelsGPT-OSS-20B (high)20B91.790.276.786.761.961.0–71.592.865.0Nemotron-3-Nano30B89.190.1––70.468.3–73.092.871.5GLM-4.5-Air106B83.3–69.290.0–70.7–75.086.337.6Qwen3-235B-A22B-Thinking235B92.3–83.9–70.574.132.581.187.851.2LongCat Flash560B90.6–83.7––79.440.781.586.9–GPT-5 Nano (high)N/A85.2–75.680.8–––71.2––VibeThinker-3B3B91.494.389.393.876.480.238.670.293.474.5

Core benchmark performance.Table1summarizes the main evaluation results across mathematics, coding, knowledge, and instruction following. The upper block of the table compares VibeThinker-3B with small and mid-sized reasoning models, including SmolLM3[6], Hunyuan-4B-Instruct[37], Qwen3.5-4B[29], Olmo-3-Think[25], Mimo7B-RL-0530[41], OpenReasoning-Nemotron[23], Gemma-4-12B-it[16], and Phi4-Reasoning-Plus[1]. On the mathematics suite, it reaches 91.4 on AIME25, 94.3 on AIME26, 89.3 on HMMT25, 93.8 on BruMO25, and 76.4 on IMO-AnswerBench. Compared to strong small reasoning baselines (<<14B), our model establishes a substantial performance lead.

Crucially, the coding and instruction-following results demonstrate that this enhancement is not confined to a specific family of mathematical benchmarks. VibeThinker-3B reaches 80.2 on LiveCodeBench v6 and 38.6 on OJBench, surpassing all models in Table1on LiveCodeBench v6. Furthermore, it achieves 93.4 on IFEval and 74.5 on IFBench, confirming that the reasoning optimization process does not compromise instruction controllability. This is particularly significant, as a practical small reasoning model must not only solve competitive problems but also maintain reliable user-facing alignment after undergoing long-context reasoning RL and self-distillation.

The lower block of Table1extends the comparison to substantially larger reasoning models, namely GPT-OSS-20B (high)[2], Nemotron-3-Nano[7], GLM-4.5-Air[46], Qwen3-235B-A22B-Thinking[36], and LongCat Flash[35]. This serves as a more rigorous test for the central claim articulated in the Introduction: if reasoning ability on verifiable tasks depends primarily on abstract search, constraint satisfaction, and error correction rather than parametric memorization, then a carefully optimized 3B model should be capable of challenging much larger systems on these tasks. Our empirical results substantiate this hypothesis. VibeThinker-3B achieves leading performance across multiple benchmarks when compared to reasoning models several times its size, outperforming or matching several 30B–560B open models. This demonstrates that the 3B parameter budget is already sufficient to support highly compressed, long-horizon mathematical reasoning, provided the model is trained with appropriate exploration and verification signals.

At the same time, the table reveals a useful boundary. The gap to the strongest large models is more visible on broad knowledge-heavy evaluation, especially GPQA-Diamond, than on competition mathematics or executable coding. This echoes the observation from VibeThinker-1.5B: compact models can acquire strong reasoning procedures, but knowledge-intensive benchmarks still expose a clear gap to large-parameter general models. This pattern is consistent with our hypothesis that reasoning and knowledge storage are only partially coupled. Compact models may still face capacity limits when broad domain knowledge must be recalled directly, but they can nevertheless host a highly effective reasoning engine for tasks with verifiable goals and structured solution spaces.

Table 2:Performance of VibeThinker-3B on Core Benchmarks (Top-Tier Reasoning Models)ModelMathematicsCodingKnowledgeInstructionNameParamsAIME25AIME26HMMT25BruMO25IMO-AnsLCBv6OJBenchGPQA-DIFEvalIFBenchOpen-Source ModelsGPT-OSS (high)120B92.593.290.092.575.681.941.580.189.569.5MiMo v2 Flash309B94.1–84.4––80.6–83.7–64.2MiniMax M2.7229B–89.8––66.3––87.0–76.0DeepSeek R1 0528671B87.5–79.492.560.868.733.681.079.139.6Qwen3.5-397B-A17B397B–91.394.8–80.983.6–88.492.676.5DeepSeek V3.2671B93.194.290.296.778.380.848.482.492.660.7Kimi K2.51T96.193.395.498.381.885.054.787.693.970.0GLM-5744B96.795.897.9–82.585.555.086.092.676.5Proprietary ModelsGemini 2.5 FlashN/A72.0–64.283.3–61.223.582.889.836.1OpenAI o3 (high)N/A88.9–77.595.861.175.825.483.392.169.3Gemini 2.5 ProN/A86.7–82.590.068.272.538.986.490.848.7Grok 4N/A91.7–90.095.073.1––87.588.053.7Claude Opus 4.5N/A92.895.192.9–78.584.8–87.0–58.0GPT-5 (high)N/A94.6–88.391.776.084.5–85.7–73.1Qwen3.6 PlusN/A93.395.396.7–83.887.1–90.494.374.2Gemini 3 ProN/A96.091.797.598.383.187.458.891.9–70.4VibeThinker-3B91.494.389.393.876.480.238.670.293.474.5+ CLR3B96.797.195.499.280.672.9

Comparison with top-tier reasoning models.Table2then raises the comparison bar from general baselines to top-tier reasoning models, including open-source models such as GPT-OSS-120B (high)[2], MiMo v2 Flash[40], MiniMax M2.7[10], DeepSeek R1 0528[12], Qwen3.5-397B-A17B[29], DeepSeek V3.2[20], Kimi K2.5[34], GLM-5[45], and proprietary models such as Gemini 2.5 Flash[11], OpenAI o3[26], Gemini 2.5 Pro[11], Grok 4[39], Claude Opus 4.5[3], GPT-5 (high)[33], Qwen3.6 Plus[30], and Gemini 3 Pro[15], making the comparison even more demanding. These systems represent the current flagship regime in reasoning capability and are backed by substantially larger model and training budgets.

VibeThinker-3B still performs competitively in this setting. Without CLR reasoning enhancement, its AIME26 score of 94.3 is comparable to DeepSeek V3.2 and Kimi K2.5, while its 93.8 on BruMO25 exceeds several much larger parameter models. We also report the effect of CLR on answer-verifiable mathematics benchmarks and GPQA-Diamond. With CLR, VibeThinker-3B improves to 96.7 on AIME25, 97.1 on AIME26, 95.4 on HMMT25, 99.2 on BruMO25, 80.6 on IMO-AnswerBench, and 72.9 on GPQA-Diamond. After using CLR, the model enters the top cluster of Table2on competition-style mathematics: it matches or exceeds many flagship open-source and proprietary systems on AIME25, AIME26, HMMT25, BruMO25, and IMO-AnswerBench.

This result does not imply that a 3B model has matched leading general-purpose systems in comprehensive capabilities (such as broad encyclopedic knowledge or open-domain instruction following). Rather, it provides an important and concrete proof: on well-constrained, verifiable reasoning tasks, first-tier performance is no longer the exclusive domain of ultra-large models, and a compact model of merely 3B parameters can equally earn its place. In this sense, VibeThinker-3B acts as a “parameter-scale probe” built upon the conclusion of VibeThinker-1.5B: if the 1.5B version proved that a small model could produce complete and logically coherent reasoning trajectories, then the 3B version further answers the critical question of “what parameter threshold is actually required to enter the top reasoning tier”. This is precisely the core empirical signal that VibeThinker-3B aims to deliver—under appropriate post-training optimization, extreme reasoning capability is not strictly bounded by raw parameter scale.

The GPQA-Diamond result should be interpreted more conservatively. CLR raises VibeThinker-3B from 70.2 to 72.9, but the model still trails the strongest large-parameter systems by a visible margin on this knowledge-heavy benchmark. This is consistent with our claim rather than a contradiction to it: the main finding is not that a 3B model has fully replaced leading general-purpose models, but that a small model can reach first-tier performance on many verifiable reasoning tasks. These results suggest that, within such domains, the decisive bottleneck is not always raw parameter count; high-quality post-training, diverse solution exploration, reliable verification, and effective test-time reasoning can jointly push a compact model into a much higher capability regime.

Table 3:OOD Generalization Test: LeetCode Weekly & Biweekly Contests (Apr 25–May 31, 2026)Generalization test on recent LeetCode contests.To further test coding generalization beyond curated benchmark suites, we evaluate VibeThinker-3B on recent LeetCode weekly and biweekly contests from Apr. 25 to May 31, 2026. As shown in Table3, each model is evaluated using Python-only one-shot generation. Each contest column contains four problems, and each problem is sampled with four independent rollouts, yielding 16 first-attempt submissions per contest. A cell ofx/16x/16therefore means thatxxof the 16 independent Python submissions passed all hidden tests on their first submission. Weekly contests are denoted by “W” and biweekly contests by “BW”; for example, W504 refers to LeetCode Weekly Contest 504. We omit W501 because it does not have public LeetCode LLM ranking data, and include W504 as the latest available weekly contest in the public leaderboard used here.

Overall, VibeThinker-3B passes 123 out of 128 first-attempt Python submissions, corresponding to an overall acceptance rate of 96.1%. This result is higher than GPT-5.2[27], Doubao Seed 2.0 Pro[9], Qwen3-Max[28], Kimi K2.5[34], Qwen3.5-397B-A17B[29], and the Claude 4.6[4]models under the same contest aggregation. It is also close to Gemini 3 Flash[14]and remains below only the strongest entries in the table. The contest results are useful because the tasks are recent, diverse, and execution-verified. They therefore provide a complementary view to LiveCodeBench and OJBench: VibeThinker-3B is not merely fitting a static coding benchmark distribution, but can solve fresh algorithmic problems under a realistic competitive-programming evaluation protocol. This confirms the model’s robust out-of-distribution (OOD) generalization capabilities on unseen algorithms.

4Conclusion

In this report, we present VibeThinker-3B, a compact reasoning model comprising only 3 billion parameters. On challenging verifiable reasoning benchmarks, including AIME26, HMMT25, IMO-AnswerBench, and LiveCodeBench v6, it delivers strong results and further demonstrates robust generalization on out-of-distribution LeetCode evaluations. Taken together, these evaluations show that VibeThinker-3B reaches a performance band comparable to representative frontier LLMs, such as GLM-5, Kimi K2.5, Gemini 3 Pro, and Claude Opus 4.5, providing evidence that small language models can effectively approximate frontier reasoning capabilities on highly complex verifiable tasks despite much smaller parameter scales.

Based on these findings, we propose the Parametric Compression-Coverage Hypothesis, suggesting a structural divergence in how foundational capabilities are encoded within the parameter space. Specifically, verifiable reasoning aligns more closely with parameter-dense core compression, whereas open-domain knowledge and general-purpose capabilities rely more heavily on the broad coverage afforded by model scale. Consequently, the potential for small models to achieve top-tier performance within specific capability domains has long been underestimated. Therefore, the development of SLMs should no longer be viewed merely as a passive choice driven by deployment efficiency. Instead, it serves as an efficient and complementary evolutionary trajectory alongside traditional scaling laws, offering novel insights for the design of future reasoning systems.

References

[1]M. Abdin, S. Agarwal, A. Awadallah, V. Balachandran, H. Behl, L. Chen, G. de Rosa, S. Gunasekar, M. Javaheripi, N. Joshi,et al.(2025)Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318.Cited by:§3.2.
[2]S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao,et al.(2025)Gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925.Cited by:§3.2,§3.2.
[3](2025-11)Claude Opus 4.5 System Card.Anthropic.Note:System cardExternal Links:LinkCited by:§3.2.
[4]Anthropic(2026-02)Claude Opus 4.6 System Card.Anthropic.Note:System cardExternal Links:LinkCited by:§3.2.
[5]Art of Problem Solving(2025)AIME Problems and Solutions.Note:Accessed: 2025External Links:LinkCited by:§3.1.
[6]E. Bakouch, L. Ben Allal, A. Lozhkov, N. Tazi, L. Tunstall, C. M. Patiño, E. Beeching, A. Roucher, A. J. Reedi, Q. Gallouédec, K. Rasul, N. Habib, C. Fourrier, H. Kydlicek, G. Penedo, H. Larcher, M. Morlon, V. Srivastav, J. Lochner, X. Nguyen, C. Raffel, L. von Werra, and T. Wolf(2025)SmolLM3: smol, multilingual, long-context reasoner.Note:https://huggingface.co/blog/smollm3Cited by:§3.2.
[7]A. Blakeman, A. Grattafiori, A. Basant, A. Gupta, A. Khattar, A. Renduchintala, A. Vavre, A. Shukla, A. Bercovich, A. Ficek,et al.(2025)Nemotron 3 nano: open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848.Cited by:§3.2.
[8]BRUMO(2025)Brown University Math Olympiad 2025.Note:Accessed: 2025External Links:LinkCited by:§1,§3.1.
[9]ByteDance Seed Team(2026-02)Seed 2.0 Official Launch.Note:Official release pageExternal Links:LinkCited by:§3.2.
[10]A. Chen, A. Li, B. Zhou, B. Gong, B. Jiang, B. Dan, C. Yu, C. Wang, C. Ma, C. Zhong,et al.(2026)The minimax-m2 series: mini activations unleashing max real-world intelligence.arXiv preprint arXiv:2605.26494.Cited by:§3.2.
[11]G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al.(2025)Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261.Cited by:§3.2.
[12]DeepSeek-AI(2025)DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948.Cited by:§3.2.
[13]Q. Fang and D. Khazi(2025)Mismatch praxis: rollout settings and is corrections.Note:The LLM Data Company BlogExternal Links:LinkCited by:§2.2.1.
[14]Google DeepMind(2025-12)Gemini 3 Flash Model Card.Google DeepMind.Note:Model cardExternal Links:LinkCited by:§1,§3.2.
[15]Google DeepMind(2026-05)Gemini 3 Pro Model Card.Google DeepMind.External Links:LinkCited by:§3.2.
[16]Google(2026-06)Bringing Gemma 4 12B to your Laptop: Unlocking Local Agentic Workflows with Google AI Edge.Note:Google Developers BlogExternal Links:LinkCited by:§3.2.
[17]HMMT(2025)HMMT 2025.Note:Accessed: 2025External Links:LinkCited by:§1,§3.1.
[18]J. Hu(2025)Reinforce++: a simple and efficient approach for aligning large language models.arXiv e-prints,pp. arXiv–2501.Cited by:§1.
[19]N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica(2024)LiveCodeBench: holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974.Cited by:§1,§3.1.
[20]A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong,et al.(2025)Deepseek-v3. 2: pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556.Cited by:§1,§3.2.
[21]M. Luo, S. Tan, J. Wong, X. Shi, W. Y. Tang, M. Roongta, C. Cai, J. Luo, T. Zhang, L. E. Li,et al.(2025)Deepscaler: surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog3(5).Cited by:§2.2.2.
[22]T. Luong, D. Hwang, H. H. Nguyen, G. Ghiasi, Y. Chervonyi, I. Seo, J. Kim, G. Bingham, J. Lee, S. Mishra, A. Zhai, C. H. Hu, H. Michalewski, J. Kim, J. Ahn, J. Bae, X. Song, T. H. Trinh, Q. V. Le, and J. Jung(2025)Towards robust mathematical reasoning.arXiv preprint arXiv:2511.01846.Cited by:§3.1.
[23]S. Majumdar, I. Gitman, S. Toshniwal, A. Ficek, and NVIDIA(2025-07)OpenReasoning-Nemotron: A Family of State-of-the-Art Distilled Reasoning Models.Note:Hugging Face community articleExternal Links:LinkCited by:§3.2.
[24]Mathematical Association of America(2026)American Invitational Mathematics Examination.Note:Accessed: 2026-04-09External Links:LinkCited by:§1,§3.1.
[25]T. Olmo, A. Ettinger, A. Bertsch, B. Kuehl, D. Graham, D. Heineman, D. Groeneveld, F. Brahman, F. Timbers, H. Ivison, J. Morrison, J. Poznanski, K. Lo, L. Soldaini, M. Jordan, M. Chen, M. Noukhovitch, N. Lambert, P. Walsh, P. Dasigi, R. Berry, S. Malik, S. Shah, S. Geng, S. Arora, S. Gupta, T. Anderson, T. Xiao, T. Murray, T. Romero, V. Graf, A. Asai, A. Bhagia, A. Wettig, A. Liu, A. Rangapur, C. Anastasiades, C. Huang, D. Schwenk, H. Trivedi, I. Magnusson, J. Lochner, J. Liu, L. J. V. Miranda, M. Sap, M. Morgan, M. Schmitz, M. Guerquin, M. Wilson, R. Huff, R. L. Bras, R. Xin, R. Shao, S. Skjonsberg, S. Z. Shen, S. S. Li, T. Wilde, V. Pyatkin, W. Merrill, Y. Chang, Y. Gu, Z. Zeng, A. Sabharwal, L. Zettlemoyer, P. W. Koh, A. Farhadi, N. A. Smith, and H. Hajishirzi(2025)Olmo 3.arXiv preprint arXiv:2512.13961.Cited by:§3.2.
[26]OpenAI(2025-04)OpenAI o3 and o4-mini System Card.OpenAI.Note:April 16, 2025External Links:LinkCited by:§3.2.
[27]OpenAI(2025-12)Update to GPT-5 System Card: GPT-5.2.OpenAI.Note:System card updateExternal Links:LinkCited by:§1,§3.2.
[28]Qwen Team(2025-09)Qwen3-Max: just scale it.External Links:LinkCited by:§3.2.
[29]Qwen Team(2026-02)Qwen3.5: towards native multimodal agents.External Links:LinkCited by:§3.2,§3.2,§3.2.
[30]Qwen Team(2026-04)Qwen3.6-Plus: towards real world agents.External Links:LinkCited by:§3.2.
[31]D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman(2023)GPQA: a graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022.Cited by:§3.1.
[32]Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu,et al.(2024)Deepseekmath: pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300.Cited by:§1.
[33]A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram,et al.(2025)Openai gpt-5 system card.arXiv preprint arXiv:2601.03267.Cited by:§3.2.
[34]K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chen,et al.(2026)Kimi k2. 5: visual agentic intelligence.arXiv preprint arXiv:2602.02276.Cited by:§1,§3.2,§3.2.
[35]M. L. Team, A. Gui, B. Li, B. Tao, B. Zhou, B. Chen, C. Zhang, C. Han, C. Yang, C. Zhang,et al.(2025)Introducing longcat-flash-thinking: a technical report.arXiv preprint arXiv:2509.18883.Cited by:§3.2.
[36]Q. Team(2025)Qwen3 technical report.arXiv preprint arXiv:2505.09388.Cited by:§3.2.
[37]Tencent Hunyuan Team(2025-07)Hunyuan-4B-Instruct.Note:Hugging Face model card.External Links:LinkCited by:§3.2.
[38]Z. Wang, Y. Liu, Y. Wang, W. He, B. Gao, M. Diao, Y. Chen, K. Fu, F. Sung, Z. Yang, T. Liu, and W. Xu(2025)OJBench: a competition level code benchmark for large language models.arXiv preprint arXiv:2506.16395.Cited by:§3.1.
[39]xAI(2025-08)Grok 4 Model Card.xAI.Note:Last updated: August 20, 2025External Links:LinkCited by:§3.2.
[40]B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang,et al.(2026)Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780.Cited by:§3.2.
[41]L. Xiaomi(2025)MiMo: unlocking the reasoning potential of language model – from pretraining to posttraining.arXiv preprint arXiv:2505.07608.Cited by:§3.2.
[42]S. Xu, Y. Zhou, W. Wang, J. Min, Z. Yin, Y. Dai, S. Liu, L. Pang, Y. Chen, and J. Zhang(2025)Tiny model, big logic: diversity-driven optimization elicits large-model reasoning ability in vibethinker-1.5 b.arXiv preprint arXiv:2511.06221.Cited by:§1,§1,§2.1.2,§2.2.1,§2.
[43]F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao(2025-08)Your efficient rl framework secretly brings you off-policy rl training.External Links:LinkCited by:§2.2.1.
[44]Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu,et al.(2025)Dapo: an open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476.Cited by:§1.
[45]A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie,et al.(2026)Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763.Cited by:§3.2.
[46]A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang,et al.(2025)Glm-4.5: agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471.Cited by:§3.2.
[47]C. Zheng, S. Liu, M. Li, X. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang,et al.(2025)Group sequence policy optimization.arXiv preprint arXiv:2507.18071.Cited by:§1.

@f14bertolotti: Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5…

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Abstract

2Methods

2.1Supervised Fine-tuning

2.1.1Data Construction

2.1.2Training Process

2.2Reinforcement Learning

2.2.1Algorithm Backbone

2.2.2Multi-domain Reasoning RL

2.3Offline Self-Distillation

2.4Instruct RL

3Evaluation

3.1Evaluation Setup

3.2Evaluation Results

4Conclusion

References

Similar Articles

WeiboAI/VibeThinker-3B

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

@kimmonismus: Crazy: A 3B model is now reaching highly competitive results on verifiable reasoning tasks. VibeThinker-3B scores 94.3 …

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

@cjzafir: A 3B parameter SLM: VibeThinker (fine-tuned on Qwen 2.5) matches Claude Opus 4.5 performance. Same performance as: > De…

Submit Feedback

Similar Articles

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

@kimmonismus: Crazy: A 3B model is now reaching highly competitive results on verifiable reasoning tasks. VibeThinker-3B scores 94.3 …

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

@cjzafir: A 3B parameter SLM: VibeThinker (fine-tuned on Qwen 2.5) matches Claude Opus 4.5 performance. Same performance as: > De…