Scaling former VibeThinker-1.5B to 3B — now it reaches frontier math & coding performance

Reddit r/LocalLLaMA 06/16/26, 01:44 PM Models

small-language-model reasoning math coding performance scaling

Summary

The VibeThinker-3B model achieves state-of-the-art math and coding reasoning performance, scoring 94.3 on AIME'26 and 96.1% on unseen LeetCode problems, demonstrating that small models can reach frontier-level reasoning in verifiable domains.

https://preview.redd.it/obgodr9dfn7h1.png?width=1796&format=png&auto=webp&s=b5fd95e2b7e6f8ed7704e3de66778e970d34a1dd 1. We trained VibeThinker-3B to test how far verifiable reasoning can be pushed in a strict small-model regime. 2. It gets 94.3 on AIME'26, 80.2 on LiveCodeBench v6, 76.4 on IMO-AnswerBench, and 93.4 on IFEval. 3. On recent unseen LeetCode weekly/biweekly contests, it passes 123/128 first-attempt Python submissions, or 96.1% overall. 4. Small models are not just cheaper substitutes. In parameter-dense domains with clear verification signals, SLMs offer a path to frontier-level reasoning that complements traditional Scaling Law. Though it still has limitations in broader practical and general-purpose use cases, we will keep improving these areas in future versions. We’d love for the community to test it on your own math/coding/OOD tasks and share failures or feedback. Paper: [paper link](https://huggingface.co/papers/2606.16140) Eval setting in the report: vLLM/Sglang, temp=1.0, top\_p=0.95, top\_k=-1.

Original Article

Similar Articles

WeiboAI/VibeThinker-3B

Hugging Face Models Trending

VibeThinker-3B is a 3B-parameter model that achieves frontier-level reasoning performance on math, coding, and STEM benchmarks by optimizing the Spectrum-to-Signal Principle (SSP) post-training pipeline, reaching performance comparable to much larger models.

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Hugging Face Daily Papers

VibeThinker-3B is a compact 3B parameter model that achieves frontier-level performance on verifiable reasoning tasks through a specialized training pipeline, matching larger models like DeepSeek V3.2 and Gemini 3 Pro.

Why Weibo's tiny VibeThinker-3B has the AI world arguing over benchmarks again (15 minute read)

TLDR AI

Weibo's VibeThinker-3B, a 3B parameter model, claims to match or exceed the reasoning performance of much larger models like DeepSeek V3.2 and Gemini 3 Pro on math and coding benchmarks, sparking debate over benchmark reliability and the necessity of scaling.

VibeThinker-3B: what is this witchcraft? Killing it at MathQA like it has ~30B parameters

Reddit r/LocalLLaMA

VibeThinker-3B is a small 3B parameter model that achieves performance comparable to ~30B parameter models on the MathQA benchmark, demonstrating significant efficiency.

@f14bertolotti: Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5…

X AI KOLs Timeline

This technical report introduces VibeThinker-3B, a 3B parameter model that achieves frontier-level verifiable reasoning performance through post-training refinements on Qwen2.5-Coder, including curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation, matching or exceeding much larger models like DeepSeek V3.2.