Scaling former VibeThinker-1.5B to 3B — now it reaches frontier math & coding performance

Reddit r/LocalLLaMA Models

Summary

The VibeThinker-3B model achieves state-of-the-art math and coding reasoning performance, scoring 94.3 on AIME'26 and 96.1% on unseen LeetCode problems, demonstrating that small models can reach frontier-level reasoning in verifiable domains.

https://preview.redd.it/obgodr9dfn7h1.png?width=1796&format=png&auto=webp&s=b5fd95e2b7e6f8ed7704e3de66778e970d34a1dd 1. We trained VibeThinker-3B to test how far verifiable reasoning can be pushed in a strict small-model regime. 2. It gets 94.3 on AIME'26, 80.2 on LiveCodeBench v6, 76.4 on IMO-AnswerBench, and 93.4 on IFEval. 3. On recent unseen LeetCode weekly/biweekly contests, it passes 123/128 first-attempt Python submissions, or 96.1% overall. 4. Small models are not just cheaper substitutes. In parameter-dense domains with clear verification signals, SLMs offer a path to frontier-level reasoning that complements traditional Scaling Law. Though it still has limitations in broader practical and general-purpose use cases, we will keep improving these areas in future versions. We’d love for the community to test it on your own math/coding/OOD tasks and share failures or feedback. Paper: [paper link](https://huggingface.co/papers/2606.16140) Eval setting in the report: vLLM/Sglang, temp=1.0, top\_p=0.95, top\_k=-1.
Original Article

Similar Articles

WeiboAI/VibeThinker-3B

Hugging Face Models Trending

VibeThinker-3B is a 3B-parameter model that achieves frontier-level reasoning performance on math, coding, and STEM benchmarks by optimizing the Spectrum-to-Signal Principle (SSP) post-training pipeline, reaching performance comparable to much larger models.

@f14bertolotti: Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5…

X AI KOLs Timeline

This technical report introduces VibeThinker-3B, a 3B parameter model that achieves frontier-level verifiable reasoning performance through post-training refinements on Qwen2.5-Coder, including curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation, matching or exceeding much larger models like DeepSeek V3.2.