@cjzafir: A 3B parameter SLM: VibeThinker (fine-tuned on Qwen 2.5) matches Claude Opus 4.5 performance. Same performance as: > De…

X AI KOLs Timeline Models

Summary

VibeThinker, a 3B parameter model fine-tuned on Qwen 2.5, achieves performance comparable to Claude Opus 4.5 and much larger models like DeepSeek v3 through innovative post-training that includes multi-path thinking and staged training on math, coding, and science.

A 3B parameter SLM: VibeThinker (fine-tuned on Qwen 2.5) matches Claude Opus 4.5 performance. Same performance as: > Deepseek v3 (671B parameters) but 220x smaller. > Kimi k2.5 (1T parameters) but 330x smaller. > GLM-5 (744B parameters) but 248x smaller. You'll be able to run this model on your Macs. Its not a fluke. Their post training work is very interesting. - They trained in a smart order: math first → coding second → science third. - For each problem, they made the model think in multiple different ways before picking the best answer. - They trained in two stages: first on many normal problems, then on hard, long reasoning problems. - They focused on verifiable high quality synthetic dataset. Also heavily filtered all bad examples. - They focused on long horizon tasks. Trained it on long text all at once (instead of gradually making it longer) so it can think for a long time without getting confused. - At the end, they added training to make the model give shorter but still correct answers (more efficient). Post-training (finetuning) innovation is very important and what happened to Fable 5 should make you realize how important it is to own your intelligence. I'll be testing this model and share my findings.
Original Article
View Cached Full Text

Cached at: 06/18/26, 12:05 AM

A 3B parameter SLM: VibeThinker (fine-tuned on Qwen 2.5) matches Claude Opus 4.5 performance.

Same performance as:

Deepseek v3 (671B parameters) but 220x smaller. Kimi k2.5 (1T parameters) but 330x smaller. GLM-5 (744B parameters) but 248x smaller.

You’ll be able to run this model on your Macs.

Its not a fluke. Their post training work is very interesting.

  • They trained in a smart order: math first → coding second → science third.

  • For each problem, they made the model think in multiple different ways before picking the best answer.

  • They trained in two stages: first on many normal problems, then on hard, long reasoning problems.

  • They focused on verifiable high quality synthetic dataset. Also heavily filtered all bad examples.

  • They focused on long horizon tasks. Trained it on long text all at once (instead of gradually making it longer) so it can think for a long time without getting confused.

  • At the end, they added training to make the model give shorter but still correct answers (more efficient).

Post-training (finetuning) innovation is very important and what happened to Fable 5 should make you realize how important it is to own your intelligence.

I’ll be testing this model and share my findings.

Francesco Bertolotti (@f14bertolotti): Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5-Coder. The paper doesn’t provide many details, but it appears they distill from RL ckpts and then do a final RL-based instruct RL.

🔗

Similar Articles

@f14bertolotti: Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5…

X AI KOLs Timeline

This technical report introduces VibeThinker-3B, a 3B parameter model that achieves frontier-level verifiable reasoning performance through post-training refinements on Qwen2.5-Coder, including curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation, matching or exceeding much larger models like DeepSeek V3.2.

WeiboAI/VibeThinker-3B

Hugging Face Models Trending

VibeThinker-3B is a 3B-parameter model that achieves frontier-level reasoning performance on math, coding, and STEM benchmarks by optimizing the Spectrum-to-Signal Principle (SSP) post-training pipeline, reaching performance comparable to much larger models.