Mellum2 Technical Report
Summary
Mellum 2 is a 12B-parameter open-weight MoE language model by JetBrains with 2.5B active parameters, specialized in software engineering tasks and optimized for efficient inference on commodity GPUs.
View Cached Full Text
Cached at: 06/01/26, 03:20 PM
Paper page - Mellum2 Technical Report
Source: https://huggingface.co/papers/2605.31268
Abstract
Mellum 2 is an open-weight 12B-parameter Mixture-of-Experts language model with 2.5B active parameters per token, specialized in software engineering tasks and optimized for inference efficiency on commodity GPUs.
We present Mellum 2, an open-weight 12B-parameterMixture-of-Experts(MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on theMixture-of-Experts(64 experts, 8 active) and combinesGrouped-Query Attentionwith 4 KV heads,Sliding Window Attentionon three of every four layers, and a singleMulti-Token Prediction headthat doubles as both an auxiliary pre-training objective and a built-in draft model forspeculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon underFP8 hybrid precisionand aWarmup-Hold-Decay schedulewith linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selectiveYaRNand then post-trained in two stages (supervised fine-tuningfollowed byRLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.31268
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper6
#### JetBrains/Mellum2-12B-A2.5B-Thinking Text Generation• 12B• Updatedabout 4 hours ago • 80 • 13
#### JetBrains/Mellum2-12B-A2.5B-Instruct Text Generation• 12B• Updatedabout 4 hours ago • 28 • 6
#### JetBrains/Mellum2-12B-A2.5B-Thinking-SFT Text Generation• 12B• Updatedabout 4 hours ago • 5
#### JetBrains/Mellum2-12B-A2.5B-Base Text Generation• 12B• Updatedabout 4 hours ago • 28 • 4
Browse 6 models citing this paper## Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.31268 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31268 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
JetBrains's Mellum 2 (49 minute read)
JetBrains releases Mellum 2, a 12B-parameter open-weight Mixture-of-Experts language model specialized in software engineering, with competitive performance in code generation, reasoning, and tool use, available under Apache 2.0.
Mellum 2 12B A2.5B
JetBrains released Mellum 2 12B A2.5B, a coding-focused small MoE model with reasoning performance comparable to Qwen 3.5 9B but weaker in other tasks.
Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
JetBrains introduces Mellum2, a 12B parameter Mixture-of-Experts model optimized for code generation and reasoning tasks, with a focus on private deployment and integration into development workflows.
Mellum2 Goes Open Source: A Fast Model for AI Workflows | The JetBrains AI Blog
JetBrains open-sources Mellum2, a fast 12B Mixture-of-Experts model designed for low-latency AI workflows in software engineering, available under Apache 2.0 license.
JetBrains/Mellum2-12B-A2.5B-Thinking
JetBrains releases Mellum2-12B-A2.5B-Thinking, an open-source Mixture-of-Experts reasoning model with 131k context length, trained with RLVR for explicit chain-of-thought reasoning.