Mellum2 Technical Report

Hugging Face Daily Papers Papers

Summary

Mellum 2 is a 12B-parameter open-weight MoE language model by JetBrains with 2.5B active parameters, specialized in software engineering tasks and optimized for efficient inference on commodity GPUs.

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.
Original Article
View Cached Full Text

Cached at: 06/01/26, 03:20 PM

Paper page - Mellum2 Technical Report

Source: https://huggingface.co/papers/2605.31268

Abstract

Mellum 2 is an open-weight 12B-parameter Mixture-of-Experts language model with 2.5B active parameters per token, specialized in software engineering tasks and optimized for inference efficiency on commodity GPUs.

We present Mellum 2, an open-weight 12B-parameterMixture-of-Experts(MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on theMixture-of-Experts(64 experts, 8 active) and combinesGrouped-Query Attentionwith 4 KV heads,Sliding Window Attentionon three of every four layers, and a singleMulti-Token Prediction headthat doubles as both an auxiliary pre-training objective and a built-in draft model forspeculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon underFP8 hybrid precisionand aWarmup-Hold-Decay schedulewith linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selectiveYaRNand then post-trained in two stages (supervised fine-tuningfollowed byRLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.31268

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper6

#### JetBrains/Mellum2-12B-A2.5B-Thinking Text Generation• 12B• Updatedabout 4 hours ago • 80 • 13 #### JetBrains/Mellum2-12B-A2.5B-Instruct Text Generation• 12B• Updatedabout 4 hours ago • 28 • 6 #### JetBrains/Mellum2-12B-A2.5B-Thinking-SFT Text Generation• 12B• Updatedabout 4 hours ago • 5 #### JetBrains/Mellum2-12B-A2.5B-Base Text Generation• 12B• Updatedabout 4 hours ago • 28 • 4 Browse 6 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31268 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31268 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

JetBrains's Mellum 2 (49 minute read)

TLDR AI

JetBrains releases Mellum 2, a 12B-parameter open-weight Mixture-of-Experts language model specialized in software engineering, with competitive performance in code generation, reasoning, and tool use, available under Apache 2.0.

Mellum 2 12B A2.5B

Reddit r/LocalLLaMA

JetBrains released Mellum 2 12B A2.5B, a coding-focused small MoE model with reasoning performance comparable to Qwen 3.5 9B but weaker in other tasks.

JetBrains/Mellum2-12B-A2.5B-Thinking

Hugging Face Models Trending

JetBrains releases Mellum2-12B-A2.5B-Thinking, an open-source Mixture-of-Experts reasoning model with 131k context length, trained with RLVR for explicit chain-of-thought reasoning.