JetBrains's Mellum 2 (49 minute read)
Summary
JetBrains releases Mellum 2, a 12B-parameter open-weight Mixture-of-Experts language model specialized in software engineering, with competitive performance in code generation, reasoning, and tool use, available under Apache 2.0.
View Cached Full Text
Cached at: 06/02/26, 03:42 PM
# 1 Introduction
Source: [https://arxiv.org/html/2605.31268](https://arxiv.org/html/2605.31268)
![[Uncaptioned image]](https://arxiv.org/html/2605.31268v1/x1.png)
\\jbsans
Mellum 2
\\jbsans
Technical Report
v1\.0 · May 2026
Marko Kojic1Ivan Bondyrev1Aral de Moor1Joseph Shtok1 Petr Borovlev1,2Kseniia Lysaniuk1,2Madeeswaran Kannan1Ivan Dolgov1 Nikita Pavlichenko1
1JetBrains2Constructor University, Bremen, Germany
\{keybox\}
\[frametitle=Abstract\] We presentMellum 2, an open\-weight 12B\-parameter Mixture\-of\-Experts \(MoE\) language model with 2\.5B active parameters per token\.Mellum 2is a general\-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi\-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion\-focused 4B denseMellummodel\. The architecture builds on the Mixture\-of\-Experts \(64 experts, 8 active\) and combines Grouped\-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi\-Token Prediction head that doubles as both an auxiliary pre\-training objective and a built\-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint\. Pre\-training spans approximately 10\.6 trillion tokens through a three\-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content \(code ratio 23 %→\\to42 %→\\to59 %\), optimized with Muon under FP8 hybrid precision and a Warmup\-Hold\-Decay schedule with linear decay to zero\. The pre\-trained base is extended to a 128K context window via a layer\-selective YaRN and then post\-trained in two stages \(supervised fine\-tuning followed by reinforcement learning with verifiable rewards\), yielding two released variants: an*Instruct*model that answers directly and a*Thinking*model that emits an explicit reasoning trace before its final answer\. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks,Mellum 2is competitive with open\-weight baselines in the 4B–14B range while running at the per\-token compute of a 2\.5B dense model\. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2\.0 license\.
[![[Uncaptioned image]](https://arxiv.org/html/2605.31268v1/x2.png)Hugging Face](https://huggingface.co/collections/JetBrains/mellum-2)[Blog post](https://blog.jetbrains.com/ai/2026/05/mellum2-goes-open-source-a-fast-model-for-ai-workflows)Apache 2\.0
Correspondence:mellum@jetbrains\.comReleased under the Apache 2\.0 license\.
Large language models \(LLMs\) have reshaped how developers work with code\. What began as inline autocomplete\[[undefam](https://arxiv.org/html/2605.31268#bib.bibx40)\]has broadened into a much wider task surface: writing whole functions from natural\-language specifications, editing and debugging existing code, reasoning through multi\-step engineering tasks, calling external tools, navigating repositories as an agent, and serving as a conversational collaborator throughout the development loop\. The competitive coding models today must do all of this at once, and at a serving cost that makes them practical to deploy at scale\.
Two regimes dominate the open\-weights landscape on the quality\-versus\-cost trade\-off\. Dense models in the 4–14B range are cheap to serve but plateau on harder coding and reasoning workloads; very large Mixture\-of\-Experts \(MoE\) models\[[undefq](https://arxiv.org/html/2605.31268#bib.bibx18),[undefm](https://arxiv.org/html/2605.31268#bib.bibx14)\]reach frontier quality but at deployment costs that strain everyday use\. To strike a balance between knowledge scope and serving cost, we aim to extend the recent line of small MoE coding models—among them Qwen3\-Coder\-30B\-A3B\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\]and Ling\-Coder\-Lite\[[undefl](https://arxiv.org/html/2605.31268#bib.bibx13)\]: sufficient parameters to absorb the long tail of programming language and reasoning knowledge but with enough sparsity to allow for deployment on commodity hardware \(per\-token compute in the 2–3B\-dense range\)\.
We introduceMellum 2, an open\-weight 12B\-parameter MoE language model with 2\.5B active parameters per token, a general\-purpose successor toMellum\[[undefaab](https://arxiv.org/html/2605.31268#bib.bibx55)\]— the 4B dense code\-completion model previously deployed in JetBrains IDEs\. While the originalMellumwas trained to fill single completions inside an editor,Mellum 2is a full\-fledged coding assistant: it generates and edits code, calls tools, plans and executes multi\-step agentic workflows, holds long conversations about code, and, in its thinking variant, produces explicit reasoning traces before answering\. The model is built on the Qwen3\-MoE recipe\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\]\(64 experts, 8 active\) with three deployment\-oriented modifications: Grouped\-Query Attention\[[undef](https://arxiv.org/html/2605.31268#bib.bibx1)\]with only 4 KV heads, Sliding Window Attention\[[undefe](https://arxiv.org/html/2605.31268#bib.bibx6)\]on three of every four layers, and a single Multi\-Token Prediction \(MTP\)\[[undefu](https://arxiv.org/html/2605.31268#bib.bibx22)\]head that is used both as an auxiliary pre\-training objective and as a built\-in draft for speculative decoding\.
Our key contributions are:
- •An efficiency\-aware architecture\.We systematically ablate dense versus MoE backbones, Grouped\-Query Attention configurations, Multi\-head Latent Attention\[[undefn](https://arxiv.org/html/2605.31268#bib.bibx15)\], Sliding Window Attention patterns, and expert sparsity ratios\. The resulting 12B/2\.5B\-active configuration matches or exceeds the throughput of a 7B dense baseline while occupying a substantially larger total\-parameter envelope\.
- •A three\-phase pre\-training curriculum on∼\{\\sim\}10\.6T tokens\.Following the “web early, curated late” paradigm\[[undefv](https://arxiv.org/html/2605.31268#bib.bibx23)\], the data mixture progressively shifts from diverse web content toward curated code and mathematical content \(code ratio 23 %→\\to42 %→\\to59 %\), with batch\-size doubling and an extended capability\-sharpening phase that decays the learning rate linearly to zero\.
- •A Muon \+ FP8 training recipe at production scale\.We adopt the Muon optimizer\[[undefag](https://arxiv.org/html/2605.31268#bib.bibx34),[undefar](https://arxiv.org/html/2605.31268#bib.bibx45)\]for large\-scale MoE pre\-training, combine it with FP8 hybrid mixed precision\[[undefav](https://arxiv.org/html/2605.31268#bib.bibx49)\]and a Warmup\-Hold\-Decay schedule\[[undefac](https://arxiv.org/html/2605.31268#bib.bibx30),[undefx](https://arxiv.org/html/2605.31268#bib.bibx25)\]with linear decay to zero, and report training\-stability observations across the full ten\-trillion\-token run\.
- •Long\-context extension to 128K\.We extend the pre\-trained base to 131,072 tokens following the layer\-selective scaling recipe of Gemma 3\[[undeft](https://arxiv.org/html/2605.31268#bib.bibx21)\]and OLMo 3\[[undefaar](https://arxiv.org/html/2605.31268#bib.bibx71)\]with YaRN\[[undefaad](https://arxiv.org/html/2605.31268#bib.bibx57)\]as the scaling method, and report empirical findings on data\-mix transfer and MoE router dynamics during this stage\.
- •Two post\-trained variants from a shared base\.From the same long\-context checkpoint we produce an*Instruct*model that answers directly and a*Thinking*model that emits an explicit reasoning trace, each refined further by reinforcement learning with verifiable rewards \(RLVR\) on math and executable coding tasks\.
- •Open release\.We release base, instruct, and thinking checkpoints under the Apache 2\.0 license, together with this report documenting the architecture decisions, data pipeline, and training recipe behind them\. In addition, we release a base model before the long context extension and SFT checkpoints\.
Across a panel of code generation, math and reasoning, tool use, knowledge, and safety benchmarks,Mellum 2is competitive with open\-weight baselines in the 4–14B range despite running at the per\-token compute of a 2\.5B dense model, and matches or exceeds the inference throughput of Qwen2\.5\-7B\[[undefaae](https://arxiv.org/html/2605.31268#bib.bibx58)\]on a single H100\. The remainder of this report follows the contributions above:[Section2](https://arxiv.org/html/2605.31268#S2)traces the architecture design and ablations,[Section3](https://arxiv.org/html/2605.31268#S3)details the pre\-training data and recipe,[Section4](https://arxiv.org/html/2605.31268#S4)describes the 128K context extension,[Section5](https://arxiv.org/html/2605.31268#S5)covers SFT, RL, and post\-training evaluation, and[Section6](https://arxiv.org/html/2605.31268#S6)reports our inference benchmarks\.
## 2Model Architecture
Mellum 2is a decoder\-only Transformer that closely follows the Qwen3\-MoE recipe\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\]: a Mixture\-of\-Experts \(MoE\) feed\-forward network in every layer, Grouped\-Query Attention \(GQA\)\[[undef](https://arxiv.org/html/2605.31268#bib.bibx1)\]with QK\-Norm\[[undefaa](https://arxiv.org/html/2605.31268#bib.bibx28)\], SiLU\-gated MLPs\[[undefaaj](https://arxiv.org/html/2605.31268#bib.bibx63)\], RMSNorm\[[undefaay](https://arxiv.org/html/2605.31268#bib.bibx78)\], and Rotary Position Embeddings \(RoPE\)\[[undefaao](https://arxiv.org/html/2605.31268#bib.bibx68)\]\. On top of this backbone we add two latency\- and quality\-oriented modifications: Sliding Window Attention \(SWA\) on a fraction of the layers, and a single Multi\-Token Prediction \(MTP\) head trained as an auxiliary objective\.
### 2\.1Architecture Design Decisions
AsMellum 2is meant to be deployed in JetBrains IDEs, we approached the design space from the perspective of efficient inference\. We targeted the latency and throughput budget of a Qwen2\.5\-7B dense model on a single H100 GPU as our baseline and conducted several architectural ablations to match it\.
#### 2\.1\.1Dense vs\. Sparse
We first evaluated whether a dense architecture could outperform the baseline under our latency constraint\. We explored multiple Qwen3\-based dense configurations—varying depth \(24–40 layers\) and width \(hidden sizes 2304–4096\), as well as DeepSeek\-style models with Multi\-head Latent Attention \(MLA\)\[[undefn](https://arxiv.org/html/2605.31268#bib.bibx15)\]\. None of the dense configurations consistently outperformed Qwen2\.5\-7B on our evaluation benchmarks within the latency budget\. MLA allowed scaling to approximately 5\.5B parameters at equivalent speed, but the quality gains were insufficient to justify the additional training complexity, and the supported latent rank was too large for our model scale\.
We therefore adopted a Mixture\-of\-Experts \(MoE\) architecture, which enabled scaling to∼\{\\sim\}12B total parameters while keeping the per\-token compute comparable to a 2\.5B dense model\.
#### 2\.1\.2Expert Configuration
Starting from the Qwen3\-30B\-A3B architecture\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\], we scaled down the model proportionally to fit within a single H100 GPU \(<<18B total parameters\)\. We fixed the number of experts at 64 as larger expert counts exceeded GPU memory constraints\.
We evaluated different sparsity levels \(number of active experts\) and found that higher sparsity \(fewer active experts\) yielded better inference performance\. For example, 2 active experts achieved∼\{\\sim\}1\.5×\\timeslower latency than 8 active experts\. However, consistent with prior work suggesting that high sparsity can be detrimental at smaller scales\[[undefi](https://arxiv.org/html/2605.31268#bib.bibx10),[undefah](https://arxiv.org/html/2605.31268#bib.bibx35)\], our benchmark evaluations confirmed that models with lower sparsity \(more active experts\) produced better quality\. We settled on8 active out of 64 total expertsas the optimal quality–latency trade\-off\. Under this configuration, the model supports up to∼\{\\sim\}15B total parameters while matching Qwen2\.5\-7B latency\.[Figure1](https://arxiv.org/html/2605.31268#S2.F1)shows iso\-latency maps for MoE configurations with 8 active experts, illustrating the feasible design space\.
\(a\)Throughput mode\.
\(b\)Sync mode\.
Figure 1:Iso\-latency maps for Qwen3\-MoE architectures \(64 experts, 8 active\) across different hidden dimensions and layer counts\. Each grid point is labelled with*T*\(total parameters\) and*A*\(active parameters\), both in billions\. Dashed lines show the latency contours of Mellum 4B \(orange\) and Qwen2\.5\-7B \(blue\); configurations below these lines are faster than the corresponding reference model\.
#### 2\.1\.3Grouped\-Query Attention
The number of KV heads is the most significant factor affecting inference throughput under high\-concurrency conditions\. While the effect is negligible in synchronous \(single\-request\) mode where KV\-cache utilization is low, it becomes substantial in throughput\-dominant serving scenarios\. For instance, Qwen2\.5\-7B with 4 KV heads achieves roughly the same throughput as our predecessor Mellum\-4B with 8 KV heads despite being nearly twice the size\.
We selected4 KV headsas the optimal trade\-off: 8 heads caused significant throughput degradation, while 2 heads yielded insufficient quality on evaluation benchmarks\.[Figure2](https://arxiv.org/html/2605.31268#S2.F2)shows iso\-latency maps for Qwen3\-based dense architectures with 4 KV heads, with dashed lines indicating the latency of Mellum 4B and Qwen2\.5\-7B\. In throughput mode, the KV\-cache bottleneck is clearly visible: wider models \(larger hidden dimension\) are disproportionately penalized\. In sync mode, where the KV cache is underutilized, the effect is much smaller and latency is dominated by model depth\.
\(a\)Throughput mode \(concurrent requests\)\.
\(b\)Sync mode \(sequential requests\)\.
Figure 2:Iso\-latency maps for dense Qwen3 architectures with 4 KV heads\. Each grid point is labelled with the model’s total parameter count in billions \(e\.g\.,*4\.20B*\); circle size encodes the same quantity\. Dashed lines show the latency contours of Mellum 4B \(orange\) and Qwen2\.5\-7B \(blue\); configurations below these lines are faster than the corresponding reference model\.
#### 2\.1\.4Sliding Window Attention
We adopted Sliding Window Attention \(SWA\)\[[undefaf](https://arxiv.org/html/2605.31268#bib.bibx33),[undefe](https://arxiv.org/html/2605.31268#bib.bibx6)\]as a latency optimization\. Experiments on both dense and MoE architectures confirmed that SWA reduces inference latency by limiting the attention span of most layers\. We apply SWA to3 out of every 4 layers\(the remaining layers use full attention\) with a window size of 1,024 tokens\. This pattern preserves long\-range context capability through the full\-attention layers while reducing compute in the majority of layers\. Consistent with findings from the Gemma model family\[[undeft](https://arxiv.org/html/2605.31268#bib.bibx21)\], a window size of 1,024 outperforms one of size 512 on quality benchmarks\.[Figure3](https://arxiv.org/html/2605.31268#S2.F3)shows that MoE models with SWA achieve latency comparable to Qwen2\.5\-7B even at double the context length, providing a significant advantage in workflows requiring larger context\.
Figure 3:Latency comparison of MoE models with Sliding Window Attention \(window sizes 512 and 1,024, applied to 3/4 of attention layers\) against Qwen2\.5\-7B across different input lengths\.
#### 2\.1\.5Multi\-Token Prediction
We augment the standard next\-token prediction objective with a Multi\-Token Prediction \(MTP\) head\[[undefu](https://arxiv.org/html/2605.31268#bib.bibx22)\]that predicts one additional future token\. The MTP head is a single additional transformer layer that receives the hidden states from the main model and is trained with a scaled loss \(α=0\.1\\alpha=0\.1\)\. The MTP head is removed during evaluation and inference \(it does not affect the main model’s predictions\), but provides a natural draft model for speculative decoding\.
In ablation studies involving a 14B MoE model trained on 105B data tokens, MTP yielded significant benchmark improvements at a cost of only 7% additional training time\. The validation loss curves of runs with and without MTP head were nearly identical, suggesting that MTP does not degrade the primary next\-token prediction objective\. Rather, benchmark evaluation \([Table1](https://arxiv.org/html/2605.31268#S2.T1)\) reveals substantial improvements on key tasks: HumanEval \+10\.4, MMLU \+3\.6, MMLU\-Pro \+3\.3, and GSM8K \+3\.0\.
Table 1:Benchmark comparison between baseline and MTP models \(14B MoE, 105B tokens\)\.
### 2\.2Final Architecture
Bringing the design decisions together, we castMellum 2as a Qwen3\-MoE\-style decoder\-only Transformer with the following components:
- •Backbone:28 transformer layers, hidden dimension 2,304, with pre\-RMSNorm\[[undefaay](https://arxiv.org/html/2605.31268#bib.bibx78)\]\(ϵ=10−6\\epsilon=10^\{\-6\}\) and SiLU\-gated MLPs\[[undefaaj](https://arxiv.org/html/2605.31268#bib.bibx63)\]\.
- •Attention:32 query heads and 4 KV heads \(GQA\[[undef](https://arxiv.org/html/2605.31268#bib.bibx1)\], head dimension 128\), QK\-Norm\[[undefaa](https://arxiv.org/html/2605.31268#bib.bibx28)\]applied to the query and key projections, and RoPE\[[undefaao](https://arxiv.org/html/2605.31268#bib.bibx68)\]with baseθ=500,000\\theta=500\{,\}000\.
- •Sliding window attention:a 3:1 SWA\[[undefe](https://arxiv.org/html/2605.31268#bib.bibx6)\]pattern in which 3 out of every 4 layers use a sliding window of 1,024 tokens and the remaining layer uses full attention\.
- •Mixture\-of\-Experts:64 routed experts per layer with 8 active per token \(top\-8 routing\), expert intermediate size 896, and no shared expert\.
- •Multi\-Token Prediction:a single MTP\[[undefu](https://arxiv.org/html/2605.31268#bib.bibx22)\]transformer layer trained with loss weightα=0\.1\\alpha=0\.1, used as a draft model for speculative decoding\[[undefak](https://arxiv.org/html/2605.31268#bib.bibx38)\]and removed at evaluation time\.
- •Embeddings:untied input/output embeddings over a 98,304\-token vocabulary; native context length 8,192 tokens \(extended to 131,072 in long\-context training, see[Section4](https://arxiv.org/html/2605.31268#S4)\)\.
This configuration totals∼\{\\sim\}12B parameters with∼\{\\sim\}2\.5B active per token\.111All matrix dimensions—hidden size 2,304, head dimension 128, expert intermediate size 896—are kept divisible by 128 or higher powers of two; violations of this alignment can cost up to a 2×\\timesslowdown in GPU kernel execution, so the constraint was treated as binding throughout the search\.[Table2](https://arxiv.org/html/2605.31268#S2.T2)summarizes the full set of hyperparameters\.
Table 2:Architecture configuration ofMellum 2\.ScaleTotal parameters∼\{\\sim\}12BActive parameters∼\{\\sim\}2\.5BVocabulary size98,304Context length8,192 / 131,072⋆Tied embeddingsNoBackboneLayers28Hidden dimension2,304ActivationSiLU \(gated\)NormalizationRMSNorm \(ϵ=10−6\\epsilon\{=\}10^\{\-6\}\)Position encodingRoPE \(θ=500,000\\theta\{=\}500\{,\}000\)
AttentionQuery heads32KV heads \(GQA\)4Head dimension128QK\-NormYes \(RMSNorm\)Sliding window1,024 \(3:1 SWA\)Mixture\-of\-Experts & MTPExperts \(total\)64Experts \(active\)8 \(top\-8\)Expert MLP size896Shared expertNoneMTP layers1 \(α=0\.1\\alpha\{=\}0\.1\)
⋆After the long\-context extension stage \([Section4](https://arxiv.org/html/2605.31268#S4)\)\.
## 3Pre\-Training
### 3\.1Training Data
Our pre\-training corpus comprises approximately 10\.6 trillion tokens drawn from diverse sources\. We organize the data into three broad categories: web and general knowledge, source code, and mathematical content\.
#### 3\.1\.1Source Code
The code portion of our corpus includes raw, permissively licensed source code files collected from public repositories and deduplicated at the file level, web pages containing code extracted from Common Crawl, and a suite of synthetic and derived code datasets\. The derived datasets augment raw code with natural language annotations—including code summarizations, functionality extensions, translations between programming languages, test generation, commit messages, and task descriptions\. We also include synthetic code datasets covering question answering, code rewriting, code review, transpilation, and educational explanations\. Consistent with\[[undefad](https://arxiv.org/html/2605.31268#bib.bibx31)\], we find that synthetic code data can effectively complement raw code, particularly for smaller\-scale MoE models where data diversity is crucial\.
#### 3\.1\.2Web and General Knowledge
The web data component includes large\-scale synthetic web corpora derived from Common Crawl\[[undefaan](https://arxiv.org/html/2605.31268#bib.bibx67)\], educational web content\[[undefaac](https://arxiv.org/html/2605.31268#bib.bibx56)\], educational PDFs, multilingual reasoning and QA datasets, and curated knowledge sources including SFT data, STEM instruction data, rewrites of Wikipedia pages, and synthetically generated encyclopedic articles\.
#### 3\.1\.3Mathematical Data
Mathematical data includes math\-focused SFT data, math\-oriented web content at multiple quality tiers, permissively licensed math textbooks, and math instruction\-tuning data\.
#### 3\.1\.4Tokenizer
We use a custom tokenizer with a vocabulary size of 98,304 tokens, identical to the tokenizer used in Mellum\-4B\[[undefaab](https://arxiv.org/html/2605.31268#bib.bibx55)\]\. The vocabulary is designed to provide strong coverage of programming language tokens and technical terminology\.
### 3\.2Three\-Phase Training Curriculum
Following the “web early, curated late” paradigm established by Llama 3\.1\[[undefv](https://arxiv.org/html/2605.31268#bib.bibx23)\], DeepSeek\-V3\[[undefo](https://arxiv.org/html/2605.31268#bib.bibx16)\], and SmolLM2\[[undefa](https://arxiv.org/html/2605.31268#bib.bibx2)\], and most recently adopted by Arcee Trinity\[[undefaal](https://arxiv.org/html/2605.31268#bib.bibx65)\], we organize pre\-training into three phases that progressively shift from diverse web content toward high\-quality code and mathematical data\. The phase boundaries are aligned with the Warmup\-Hold\-Decay \(WHD\) learning rate schedule\[[undefac](https://arxiv.org/html/2605.31268#bib.bibx30),[undefx](https://arxiv.org/html/2605.31268#bib.bibx25)\]\.
Table 3:Three\-phase pre\-training curriculum\. The data mix progressively shifts toward code and math as training progresses\.Phase 1: Foundation Building\(∼\{\\sim\}6\.18T tokens, 58%\)\. The first phase establishes broad linguistic capabilities and foundational code understanding using predominantly web data\. The mix is approximately 70% web and general knowledge, 23% code, and 6% math\. This phase covers the learning rate warmup and the beginning of the hold period\.
Phase 2: Quality Uplift\(∼\{\\sim\}2\.79T tokens, 26\.2%\)\. The second phase shifts toward higher\-quality data, with significant code upsampling to 42%\. High\-quality curated datasets, including SFT data, reasoning QA, STEM instruction data, and knowledge\-aligned articles, are introduced in this phase rather than Phase 1, as curated data is more effective during stable learning rate than during warmup\. New synthetic code datasets covering question answering, code rewriting, and educational explanations are added\. The raw code corpus enters its second epoch\.
Phase 3: Capability Sharpening\(∼\{\\sim\}1\.69T tokens, 15\.9%\)\. The final phase maximizes coding and mathematical capability during learning rate decay, when the model is most sensitive to data quality\. Code reaches 59% of the mix\. Additional synthetic code datasets covering code review and cross\-language transpilation are introduced\. The raw code corpus enters its third epoch\. Web content is reduced to only the highest\-quality curated sources\.
#### 3\.2\.1Data Repetition Strategy
High\-quality data is scarce, so we repeat it\. Small curated code datasets \(summarization, test generation, translation, commit messages, algorithmic solutions\) are shown across all three phases, and the raw code corpus is seen for three epochs, contributing roughly 958B tokens\. No dataset is repeated more than 4×\\timesover the full run, which we find to be the point where further repetition stops yielding gains\. Repetition is particularly valuable for MoE training: high\-quality data seen multiple times sharpens expert specialization in a way that a single pass over noisier data does not\.
#### 3\.2\.2Fill\-in\-the\-Middle Objective
In addition to standard left\-to\-right next\-token prediction, we trainMellum 2with a Fill\-in\-the\-Middle \(FIM\) objective\[[undefd](https://arxiv.org/html/2605.31268#bib.bibx5)\], which is essential for in\-IDE code completion where the model must condition on both the prefix and the suffix of the current cursor position\. Documents selected for FIM are split into a \(prefix, middle, suffix\) triple at two uniformly sampled positions and reformatted with sentinel tokens\. We use a 50/50 split between the Prefix–Suffix–Middle \(PSM\) and Suffix–Prefix–Middle \(SPM\) orderings in all phases\.
The fraction of training documents transformed into FIM examples varies across the curriculum to match the data composition of each phase\. In Phase 1, the FIM rate is 50% and is applied to all data, exposing the model to bidirectional context early when the mix is dominated by web and general\-knowledge text\. In Phase 2, the FIM rate is reduced to 10% so that the high\-quality curated code, reasoning, and instruction data introduced in this phase is consumed primarily under the standard left\-to\-right objective\. In Phase 3, the FIM rate is restored to 50%, but the transformation is restricted to source\-code files only; non\-code data \(curated web, math, reasoning\) continues to be trained with next\-token prediction\. This schedule concentrates FIM training on the data distribution that most closely matches the downstream completion setting, while preserving generative quality on natural\-language inputs\.
### 3\.3Quality Filtering and Deduplication
We apply a multi\-stage quality filtering pipeline to the raw data:
1. 1\.Heuristic filtering\.We apply checks on line length, entropy, comment ratio, and AST parseability checks for code data\. We filter samples with fewer than 82 unique tokens \(1% of context size\) to eliminate degenerate sequences with abnormally low lexical diversity, which we identify as a source of periodic training loss drops\.
2. 2\.Classifier\-based filtering\.Quality classifiers at multiple tiers are used to stratify web data by quality, enabling phase\-appropriate data selection\.
3. 3\.Deduplication\.MinHash\-based near\-deduplication\[[undefaj](https://arxiv.org/html/2605.31268#bib.bibx37)\]at the file level for code data\. For web data, intra\-phase deduplication is applied, while cross\-phase repetition is intentional and aligned with the curriculum design\.
### 3\.4Training Setup
#### 3\.4\.1Optimizer
We use the Muon optimizer\[[undefag](https://arxiv.org/html/2605.31268#bib.bibx34)\]with the distributed configuration described in Moonlight\[[undefar](https://arxiv.org/html/2605.31268#bib.bibx45)\]\. Muon applies orthogonalization\-based updates to hidden layers while using Adam for embedding and output layers\.
We compared AdamW\[[undefat](https://arxiv.org/html/2605.31268#bib.bibx47)\]and Muon on both a dense Qwen2\.5\-7B model and our Qwen3\-MoE\-14B architecture, each trained for 105B tokens\. We evaluated two Muon configurations: Megatron defaults \(extra scale factor 1\.0\) and the Moonlight setup \(extra scale factor 0\.2\)\.
On the dense 7B architecture \([Figure4\(a\)](https://arxiv.org/html/2605.31268#S3.F4.sf1)\), Megatron defaults caused immediate divergence, while the Moonlight setup beat AdamW by a large margin, reducing validation loss by 0\.028 \(∼\{\\sim\}2\.5%\)\. On the MoE\-14B \([Figure4\(b\)](https://arxiv.org/html/2605.31268#S3.F4.sf2)\), both Muon configurations converged successfully, with Megatron defaults achieving slightly better final loss \(−\-0\.026,∼\{\\sim\}2\.4%\) and Moonlight close behind\. We selected the Moonlight configuration for its stability across both dense and MoE architectures\.
\(a\)Qwen2\.5\-7B \(dense\)\.
\(b\)Qwen3\-MoE\-14B\.
Figure 4:Optimizer comparison on 105B\-token ablation runs\.Table 4:Optimizer and training hyperparameters\.Our investigation of the Adamϵ\\epsilonparameter revealed that values as large as10−510^\{\-5\}\(the value used by LLaMA 2\[[undefaas](https://arxiv.org/html/2605.31268#bib.bibx72)\]\) cause disproportionate dampening of updates\. We confirmed thatϵ=10−8\\epsilon=10^\{\-8\}provides the best trade\-off between training stability and optimization effectiveness\.
#### 3\.4\.2Learning Rate Schedule
We employ a Warmup\-Hold\-Decay \(WHD\) schedule\[[undefac](https://arxiv.org/html/2605.31268#bib.bibx30),[undefx](https://arxiv.org/html/2605.31268#bib.bibx25)\]\. The learning rate warms up linearly over 2,000 steps to a peak of3×10−43\\times 10^\{\-4\}, holds at peak through Phases 1 and 2, then decays linearly to zero over Phase 3 \(∼\{\\sim\}49,306 steps, approximately 15% of total training\)\. The linear decay to zero follows recent findings showing that it outperforms cosine decay to a non\-zero minimum, providing equivalent loss at lower effective compute cost\.[Figure5](https://arxiv.org/html/2605.31268#S3.F5)illustrates the full training schedule with learning rate, batch size rampup, and phase boundaries\.
Steps \(K\)LR010−410^\{\-4\}2×10−42\{\\times\}10^\{\-4\}3×10−43\{\\times\}10^\{\-4\}050K100K150K200K250K274K323KPhase 1: Foundation \(6\.18T\)Phase 2: QualityUplift \(2\.79T\)Phase 3\(1\.69T\)Batch: 4096Batch: 2048Learning rateBatch sizeFigure 5:Training schedule forMellum 2showing the Warmup\-Hold\-Decay \(WHD\) learning rate schedule, batch size rampup, and three\-phase data curriculum boundaries\.
#### 3\.4\.3Batch Size Rampup
The global batch size ramps linearly from 2,048 to 4,096 sequences during the initial phase of training\. At full batch size, each step processes approximately 33\.6M tokens \(4,096×8,1924\{,\}096\\times 8\{,\}192\)\.
#### 3\.4\.4Precision
We use BF16 as the base precision with FP8 hybrid mixed\-precision training\[[undefav](https://arxiv.org/html/2605.31268#bib.bibx49)\], using tensorwise FP8 recipe with the most\-recent amax algorithm\. Gradient reduction is performed in FP32 to maintain numerical stability\.
#### 3\.4\.5MoE\-Specific Training
For the MoE routing, we use global auxiliary load\-balancing loss\[[undefq](https://arxiv.org/html/2605.31268#bib.bibx18)\]with a coefficient of10−310^\{\-3\}, combined with a router z\-loss of10−310^\{\-3\}for training stability\[[undefaaaa](https://arxiv.org/html/2605.31268#bib.bibx80)\]\. The router operates in FP32 precision\. We explored both per\-sequence and global\-batch balancing strategies and chose global\-batch balancing for its flexibility, despite per\-sequence balancing producing slightly better loss on short runs\.
We adoptdropless routing\[[undefr](https://arxiv.org/html/2605.31268#bib.bibx19)\]\(no expert capacity factor\), which avoids token dropping entirely\. In short\-run experiments, we found no meaningful quality difference between capacity factors of 1\.0–1\.5\. Dropless routing was initially slower than routing with a capacity factor of 1\.5 in our tests\. However, this was before accounting for the effect of router balancing on throughput: as the router learns a proper load balance during training, dropless routing throughput improves and approaches that of capacity\-limited routing\. In the early stages of training, when routing is less balanced, the overhead is more pronounced\. We observe approximately 15% higher initial iteration step time compared to capacity factor 1\.5\. Dropless routing also eliminates information loss from dropped tokens and allows full micro\-batch utilization\.
#### 3\.4\.6Sequence Packing
Documents are combined into fixed\-length 8,192\-token training sequences using best\-fit packing\[[undefp](https://arxiv.org/html/2605.31268#bib.bibx17)\], which minimizes intra\-document truncation relative to the standard concatenate\-and\-chunk approach and reduces hallucinations caused by spurious cross\-document context\.
#### 3\.4\.7Infrastructure
Training is conducted on 32 nodes, each equipped with 8 H200 GPUs, using a Megatron\-LM\[[undefaak](https://arxiv.org/html/2605.31268#bib.bibx64)\]\-based training framework\. We employ expert parallelism with a degree of 8 \(each GPU hosts 8 of 64 experts\), with tensor and pipeline parallelism degrees of 1\. Gradient reduction and parameter gather are overlapped with computation for efficiency\.
All data processing is performed offline on a MapReduce\-like distributed storage and compute system\. Each raw example is tokenized and then assembled into fixed\-length training\-ready shards that are stored alongside the raw corpora\. At training time, a background streamer running on the master node pulls these shards from the storage cluster and writes them into an in\-memory Redis queue; all data\-parallel workers consume batches from this queue over the internal network\. This design fully decouples dataset storage and offline processing from the training fleet: the two systems share no filesystem and communicate only through the streaming queue, which lets us place them in geographically separate data centers \(in our setup, the storage and processing cluster is hosted in France while the training fleet runs on a GPU cluster in the United States\) without exposing transatlantic latency to the training loop\.
### 3\.5Training Curves
[Figure6](https://arxiv.org/html/2605.31268#S3.F6)shows the training loss curves from the ongoing production run\. The LM loss decreases steadily across phases, with visible phase transitions at the data mix boundaries\. The MTP loss tracks the LM loss closely but at a higher magnitude, consistent with the increased difficulty of predicting tokens further ahead\. The global load\-balancing loss reflects the router’s learning dynamics: it stabilizes as training progresses, indicating that the router learns an effective expert assignment\.
\(a\)LM loss \(next\-token prediction\)\.
\(b\)MTP\-1 loss \(1\-step\-ahead prediction\)\.
\(c\)Global load\-balancing loss\.
Figure 6:Training loss curves for theMellum 2production run\. Shaded regions indicate the three training phases; the dotted line marks the batch size doubling \(2,048→\\to4,096\)\.
### 3\.6Training Stability
During pre\-training, we identified and resolved several stability issues:
Loss spikes from low\-diversity sequences\.Two loss spikes visible at the very beginning of training were traced to data segments containing sequences with abnormally low lexical diversity \(e\.g\., a single repeated token spanning the entire context\)\. We mitigated this by filtering samples with fewer than 82 unique tokens \(1% of the 8,192 context length\)\.
Residual periodic loss spikes from hash\-sorted duplicates\.Our data preparation pipeline sorts samples by a hash of the token sequence\. Some source documents were long enough that, when sliced into 8,192\-token chunks, multiple chunks became exact duplicates\. Hash\-based sorting placed these duplicates at the same position within each data shard\. Since each training phase is composed of 16 uniform shards, the duplicates appear at roughly the same offset in every shard, producing 16 periodic downward loss spikes per phase\. These spikes are visible in[Figure6\(a\)](https://arxiv.org/html/2605.31268#S3.F6.sf1)as faint periodic dips\. We verified that they are modest in magnitude, isolated, and have no measurable effect on training dynamics—including no impact on the MoE load\-balancing loss \([Figure6\(c\)](https://arxiv.org/html/2605.31268#S3.F6.sf3)\)\. Since removing these duplicates from the already\-prepared data was technically non\-trivial, we chose to continue training with them in place\.
Cluster migration and load\-balancing loss shift\.Approximately halfway through training, we migrated from 32 nodes to a smaller cluster of 16 nodes while keeping the effective global batch size fixed\. As visible in[Figure6\(c\)](https://arxiv.org/html/2605.31268#S3.F6.sf3), the global load\-balancing loss decreased noticeably after this transition\. This is not a change in model behavior but rather a consequence of how Megatron\-LM implements the global auxiliary loss\. The implementation maintains a running average of per\-expert token counts across microbatches within each optimizer step, resetting the accumulator only at gradient finalization\. The loss at each microbatch is computed against this running estimate rather than against a true global count\. When the number of data\-parallel ranks changes \(here, halved\), the microbatch decomposition of the same effective batch changes: fewer ranks means more gradient\-accumulation microbatches per step, which allows the running average to converge more closely to the true distribution before reset\. The resulting loss is therefore systematically lower, even though the effective optimization signal is comparable\. This is an accumulation\-semantics artifact rather than a precision issue \(all auxiliary\-loss computations use FP32\) and did not materially affect training quality\.
### 3\.7Pre\-Training Evaluation
We evaluate the base model ofMellum 2on a broad suite of benchmarks spanning general knowledge, reasoning, mathematics, and code\. We compare against OLMo\-3\-7B\[[undefaar](https://arxiv.org/html/2605.31268#bib.bibx71)\], Qwen2\.5\-7B\[[undefaae](https://arxiv.org/html/2605.31268#bib.bibx58)\], Qwen3\-4B\-Base\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\], and Qwen3\.5\-4B\-Base\[[undefaaq](https://arxiv.org/html/2605.31268#bib.bibx70)\]\.
The evaluation suite consists of 18 benchmarks grouped into three categories:
- •General Knowledge & Reasoning:MMLU\[[undefy](https://arxiv.org/html/2605.31268#bib.bibx26)\], MMLU\-Pro\[[undefaat](https://arxiv.org/html/2605.31268#bib.bibx73)\], BBH\[[undefaap](https://arxiv.org/html/2605.31268#bib.bibx69)\], ARC\-Challenge\[[undefj](https://arxiv.org/html/2605.31268#bib.bibx11)\], HellaSwag\[[undefaax](https://arxiv.org/html/2605.31268#bib.bibx77)\], WinoGrande\[[undefaah](https://arxiv.org/html/2605.31268#bib.bibx61)\], and TruthfulQA\[[undefan](https://arxiv.org/html/2605.31268#bib.bibx41)\]\.
- •Math & Science:GSM8K\[[undefk](https://arxiv.org/html/2605.31268#bib.bibx12)\], MATH\[[undefz](https://arxiv.org/html/2605.31268#bib.bibx27)\], and GPQA \(Main and Diamond splits\)\[[undefaaf](https://arxiv.org/html/2605.31268#bib.bibx59)\]\.
- •Code Generation:HumanEval and HumanEval\+\[[undefh](https://arxiv.org/html/2605.31268#bib.bibx9),[undefaq](https://arxiv.org/html/2605.31268#bib.bibx44)\], MBPP and MBPP\+\[[undefc](https://arxiv.org/html/2605.31268#bib.bibx4),[undefaq](https://arxiv.org/html/2605.31268#bib.bibx44)\], MultiPL\-E\[[undefg](https://arxiv.org/html/2605.31268#bib.bibx8)\], and CRUXEval \(input and output prediction\)\[[undefw](https://arxiv.org/html/2605.31268#bib.bibx24)\]\.
[Table5](https://arxiv.org/html/2605.31268#S3.T5)summarizes performance across all benchmark groups\. Despite activating only 2\.5B parameters per token,Mellum 2is competitive with 7B dense models on many benchmarks and exceeds them on several reasoning and code tasks \(MMLU\-Pro, BBH, GSM8K, MBPP, CRUXEval\)\.
Table 5:Pre\-training evaluation results\. All values are reported as percentages\. TheMellum 2column is shaded for grouping\.Key observations:
- •MMLU\-Pro:Mellum 2achieves 59\.3%, surpassing all comparison models including Qwen3\.5\-4B \(52\.4%\) and Qwen2\.5\-7B \(48\.6%\)\.
- •BBH: At 74\.9%,Mellum 2outperforms OLMo\-3\-7B \(63\.6%\), Qwen2\.5\-7B \(69\.0%\), and Qwen3\-4B \(71\.3%\)\.
- •GSM8K:Mellum 2\(81\.7%\) is on par with Qwen2\.5\-7B \(81\.9%\) and Qwen3\-4B \(82\.0%\) despite significantly fewer active parameters\.
- •MBPP / MBPP\+: Strong code generation with 62\.4% / 61\.4%, outperforming OLMo\-3\-7B and Qwen3\.5\-4B\.
- •HumanEval: At 41\.5%, this remains a growth area; we observed significant performance lift on HumanEval after the post\-training\.
- •GPQA Main:Mellum 2\(35\.0%\) outperforms OLMo\-3\-7B \(27\.9%\) and Qwen2\.5\-7B \(34\.2%\)\.
These results demonstrate that the MoE architecture with 2\.5B active parameters can match or exceed 4–7B dense models on reasoning\-heavy benchmarks\.
## 4Long Context Extension
Following the main pre\-training run, we performed a dedicated long\-context extension stage to extend the effective context length ofMellum 2from the 8,192\-token training context to 131,072 tokens \(128K\)\.
### 4\.1Layer\-Selective YaRN
We adopt YaRN\[[undefaad](https://arxiv.org/html/2605.31268#bib.bibx57)\]for context extension, but apply it selectively rather than uniformly across the network\. Specifically, the YaRN frequency re\-mapping is applied only to the global \(full\-attention\) layers, leaving the sliding window layers with their original RoPE parameters\. This layer\-selective recipe was first reported in the Gemma 3 technical report\[[undeft](https://arxiv.org/html/2605.31268#bib.bibx21)\]\(with positional interpolation rather than YaRN as the scaling method\) and was subsequently adopted by OLMo 3\[[undefaar](https://arxiv.org/html/2605.31268#bib.bibx71)\]\. Our ablations \([Figure7](https://arxiv.org/html/2605.31268#S4.F7)\) are consistent with their findings: applying YaRN only to the global layers outperforms both \(i\) a uniform RoPE base \(θ\\theta\) bump on all layers and \(ii\) leavingθ\\thetaunchanged\. Intuitively, the sliding window layers operate on a fixed local span and therefore do not require frequency re\-mapping, while the global layers are the only ones that must extrapolate to the new sequence length\.
Concretely, at a 64K evaluation context the layer\-selective recipe reaches a RULER\[[undefab](https://arxiv.org/html/2605.31268#bib.bibx29)\]score of0\.640\.64, compared with0\.520\.52for the uniformθ\\theta\-bump and0\.330\.33for the unchanged\-θ\\thetabaseline\. The gap between recipes*widens*with context length: the unchanged\-θ\\thetarun never adapts the full\-attention layers to longer sequences and collapses past 32K, while the uniform bump unnecessarily perturbs the sliding\-window layers that were already operating correctly at the base context length\. The absolute RULER numbers here are conservative because of a prompt\-formatting issue that depressed scores on the QA subsets throughout the extension stage; we discuss this in[SectionC\.1](https://arxiv.org/html/2605.31268#A3.SS1)and read[Figure7](https://arxiv.org/html/2605.31268#S4.F7)as a*within*\-recipe ranking rather than as RULER’s final word on absolute long\-context capability\.
Figure 7:RULER score versus evaluation context length for the three long\-context recipes we ablated, each scored at its best checkpoint along the extension run\. The uniformθ\\theta\-bump and unchanged\-θ\\thetaevaluation runs were capped at a 64K training context, hence the missing 128K points\. See[SectionC\.1](https://arxiv.org/html/2605.31268#A3.SS1)for caveats on the absolute scores\.
### 4\.2Data Mix
The training data for the extension stage combines a rebalanced version of the Phase 3 pre\-training mix with a portion of agentic SFT data, which naturally contains long\-context examples\. The Phase 3 mix was rebalanced to subsample long reasoning traces, which we found to dominate the long\-context tail and to skew the model toward reasoning\-style outputs at the expense of more general long\-context behaviors\.
We also experimented with reproducing OLMo 3’s Longmino mix\[[undefaar](https://arxiv.org/html/2605.31268#bib.bibx71)\]and several other mixtures, but were unable to replicate the data\-mix gains reported there\. In a head\-to\-head with everything else held constant \(same model, optimizer, YaRN configuration, and iteration budget\), adding the Longmino mix on top of our base mix*lowered*RULER by roughly 2–3 percentage points at every measured context length, rather than improving it—consistent with the broader pattern that, across the configurations we tested, different mixtures produced very similar benchmark numbers, with our base mix narrowly on top\. We also observed essentially no further quality improvement beyond∼\{\\sim\}30B tokens of long\-context training \([Figure8](https://arxiv.org/html/2605.31268#S4.F8)\)\.
To preserve the in\-IDE completion capability at long contexts, we also inject FIM\-formatted examples with repository\-level context into the extension mix, following the construction used forMellum1\[[undefaab](https://arxiv.org/html/2605.31268#bib.bibx55)\]\. Each example concatenates a set of related files from the same repository as additional context preceding the \(prefix, middle, suffix\) target file, so that the cross\-file dependencies relevant to completing the middle span appear at distances representative of real project layouts\. This ensures that the model learns to attend across repository\-scale spans while learning a FIM objective that drives in\-IDE completion, similarly toMellum1\.
### 4\.3Training Schedule
[Figure8](https://arxiv.org/html/2605.31268#S4.F8)plots RULER scores against the number of long\-context training tokens for the chosen recipe\. By the end of the first∼\{\\sim\}30B tokens, RULER at every measured context length is already within∼\{\\sim\}1 pp of the final value reached at 117B tokens; the subsequent∼\{\\sim\}3×\\timesincrease in token budget yields only marginal improvements\. Beyond the 30B\-token point, the only quantity that continued to change meaningfully was the MoE router’s load\-balancing loss, which decreased substantially as the router adapted to the new sequence\-length regime \([Figure9](https://arxiv.org/html/2605.31268#S4.F9)\)\. On the strength of this signal, we extended the run to 3,500 iterations \(∼\{\\sim\}117B tokens\) using a Warmup\-Hold\-Decay \(WHD\) schedule\[[undefac](https://arxiv.org/html/2605.31268#bib.bibx30),[undefx](https://arxiv.org/html/2605.31268#bib.bibx25)\]with 500 decay iterations and a peak learning rate of3×10−53\\times 10^\{\-5\}, allowing the router to fully equilibrate before annealing\.
Figure 8:RULER score versus training tokens during the long\-context extension stage, for the chosen layer\-selective YaRN recipe\. See[SectionC\.1](https://arxiv.org/html/2605.31268#A3.SS1)for a comment on absolute RULER scores\.Figure 9:Global MoE load\-balancing loss during the long\-context extension stage\.
## 5Post\-Training
Post\-training ofMellum 2starts from the long\-context YaRN checkpoint described in[Section4](https://arxiv.org/html/2605.31268#S4)and proceeds in two stages: supervised fine\-tuning \(SFT\) and reinforcement learning\.
### 5\.1Supervised Fine\-Tuning
We train two SFT variants ofMellum 2from the same long\-context base checkpoint and the same data mix, differing in their chat templates and in how reasoning traces and loss masking are handled:
- •Instruct \(no\-thinking\)\.A general\-purpose assistant that produces answers directly, without an externalized chain of thought\. Loss is computed on every assistant turn in the conversation, with all other tokens masked, and any reasoning fields present in the source data are discarded\.
- •Thinking\.A reasoning\-augmented assistant that emits an internal chain of thought before its final answer\. Only the final assistant turn, together with its reasoning trace, contributes to the loss; preceding turns serve as conditioning context, and conversations lacking a reasoning trace are excluded\. To amplify the effective signal on multi\-turn data, each multi\-turn conversation is unfolded by sliding the loss target across successive assistant turns, producing up to five training samples per source conversation\.
After tokenization, sequences are packed to the full 131,072\-token training length; samples that would not fit cleanly into a pack are dropped rather than truncated\. Both variants reuse the pre\-training optimizer and precision stack and keep the Multi\-Token Prediction head active throughout SFT\.
#### 5\.1\.1Data Composition
The SFT corpus is assembled from a number of sources covering the capabilities we wantMellum 2to provide at deployment time\. The dataset mix can be grouped into the following broad categories:
- •General chat and instruction\-following\.Single\- and multi\-turn conversational data covering open\-domain questions, reading\-comprehension QA, multiple\-choice items, and short\-form instruction\-following\.
- •Single\-turn coding\.Code generation, editing, explanation, and translation prompts spanning multiple programming languages, with dedicated splits for C\+\+, Python, C\#, JavaScript and TypeScript competitive programming\.
- •Agentic coding\.Long\-horizon interactive agent trajectories \(early and revised generations\), including SWE\-style repository\-level edit tasks\. These supply the model with patterns for navigating a codebase, planning multi\-step edits, and verifying intermediate results\.
- •Tool use and function calling\.Tool\-augmented conversations covering general function\-calling formats, Bash execution, a clarification tool, and search tools\. The mix teaches both schema\-faithful tool invocation and recovery from tool errors\.
- •Reasoning traces\.Chain\-of\-thought\-bearing examples that populate thereasoningfield used by the thinking variant\. These cover math, code, and general reasoning; they are filtered out at processing time for the instruct variant\.
- •Safety\.Refusal and safe\-response data drawn from a permissively licensed safety corpus, included to reduce harmful completions without degrading helpfulness on benign code prompts\.
- •Identity examples\.A small set of self\-identification dialogues is oversampled \(3×\\times\) so that the model reliably introduces itself asMellum 2rather than its upstream architectures\. Interestingly, in initial runs without identity data, the model consistently identified itself as an AI assistant developed by Google, even though no Google models were used for synthetic data generation\.
Every example is stored in a unified schema with amessageslist \(role/content turns\), an optionaltoolslist describing available function\-call signatures, and an optionalreasoningfield holding the chain\-of\-thought associated with the final assistant turn\.
#### 5\.1\.2Training Setup
Both SFT runs initialize from the long\-context YaRN checkpoint \([Section4](https://arxiv.org/html/2605.31268#S4)\), use the same distributed Muon optimizer as pre\-training, and run for three epochs over their respective packed datasets\. The learning rate peaks at3×10−53\{\\times\}10^\{\-5\}—a tenth of the pre\-training peak—warms up linearly over 100 iterations, and then decays cosine\-style to3×10−63\{\\times\}10^\{\-6\}\(10 % of peak\) over the remainder of training\. We keep BF16 with FP8 hybrid mixed precision, the dropless MoE router, and the MTP head with loss weightα=0\.1\\alpha=0\.1unchanged from pre\-training\. The MoE auxiliary load\-balancing coefficient is reduced from10−310^\{\-3\}to10−410^\{\-4\}, since the router is already well\-balanced after pre\-training and a smaller coefficient avoids over\-constraining expert utilization on the narrower SFT distribution\.
We train at a global batch size of 64 packed sequences of length 131,072—roughly 8\.4M tokens per optimizer step—on 16 nodes of 8 H200 GPUs each\. The run uses expert parallelism of 8 and context parallelism of 8\. The instruct run consumes≈\\approx47B tokens and the thinking run≈\\approx167B tokens, matching the three\-epoch budget on each packed dataset\.[Table6](https://arxiv.org/html/2605.31268#S5.T6)summarizes the shared and variant\-specific hyperparameters\.
Table 6:Supervised fine\-tuning configuration\. Shared rows apply to both runs; rows below the rule differ between Instruct and Thinking\.
### 5\.2Reinforcement Learning
Post\-training ofMellum 2finishes with a Reinforcement Learning \(RL\) stage that refines each SFT checkpoint against programmatically verifiable rewards \(RLVR\)\. We use RLVR rather than RLHF because every prompt in our training corpus admits an unambiguous, programmatic correctness check, so we never have to train a separate reward model whose noise could dominate the gradient signal\.
We run RL twice, once per SFT variant\. TheInstructstage starts from the SFT\-instruct checkpoint and trains on the data mix for the Instruct model\. TheThinkingstage is a cold restart from the SFT\-thinking checkpoint on the data mix for the Thinking model, and its tasks are more difficult for the model than the Instruct mix because it adds a more challenging long\-form math subset\. Each stage produces its own deployable checkpoint; the two runs never share weights\.
Both stages use a variation of GRPO\[[undefaai](https://arxiv.org/html/2605.31268#bib.bibx62)\]with a few adjustments that we describe later in this section\.
#### 5\.2\.1Infrastructure
RL runs on a single Kubernetes cluster of H200 GPU nodes\. The cluster is split into two roles at launch time: a small group of*training*nodes that owns the policy weights and runs the gradient updates, and a larger group of*inference*nodes that hosts the generation engines and produces the rollouts\. The split is fixed for the duration of a run\.
##### Training stack\.
The trainer is built on NeMo\-RL\[[undefay](https://arxiv.org/html/2605.31268#bib.bibx52)\], which already provides the asynchronous GRPO loop we use\. Model parallelism, optimizer state, and the policy backward pass go through Megatron\-Bridge, configured with the same MoE routing, attention layout, and BF16 / FP8 hybrid precision recipe used during pre\-training \([Section3\.4\.4](https://arxiv.org/html/2605.31268#S3.SS4.SSS4)\)\. Generation runs in vLLM\[[undefai](https://arxiv.org/html/2605.31268#bib.bibx36)\]\. The whole pipeline is orchestrated by Ray and scheduled by Kubernetes\.
##### Async actor topology\.
[Figure10](https://arxiv.org/html/2605.31268#S5.F10)summarises the actor topology\. Trajectory collectors stream completed rollouts into a global buffer; the trainer pulls batches from it, runs the GRPO update, and pushes new weights back to the inference engines\. A trajectory may span two consecutive policy versions, which we cap to a small staleness window\. After every weight push the inference engines recompute the KV cache so that prefix logits stay consistent with the new policy\.
Training nodesMegatron\-Bridge trainerInference nodesvLLM engine×N\\times NTrajectory bufferrolloutsbatchesweight sync\+ KV\-cache recomputeFigure 10:Async GRPO actor topology\.
##### Verification stack\.
Reward computation is decoupled from the training loop and runs as a separate set of microservices \([Figure11](https://arxiv.org/html/2605.31268#S5.F11)\)\. The trainer’s environment workers issue HTTP calls into a verification gateway, which routes each request to the appropriate backend based on the verifier type carried with each prompt\. This decoupling lets us run the entire verification stack on a separate cluster, so it never competes for GPUs or memory with the trainer and the generation engines, and it makes scaling and monitoring each backend independent of the training job\. Backends used duringMellum 2RL include a code execution sandbox for unit\-test based rewards on code, a math answer verifier that performs symbolic and numeric comparison, an LLM\-as\-a\-Judge service for grading free\-form outputs, and a number of other environments that back the remaining tasks\. Some of those other environments need extra state, for example session management for stateful tool conversations, so they sit behind their own dedicated services\. The gateway distinguishes between two kinds of failures during a verification call: the model’s response was un\-scoreable, or the verifier itself was transiently unavailable\. We keep these separate so the trainer sees a clean reward signal: un\-scoreable responses produce a zero reward and the model is shown the error string on its next rollout, while infrastructure failures are retried\.
Trainer env workersVerification gatewayCode ExecutionMath AnswerVerificationLLM as a JudgeOtherHTTPFigure 11:Verification stack\.
#### 5\.2\.2Data
We build two RL data mixes, one per stage\. Each is assembled from a combination of public RLVR releases and a small set of our own additions, organized into six capability domains: code, math, agentic tool use, instruction following, reasoning, and knowledge\. Both mixes total roughly 260,000 training prompts and 3,600 validation prompts, and[Table7](https://arxiv.org/html/2605.31268#S5.T7)summarises the per\-domain breakdown\. The two mixes share most sources and are roughly the same size; the only meaningful difference is that the Thinking mix replaces part of the pure\-math share with a difficulty\-filtered long\-form math subset, making it the harder mix overall\.
Table 7:RL data mix composition by capability domain, in number of training prompts and share of total\.##### Code\.
The code domain combines three sources\. We use a dataset with competitive programming problems and tests\[[undefax](https://arxiv.org/html/2605.31268#bib.bibx51)\]\. We also use a public math\-with\-code dataset\[[undefax](https://arxiv.org/html/2605.31268#bib.bibx51)\], which pairs a hard math prompt with a Jupyter\-style Python execution tool: the model generates Python code, reads back the tool’s stdout, and emits a final answer \(this dataset is also counted under Math in[Table7](https://arxiv.org/html/2605.31268#S5.T7)\)\. On top of these two public sources, we add our own collection of realistic multi\-language coding tasks covering twelve target languages \(Python, Java, PHP, TypeScript, C\#, JavaScript, JSX, Rust, Kotlin, Go, C\+\+, and CSS\) and grouped by the kind of work the model has to do: greenfield implementation, debugging from a stack trace, test generation, behaviour modification, filesystem and API integration, and security hardening\. Each task in this collection ships with a test suite, and the fraction of passing tests defines the reward signal\.
##### Math\.
Math is the largest single block in both mixes \(60,000 prompts / 23% in Instruct, 72,000 prompts / 28% in Thinking\) and is built from three complementary styles\. The first is pure math with no tools, where the model must do the work in its own context and emit a final answer that a strict\-match verifier compares against the ground truth\. For the Instruct mix we take this subset from the math portion of OLMo\-3’s Instruct RL release\[[undefaar](https://arxiv.org/html/2605.31268#bib.bibx71)\]; for the Thinking mix we swap in the math portion of OLMo\-3’s Thinking RL release\[[undefaar](https://arxiv.org/html/2605.31268#bib.bibx71)\], which is harder than its Instruct counterpart and the primary contributor to its difficulty\. The second style is math with calculator tools, taken from Nemotron’s math\-advanced\-calculations release\[[undefax](https://arxiv.org/html/2605.31268#bib.bibx51)\], where the model issues calculator\-tool calls and folds the returned values into its answer\. The third style is math with code execution, the math\-with\-code dataset already described under Code, where the model uses the Python execution tool to compute intermediate quantities\. The three styles together cover the main ways the deployed model attacks hard math problems at inference time\.
##### Agentic tool use\.
The math subsets already exercise the tool\-use channel, since both the calculator\-tool dataset and math\-with\-code involve issuing tool calls and reading back their results\. On top of that we add two dedicated agentic sources\. The first is xLAM\-style function\-calling RLVR data\[[undefax](https://arxiv.org/html/2605.31268#bib.bibx51)\], where the model picks and parameterises a tool from an OpenAI\-format tool registry in a single step\. The second is a stateful workplace\-assistant benchmark\[[undefax](https://arxiv.org/html/2605.31268#bib.bibx51)\]in which the model uses an evolving set of personal\-assistant tools \(calendar, email, customer\-relations, project\-management, and analytics queries\) inside a session\-managed environment; the verifier replays the resulting trajectory against a ground\-truth state to score it\. These two sources account for 14% of the Instruct mix and 12% of the Thinking mix\.
##### Instruction following\.
The instruction\-following block exercises format adherence and rule\-based constraints\. We include a generic verifiable IF dataset graded by machine\-checkable instructions, a structured\-output dataset graded by JSON\-schema validation, and a small calendar\-scheduling agent, all from Nemotron’s public RLVR release\[[undefax](https://arxiv.org/html/2605.31268#bib.bibx51)\]\. Together they contribute 19% of the Instruct mix and 21% of the Thinking mix\.
##### Reasoning\.
We include a large slice ofreasoning\-gym\[[undefaam](https://arxiv.org/html/2605.31268#bib.bibx66)\], a public library of roughly a hundred procedurally generated reasoning tasks \(logic puzzles, sequence completion, spatial reasoning, simple games\) each with its own task\-specific verifier\.reasoning\-gymkeeps the mix’s reasoning footprint broad without committing to any single benchmark format and contributes about 13% to both mixes\.
##### Knowledge\.
A multi\-domain MCQA pool covers physics, biology, mathematics, humanities, computer science, engineering, chemistry, and several other subjects\. It is the smallest domain in both mixes \(9% of Instruct, 4% of Thinking\) and is intentionally downsampled because we have observed that excessive MCQA exposure can hurt instruction\-following quality\.
#### 5\.2\.3RL algorithm
Both stages train the policy with a variant of GRPO\[[undefaai](https://arxiv.org/html/2605.31268#bib.bibx62)\]adapted for asynchronous rollouts and equipped with stability mechanisms that handle the train↔\\leftrightarrowinference mismatch we see on BF16 \+ MoE policies\.
##### GRPO loss\.
We use the GRPO recipe with the modifications that have become standard across recent open RL systems\. The loss is token\-level: every valid generated token contributes equally to the gradient, as recommended by DAPO\[[undefaaw](https://arxiv.org/html/2605.31268#bib.bibx76)\]and Dr\. GRPO\[[undefas](https://arxiv.org/html/2605.31268#bib.bibx46)\]\. Advantages are computed per prompt group with a leave\-one\-out baseline and*without*standard\-deviation normalization, again following Dr\. GRPO\. We sampleGGresponses per prompt, oversample by roughly1\.5×1\.5\\times, and discard prompt groups whose within\-group reward variance is zero, an approximate version of the dynamic\-sampling step from DAPO\. The PPO surrogate uses an asymmetric clip range\[1−ϵlow,1\+ϵhigh\]\[1\-\\epsilon\_\{\\text\{low\}\},\\,1\+\\epsilon\_\{\\text\{high\}\}\], the “clip\-higher” setting introduced by DAPO, which lets positive\-advantage updates flow more freely than negative ones\. We do not anchor the policy to the SFT reference with a KL term; recent large\-scale open RL systems have converged on omitting this term\[[undefaar](https://arxiv.org/html/2605.31268#bib.bibx71),[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74),[undefaz](https://arxiv.org/html/2605.31268#bib.bibx53),[undefaaw](https://arxiv.org/html/2605.31268#bib.bibx76)\]\.
##### Asynchronous rollouts\.
Rollouts and gradient updates run on different GPUs \([Section5\.2\.1](https://arxiv.org/html/2605.31268#S5.SS2.SSS1)\); the trainer pulls a batch from a continuously\-filling trajectory buffer rather than waiting for generation\. Trajectory staleness is bounded so that a rollout’s tokens are at most two training steps older than the policy used in the gradient update\.
##### Train versus inference importance sampling\.
Even when the inference policy and the trainer’s recomputed policy are nominally the same model, the two forward passes can disagree on per\-token log\-probabilities\. The principal source of this non\-determinism in an MoE policy is the router itself: for the same hidden state, the inference\-time router may dispatch a token to a different expert than the trainer\-side router, and the resulting logits and log\-probabilities differ even though the weights are identical\. BF16 numerical stability contributes additional noise\. We track this disparity through the train\-versus\-inference ratio:
ρt=πtrain\(yt∣y<t;θold\)πinfer\(yt∣y<t;θold\),\\rho\_\{t\}\\;=\\;\\frac\{\\pi\_\{\\text\{train\}\}\(y\_\{t\}\\mid y\_\{<t\};\\,\\theta\_\{\\text\{old\}\}\)\}\{\\pi\_\{\\text\{infer\}\}\(y\_\{t\}\\mid y\_\{<t\};\\,\\theta\_\{\\text\{old\}\}\)\},which is not exactly11even before any gradient update\. Left unbounded in the loss,ρt\\rho\_\{t\}would let a small number of drifted tokens dominate the gradient\. This is distinct from the standard PPO ratio between the current and pre\-step training policies introduced below; PPO clipping handles the latter, IcePop handlesρt\\rho\_\{t\}\.
We use per\-token IcePop truncation\[[undefao](https://arxiv.org/html/2605.31268#bib.bibx42)\]to guard against this\. For each generated token we keep its contribution to the loss only whenρt∈\[α,β\]\\rho\_\{t\}\\in\[\\alpha,\\beta\]; the contribution is set to zero outside the band\. Unlike the PPO clip, which caps an out\-of\-band ratio at the clip edge, IcePop drops the token entirely\. This is the safer default when the cause of a largeρt\\rho\_\{t\}is an expert flip rather than a real on\-policy update we want to apply\.
Putting the pieces together, the per\-step loss minimised by the trainer is
Ai\\displaystyle A\_\{i\}=Ri−1G−1∑j≠iRj,\\displaystyle=\\;R\_\{i\}\\;\-\\;\\frac\{1\}\{G\-1\}\\sum\_\{j\\neq i\}R\_\{j\},ri,t\\displaystyle r\_\{i,t\}=πtrain\(yi,t∣yi,<t;θ\)πtrain\(yi,t∣yi,<t;θold\),ρi,t=πtrain\(yi,t∣yi,<t;θold\)πinfer\(yi,t∣yi,<t;θold\),\\displaystyle=\\;\\frac\{\\pi\_\{\\text\{train\}\}\(y\_\{i,t\}\\mid y\_\{i,<t\};\\,\\theta\)\}\{\\pi\_\{\\text\{train\}\}\(y\_\{i,t\}\\mid y\_\{i,<t\};\\,\\theta\_\{\\text\{old\}\}\)\},\\qquad\\rho\_\{i,t\}\\;=\\;\\frac\{\\pi\_\{\\text\{train\}\}\(y\_\{i,t\}\\mid y\_\{i,<t\};\\,\\theta\_\{\\text\{old\}\}\)\}\{\\pi\_\{\\text\{infer\}\}\(y\_\{i,t\}\\mid y\_\{i,<t\};\\,\\theta\_\{\\text\{old\}\}\)\},M\(ρ\)\\displaystyle M\(\\rho\)=\{ρifα≤ρ≤β,0otherwise,\\displaystyle=\\;ℒGRPO\\displaystyle\\mathcal\{L\}\_\{\\text\{GRPO\}\}=−1Ntok∑i,tM\(ρi,t\)min\(ri,tAi,clip\(ri,t,1−ϵlow,1\+ϵhigh\)Ai\),\\displaystyle=\\;\-\\,\\frac\{1\}\{N\_\{\\text\{tok\}\}\}\\,\\sum\_\{i,t\}M\(\\rho\_\{i,t\}\)\\,\\min\\\!\\Big\(r\_\{i,t\}\\,A\_\{i\},\\;\\mathrm\{clip\}\\\!\\big\(r\_\{i,t\},1\-\\epsilon\_\{\\text\{low\}\},1\+\\epsilon\_\{\\text\{high\}\}\\big\)\\,A\_\{i\}\\Big\),whereri,tr\_\{i,t\}is the standard PPO ratio between the trainer’s current and pre\-step policies,ρi,t\\rho\_\{i,t\}is the train\-versus\-inference disparity that IcePop calibrates,GGis the number of responses per prompt, andNtokN\_\{\\text\{tok\}\}is the total number of valid generated tokens in the batch\. The four choices that distinguish this recipe from textbook GRPO are visible in the formula:
1. 1\.a leave\-one\-out baseline without standard\-deviation normalization;
2. 2\.the IcePop calibrationM\(ρi,t\)M\(\\rho\_\{i,t\}\)that zeroes the contribution of any token whose train\-versus\-inference ratio falls outside\[α,β\]\[\\alpha,\\beta\];
3. 3\.token\-level normalization by the total valid\-token count;
4. 4\.the asymmetric clip\-higher rangeϵlow<ϵhigh\\epsilon\_\{\\text\{low\}\}<\\epsilon\_\{\\text\{high\}\}\.
##### Reward shaping\.
We add two reward\-shaping rules on top of the verifier’s raw score\.
The first is the soft overlong penalty from DAPO\[[undefaaw](https://arxiv.org/html/2605.31268#bib.bibx76)\]\. Rewards inside a buffer region just below the maximum response length interpolate linearly between the raw score at the buffer’s lower edge and a configured floor at the length cap; rollouts that exceed the cap are dropped from the loss entirely, also following DAPO\. This avoids training on samples that simply ran out of budget while preserving the gradient signal on shorter samples\.
The second is a concision penalty applied selectively to non\-thinking responses\. During an early Instruct run we observed that the policy began producing inline reasoning without the<think\>delimiters used by the Thinking variant, contradicting the deployment contract of a brief Instruct model\. Late\-training math rollouts looked like the following:
\{rolloutquoteframe\}
\[…\]But wait, I recall that in some similar problems, the answer is more than33\.Wait, let me check online or think again\.Wait, perhaps I missed a case\.Wait, what if the number is of the formp4q2p^\{4\}q^\{2\}, but with the same prime? No, then it would bep6p^\{6\}, which has77divisors, not1515\. So no\.Wait, but let’s checkn=144n=144,400400,324324, all less than500500\. \[…\]
Models tend to mark such reasoning with a fairly stable lexicon of trigger words \(*wait*,*actually*,*hmm*,*let me think*, and similar markers\); we follow the ARLCP\-style penalty of\[[undefb](https://arxiv.org/html/2605.31268#bib.bibx3)\]and multiplicatively shrink the reward on correct rollouts in proportion to the number of trigger words present in the response\. The multiplier is bucketed into three tiers of increasing strength as the trigger count grows, and we apply the penalty only on tasks where the lexicon is not legitimately part of the output, so that thinking\-mode responses on math and reasoning tasks are not penalised\. The penalty drives the leakage down sharply at the population level: in math rollouts sampled near the end of training, the average rollout in the no\-concision run carried7\.37\.3reflection\-trigger words \(0\.750\.75per10001000characters of response\), against0\.60\.6\(0\.210\.21per10001000characters\) in the production Instruct run with the penalty enabled\.
#### 5\.2\.4Training Setup
Both stages share the optimizer recipe and overall training loop\. The trainer uses distributed AdamW with peak learning rate1×10−61\\\!\\times\\\!10^\{\-6\}, decaying to1×10−71\\\!\\times\\\!10^\{\-7\}, with a linear warmup over the first 50 iterations and a constant schedule for the remainder of the run\. We keep the BF16 / FP8 hybrid precision recipe from pre\-training \([Section3\.4\.4](https://arxiv.org/html/2605.31268#S3.SS4.SSS4)\), and clip gradients at norm1\.01\.0\.[Table8](https://arxiv.org/html/2605.31268#S5.T8)lists the per\-stage hyperparameters; the dominant differences between the two runs are the sequence budget and the number of training steps\.
Table 8:Per\-stage RL hyperparameters\. Shared rows apply to both runs; rows below the rule differ between Instruct and Thinking\.##### Instruct\.
The Instruct stage starts from the SFT\-Instruct checkpoint \([Section5\.1](https://arxiv.org/html/2605.31268#S5.SS1)\) and trains on the Instruct data mix \([Table7](https://arxiv.org/html/2605.31268#S5.T7)\) for 500 steps\. The shorter response budget allows two rollouts per trainer micro\-batch and a maximum total sequence length of 16,384 tokens\.[Figure12](https://arxiv.org/html/2605.31268#S5.F12)shows the train and validation accuracy curves for this run\.
Figure 12:Training and validation accuracy \(macro\-averaged across tasks\) for the Instruct RL run\. The smoothed train curve is shown in black with the raw per\-step values rasterised underneath; validation is sampled every 50 steps\.
##### Thinking\.
The Thinking stage is a cold restart from the SFT\-Thinking checkpoint \([Section5\.1](https://arxiv.org/html/2605.31268#S5.SS1)\) and trains on the Thinking data mix \([Table7](https://arxiv.org/html/2605.31268#S5.T7)\) for 100 steps\. To accommodate long chains of thought we lift the maximum total sequence length to 40,960 tokens, which forces the trainer’s micro\-batch size down to one\.
### 5\.3Post\-Training Evaluation
We evaluate post\-trained variants ofMellum 2against a panel of open\-weight models in the 4B–14B range: Qwen3\.5\-4B and Qwen3\.5\-9B\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\], OLMo\-3\-7B\[[undefaar](https://arxiv.org/html/2605.31268#bib.bibx71)\], Ministral\-3\-14B\[[undefap](https://arxiv.org/html/2605.31268#bib.bibx43)\], and Seed\-Coder\-8B\[[undeff](https://arxiv.org/html/2605.31268#bib.bibx7)\]\. We report two tables: one comparing the*instruct*\(no\-thinking\) variants in[Table9](https://arxiv.org/html/2605.31268#S5.T9), and one comparing the*thinking*/reasoning variants in[Table10](https://arxiv.org/html/2605.31268#S5.T10)\.
The post\-training evaluation suite covers seven capability areas:
- •Coding:LiveCodeBench v6\[[undefae](https://arxiv.org/html/2605.31268#bib.bibx32)\]\(scored over all v1\-6 cohorts\), EvalPlus \(the average of HumanEval\+ and MBPP\+\)\[[undefaq](https://arxiv.org/html/2605.31268#bib.bibx44)\], and MultiPL\-E\[[undefg](https://arxiv.org/html/2605.31268#bib.bibx8)\]\(restricted to 7 of the 18 languages in the original suite:C\+\+,Java,PHP,TypeScript,C\#,Shell,JavaScript\)\.
- •Tool Use:BFCL v3 focuses onmulti\-turnfunction\-calling, and v4 extends this withagenticcapabilities consisting of web\-search and memory tools\[[undefaaa](https://arxiv.org/html/2605.31268#bib.bibx54)\]\.
- •Math:AIME \(average of AIME 2025 and 2026, 30 questions each\) and GSM\-Plus\[[undefal](https://arxiv.org/html/2605.31268#bib.bibx39)\]\.
- •Knowledge:MMLU\-Redux\[[undefs](https://arxiv.org/html/2605.31268#bib.bibx20)\]and GPQA Diamond\[[undefaaf](https://arxiv.org/html/2605.31268#bib.bibx59)\]\.
- •Conversational:IFEval\[[undefaaz](https://arxiv.org/html/2605.31268#bib.bibx79)\]\(prompt\-level strict accuracy\), MixEval\[[undefaw](https://arxiv.org/html/2605.31268#bib.bibx50)\], BS\-Bench \(false premise detection rate\), and a JetBrains internal pairwise win rate against Qwen2\.5\-7B\-Instruct\.
- •Safety:HarmBench\[[undefau](https://arxiv.org/html/2605.31268#bib.bibx48)\]\(harmful rate, lower is better\) and XSTest\[[undefaag](https://arxiv.org/html/2605.31268#bib.bibx60)\]\(safe compliance rate\)\.
LLM\-as\-a\-Judge benchmarks \(BS\-Bench, JetBrains pairwise, HarmBench, and XSTest\) use GPT\-5\.2 as a judge model\. All benchmarks run at 0\.0 temperature, except for BFCL at 0\.01 and LiveCodeBench at 0\.2\. All models use greedy decoding\.
Table 9:Post\-training evaluation,instruct \(no\-thinking\)variants\. All values are percentages; higher is better except HarmBench \(lower is better\)\. EvalPlus is the average of HumanEval\+ and MBPP\+\. AIME is the average of AIME 2025 and AIME 2026 \(30 questions each\)\. BFCL v4 is the macro\-average of its five subtasks \(v1, v2, v3, web search, memory\)\. JetBrains internal scores are pairwise win rates against Qwen2\.5\-7B\-Instruct\. Em\-dashes \(—\) indicate lacking native tool calling for Seed\-Coder\-8B\.Table 10:Post\-training evaluation,thinking/reasoning variants\. Same metric conventions as[Table9](https://arxiv.org/html/2605.31268#S5.T9)\. OLMo\-3\-7B\-Thinking does not support native tool calling\.##### Overall profile\.
The seven capability areas reveal a consistent picture:Mellum 2is strongest where the domain aligns with our training mix \(function\-level code synthesis and JetBrains\-style developer interaction\), competitive on tool use and math once RL is applied, and weakest on broad world knowledge\. With only 2\.5B active parameters drawn from a 12B MoE backbone, the model is competing against dense baselines that range from 4B \(Qwen3\.5\-4B\) to 14B \(Ministral\-3\-14B\); we contextualize the results in that light below\.
##### Coding\.
The three coding benchmarks measure different abilities and the results separate cleanly\.EvalPlus– the augmented HumanEval\+/MBPP\+ pair that probes robust function\-level synthesis – is led byMellum 2\-RL at 78\.4%, ahead of every baseline including Qwen3\.5\-9B \(71\.8\) and the code\-specialized Seed\-Coder\-8B \(73\.8\)\. This is the regime our pre\-training mix targets directly\.LiveCodeBench v6, by contrast, draws on contamination\-resistant competitive\-programming problems that demand multi\-step algorithmic reasoning over relatively few tokens; the instruct variant lags the Qwen3\.5 series \(37\.2 vs\. 51\.0 / 63\.7\) but matches or beats the other 7–14B baselines\. The gap closes dramatically in the thinking configuration:Mellum 2\-SFT\-Thinking reaches 75\.1, the top score in our panel and 6\.8 points ahead of Qwen3\.5\-9B\-Thinking\. We read this as evidence that algorithmic reasoning is in the model’s reach but requires an explicit thinking budget to be unlocked, whereas function synthesis transfers from pre\-training without one\.MultiPL\-E, restricted here to seven of the eighteen native languages, is mid\-pack: Seed\-Coder\-8B \(77\.0\) and Ministral\-3\-14B \(71\.5\) edge ahead on cross\-lingual breadth\.
##### Tool use, math, and reasoning\.
RL is where the largest single\-step jumps appear\. BFCL v3 climbs from 43\.1 to 66\.3 \(instruct\) and 60\.5 to 69\.4 \(thinking\), with the thinking variant overtaking Qwen3\.5\-9B\-Thinking \(68\.5\)\. On BFCL v4, which adds agentic web\-search and memory subtasks,Mellum 2\-RL\-Thinking leads the panel at 45\.6, against 42\.9 / 42\.7 for the Qwen3\.5 family — a sign that our function\-calling RL recipe transfers usefully to held\-out agentic settings\. Math follows a similar arc: AIME goes from 29\.9 \(SFT instruct\) to 41\.7 \(RL instruct\) and from 20\.0 to 58\.4 in thinking mode\. The SFT\-Thinking AIME score is below its SFT\-instruct counterpart, a quirk we attribute to the thinking head requiring RL\-stage exposure to mathematical reasoning before its reasoning trace is well\-calibrated for that task family\. GSM\-Plus reaches 87\.0 in RL\-Thinking, within a few points of Qwen3\.5\-9B\-Thinking \(90\.7\)\.
##### Knowledge: the principal weakness\.
MMLU\-Redux and GPQA Diamond are the area where the Qwen3\.5 series is dominant: 91\.1 / 79\.8 at 9B against our 78\.1 / 40\.9 \(instruct\) and 86\.2 / 57\.6 \(thinking\)\. GPQA in particular — graduate\-level science QA — is essentially a probe of factual depth outside computer science, and the gap reflects a deliberate tradeoff in our training mix toward code and developer documentation rather than broad encyclopedic coverage\. For a code\-assistant model this profile is acceptable, but it bounds the off\-domain use ofMellum 2and is worth surfacing explicitly to deployers\.
##### Conversational: JetBrains\-relative leadership, generic mid\-pack\.
On the internal JetBrains pairwise win\-rate against Qwen2\.5\-7B\-Instruct,Mellum 2\-RL\-Thinking leads the panel at 69\.5%, above both Ministral\-3\-14B\-Thinking \(63\.8\) and Qwen3\.5\-9B\-Thinking \(56\.7\), while on the generic conversational benchmarks \(IFEval, MixEval\) the model sits in the middle of the pack\. The asymmetry is informative: the pairwise judge sees code\-aware, developer\-flavored prompts where domain familiarity pays off, whereas the generic benchmarks reward broad\-coverage chat behavior that benefits from the Qwen3\.5 post\-training mix\. BS\-Bench is the conversational outlier:Mellum 2scores 14–24 against 56–70 for the Qwen3\.5 series\. This benchmark rewards push\-back against false premises rather than helpful task completion; the gap suggests our SFT/RL signal leans toward compliance, and we leave tightening this trade\-off for future iterations\.
##### Safety\.
On HarmBench \(lower is better\),Mellum 2\-SFT is the safest model in the instruct table at 8\.4%, with Ministral\-3\-14B \(56\.5\) and Seed\-Coder\-8B \(40\.0\) substantially worse\. The RL variant regresses to 23\.1%, consistent with the well\-documented tendency of preference\-optimization stages to relax some refusal behaviors; this is a known alignment tax in our RL recipe and a target for future iterations\. On XSTest,Mellum 2trails the largest baselines by roughly ten points, indicating that a subset of safe prompts are over\-refused; we view this as the symmetric counterpart to the HarmBench regression and an item for joint optimization in subsequent releases\.
## 6Efficiency and Deployment
Practical deployment in latency\-sensitive IDE environments is a core design goal ofMellum 2\. The architecture was designed from the outset to match or exceed the inference speed of Qwen2\.5\-7B\[[undefaae](https://arxiv.org/html/2605.31268#bib.bibx58)\]\.
We built a dedicated inference benchmarking pipeline with fixed hardware, software dependencies, and Docker containers to ensure reproducibility across all architectural candidates\. Benchmarks use representative input/output sizes from production code completion workloads \(mean input length of 2,304 tokens, mean output length of 256 tokens\) and evaluate in two regimes:*sync mode*, which measures sequential single\-request latency, and*throughput mode*, which measures sustained tokens/s under concurrent high\-load requests\. Throughput mode uses no fixed request rate: the client issues requests back\-to\-back to keep the server saturated, and the sustained rates we measure are 20\.2 req/s forMellum 2, 16\.7 req/s for Qwen2\.5\-7B, and 11\.3 req/s for Qwen3\-8B\. All measurements use a single H100 GPU \(80 GB\) with vLLM\[[undefai](https://arxiv.org/html/2605.31268#bib.bibx36)\]serving and dynamic FP8 model quantization on a host with 192 GB of system RAM and 48 CPU cores\.
[Figure13](https://arxiv.org/html/2605.31268#S6.F13)comparesMellum 2against two dense baselines, Qwen2\.5\-7B\[[undefaae](https://arxiv.org/html/2605.31268#bib.bibx58)\]and Qwen3\-8B\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\]\. In sync mode,Mellum 2matches the 193 tokens/s of Qwen2\.5\-7B—the architectural target set in[Section2\.1](https://arxiv.org/html/2605.31268#S2.SS1)—to within a single token\. In throughput mode, it pulls 21% ahead of Qwen2\.5\-7B and 79% ahead of Qwen3\-8B\.
Figure 13:Output tokens/s on a single H100, vLLM FP8 serving, at the benchmark workload shape \(ISL/OSL = 2,304/256\)\.Mellum 2matches the sync latency of Qwen2\.5\-7B while delivering 21% higher sustained throughput\.
## 7Conclusion
We have presentedMellum 2, an open\-weight 12B\-parameter Mixture\-of\-Experts model with 2\.5B active parameters, released as matched*Instruct*and*Thinking*variants under the Apache 2\.0 license\. As the general\-purpose successor to the 4B denseMellumcompletion model, it is built to generate and edit code, reason through engineering tasks, call tools, and drive agentic workflows inside the IDE at a per\-token cost that is practical to deploy at scale\.
Every architectural decision, including MoE versus dense, 8\-of\-64 expert sparsity, 4\-KV\-head GQA, the 3:1 Sliding Window Attention pattern, and the single MTP head, was selected by ablation under a fixed inference budget: matching the single\-H100 speed of Qwen2\.5\-7B\. The resulting model meets that target in single\-request decoding \(192 vs\. 193 tokens/s\) and exceeds it by 21 % under concurrent serving \(5,179 tokens/s\)\. On top of this, we ran a three\-phase pre\-training curriculum on∼\{\\sim\}10\.65T tokens with a Muon \+ FP8\-hybrid stack, extended context to 131,072 tokens via layer\-selective YaRN, and applied a two\-stage post\-training pipeline \(SFT followed by RLVR on math and executable coding\)\. Across code, math, tool use, knowledge, conversational, and safety benchmarks,Mellum 2is competitive with open\-weight baselines in the 4–14B range while running at the per\-token compute of a 2\.5B dense model\.
Natural directions to explore from here include:
1. 1\.pushingMellum 2further into SWE RL—training directly on repository\-level software\-engineering tasks and toward competitive small SWE agents;
2. 2\.broader scaling of RL infrastructure and environment coverage;
3. 3\.revisiting the long\-context mid\-training mix\.
Looking further out, the same recipe of selecting architecture by ablation against a fixed inference budget also opens the door to a larger, similarly inference\-aware Mellum\.
We release the base, instruct, and thinking checkpoints together with this report, with the aim of giving the community both an open recipe and an inference\-aware design point for small\-MoE coding models\.
## References
- \[undef\]Joshua Ainslie et al\.“GQA: Training Generalized Multi\-Query Transformer Models from Multi\-Head Checkpoints”In*arXiv preprint arXiv:2305\.13245*, 2023
- \[undefa\]Loubna Ben Allal and Anton Lozhkov“SmolLM2: When Smol Goes Big – Data\-Centric Training of a Small Language Model”In*arXiv preprint arXiv:2502\.02737*, 2025
- \[undefb\]undef Anonymous“Stop Unnecessary Reflection: ARLCP for Concision\-Aware Reward Shaping in Reasoning Models”In*arXiv preprint arXiv:2602\.12113*, 2026
- \[undefc\]Jacob Austin et al\.“Program Synthesis with Large Language Models”In*arXiv preprint arXiv:2108\.07732*, 2021
- \[undefd\]Mohammad Bavarian, Heewoo Jun and Nikolas Tezak“Efficient Training of Language Models to Fill in the Middle”In*arXiv preprint arXiv:2207\.14255*, 2022
- \[undefe\]Iz Beltagy, Matthew E Peters and Arman Cohan“Longformer: The Long\-Document Transformer”In*arXiv preprint arXiv:2004\.05150*, 2020
- \[undeff\]undef ByteDance Seed et al\.“Seed\-Coder: Let the Code Model Curate Data for Itself”In*arXiv preprint arXiv:2506\.03524*, 2025
- \[undefg\]Federico Cassano et al\.“MultiPL\-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation”In*IEEE Transactions on Software Engineering*49\.7, 2023, pp\. 3675–3691
- \[undefh\]Mark Chen, Jerry Tworek and Heewoo Jun“Evaluating Large Language Models Trained on Code”In*arXiv preprint arXiv:2107\.03374*, 2021
- \[undefi\]Aidan Clark et al\.“Unified Scaling Laws for Routed Language Models”In*arXiv preprint arXiv:2202\.01169*, 2022
- \[undefj\]Peter Clark et al\.“Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”In*arXiv preprint arXiv:1803\.05457*, 2018
- \[undefk\]Karl Cobbe et al\.“Training Verifiers to Solve Math Word Problems”In*arXiv preprint arXiv:2110\.14168*, 2021
- \[undefl\]undef Codefuse et al\.“Every Sample Matters: Leveraging Mixture\-of\-Experts and High\-Quality Data for Efficient and Accurate Code LLM”In*arXiv preprint arXiv:2503\.17793*, 2025
- \[undefm\]Damai Dai, Chengqi Deng and Chenggang Zhao“DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture\-of\-Experts Language Models”In*arXiv preprint arXiv:2401\.06066*, 2024
- \[undefn\]undef DeepSeek\-AI“DeepSeek\-V2: A Strong, Economical, and Efficient Mixture\-of\-Experts Language Model”In*arXiv preprint arXiv:2405\.04434*, 2024
- \[undefo\]undef DeepSeek\-AI“DeepSeek\-V3 Technical Report”In*arXiv preprint arXiv:2412\.19437*, 2025
- \[undefp\]Hantian Ding et al\.“Fewer Truncations Improve Language Modeling”In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, 2024arXiv:[2404\.10830 \[cs\.CL\]](https://arxiv.org/abs/2404.10830)
- \[undefq\]William Fedus, Barret Zoph and Noam Shazeer“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”In*Journal of Machine Learning Research*23\.120, 2022, pp\. 1–40
- \[undefr\]Trevor Gale, Deepak Narayanan, Cliff Young and Matei Zaharia“MegaBlocks: Efficient Sparse Training with Mixture\-of\-Experts”In*Proceedings of the Sixth Conference on Machine Learning and Systems \(MLSys\)*, 2023
- \[undefs\]Aryo Pradipta Gema et al\.“Are We Done with MMLU?”In*arXiv preprint arXiv:2406\.04127*, 2024
- \[undeft\]undef Gemma Team“Gemma 3 Technical Report”In*arXiv preprint arXiv:2503\.19786*, 2025
- \[undefu\]Fabian Gloeckle et al\.“Better & Faster Large Language Models via Multi\-token Prediction”In*arXiv preprint arXiv:2404\.19737*, 2024
- \[undefv\]Aaron Grattafiori and Abhimanyu Dubey“The Llama 3 Herd of Models”In*arXiv preprint arXiv:2407\.21783*, 2024
- \[undefw\]Alex Gu et al\.“CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution”In*arXiv preprint arXiv:2401\.03065*, 2024
- \[undefx\]Alexander Hägele et al\.“Scaling Laws and Compute\-Optimal Training Beyond Fixed Training Durations”In*arXiv preprint arXiv:2405\.18392*, 2024
- \[undefy\]Dan Hendrycks et al\.“Measuring Massive Multitask Language Understanding”In*arXiv preprint arXiv:2009\.03300*, 2021
- \[undefz\]Dan Hendrycks et al\.“Measuring Mathematical Problem Solving With the MATH Dataset”In*arXiv preprint arXiv:2103\.03874*, 2021
- \[undefaa\]Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar and Yuxuan Chen“Query\-Key Normalization for Transformers”In*Findings of the Association for Computational Linguistics: EMNLP 2020*Association for Computational Linguistics, 2020, pp\. 4246–4253
- \[undefab\]Cheng\-Ping Hsieh et al\.“RULER: What’s the Real Context Size of Your Long\-Context Language Models?”In*arXiv preprint arXiv:2404\.06654*, 2024
- \[undefac\]Shengding Hu, Yuge Tu and Xu Han“MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies”In*arXiv preprint arXiv:2404\.06395*, 2024
- \[undefad\]Binyuan Hui, Jian Yang and Zeyu Cui“Qwen2\.5\-Coder Technical Report”In*arXiv preprint arXiv:2409\.12186*, 2024
- \[undefae\]Naman Jain et al\.“LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”In*arXiv preprint arXiv:2403\.07974*, 2024
- \[undefaf\]Albert Q Jiang, Alexandre Sablayrolles and Arthur Mensch“Mistral 7B”In*arXiv preprint arXiv:2310\.06825*, 2023
- \[undefag\]Keller Jordan et al\.“Muon: An optimizer for hidden layers in neural networks”,[https://kellerjordan\.github\.io/posts/muon/](https://kellerjordan.github.io/posts/muon/), 2024
- \[undefah\]Jakub Krajewski et al\.“Scaling Laws for Fine\-Grained Mixture of Experts”In*arXiv preprint arXiv:2402\.07871*, 2024
- \[undefai\]Woosuk Kwon et al\.“Efficient Memory Management for Large Language Model Serving with PagedAttention”In*Proceedings of the 29th Symposium on Operating Systems Principles \(SOSP\)*ACM, 2023, pp\. 611–626
- \[undefaj\]Katherine Lee et al\.“Deduplicating Training Data Makes Language Models Better”In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*Association for Computational Linguistics, 2022, pp\. 8424–8445
- \[undefak\]Yaniv Leviathan, Matan Kalman and Yossi Matias“Fast Inference from Transformers via Speculative Decoding”, 2023arXiv:[https://arxiv\.org/abs/2211\.17192](https://arxiv.org/abs/2211.17192)
- \[undefal\]Qintong Li et al\.“GSM\-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers”In*arXiv preprint arXiv:2402\.19255*, 2024
- \[undefam\]Raymond Li“StarCoder: May the Source Be with You\!”In*arXiv preprint arXiv:2305\.06161*, 2023
- \[undefan\]Stephanie Lin, Jacob Hilton and Owain Evans“TruthfulQA: Measuring How Models Mimic Human Falsehoods”In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*Association for Computational Linguistics, 2022, pp\. 3214–3252
- \[undefao\]undef Ling Team“Ring\-1T Technical Report”In*arXiv preprint arXiv:2510\.18855*, 2025
- \[undefap\]Alexander H\. Liu“Ministral 3”In*arXiv preprint arXiv:2601\.08584*, 2026
- \[undefaq\]Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang and Lingming Zhang“Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”In*arXiv preprint arXiv:2305\.01210*, 2023
- \[undefar\]Jingyuan Liu, Jianlin Su and Xingcheng Yao“Muon is Scalable for LLM Training”In*arXiv preprint arXiv:2502\.16982*, 2025
- \[undefas\]Zichen Liu et al\.“Understanding R1\-Zero\-Like Training: A Critical Perspective”In*Conference on Language Modeling \(COLM\)*, 2025
- \[undefat\]Ilya Loshchilov and Frank Hutter“Decoupled Weight Decay Regularization”In*International Conference on Learning Representations \(ICLR\)*, 2019URL:[https://openreview\.net/forum?id=Bkg6RiCqY7](https://openreview.net/forum?id=Bkg6RiCqY7)
- \[undefau\]Mantas Mazeika et al\.“HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal”In*arXiv preprint arXiv:2402\.04249*, 2024
- \[undefav\]Paulius Micikevicius, Dusan Stosic and Neil Burgess“FP8 Formats for Deep Learning”In*arXiv preprint arXiv:2209\.05433*, 2022
- \[undefaw\]Jinjie Ni et al\.“MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures”In*arXiv preprint arXiv:2406\.06565*, 2024
- \[undefax\]undef NVIDIA“NeMo Gym: An Open Source Framework for Scaling Reinforcement Learning Environments for LLM” GitHub repository,[https://github\.com/NVIDIA\-NeMo/Gym](https://github.com/NVIDIA-NeMo/Gym), 2025
- \[undefay\]undef NVIDIA“NeMo RL: A Scalable and Efficient Post\-Training Library” GitHub repository,[https://github\.com/NVIDIA\-NeMo/RL](https://github.com/NVIDIA-NeMo/RL), 2025
- \[undefaz\]undef NVIDIA“NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba\-Transformer Reasoning Model”In*arXiv preprint arXiv:2508\.14444*, 2025
- \[undefaaa\]Shishir G\. Patil et al\.“The Berkeley Function Calling Leaderboard \(BFCL\): From Tool Use to Agentic Evaluation of Large Language Models”In*Proceedings of the 42nd International Conference on Machine Learning*, 2025, pp\. 48371–48392
- \[undefaab\]Nikita Pavlichenko et al\.“Mellum: Production\-Grade in\-IDE Contextual Code Completion with Multi\-File Project Understanding”In*arXiv preprint arXiv:2510\.05788*, 2025
- \[undefaac\]Guilherme Penedo, Hynek Kydlicek and Loubna Ben Allal“The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale”In*arXiv preprint arXiv:2406\.17557*, 2024
- \[undefaad\]Bowen Peng, Jeffrey Quesnelle, Honglu Fan and Enrico Shippole“YaRN: Efficient Context Window Extension of Large Language Models”In*arXiv preprint arXiv:2309\.00071*, 2024
- \[undefaae\]undef Qwen Team“Qwen2\.5 Technical Report”In*arXiv preprint arXiv:2412\.15115*, 2024
- \[undefaaf\]David Rein et al\.“GPQA: A Graduate\-Level Google\-Proof Q&A Benchmark”In*arXiv preprint arXiv:2311\.12022*, 2023
- \[undefaag\]Paul Röttger et al\.“XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models”In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*Association for Computational Linguistics, 2024, pp\. 5377–5400
- \[undefaah\]Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi“WinoGrande: An Adversarial Winograd Schema Challenge at Scale”In*Communications of the ACM*64\.9, 2021, pp\. 99–106
- \[undefaai\]Zhihong Shao et al\.“DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”In*arXiv preprint arXiv:2402\.03300*, 2024
- \[undefaaj\]Noam Shazeer“GLU Variants Improve Transformer”In*arXiv preprint arXiv:2002\.05202*, 2020
- \[undefaak\]Mohammad Shoeybi et al\.“Megatron\-LM: Training Multi\-Billion Parameter Language Models Using Model Parallelism”In*arXiv preprint arXiv:1909\.08053*, 2020
- \[undefaal\]Varun Singh et al\.“Arcee Trinity Large Technical Report”In*arXiv preprint arXiv:2602\.17004*, 2026
- \[undefaam\]Zafir Stojanovski et al\.“Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards”, 2025arXiv:[https://arxiv\.org/abs/2505\.24760](https://arxiv.org/abs/2505.24760)
- \[undefaan\]Dan Su et al\.“Nemotron\-CC: Transforming Common Crawl into a Refined Long\-Horizon Pretraining Dataset”In*arXiv preprint arXiv:2412\.02595*, 2024
- \[undefaao\]Jianlin Su et al\.“RoFormer: Enhanced Transformer with Rotary Position Embedding”In*Neurocomputing*568, 2024, pp\. 127063
- \[undefaap\]Mirac Suzgun et al\.“Challenging BIG\-Bench Tasks and Whether Chain\-of\-Thought Can Solve Them”In*arXiv preprint arXiv:2210\.09261*, 2022
- \[undefaaq\]Qwen Team“Qwen3\.5: Towards Native Multimodal Agents”, 2026
- \[undefaar\]undef Team Olmo, Allyson Ettinger, Amanda Bertsch and Bailey Kuehl“Olmo 3”In*arXiv preprint arXiv:2512\.13961*, 2025
- \[undefaas\]Hugo Touvron, Louis Martin and Kevin Stone“Llama 2: Open Foundation and Fine\-Tuned Chat Models”In*arXiv preprint arXiv:2307\.09288*, 2023
- \[undefaat\]Yubo Wang et al\.“MMLU\-Pro: A More Robust and Challenging Multi\-Task Language Understanding Benchmark”In*arXiv preprint arXiv:2406\.01574*, 2024
- \[undefaau\]An Yang, Anfeng Yang and Baosong Yang“Qwen3 Technical Report”In*arXiv preprint arXiv:2505\.09388*, 2025
- \[undefaav\]Songlin Yang, Jan Kautz and Ali Hatamizadeh“Gated Delta Networks: Improving Mamba2 with Delta Rule” arXiv:2412\.06464In*International Conference on Learning Representations \(ICLR\)*, 2025
- \[undefaaw\]Qiying Yu et al\.“DAPO: An Open\-Source LLM Reinforcement Learning System at Scale”In*arXiv preprint arXiv:2503\.14476*, 2025
- \[undefaax\]Rowan Zellers et al\.“HellaSwag: Can a Machine Really Finish Your Sentence?”In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*Association for Computational Linguistics, 2019, pp\. 4791–4800
- \[undefaay\]Biao Zhang and Rico Sennrich“Root Mean Square Layer Normalization”In*Advances in Neural Information Processing Systems*32, 2019, pp\. 12360–12371
- \[undefaaz\]Jeffrey Zhou et al\.“Instruction\-Following Evaluation for Large Language Models”In*arXiv preprint arXiv:2311\.07911*, 2023
- \[undefaaaa\]Barret Zoph et al\.“ST\-MoE: Designing Stable and Transferable Sparse Expert Models”In*arXiv preprint arXiv:2202\.08906*, 2022
## Appendix AArchitecture Exploration Details
This appendix provides additional detail on the architecture exploration experiments summarized in[Section2\.1](https://arxiv.org/html/2605.31268#S2.SS1)\.
### A\.1Dense Architecture Exploration
We evaluated dense architectures based on Qwen3\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\]variations across two axes:
Deeper variants\(32–40 layers, hidden size 3072–4096\): None consistently outperformed Qwen2\.5\-7B\[[undefaae](https://arxiv.org/html/2605.31268#bib.bibx58)\]on evaluation benchmarks under the latency constraint\. Deeper architectures suffer from more sequential operations, degrading inference performance\.
Wider variants\(24–28 layers, hidden size 3584–4096\): Wider and shallower architectures exhibited better inference performance, as expected, but still failed to consistently exceed the Qwen2\.5\-7B quality baseline\.
Multi\-head Latent Attention \(MLA\)\[[undefn](https://arxiv.org/html/2605.31268#bib.bibx15)\]: We adapted the DeepSeek architecture by removing MoE layers and enabling MLA\. With a latent rank of 512 \(the only rank supported by the vLLM\[[undefai](https://arxiv.org/html/2605.31268#bib.bibx36)\]inference backend at the time\), MLA allowed scaling to approximately 5\.5B parameters at Qwen2\.5\-7B latency\. However, quality improvements were insufficient, and the latent rank was overly large for our model scale, limiting the potential KV\-cache savings\.
### A\.2MoE Architecture Exploration
We scaled down the Qwen3\-30B\-A3B\[[undefaau](https://arxiv.org/html/2605.31268#bib.bibx74)\]architecture proportionally while preserving the ratios between hidden size, intermediate size, and expert size\. Key findings:
- •Expert count: Fixed at 64 \(maximum that fits in GPU memory\)\.
- •Active experts: 2 active experts achieved∼\{\\sim\}1\.5×\\timeslower latency than 8, but quality was substantially worse at our model scale\. 8 active experts provided the best quality–latency trade\-off\.
- •Total parameters: Up to∼\{\\sim\}15B total parameters were feasible while matching Qwen2\.5\-7B latency with 8 active experts\.
- •Shared expert: Adding a shared expert\[[undefm](https://arxiv.org/html/2605.31268#bib.bibx14)\]\(always active in addition to the routed top\-kk\) yielded no measurable quality gain at our scale and consistently hurt inference performance because of the extra always\-on FFN compute per token\. We dropped it from the final design\.
- •Dense/sparse interleaving: Replacing a subset of MoE layers with dense FFN layers \(in the spirit of recent interleaved\-MoE designs\) similarly hurt inference performance without a matching quality improvement, so all FFN layers inMellum 2are MoE\.
- •Auxiliary\-loss\-free load balancing: We were strongly tempted to adopt the auxiliary\-loss\-free, bias\-based load\-balancing scheme popularised by DeepSeek\-V3\[[undefo](https://arxiv.org/html/2605.31268#bib.bibx16)\]: it simplifies the training stack by removing an extra loss term and its coefficient, and in our short\-run experiments it matched or slightly improved expert utilisation\. We ultimately stayed with the auxiliary\-loss formulation in order to fit cleanly into the Qwen3\-MoE module layout, which is what every major open\-source inference framework already implements; this made integration ofMellum 2into the existing ecosystem essentially free\. We plan to switch to auxiliary\-loss\-free balancing in the next iteration, once the loss\-free variant is equally well supported downstream\.
### A\.3Hybrid Architecture Exploration
In parallel with the dense and MoE sweeps above, we also explored*hybrid*attention designs that interleave standard softmax attention with linear\-recurrent token mixers\. Concretely, we built variants based on the Qwen3\-Next recipe\[[undefaaq](https://arxiv.org/html/2605.31268#bib.bibx70)\]\(later adopted in the Qwen3\.5 family\), which replaces a large fraction of attention layers with Gated DeltaNet\[[undefaav](https://arxiv.org/html/2605.31268#bib.bibx75)\]layers, keeping only every fourth layer as full attention\.
On long\-context, large\-batch workloads are very attractive for such hybrids: the fixed\-size recurrent state of Gated DeltaNet eliminates the linearly growing KV cache and gives near\-constant per\-token decode cost\. ForMellum 2, however, the dominant deployment target is*short context, single batch*in\-IDE inference, where the scenario is inverted\. At the time we ran our architecture search, every hybrid variant we benchmarked exhibited a substantial latency regression on short input/output lengths compared with a pure\-attention baseline of the same parameter budget\. The reasons are at least partly structural: the recurrent state update is more arithmetically heavy than a standard attention step at small sequence lengths, decode is memory\-bound on the state matrix rather than on a tiny KV cache, and the relevant kernels were significantly less optimised in mainstream inference backends than the long\-standing softmax attention path\.
Because none of these issues are fundamental — they reflect kernel and framework maturity rather than the underlying algorithm — we expect the short\-context inference gap to shrink as hybrid architectures see wider adoption and dedicated optimisation in inference engines, and we intend to revisit hybrid designs for futureMellum 2iterations\.
### A\.4MoE Training Hyperparameters
We conducted preliminary experiments on MoE\-specific hyperparameters before the main training sweeps:
- •Balancing strategy: Per\-sequence auxiliary loss produced slightly better test loss than global\-batch balancing on short runs\. We selected global\-batch balancing for its flexibility with variable batch sizes\.
- •Auxiliary loss coefficient:10−210^\{\-2\}performed better on short runs, but we chose10−310^\{\-3\}for full pre\-training to avoid over\-constraining expert utilization\.
- •Token dropping: Experiments with expert capacity factors of 1\.0–1\.5 showed no meaningful quality difference\. We adopted dropless routing, which was initially slower but improved in throughput as the router learned to balance load during training\. The residual overhead is∼\{\\sim\}15% at the time of writing\.
## Appendix BTraining Hyperparameters \(Full\)
Table 11:Complete training hyperparameters forMellum 2pre\-training\.OptimizerOptimizerDistributed MuonMuon momentum0\.95Muon Newton–Schulz iterations5Muon scale modespectralMuon TP modeblockwiseMuon extra scale factor0\.2Nesterov momentumYesAdamβ1,β2\\beta\_\{1\},\\beta\_\{2\}0\.9, 0\.95Adamϵ\\epsilon10−810^\{\-8\}SGD momentum0\.9Weight decay0\.1Gradient clipping1\.0Learning RatePeak learning rate3×10−43\\times 10^\{\-4\}Minimum learning rate0ScheduleWHDWarmup steps2,000 \(linear\)Decay steps49,306 \(linear\)Decay styleLinearBatch & SequenceSequence length8,192Global batch size4,096 sequencesMicro batch size2Batch rampup2,048→\\to4,096Total training steps323,459
PrecisionBase precisionBF16FP8 modeHybridFP8 recipeTensorwiseFP8 amax algorithmMost recentFP8 parameter gatherYesGradient reductionFP32MoEAuxiliary loss typeGlobal batchAuxiliary loss coefficient10−310^\{\-3\}Z\-loss coefficient10−310^\{\-3\}Router bias update rate10−310^\{\-3\}Router precisionFP32Token droppingDisabledGrouped GEMMYesRouter fusionYesPermute fusionYesParallelismExpert parallelism8Tensor parallelism1Pipeline parallelism1Multi\-Token PredictionAdditional prediction layers1MTP loss scaling factor0\.1
## Appendix CEvaluation Notes and Lessons Learned
This appendix collects two evaluation\-time observations that shaped how we report numbers in[Section5\.3](https://arxiv.org/html/2605.31268#S5.SS3)and that we believe are useful for other groups running similar pipelines\.
### C\.1RULER QA Subsets and Prompt Formatting
Throughout the long\-context extension stage \([Section4](https://arxiv.org/html/2605.31268#S4)\), we used RULER\[[undefab](https://arxiv.org/html/2605.31268#bib.bibx29)\]at 128K as the primary long\-context benchmark\. Early in the run, we observed that the model scored approximately zero on the QA subsets while behaving normally on the retrieval and aggregation tasks\. The failure mode was not a capability gap: the model was*continuing*the question \(generating plausible follow\-up questions in the same style\) rather than answering it, and the exact\-match scorer counted every such response as wrong\.
The lower quality resulted from a prompt\-formatting issue rather than an actual capability gap\. We deliberately did not add RULER\-style QA prompts to the long\-context data mix, since doing so would have amounted to optimizing for the benchmark rather than for the underlying capability\.
### C\.2Reasoning Budgets for Qwen3 and Qwen3\.5 Thinking Variants
While evaluating the*thinking*variants of Qwen3\-4B and Qwen3\.5\-4B \(reported in[Table10](https://arxiv.org/html/2605.31268#S5.T10)\), we encountered a consistent failure mode on a non\-trivial fraction of prompts: the model would not emit a closing</think\>tag and continued to reason indefinitely\. Running these models without a generation cap is both expensive and produces near\-zero benchmark scores, because the model would rather fill its context window with a reasoning trace than answer the benchmark question\.
Recent vLLM\[[undefai](https://arxiv.org/html/2605.31268#bib.bibx36)\]releases expose a configurable reasoning budget that forces the model out of the thinking phase after a chosen number of tokens\. Qwen does not publish an official threshold for the 4B/9B thinking variants, so we used a generous but arbitrary budget of 32K tokens for every thinking model in our evaluation\. This is sufficient to admit long but bounded chains of thought while preventing the pathological non\-terminating cases from dominating the average\.
We note that, from a downstream\-user perspective, the small thinking variants of Qwen3 and Qwen3\.5 are difficult to deploy in their thinking regime without such a cap\. We do not have a definitive explanation for this behavior, but we suspect a lack of on\-policy reinforcement\-learning training at the smallest scales in those families, since the larger models in the same families appear to terminate reasoning much more reliably\.Similar Articles
Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
JetBrains introduces Mellum2, a 12B parameter Mixture-of-Experts model optimized for code generation and reasoning tasks, with a focus on private deployment and integration into development workflows.
Mellum2 Technical Report
Mellum 2 is a 12B-parameter open-weight MoE language model by JetBrains with 2.5B active parameters, specialized in software engineering tasks and optimized for efficient inference on commodity GPUs.
Mellum2 Goes Open Source: A Fast Model for AI Workflows | The JetBrains AI Blog
JetBrains open-sources Mellum2, a fast 12B Mixture-of-Experts model designed for low-latency AI workflows in software engineering, available under Apache 2.0 license.
Mellum 2 12B A2.5B
JetBrains released Mellum 2 12B A2.5B, a coding-focused small MoE model with reasoning performance comparable to Qwen 3.5 9B but weaker in other tasks.
JetBrains/Mellum2-12B-A2.5B-Thinking
JetBrains releases Mellum2-12B-A2.5B-Thinking, an open-source Mixture-of-Experts reasoning model with 131k context length, trained with RLVR for explicit chain-of-thought reasoning.