Model Merging Scaling Laws in Large Language Models
Summary
This paper establishes empirical scaling laws for language model merging, identifying power-law relationships between model size, expert count, and performance to enable predictive planning for optimal model composition.
View Cached Full Text
Cached at: 05/12/26, 07:32 AM
Paper page - Model Merging Scaling Laws in Large Language Models
Source: https://huggingface.co/papers/2509.24244
Abstract
Empirical scaling laws for language model merging reveal power-law relationships between model size, expert count, and cross-entropy performance, enabling predictive planning for optimal model composition.
We study empiricalscaling lawsforlanguage model mergingmeasured bycross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale themodel size. We identify a compactpower lawthat linksmodel sizeandexpert number: the size-dependent floor decreases withmodel capacity, while the merging tail exhibits cleardiminishing returnsin the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enablespredictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative tomultitask training. This suggests a scaling principle fordistributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path towardAGI-level systems.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2509\.24244
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2509.24244 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2509.24244 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2509.24244 in a Space README.md to link it from this page.
Collections including this paper2
Similar Articles
Scaling laws for neural language models
Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
InfoLaw is a data-aware scaling framework that predicts model loss based on token consumption, model size, data mixture weights, and repetition, enabling efficient data-recipe selection under varying compute budgets.
Scaling laws for reward model overoptimization
OpenAI researchers empirically study how reward model overoptimization affects performance, establishing scaling laws that show the relationship between proxy reward optimization and ground truth performance varies by optimization method and scales predictably with model size.
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.
Evolution through large models
This paper demonstrates that large language models trained on code can significantly enhance genetic programming mutation operators, enabling the generation of hundreds of thousands of functional Python programs for robot design in the Sodarace domain without prior training data. The approach, called Evolution through Large Models (ELM), combines LLMs with MAP-Elites to bootstrap new conditional models for context-specific artifact generation.