Model Merging Scaling Laws in Large Language Models

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

Summary

This paper establishes empirical scaling laws for language model merging, identifying power-law relationships between model size, expert count, and performance to enable predictive planning for optimal model composition.

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/12/26, 07:32 AM

Paper page - Model Merging Scaling Laws in Large Language Models

Source: https://huggingface.co/papers/2509.24244

Abstract

Empirical scaling laws for language model merging reveal power-law relationships between model size, expert count, and cross-entropy performance, enabling predictive planning for optimal model composition.

We study empiricalscaling lawsforlanguage model mergingmeasured bycross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale themodel size. We identify a compactpower lawthat linksmodel sizeandexpert number: the size-dependent floor decreases withmodel capacity, while the merging tail exhibits cleardiminishing returnsin the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enablespredictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative tomultitask training. This suggests a scaling principle fordistributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path towardAGI-level systems.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2509\.24244

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2509.24244 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2509.24244 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2509.24244 in a Space README.md to link it from this page.

Model Merging Scaling Laws in Large Language Models

Paper page - Model Merging Scaling Laws in Large Language Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper2

Similar Articles

Scaling laws for neural language models

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Scaling laws for reward model overoptimization

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Evolution through large models

Submit Feedback

Similar Articles

Scaling laws for neural language models

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Scaling laws for reward model overoptimization

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Evolution through large models