SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization
Summary
This paper introduces SimReg, a regularization technique for LLM pretraining that uses embedding similarity to improve training convergence by over 30% and boost zero-shot performance.
View Cached Full Text
Cached at: 05/12/26, 07:00 AM
# SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization
Source: [https://arxiv.org/html/2605.08809](https://arxiv.org/html/2605.08809)
Yan Sun1,Guoxia Wang1,Jinle Zeng1,JiaBin Yang1,Shuai Li1 Li Shen3,Dacheng Tao4,DianHai Yu1,Haifeng Wang1 1Baidu Inc\.2Sun Yat\-sen University3Nanyang Technological University \{sunyan25,wangguoxia,zengjinle,yangjiabin01,yudianhai,wanghaifeng\}@baidu\.com lishuai\_math@163\.com, mathshenli@gmail\.com, dacheng\.tao@ntu\.edu\.sg
###### Abstract
Pretraining large language models \(LLMs\) with next\-token prediction has led to remarkable advances, yet the context\-dependent nature of token embeddings in such models results in high intra\-class variance and inter\-class similarity, thus hindering the efficiency of representation learning\. While similarity\-based regularization has demonstrated benefit in supervised fine\-tuning and classification tasks, its application and efficacy in large\-scale LLM pretraining remains underexplored\. In this work, we propose theSimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground\-truth label within each sequence to be more similar, while enforcing separation from different\-label tokens via a contrastive loss\. Our analysis reveals that this mechanism introduces gains by enlarging multi\-classification margins, thereby enabling more efficient classification\. Extensive experiments across dense and Mixture\-of\-Experts \(MoE\) architectures demonstrate thatSimRegconsistently accelerates training convergence by over30%30\\%and improves average zero\-shot downstream performance by over1%1\\%across standard benchmarks\. Further ablation studies and analyses offer practical insights into hyperparameter tuning and loss effectiveness\.
## 1Introduction
LLMs have emerged as a cornerstone of modern artificial intelligence and have demonstrated remarkable capabilities across a wide range of domains such as natural language understanding\(Radfordet al\.,[2019](https://arxiv.org/html/2605.08809#bib.bib10)\), reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2605.08809#bib.bib11)\), and multimodal interaction\(Linet al\.,[2025b](https://arxiv.org/html/2605.08809#bib.bib12)\)\. While LLMs are advancing along diverse directions, they all fundamentally share a consistent underlying principle, i\.e\., next\-token prediction\. The essential mechanism of LLMs is to predict the categorical distribution of the next token from the embeddings of the prior context, which can also be viewed as a classification problem defined over the combined representations of the preceding context\. By leveraging enormous model parameters and vast training data, it exhibits exceptional generalization capability, introduces novel solutions in diverse research domains, and further drives the adoption of a wide range of applications\(Topsakal and Akinci,[2023](https://arxiv.org/html/2605.08809#bib.bib13)\)with growing challenges in efficiency\(Shenet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib50)\)\. Both data\-specific\(Fanet al\.,[2025](https://arxiv.org/html/2605.08809#bib.bib53); Denget al\.,[2026](https://arxiv.org/html/2605.08809#bib.bib55)\)and weight\-specific\(Liet al\.,[2024a](https://arxiv.org/html/2605.08809#bib.bib57); Sunet al\.,[2025](https://arxiv.org/html/2605.08809#bib.bib54); Linet al\.,[2025a](https://arxiv.org/html/2605.08809#bib.bib56)\)approaches have attracted considerable research interest\.
Unlike conventional classification, language model prediction does not rely on a stable object strictly tied to its label\. In image classification, for instance, a cat image is consistently associated with its label, leading to highly consistent embeddings within the same class\. In contrast, on language tasks, the representation used to predict a token is composed of diverse contextual features, many unrelated to the label itself\. As a result, embeddings predicting the same token can vary significantly\. For example, the representations for “walls” in “The cat jumps over walls” and “A child paints near walls” originate from entirely different contexts, making the classification process more challenging\.
Recent advances in consistency learning for finetuning language models shed light on potential solutions to this challenge\(Huanget al\.,[2021](https://arxiv.org/html/2605.08809#bib.bib15); Gunelet al\.,[2021](https://arxiv.org/html/2605.08809#bib.bib16); Yinet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib17)\)\. However, this line of research has not yet been extended to pretraining and has not been widely adopted in the large\-scale pretraining practices\. Post\-training is typically performed with a small learning rate and limited datasets, which makes it difficult to significantly modify the geometric structure of the learned parameters\. These insights motivate us to further extend this approach to large\-scale pretraining\.
Figure 1:\(left\) Workflow of theSimRegloss\. \(Right\) We compare the cosine similarity of token embeddings in a sample on the LLaMA\-7B model trained via “CrossEntropy only" and “CrossEntropy\+SimReg"\. Using “CrossEntropy only" fails to enforce sufficient separability among token features, whose cosine values of all token pairs exceed0\.50\.5\. With the introduction ofSimReg, feature separability is generally enhanced \(averaged cosine value is reduced by at least0\.10\.1\), thereby providing stronger support for classification training\. More results are stated in Appendix[A\.4](https://arxiv.org/html/2605.08809#A1.SS4)\.In this work, we show that large\-scale pretraining with cross\-entropy alone fails to impose strong consistency on token embeddings\. To address this, we then add a consistency regularization term,SimReg, to strengthen the representational capacity of large models during pretraining\. For each token in a sequence, all tokens are partitioned into positive and negative groups\. The objective penalizes the similarity across groups, which pulls embeddings toward same\-class samples and pushes them away from different\-class samples\. To ensure valid contrastive pairs for every token,SimRegintroduces self\-sample similarity in each positive group and further computes the loss with group\-level rather than sample\-level averaging, which balances the contributions of different tokens, which allows it to preserve a high level of stability over the long pretraining runs\. We also provide a thorough theoretical understanding to explain how it contributes to improving cross\-entropy loss\. Extensive evaluations are conducted on both dense and MoE models, including LLaMA\-350M, 1\.3B, 3B, 7B\(Touvronet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib3)\), and Mixtral\-8×\\times1B\(Jianget al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib4)\)\. TheSimRegloss can consistently accelerates convergence by over 30% in pretraining\. When training with over 52B tokens, it can yield an improvement of more than 1% in average performance across downstream general tasks\. We investigate the hyperparameter sensitivity ofSimRegand find that it maintains a wide range of applicability\. We summarize our main contributions as follows:
- •We explore the advantages of employing consistency regularization in large\-scale pretraining tasks and propose a series of improvements to address the training instabilities of existing methods, thereby enabling stable performance gains throughout long\-term pretraining\.
- •We provide a detailed theoretical analysis of the benefits of theSimRegloss for the cross\-entropy loss, and how it improves the multi\-classification margins\.
- •We conduct extensive experiments to validate its substantial improvements for pretraining tasks, achieving an average training acceleration of over 30% and yielding over 1% gains on downstream tasks, and state detailed empirical insights for the community\.
## 2Related Work
Contrastive learning\. The systematic exploration of feature similarity constraints in machine learning can be traced back to their early development in computer vision \(CV\) tasks and contrastive learning\(Oordet al\.,[2018](https://arxiv.org/html/2605.08809#bib.bib19); Khoslaet al\.,[2020](https://arxiv.org/html/2605.08809#bib.bib20)\)\. They enhance the training of baseline classification models by constructing virtual data pairs and incorporating additional supervised loss signals, which helped the models extract more discriminative features\. It is typically employed to counteract noise perturbations at the input level, thereby improving generalization ability\(Genget al\.,[2021](https://arxiv.org/html/2605.08809#bib.bib23); Shiet al\.,[2022](https://arxiv.org/html/2605.08809#bib.bib21); Huang and Gong,[2022](https://arxiv.org/html/2605.08809#bib.bib25); Zhouet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib22); Wanget al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib24)\)\. Generally, a data pair is constructed from a raw sample and its perturbed counterpart, and the model is trained to minimize their representation similarity\. Subsequently, supervised contrastive learning has been extended to incorporate class information\. By leveraging available labels to construct class\-consistent data pairs, the model is trained not only to pull together samples from the same class but also to push apart samples from different classes\(Wang and Liu,[2021](https://arxiv.org/html/2605.08809#bib.bib29); Wen and Li,[2021](https://arxiv.org/html/2605.08809#bib.bib26); Yeet al\.,[2022](https://arxiv.org/html/2605.08809#bib.bib27); Denizeet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib28)\)\. Recent studies have revealed that contrastive learning can also achieve more efficient feature extraction across tasks and data originating from different domains\(Vermaet al\.,[2021](https://arxiv.org/html/2605.08809#bib.bib31); Wanget al\.,[2022](https://arxiv.org/html/2605.08809#bib.bib30); Xieet al\.,[2022](https://arxiv.org/html/2605.08809#bib.bib32); Azumaet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib33)\)\. In multimodal large model training, this learning paradigm is often employed to align the mapping of knowledge across domains and to capture the representation capacity of the same knowledge under different modalities\(Yuanet al\.,[2021](https://arxiv.org/html/2605.08809#bib.bib34); Maiet al\.,[2022](https://arxiv.org/html/2605.08809#bib.bib35); Liuet al\.,[2024b](https://arxiv.org/html/2605.08809#bib.bib36); Sunet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib37)\)\. In summary, contrastive learning offers an efficient and general paradigm for representation learning to the machine learning community\.
Embedding Consistency in LLMs\.The study of feature similarity has also been considered as compositional generalization\(Lake,[2019](https://arxiv.org/html/2605.08809#bib.bib39); Wiedemeret al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib38)\)and embedding consistency regularization\(Yinet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib17)\)\.Gaoet al\.\([2021](https://arxiv.org/html/2605.08809#bib.bib40)\)learn the sentence embeddings and achieve higher generalization efficiency\. Then it is widely expanded to the token\-level\(Gaoet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib45); Wang and Yu,[2023](https://arxiv.org/html/2605.08809#bib.bib41)\), word\-level\(Kenter and De Rijke,[2015](https://arxiv.org/html/2605.08809#bib.bib42); Antoniak and Mimno,[2018](https://arxiv.org/html/2605.08809#bib.bib44)\), context\-level\(Laskaret al\.,[2020](https://arxiv.org/html/2605.08809#bib.bib43)\)\. Most of these tasks have primarily focused on small\-scale or fine\-tuning settings\. As the cornerstone of modern language models, the next\-token prediction paradigm has been widely applied across various downstream tasks\(Liet al\.,[2024b](https://arxiv.org/html/2605.08809#bib.bib46); Chenet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib47)\)\. Recent research has further investigated the similarity and dispersion of token embeddings, which highlights the separability of embeddings to be a key direction\(de Andradeet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib48); Taoet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib49); Huet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib51)\)\.
## 3Problem Setup and Methodology
In this section, we introduce howSimRegcan be incorporated into the pretraining of LLMs and explain why it helps improve performance\. Before proceeding, we formalize the overall pretraining setup of LLMs and introduce the notations used throughout the subsequent analysis\.
General Pretraining\.Before introducing the training framework, we first define the notation in this work\. We consider the progress of LLM pretraining as learning the optimal weight𝐰\\mathbf\{w\}by minimizing the cross\-entropy lossℓ\\ellunder a general distribution𝒟\\mathcal\{D\}\. We decompose the model into two cascaded functionsfP∘fEf\_\{P\}\\circ f\_\{E\}, wherefPf\_\{P\}\(the logits generation module\) is parameterized by𝐰P\\mathbf\{w\}\_\{P\}andfEf\_\{E\}\(the embedding generation module\) is parameterized by𝐰E\\mathbf\{w\}\_\{E\}, with the overall parameters denoted as𝐰=\[𝐰P,𝐰E\]\\mathbf\{w\}=\\left\[\\mathbf\{w\}\_\{P\},\\mathbf\{w\}\_\{E\}\\right\]\. Based on this decomposition, the general pretraining objective of language models can then be formally formulated as:
min𝐰𝔼\(𝐱i,yi\)∼𝒟\[ℓ\(fP∘fE\(𝐱i\),yi\)\],\\min\_\{\\mathbf\{w\}\}\\mathbb\{E\}\_\{\\left\(\\mathbf\{x\}\_\{i\},y\_\{i\}\\right\)\\sim\\mathcal\{D\}\}\\left\[\\ell\\left\(f\_\{P\}\\circ f\_\{E\}\\left\(\\mathbf\{x\}\_\{i\}\\right\),y\_\{i\}\\right\)\\right\],\(1\)where\(𝐱i,yi\)\\left\(\\mathbf\{x\}\_\{i\},y\_\{i\}\\right\)is the\(data,label\)\\left\(\\text\{data\},\\text\{label\}\\right\)pair sampled from the distribution𝒟\\mathcal\{D\}\. Here, the choice offEf\_\{E\}andfPf\_\{P\}is entirely flexible, meaning that theSimRegloss can in principle be applied to any valid token embedding across the network\. We further explore the optimal placement of this component in subsequent experiments of Sec\.[5\.3](https://arxiv.org/html/2605.08809#S5.SS3)\.
Cross\-entropy loss serves as the fundamental training objective in language modeling\. It measures the discrepancy between the predicted token distribution and the ground\-truth one\-hot distribution, thereby guiding the model to maximize the likelihood of the correct next token\. The models typically employ large\-scale feature extractors to obtain separable representations\. By denoting the token embedding as𝐞i=fE\(𝐱i\)\\mathbf\{e\}\_\{i\}=f\_\{E\}\(\\mathbf\{x\}\_\{i\}\)and corresponding logits as𝐳i=fP\(𝐞i\)\\mathbf\{z\}\_\{i\}=f\_\{P\}\(\\mathbf\{e\}\_\{i\}\), the population risk of sample\-wise cross\-entropy loss is:
Lce=1n∑i\(−𝐳i,yi\+log\(∑jexp\(𝐳i,j\)\)\)\.L^\{\\text\{ce\}\}=\\frac\{1\}\{n\}\\sum\_\{i\}\\left\(\-\\mathbf\{z\}\_\{i,y\_\{i\}\}\+\\log\\left\(\\sum\_\{j\}\\exp\\left\(\{\\mathbf\{z\}\_\{i,j\}\}\\right\)\\right\)\\right\)\.\(2\)
\(a\)
\(b\)
Figure 2:\(a\) We analyze the token ID distribution over 1B training samples from the C4 dataset and find that only about 2% of tokens occur with extremely high frequency, resulting in a pronounced long\-tail effect in the classification data\. \(b\) We observe that the contrastive similarity loss of embeddings does not continue to decrease after reaching a basic threshold and then the feature similarity is no longer further optimized\.Simply increasing the model size does not improve this performance\.Generally, larger separability can enhance the distinction between different samples, leading to more robust and discriminative representations\. Although Eq\. \([2](https://arxiv.org/html/2605.08809#S3.E2)\) averages over samples, the unique characteristics of language tasks introduce a challenge:the distribution of words \(tokens\) is highly imbalanced, which causes frequent tokens to dominate the loss while rare but informative ones contribute disproportionately little, yielding a heavy long\-tail dataset\. When training classification tasks on such dataset, the inter\-class margin is greatly influenced by the number of samples per class\. As shown in Figure[2](https://arxiv.org/html/2605.08809#S3.F2)\(a\), we empirical investigate the token distribution of C4 dataset and the behavior of contrastive similarity\.A primary challenge we investigate in the LLM pretraining is:
cross\-entropystopsdriving stronger representation learning after a basic separability level of tokens\.
Figure[2](https://arxiv.org/html/2605.08809#S3.F2)\.\(b\) indicates that during the early stage of cross\-entropy training, the model rapidly constrains the contrastive similarity of embeddings\. However, once the contrastive diversity becomes sufficient to sustain classification training, the model no longer enforces heterogeneity among token embeddings\. Subsequently, even though the cross\-entropy loss continues to decrease, the contrastive similarity exhibits little further change\. Another interesting phenomenon we observe is that, even as the model depth increases and the embedding dimension grows, the supervision of token embedding contrastive similarity under cross\-entropy remains nearly at the same level\. This limits the potential for further improvement in classification tasks, while also motivating us to impose the contrastive similarity\.
Embedding Similarity Regularization\.Here, we introduce the generalized form of our similarity regularization\. For each token𝐱\\mathbf\{x\}, its embedding can be denoted by𝐞=fE\(𝐱\)\\mathbf\{e\}=f\_\{E\}\(\\mathbf\{x\}\)\. For each data sample\(𝐱i,yi\)\\left\(\\mathbf\{x\}\_\{i\},y\_\{i\}\\right\), we can define a positive embedding set𝒫i=\{k:yk=yi\}\\mathcal\{P\}\_\{i\}=\\left\\\{k:y\_\{k\}=y\_\{i\}\\right\\\}and a negative embedding set𝒩i=\{k:yk≠yi\}\\mathcal\{N\}\_\{i\}=\\left\\\{k:y\_\{k\}\\neq y\_\{i\}\\right\\\}\. The consistency loss aims to minimize the distance between embeddings of positive pairs, while simultaneously maximizing the separation between negative pairs:
Lisr≜log∑j∈𝒩iϕi,j−log∑j∈𝒫iϕi,j,L\_\{i\}^\{\\text\{sr\}\}\\triangleq\\log\\sum\_\{j\\in\\mathcal\{N\}\_\{i\}\}\\phi\_\{i,j\}\-\\log\\sum\_\{j\\in\\mathcal\{P\}\_\{i\}\}\\phi\_\{i,j\},\(3\)whereLisrL\_\{i\}^\{\\text\{sr\}\}is the similarity loss ofii\-th token\.ϕ\\phidenotes a similarity function between two embeddings\. We explore two primary forms: the exponential of the inner\-product⟨𝐞i,𝐞j⟩\\left\\langle\\mathbf\{e\}\_\{i\},\\mathbf\{e\}\_\{j\}\\right\\rangleand that of the cosine similarity⟨𝐞i,𝐞j⟩‖𝐞i‖⋅‖𝐞j‖\\frac\{\\left\\langle\\mathbf\{e\}\_\{i\},\\mathbf\{e\}\_\{j\}\\right\\rangle\}\{\\\|\\mathbf\{e\}\_\{i\}\\\|\\cdot\\\|\\mathbf\{e\}\_\{j\}\\\|\}\. Both similarity measures provide effective supervision for feature similarity, yet their applicable scenarios differ\. It often yields stronger statistical constraints, thereby enforcing supervision on both geometric structure and feature norms\. However, this advantage may also introduce ambiguity: for instance, when a embedding has an abnormally large norm, the inner\-product value becomes dominated by the magnitude, rendering the loss function almost insensitive to angular differences\. In such cases, the optimization may overly rely on vector norms while neglecting the discriminative power of directional alignment\. Therefore, for numerical stability, we adopt cosine similarity as the similarity measure in Eq\. \([3](https://arxiv.org/html/2605.08809#S3.E3)\) and introduce a constant temperature coefficientτ\\tauto adjust the sharpness of the distribution\. Since words in natural language are inherently distributed in an imbalanced manner, a sequence may contain only a single occurrence of a particular token type, we add the self\-similarity to𝒫i\\mathcal\{P\}\_\{i\}to enforce that there exists at least a positive data pair\. Moreover, to ensure that the regularization loss is non\-negative, we introduce thesoftplusfunction to further scale it\. Therefore, the final form of the loss is computed asLi=Lice\+λ⋅softplus\(Lisr\)L\_\{i\}=L\_\{i\}^\{\\text\{ce\}\}\+\\lambda\\cdot\\text\{softplus\}\\left\(L\_\{i\}^\{\\text\{sr\}\}\\right\)\. The entire optimization process involves two hyperparametersτ\\tauandλ\\lambda\. We discuss them in Sec\.[5\.2](https://arxiv.org/html/2605.08809#S5.SS2)\.
Chunk\-wiseSimRegfor Sequence Parallelism\.The computation ofSimRegis centered on the embedding of each token within a sequence sample, whose complexity is𝒪\(n2\)\\mathcal\{O\}\(n^\{2\}\)\. Its computation can naturally support parallelization strategies like data parallelism \(DP\), tensor parallelism \(TP\), and pipeline parallelism \(PP\)\. However, during long\-text training, sequence parallelism \(SP\) splits the data of each sequence across different nodes for training, which introduces additional redundant communications\. To alleviate this issue, we divideSimRegintobbchunks, where everynb\\frac\{n\}\{b\}tokens form a chunk to compute theSimRegloss internally\. The losses across different nodes are then weighted according to the ratio of positive and negative samples, while the overall computational complexity is reduced to𝒪\(n2/b\)\\mathcal\{O\}\(n^\{2\}/b\)\. Moreover, we further point out that there exists a fundamental trade\-off between the strength of supervision and the expressive capacity of feature representations with respect to the choice of chunk size\. When a chunk contains a larger number of tokens, the estimation ofSimRegbecomes more accurate\. However, its constraining power on each individual token is weakened, as the loss must balance relationships among a larger set of tokens\. Therefore, selecting an appropriate chunk size is of critical importance, which is empirically explored in Sec\.[5\.1](https://arxiv.org/html/2605.08809#S5.SS1)\.
## 4Theoretical Analysis
In this section, we demonstrate howSimRegimproves classification margins\. All proofs are provided in Appendix[B](https://arxiv.org/html/2605.08809#A2)\. We first introduce classification margins, i\.e\.,mi=𝐳i,yi−maxj≠yi𝐳i,jm\_\{i\}=\\mathbf\{z\}\_\{i,y\_\{i\}\}\-\\max\_\{j\\neq y\_\{i\}\}\\mathbf\{z\}\_\{i,j\}, which is the gap between the top two logits\. Then the cross entropy loss can be upper bounded by:
ℓi=log\(1\+∑jexp\(𝐳i,j−𝐳i,yi\)\)≤Cexp\(−mi\),\\begin\{split\}\\ell\_\{i\}=\\log\\left\(1\+\\sum\_\{j\}\\exp\\left\(\\mathbf\{z\}\_\{i,j\}\-\\mathbf\{z\}\_\{i,y\_\{i\}\}\\right\)\\right\)\\leq C\\exp\\left\(\-m\_\{i\}\\right\),\\end\{split\}\(4\)whereCCis the number of total classes\. The above formula explicitly characterizes the relationship between the classification margin and the training loss, and enlarging the marginmmleads to a further reduction in the loss\. Our supervision on the embedding variable𝐞\\mathbf\{e\}is propagated through a functionfP\(⋅\)f\_\{P\}\\left\(\\cdot\\right\)to the logits𝐳\\mathbf\{z\}used for the classification with cross\-entropy, i\.e\.,𝐳=fP\(𝐞\)\\mathbf\{z\}=f\_\{P\}\\left\(\\mathbf\{e\}\\right\)\. This mapping can take the form of a simple linear projection \(e\.g\., the LM head\) or several intermediate layers of the LLM\. Without loss of generality, we assume it to be a general smooth and non\-convex function with smoothness coefficientLPL\_\{P\}\. Thus, we consider the margin\. By defining𝕀\\mathbb\{I\}as the standard basis vector where𝕀j\\mathbb\{I\}\_\{j\}means11in thejj\-th coordinate and0elsewhere, we measure the pair\-wise gap in logits bygyk,j\(𝐞k\)=\(𝕀yk−𝕀j\)⊤fP\(𝐞k\)g\_\{y\_\{k\},j\}\(\\mathbf\{e\}\_\{k\}\)=\\left\(\\mathbb\{I\}\_\{y\_\{k\}\}\-\\mathbb\{I\}\_\{j\}\\right\)^\{\\top\}f\_\{P\}\(\\mathbf\{e\}\_\{k\}\), which also holds smoothness and non\-convexity\. Furthermore, we can transfer the smoothness by:\|gyi,j\(𝐞p\)−gyi,j\(𝐞q\)\|≤‖𝕀yi−𝕀j‖‖fP\(𝐞p\)−fP\(𝐞q\)‖≤2LP‖𝐞p−𝐞q‖\|g\_\{y\_\{i\},j\}\(\\mathbf\{e\}\_\{p\}\)\-g\_\{y\_\{i\},j\}\(\\mathbf\{e\}\_\{q\}\)\|\\leq\\\|\\mathbb\{I\}\_\{y\_\{i\}\}\-\\mathbb\{I\}\_\{j\}\\\|\\\|f\_\{P\}\(\\mathbf\{e\}\_\{p\}\)\-f\_\{P\}\(\\mathbf\{e\}\_\{q\}\)\\\|\\leq\\sqrt\{2\}L\_\{P\}\\\|\\mathbf\{e\}\_\{p\}\-\\mathbf\{e\}\_\{q\}\\\|\. To investigate their relationships, we have the following lemma\.
###### Lemma 1
For each token𝐱i\\mathbf\{x\}\_\{i\}where its embedding is𝐞i=fE\(𝐱i\)\\mathbf\{e\}\_\{i\}=f\_\{E\}\\left\(\\mathbf\{x\}\_\{i\}\\right\), we further define a weighted center of the embedding in the original space, where the positive and negative centers are𝐞¯k\+=∑i∈𝒫kαk,i𝐞i∑i∈𝒫kαk,i\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\+\}=\\frac\{\\sum\_\{i\\in\\mathcal\{P\}\_\{k\}\}\\alpha\_\{k,i\}\\mathbf\{e\}\_\{i\}\}\{\\sum\_\{i\\in\\mathcal\{P\}\_\{k\}\}\\alpha\_\{k,i\}\}and𝐞¯k−=∑i∈𝒩kαk,i𝐞i∑i∈𝒩kαk,i\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\-\}=\\frac\{\\sum\_\{i\\in\\mathcal\{N\}\_\{k\}\}\\alpha\_\{k,i\}\\mathbf\{e\}\_\{i\}\}\{\\sum\_\{i\\in\\mathcal\{N\}\_\{k\}\}\\alpha\_\{k,i\}\}whereαk,i∝exp\(𝐞k⊤𝐞i\)\\alpha\_\{k,i\}\\propto\\exp\\left\(\\mathbf\{e\}\_\{k\}^\{\\top\}\\mathbf\{e\}\_\{i\}\\right\)\. Then we have the averaged group margins arem¯k\+=minc≠ykgyk,c\(𝐞¯k\+\)\\overline\{m\}\_\{k\}^\{\+\}=\\min\_\{c\\neq y\_\{k\}\}g\_\{y\_\{k\},c\}\(\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\+\}\)andm¯k−=minc≠ykgyk,c\(𝐞¯k−\)\\overline\{m\}\_\{k\}^\{\-\}=\\min\_\{c\\neq y\_\{k\}\}g\_\{y\_\{k\},c\}\(\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\-\}\)\. Therefore, the classification margin bound of each tokenmkm\_\{k\}is the Central–Eccentric lower bound within the group margin:
m¯k\+−2LP‖𝐞k−𝐞¯k\+‖≤mk≤m¯k−\+2LP‖𝐞k−𝐞¯k−‖\.\\overline\{m\}\_\{k\}^\{\+\}\-\\sqrt\{2\}L\_\{P\}\\\|\\mathbf\{e\}\_\{k\}\-\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\+\}\\\|\\leq m\_\{k\}\\leq\\overline\{m\}\_\{k\}^\{\-\}\+\\sqrt\{2\}L\_\{P\}\\\|\\mathbf\{e\}\_\{k\}\-\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\-\}\\\|\.\(5\)
Intuitively,m¯\\overline\{m\}can be regarded as an idealized margin, obtained by evaluating the logit of the correct class at the positive center and that of the strongest competing class at the negative center\. A key point is that it separates the upper and lower bounds of the classification margin for each individual sample, showing that the lower bound is influenced by the distance to positive samples‖𝐞k−𝐞¯k\+‖\\\|\\mathbf\{e\}\_\{k\}\-\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\+\}\\\|andm¯k\+\\overline\{m\}\_\{k\}^\{\+\}, while the upper bound is determined by the distance to negative samples‖𝐞k−𝐞¯k−‖\\\|\\mathbf\{e\}\_\{k\}\-\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\-\}\\\|andm¯k−\\overline\{m\}\_\{k\}^\{\-\}\. Thus we have:
- •The dynamics of the central distance of the positive set would decrease:ddt‖𝐞k−𝐞¯k\+‖2≤0\\frac\{d\}\{dt\}\\\|\\mathbf\{e\}\_\{k\}\-\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\+\}\\\|^\{2\}\\leq 0;
- •The classification margin at the positive center would increase: there exists a positive constantδ\\deltathat gyk,j\(𝐞k\+\+ϵ\+\)−gyk,j\(𝐞k\+\)≥δ‖ϵ\+‖,g\_\{y\_\{k\},j\}\(\\mathbf\{e\}\_\{k\}^\{\+\}\+\\epsilon\_\{\+\}\)\-g\_\{y\_\{k\},j\}\(\\mathbf\{e\}\_\{k\}^\{\+\}\)\\geq\\delta\\\|\\epsilon\_\{\+\}\\\|,\(6\)whereϵ\+\\epsilon\_\{\+\}is the perturbation caused by similarity loss\.
By minimizing the objectiveLsrL^\{\\text\{sr\}\}, the positive center𝐞¯k\+\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\+\}shifts its weights toward same\-class samples that are more similar to the anchor𝐞k\\mathbf\{e\}\_\{k\}, causing‖𝐞k−𝐞¯k\+‖\\\|\\mathbf\{e\}\_\{k\}\-\\overline\{\\mathbf\{e\}\}\_\{k\}^\{\+\}\\\|to decreaseγ\\gamma\. Simultaneously the positive group margin can increase at leastδ‖ϵ\+‖\\delta\\\|\\epsilon\_\{\+\}\\\|\. Therefore, the classification margin of thekk\-th token can improve at leastmk′≥mk\+δ‖ϵ\+‖\+2LPγm\_\{k\}^\{\\prime\}\\geq m\_\{k\}\+\\delta\\\|\\epsilon\_\{\+\}\\\|\+\\sqrt\{2\}L\_\{P\}\\gamma\. Therefore, the cross\-entropy loss will decrease at least byℓk′≤ℓk⋅exp\(−\(δ‖ϵ\+‖\+2LPγ\)\)\\ell\_\{k\}^\{\\prime\}\\leq\\ell\_\{k\}\\cdot\\exp\\left\(\-\\left\(\\delta\\\|\\epsilon\_\{\+\}\\\|\+\\sqrt\{2\}L\_\{P\}\\gamma\\right\)\\right\), which can also accelerate the pretraining process\.
## 5Experiments
In this section, we show the empirical studies of the proposedSimRegloss\. We primarily investigate the advantages in pretraining tasks, including its acceleration on the training loss, improvements of the evaluation on the downstream tasks, and influence the dynamics of the embedding similarity during training\. We also examine its sensitivity to hyperparameters and its behavior\. Moreover, we explore the practical effects of inserting theSimRegloss at different positions within the model\. These experiments can provide useful technical guidance for the community\.
Model Backbones\.We mainly select LLaMA\(Touvronet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib3)\)and Mixtral\(Jianget al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib4)\)as the dense and MoE backbones for pretraining, including the core modules of the mainstream models in the current community, e\.g\. for RoPE\(Suet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib7)\), RMSNorm\(Zhang and Sennrich,[2019](https://arxiv.org/html/2605.08809#bib.bib8)\), and SwiGLU\(Shazeer,[2020](https://arxiv.org/html/2605.08809#bib.bib9)\)\. We conduct experiments on dense models with 350M, 1\.3B, 3B, and 7B parameters, and on the MoE model with 8B parameters\.
Training Hyperparameters\.We follow the experimental setups reported in several recent classical LLM pretraining studies\(Touvronet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib3); Liuet al\.,[2024a](https://arxiv.org/html/2605.08809#bib.bib1); Jianget al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib4); Baidu\-ERNIE\-Team,[2025](https://arxiv.org/html/2605.08809#bib.bib52)\)to configure the baseline hyperparameters\. We employ the AdamW optimizer\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.08809#bib.bib2)\)withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95, and let the weight decay equals to0\.10\.1\. The standard deviation of the weight initialization is set to0\.010\.01\. The global batch size is set to512512for the350350M and MoE\-77B models, and20482048for the1\.31\.3B,33B, and77B dense models\. The input sequence length is fixed to20482048\. For the learning rate schedule, we adopt a20002000\-step warm\-up phase to linearly increase the learning rate from0to3×10−43\\times 10^\{\-4\}, followed by a cosine decay strategy that gradually reduces it to one\-tenth of its peak value\. For dense models, we train about1313B tokens for the350350M model and5252B tokens for the larger dense models\. For MoE models, we train approximately5252B tokens\. To avoid loss spikes, we adopt the AdaGC\(Wanget al\.,[2025](https://arxiv.org/html/2605.08809#bib.bib5)\)to clip gradients for all methods\. Other details are stated in the Appendix\.
Baselines\.We select the Simple Contrastive Sentence EmbeddingGaoet al\.\([2021](https://arxiv.org/html/2605.08809#bib.bib40)\)\(SimCSE\), Contrastive PretrainingNeelakantanet al\.\([2022](https://arxiv.org/html/2605.08809#bib.bib18)\)\(CPretrain\), Consistency RegularizationYinet al\.\([2023](https://arxiv.org/html/2605.08809#bib.bib17)\)\(CReg\), Similarity Contrastive EstimationDenizeet al\.\([2023](https://arxiv.org/html/2605.08809#bib.bib28)\)\(SCE\)\. SimCSE adopts the contrastive loss on the sentence embedding\. CPretrain minimizes the similarity distribution\. CReg treats each token pair as an independent negative example\. SCE adopts a weighted similarity via latent distributions\. The above works are not all designed for pretraining; however, they share certain conceptual similarities\. In our experiments, we uniformly adapt them to the pretraining framework\.
### 5\.1Empirical Studies on Performance
Table 1:Generalization performance comparisons: Zero\-shot evaluations on the downstream tasks\.



\(a\) LLaMA\-1\.3B\.\(b\) LLaMA\-3B\.\(c\) LLaMA\-7B\.
Figure 3:Cross\-entropy loss acceleration \(upper\) and contrastive similarity improvements \(lower\) in the pretraining\. “CE" denotes the cross\-entropy loss and “SR" denotes the similarity regularization loss\.SimRegloss helps to further reduce the contrastive similarity\.Table 2:Generalization efficiency: Zero\-shot evaluations on the general downstream tasks\.Arc\-EArc\-CBoolQHellaS\.ObqaPiqaMmluWinoG\.SciqAvg\.LLaMA\-350M38\.6438\.6422\.9522\.9557\.0957\.0936\.5136\.5128\.4028\.4066\.4966\.4922\.9522\.9551\.3051\.3063\.2063\.2043\.06∘\\circSimReg40\.1540\.1524\.4924\.4957\.5557\.5537\.6437\.6429\.4029\.4068\.2668\.2622\.9222\.9252\.0752\.0764\.4064\.4044\.10∘\\circSimReg\-Chunk39\.7739\.7724\.2324\.2358\.1458\.1437\.2537\.2529\.4029\.4067\.5967\.5923\.0223\.0251\.8451\.8464\.4064\.4043\.96LLaMA\-1\.3B46\.2146\.2125\.0925\.0958\.0158\.0149\.6049\.6031\.8031\.8072\.1472\.1423\.0723\.0752\.8052\.8068\.9068\.9047\.51∘\\circSimReg46\.5146\.5126\.7926\.7961\.0161\.0152\.5152\.5130\.4030\.4072\.9172\.9124\.0624\.0654\.1454\.1469\.5069\.5048\.65∘\\circSimReg\-Chunk46\.8046\.8026\.1126\.1159\.1759\.1751\.9451\.9431\.8031\.8072\.2572\.2523\.1223\.1254\.7854\.7869\.0069\.0048\.33LLaMA\-3B48\.9148\.9127\.3027\.3058\.2958\.2955\.6755\.6733\.0033\.0074\.1674\.1623\.6523\.6555\.4955\.4973\.5073\.5050\.00∘\\circSimReg50\.5950\.5928\.0728\.0758\.6558\.6557\.6557\.6533\.4033\.4074\.3274\.3223\.9523\.9556\.6756\.6775\.3075\.3050\.96∘\\circSimReg\-Chunk50\.8050\.8027\.3927\.3962\.4862\.4858\.4958\.4933\.6033\.6073\.8873\.8822\.9522\.9555\.6455\.6473\.2073\.2050\.94LLaMA\-7B53\.0753\.0728\.8428\.8454\.0754\.0760\.4160\.4133\.8033\.8076\.1276\.1223\.7923\.7957\.3057\.3075\.7075\.7051\.45∘\\circSimReg52\.5752\.5729\.0129\.0159\.7959\.7962\.0162\.0135\.8035\.8075\.1475\.1424\.4724\.4759\.0459\.0476\.2076\.2052\.67∘\\circSimReg\-Chunk51\.6051\.6029\.6929\.6962\.3962\.3961\.8061\.8035\.8035\.8075\.4675\.4623\.5123\.5158\.7258\.7276\.0076\.0052\.77Mixtral\-8×\\times1B48\.8648\.8629\.1829\.1854\.6254\.6259\.5759\.5734\.0034\.0073\.8873\.8824\.1724\.1756\.9956\.9972\.4072\.4050\.41∘\\circSimReg51\.8151\.8128\.7528\.7560\.0360\.0362\.5362\.5335\.0035\.0075\.0875\.0823\.5923\.5954\.3054\.3074\.1074\.1051\.69∘\\circSimReg\-Chunk52\.0452\.0428\.9828\.9860\.2660\.2662\.7662\.7635\.2335\.2373\.8873\.8823\.8223\.8254\.5354\.5373\.1073\.1051\.62Higher Generalization\.Table[1](https://arxiv.org/html/2605.08809#S5.T1)shows the zero\-shot generalization results on a range of downstream tasks\. Overall, existing consistency\-based baselines bring only marginal or inconsistent improvements in the averaged performance across tasks, and in some cases even lead to slight degradation\. In contrast, our method achieves the most consistent and significant gains in terms of average accuracy for both model scales, improving the mean score by \+1\.04% for LLaMA\-350M and \+1\.14% for LLaMA\-1\.3B\. This trend indicates that our approach provides more effective downstream transfer and stronger generalization performance than prior methods under the general downstream tasks\.
Higher Convergence\.We first demonstrate the training acceleration ofSimRegin large\-scale pretraining tasks\. As shown in Figur[3](https://arxiv.org/html/2605.08809#S5.F3)upper part, on the 1\.3B model, the speedup can reach nearly 33%, and after training on 52B tokens, the cross entropy loss can be reduced by about 0\.05\. On larger\-scale models, including the 3B model and the 7B model,SimRegachieves more than 37% speedup when training reaches 52B tokens, with the final training loss reduced by about 0\.03\. In the lower part, we present theSimRegloss\. It can be observed that cross\-entropy does not impose a mandatory constraint on feature similarity\. When training with cross\-entropy alone, the feature similarity undergoes a rapid decline in the early stage, and then gradually tends to stabilize\. At this point, the network no longer additionally learns to accelerate classification training by enhancing feature separability\. An interesting phenomenon we observe is that, when trained solely with cross\-entropy, the similarity regularization value for almost all networks eventually converges to around 0\.01, which implies that the average angle between words of different classes is approximately 61\.3 degrees\. After introducing theSimRegloss, the embedding similarity decreases significantly, with the regularization loss converging to about 0\.00001, indicating that the average angle achieves approximately 74 degrees among tokens\.
Chunk\-wiseSimRegv\.s\. FullSimReg\.In Table[2](https://arxiv.org/html/2605.08809#S5.T2), it can be observed that the chunk\-wisedSimRegachieves comparable performance to that of Full\-SimReg, and even outperforms it on the 7B model\. Under large chunks, the expressive capacity of theSimRegloss becomes limited\. When dealing with an excessively large number of tokens, the effective supervisory signal for each individual token is weakened\. There exists a trade\-off between the expressive capacity of the loss and its strength of supervision\. This phenomenon becomes more pronounced as the parameter scale increases\. As the model scales up, the dimensionality of the hidden states grows proportionally, which naturally leads to larger angles between embeddings\. When computing similarity regularization in high\-dimensional spaces, the number of participating tokens has a stronger influence on the evaluation quality for each individual token\. Thus, chunk\-wiseSimRegcan be considered as an effective alternative to fullSimRegfor the large\-scale model pretraining\.
Scaling to Large Models\.Table[2](https://arxiv.org/html/2605.08809#S5.T2)reports the performance on larger\-sacle models\. Overall, introducing theSimRegloss consistently improves the average performance from 350M to 7B\.SimRegcan bring \+1\.14% average improvement on LLaMA\-1\.3B, \+0\.96% on LLaMA\-3B, \+1\.22% on LLaMA\-7B, and \+1\.28% on Mixtral\-8×\\times1B\. These results highlight thatSimRegprovides stable and non\-trivial gains as the model scale increases\. Moreover,SimRegachieves the largest single\-task gain of \+5\.72% on BoolQ with LLaMA\-7B\. Besides BoolQ, we also observe clear improvements on HellaSwag, WinoGrande, and SciQ across multiple scales, showing that it is particularly effective for reasoning\-heavy and multi\-choice tasks\. These consistent improvements further suggest thatSimRegis a simple yet broadly applicable strategy for the large\-scale pretraining to enhance generalization\.
### 5\.2Hyperparameter Sensitivity
\(a\)
\(b\)
\(c\)
Figure 4:\(a\) Grid search over hyperparametersτ\\tauandλ\\lambda\. The blue blocks indicate the values where the final training loss under the corresponding combination\(τ,λ\)\\left\(\\tau,\\lambda\\right\)is lower than baseline, with darker colors representing lower losses\. \(b\) We further conduct a fine\-grained search over differentλ\\lambdavalues at the generally optimalτ=0\.01\\tau=0\.01, using an approximate2×2\\timesscaling ratio\. \(c\) We explore the trends on differentλ\\lambdaacross different model sizes \(the red line indicates the optimal trend\)\.We first grid search\(τ,λ\)\(\\tau,\\lambda\)on the 350M model to identify a valid range, followed by a fine\-grained search to determine their optimal combinations\. Subsequently, we conduct scaling experiments on the 1\.3B and 7B models to examine how the optimal choices vary as the model size increases and the corresponding token embedding dimension grows\. As shown in Figure[4](https://arxiv.org/html/2605.08809#S5.F4)\(a\), to explore the stable results, we grid search the temperature coefficientτ\\taufrom\[0\.001,0\.003,0\.01,0\.03,0\.1\]\\left\[0\.001,0\.003,0\.01,0\.03,0\.1\\right\]with a3×3\\timesskip, and coarsely choose the coefficientλ\\lambdafrom\[0\.01,0\.1,1,10,100\]\\left\[0\.01,0\.1,1,10,100\\right\]with a10×10\\timesskip\. The valid range forτ\\tauis relatively limited, with0\.010\.01proving to be a robust selection for all models\.λ\\lambdaspans a broad effective range from0\.10\.1to100100\. Figure[4](https://arxiv.org/html/2605.08809#S5.F4)\(b\) presents a fine\-grained exploration ofλ\\lambda, varying it from0\.10\.1to100100with roughly2×2\\timesresolution\. The results reveal a stable region between22and2020\. In Figure[4](https://arxiv.org/html/2605.08809#S5.F4)\(c\), we explore the scaling of hyperparameters and infer from results across different model sizes how to select optimal hyperparameters\. Specifically, when the embedding dimension increases, each token is represented in a higher\-dimensional space\. Therefore, it becomes necessary to increaseλ\\lambdato maintain training efficiency\. Our experiments confirm this trend, and current results suggest that every time the embedding dimension doubles, the optimal hyperparameter increases by approximately a factor of2\\sqrt\{2\}\. The optimalτ\\taucan be fixed as0\.010\.01for all models\.
### 5\.3Optimal Position of AdoptingSimReg
Figure 5:Loss changes of adopting ourSimRegloss at different layers on 1B model\.In this part, we empirically investigate at which positions in the model embedding supervision yields the best results\. We divide the network according to its natural layer\-wise structure and apply supervision at different depths\. As shown in Figure[5](https://arxiv.org/html/2605.08809#S5.F5), supervision on intermediate layers brings only negligible performance gains\. This is expected, as token representations in the middle layers are not simply tied to independent word meanings; instead, they encode blended contextual information aggregated from preceding tokens\. Enforcing similarity regularization on such entangled representations may therefore provide limited useful signal\. In contrast, the final layers gradually project these broad contextual representations into more distinct semantic spaces that are directly used for next\-token prediction\. Our experiments further show that applyingSimRegonly at the last layer is sufficient to achieve efficient pretraining\.
### 5\.4Runtime and Memory Consumptions
Table 3:Performance and sensitivity \(T=600T=600\)\.We evaluate the training efficiency of our method on a 7B\-scale model with a token embedding dimension of 4096\. ForSimReg\-Chunk, we use a chunk size of 1024 to further reduce the computational footprint\. All reported statistics are collected on H800 GPUs, and memory usage is measured by the maximum GPU memory allocation\. As shown in Table[3](https://arxiv.org/html/2605.08809#S5.T3), incorporating the fullSimRegloss results in less than a 2% increase in runtime and under a 1% increase in GPU memory consumption\. In comparison,SimReg\-Chunk introduces only negligible computational and memory overhead, making it effectively resource\-neutral in practice\. These results show thatSimRegdelivers meaningful performance gains with minimal additional training cost, highlighting its practicality as a lightweight and effective auxiliary component for large\-scale pretraining\.
## 6Conclusion
In this work, we introducedSimReg, a similarity regularization loss for large\-scale pretraining\. We show that cross\-entropy alone does not sufficiently enforce embedding consistency, whereasSimRegstrengthens representation learning by aligning same\-class tokens while separating different classes\. Experiments on both dense and MoE models demonstrate thatSimRegconsistently accelerates convergence by more than 30% and improves downstream performance by over 1%\. Moreover, it remains robust across different model scales and hyperparameter settings, indicating its practical applicability\. These findings highlight consistency regularization as a promising direction for improving the efficiency and generalization of LLM pretraining\.
## References
- Evaluating the stability of embedding\-based word similarities\.Transactions of the Association for Computational Linguistics6,pp\. 107–119\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- C\. Azuma, T\. Ito, and T\. Shimobaba \(2023\)Adversarial domain adaptation using contrastive learning\.Engineering Applications of Artificial Intelligence123,pp\. 106394\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- Baidu\-ERNIE\-Team \(2025\)ERNIE 4\.5 technical report\.External Links:,LinkCited by:[§5](https://arxiv.org/html/2605.08809#S5.p3.19)\.
- L\. Chen, Z\. Wang, S\. Ren, L\. Li, H\. Zhao, Y\. Li, Z\. Cai, H\. Guo, L\. Zhang, Y\. Xiong,et al\.\(2024\)Next token prediction towards multimodal intelligence: a comprehensive survey\.arXiv preprint arXiv:2412\.18619\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- C\. M\. de Andrade, F\. M\. Belem, W\. Cunha, C\. França, F\. Viegas, L\. Rocha, and M\. A\. Goncalves \(2023\)On the class separability of contextual embeddings representations–or “the classifier does not matter when the \(text\) representation is so good\!”\.Information Processing & Management60\(4\),pp\. 103336\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- X\. Deng, H\. Zhong, R\. Ai, F\. Feng, Z\. Wang, and X\. He \(2026\)Less is more: improving llm alignment via preference data selection\.Advances in Neural Information Processing Systems38,pp\. 161259–161285\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- J\. Denize, J\. Rabarisoa, A\. Orcesi, R\. Hérault, and S\. Canu \(2023\)Similarity contrastive estimation for self\-supervised soft contrastive learning\.InProceedings of the IEEE/CVF winter conference on applications of computer vision,pp\. 2706–2716\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1),[§5](https://arxiv.org/html/2605.08809#S5.p4.1)\.
- Z\. Fan, Y\. Xian, Y\. Sun, and L\. Shen \(2025\)Joint selection for large\-scale pre\-training data via policy gradient\-based mask learning\.arXiv preprint arXiv:2512\.24265\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p6.1)\.
- P\. Gao, R\. Zhang, Z\. He, H\. Wu, and H\. Wang \(2023\)An empirical study of consistency regularization for end\-to\-end speech\-to\-text translation\.arXiv preprint arXiv:2308\.14482\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- T\. Gao, X\. Yao, and D\. Chen \(2021\)Simcse: simple contrastive learning of sentence embeddings\.arXiv preprint arXiv:2104\.08821\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1),[§5](https://arxiv.org/html/2605.08809#S5.p4.1)\.
- Y\. Geng, S\. Li, F\. Zhang, S\. Zhang, L\. Yang, and H\. Lin \(2021\)Context\-aware and data\-augmented transformer for interactive argument pair identification\.InCCF International Conference on Natural Language Processing and Chinese Computing,pp\. 579–589\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- B\. Gunel, J\. Du, A\. Conneau, and V\. Stoyanov \(2021\)Supervised contrastive learning for pre\-trained language model fine\-tuning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=cu7IUiOhujH)Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p3.1)\.
- J\. Hu, W\. Xia, X\. Zhang, C\. Fu, W\. Wu, Z\. Huan, A\. Li, Z\. Tang, and J\. Zhou \(2024\)Enhancing sequential recommendation via llm\-based semantic embedding learning\.InCompanion Proceedings of the ACM Web Conference 2024,pp\. 103–111\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- H\. Huang and Y\. Gong \(2022\)Contrastive learning: an alternative surrogate for offline data\-driven evolutionary computation\.IEEE Transactions on Evolutionary Computation27\(2\),pp\. 370–384\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- Q\. Huang, T\. Ko, L\. Tang, X\. Liu, and B\. Wu \(2021\)Token\-level supervised contrastive learning for punctuation restoration\.InInterspeech,External Links:[Link](https://api.semanticscholar.org/CorpusID:236134216)Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p3.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, E\. B\. Hanna, F\. Bressand,et al\.\(2024\)Mixtral of experts\.arXiv preprint arXiv:2401\.04088\.Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p2.1),[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p3.8),[§1](https://arxiv.org/html/2605.08809#S1.p4.1),[§5](https://arxiv.org/html/2605.08809#S5.p2.1),[§5](https://arxiv.org/html/2605.08809#S5.p3.19)\.
- T\. Kenter and M\. De Rijke \(2015\)Short text similarity with word embeddings\.InProceedings of the 24th ACM international on conference on information and knowledge management,pp\. 1411–1420\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- P\. Khosla, P\. Teterwak, C\. Wang, A\. Sarna, Y\. Tian, P\. Isola, A\. Maschinot, C\. Liu, and D\. Krishnan \(2020\)Supervised contrastive learning\.Advances in neural information processing systems33,pp\. 18661–18673\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- B\. M\. Lake \(2019\)Compositional generalization through meta sequence\-to\-sequence learning\.Advances in neural information processing systems32\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- M\. T\. R\. Laskar, X\. Huang, and E\. Hoque \(2020\)Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task\.InProceedings of the twelfth language resources and evaluation conference,pp\. 5505–5514\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- J\. Li, J\. Xu, S\. Li, S\. Huang, J\. Liu, Y\. Lian, and G\. Dai \(2024a\)Fast and efficient 2\-bit llm inference on gpu: 2/4/16\-bit in a weight matrix with asynchronous dequantization\.InProceedings of the 43rd IEEE/ACM International Conference on Computer\-Aided Design,pp\. 1–9\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- Y\. Li, Y\. Huang, M\. E\. Ildiz, A\. S\. Rawat, and S\. Oymak \(2024b\)Mechanics of next token prediction with self\-attention\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 685–693\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, G\. Xiao, and S\. Han \(2025a\)Awq: activation\-aware weight quantization for on\-device llm compression and acceleration\.GetMobile: Mobile Computing and Communications28\(4\),pp\. 12–17\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- Z\. Lin, S\. Basu, M\. Beigi, V\. Manjunatha, R\. A\. Rossi, Z\. Wang, Y\. Zhou, S\. Balasubramanian, A\. Zarei, K\. Rezaei,et al\.\(2025b\)A survey on mechanistic interpretability for multi\-modal foundation models\.arXiv preprint arXiv:2502\.17516\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024a\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p3.8),[§5](https://arxiv.org/html/2605.08809#S5.p3.19)\.
- R\. Liu, H\. Zuo, Z\. Lian, B\. W\. Schuller, and H\. Li \(2024b\)Contrastive learning based modality\-invariant feature acquisition for robust multimodal emotion recognition with missing modalities\.IEEE Transactions on Affective Computing15\(4\),pp\. 1856–1873\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p3.8),[§5](https://arxiv.org/html/2605.08809#S5.p3.19)\.
- S\. Mai, Y\. Zeng, S\. Zheng, and H\. Hu \(2022\)Hybrid contrastive learning of tri\-modal representation for multimodal sentiment analysis\.IEEE Transactions on Affective Computing14\(3\),pp\. 2276–2289\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- A\. Neelakantan, T\. Xu, R\. Puri, A\. Radford, J\. M\. Han, J\. Tworek, Q\. Yuan, N\. Tezak, J\. W\. Kim, C\. Hallacy,et al\.\(2022\)Text and code embeddings by contrastive pre\-training\.arXiv preprint arXiv:2201\.10005\.Cited by:[§5](https://arxiv.org/html/2605.08809#S5.p4.1)\.
- A\. v\. d\. Oord, Y\. Li, and O\. Vinyals \(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- N\. Shazeer \(2020\)Glu variants improve transformer\.arXiv preprint arXiv:2002\.05202\.Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p2.1),[§5](https://arxiv.org/html/2605.08809#S5.p2.1)\.
- L\. Shen, Y\. Sun, Z\. Yu, L\. Ding, X\. Tian, and D\. Tao \(2024\)On efficient training of large\-scale deep learning models\.ACM Computing Surveys57\(3\),pp\. 1–36\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- L\. Shi, F\. Giunchiglia, R\. Song, D\. Shi, T\. Liu, X\. Diao, and H\. Xu \(2022\)A simple contrastive learning framework for interactive argument pair identification via argument\-context extraction\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 10027–10039\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- J\. Su, M\. Ahmed, Y\. Lu, S\. Pan, W\. Bo, and Y\. Liu \(2024\)Roformer: enhanced transformer with rotary position embedding\.Neurocomputing568,pp\. 127063\.Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p2.1),[§5](https://arxiv.org/html/2605.08809#S5.p2.1)\.
- L\. Sun, M\. Zhang, Y\. Lu, W\. Zhu, Y\. Yi, and F\. Yan \(2024\)Nodule\-clip: lung nodule classification based on multi\-modal contrastive learning\.Computers in Biology and Medicine175,pp\. 108505\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- Y\. Sun, Q\. Zhang, Z\. Yu, X\. Zhang, L\. Shen, and D\. Tao \(2025\)Maskpro: linear\-space probabilistic learning for strict \(n: m\)\-sparsity on large language models\.arXiv preprint arXiv:2506\.12876\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- C\. Tao, T\. Shen, S\. Gao, J\. Zhang, Z\. Li, K\. Hua, W\. Hu, Z\. Tao, and S\. Ma \(2024\)Llms are also effective embedding models: an in\-depth overview\.arXiv preprint arXiv:2412\.12591\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- O\. Topsakal and T\. C\. Akinci \(2023\)Creating large language model applications utilizing langchain: a primer on developing llm apps fast\.InInternational conference on applied engineering and natural sciences,Vol\.1,pp\. 1050–1056\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p2.1),[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p3.8),[§1](https://arxiv.org/html/2605.08809#S1.p4.1),[§5](https://arxiv.org/html/2605.08809#S5.p2.1),[§5](https://arxiv.org/html/2605.08809#S5.p3.19)\.
- V\. Verma, T\. Luong, K\. Kawaguchi, H\. Pham, and Q\. Le \(2021\)Towards domain\-agnostic contrastive learning\.InInternational Conference on Machine Learning,pp\. 10530–10541\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- F\. Wang and H\. Liu \(2021\)Understanding the behaviour of contrastive loss\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 2495–2504\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- G\. Wang, S\. Li, C\. Chen, J\. Zeng, J\. Yang, T\. Sun, Y\. Ma, D\. Yu, and L\. Shen \(2025\)AdaGC: improving training stability for large language model pretraining\.arXiv preprint arXiv:2502\.11034\.Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p4.8),[§5](https://arxiv.org/html/2605.08809#S5.p3.19)\.
- H\. Wang and D\. Yu \(2023\)Going beyond sentence embeddings: a token\-level matching algorithm for calculating semantic textual similarity\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 563–570\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- R\. Wang, Z\. Wu, Z\. Weng, J\. Chen, G\. Qi, and Y\. Jiang \(2022\)Cross\-domain contrastive learning for unsupervised domain adaptation\.IEEE Transactions on Multimedia25,pp\. 1665–1673\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- Y\. Wang, J\. Zhang, and Y\. Wang \(2024\)Do generated data always help contrastive learning?\.arXiv preprint arXiv:2403\.12448\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p1.1)\.
- Z\. Wen and Y\. Li \(2021\)Toward understanding the feature learning process of self\-supervised contrastive learning\.InInternational Conference on Machine Learning,pp\. 11112–11122\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- T\. Wiedemer, P\. Mayilvahanan, M\. Bethge, and W\. Brendel \(2023\)Compositional generalization from first principles\.Advances in Neural Information Processing Systems36,pp\. 6941–6960\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p2.1)\.
- R\. Xie, Q\. Liu, L\. Wang, S\. Liu, B\. Zhang, and L\. Lin \(2022\)Contrastive cross\-domain recommendation in matching\.InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 4226–4236\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- Y\. Ye, C\. Yu, Y\. Chang, L\. Zhu, X\. Zhao, L\. Yan, and Y\. Tian \(2022\)Unsupervised deraining: where contrastive learning meets self\-similarity\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 5821–5830\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- Y\. Yin, J\. Zeng, Y\. Li, F\. Meng, J\. Zhou, and Y\. Zhang \(2023\)Consistency regularization training for compositional generalization\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1294–1308\.Cited by:[§1](https://arxiv.org/html/2605.08809#S1.p3.1),[§2](https://arxiv.org/html/2605.08809#S2.p2.1),[§5](https://arxiv.org/html/2605.08809#S5.p4.1)\.
- X\. Yuan, Z\. Lin, J\. Kuen, J\. Zhang, Y\. Wang, M\. Maire, A\. Kale, and B\. Faieta \(2021\)Multimodal contrastive training for visual representation learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 6995–7004\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in neural information processing systems32\.Cited by:[§A\.1](https://arxiv.org/html/2605.08809#A1.SS1.p2.1),[§5](https://arxiv.org/html/2605.08809#S5.p2.1)\.
- P\. Zhou, Y\. Huang, Y\. Xie, J\. Gao, S\. Wang, J\. B\. Kim, and S\. Kim \(2024\)Is contrastive learning necessary? a study of data augmentation vs contrastive learning in sequential recommendation\.InProceedings of the ACM Web Conference 2024,pp\. 3854–3863\.Cited by:[§2](https://arxiv.org/html/2605.08809#S2.p1.1)\.
## Appendix AAppendix: Experiments
### A\.1Experimental Setups
Here we present the detailed experimental setups in this paper to ensure the reproducibility\.
Model Hyperparameters\.We mainly select LLaMA2\(Touvronet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib3)\)and Mixtral\(Jianget al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib4)\)as the dense and MoE backbones for pretraining, including the core modules of the mainstream models in the current community, e\.g\. for RoPE\(Suet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib7)\), RMSNorm\(Zhang and Sennrich,[2019](https://arxiv.org/html/2605.08809#bib.bib8)\), and SwiGLU\(Shazeer,[2020](https://arxiv.org/html/2605.08809#bib.bib9)\)\. We follow the common practices in the community to scale models of different sizes, and the detailed configurations are shown in Table[4](https://arxiv.org/html/2605.08809#A1.T4)\.
Table 4:Model Hyperparameters\.Training Hyperparameters\.We follow the experimental setups reported in several recent classical LLM pretraining studies\(Touvronet al\.,[2023](https://arxiv.org/html/2605.08809#bib.bib3); Liuet al\.,[2024a](https://arxiv.org/html/2605.08809#bib.bib1); Jianget al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib4)\)to configure the baseline hyperparameters, ensuring comparability with prior work\. Specifically, we employ the AdamW optimizer\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.08809#bib.bib2)\)withβ1=0\.9\\beta\_\{1\}=0\.9andβ2=0\.95\\beta\_\{2\}=0\.95, and a weight decay of0\.10\.1\. The standard deviation of weight initialization is set to0\.010\.01\. To balance efficiency and stability, we use a global batch size of512512for the 350M and MoE\-1×\\times8B models, and20482048for the 1\.3B, 3B, and 7B dense models, while the input sequence length is fixed at20482048\.
For the learning rate schedule, we adopt a 2000\-step warm\-up phase that linearly increases the learning rate from0to3e\-43e\\text\{\-\}4, followed by a cosine decay strategy that gradually reduces it to one\-tenth of its peak value\. Regarding training length, dense models are trained for12,50012,500steps, corresponding to roughly1313B tokens for the350350M model and5252B tokens for the larger dense models\. In contrast, MoE models are trained for50,00050,000steps to ensure comparable exposure of approximately5252B tokens\. Finally, to mitigate potential instabilities caused by loss spikes, we adopt AdaGC\(Wanget al\.,[2025](https://arxiv.org/html/2605.08809#bib.bib5)\)for adaptive gradient clipping\. We summarize the details in Table[5](https://arxiv.org/html/2605.08809#A1.T5)\.
Table 5:Training Hyperparameters\.Specific Hyperparameters\.Our proposed loss function is primarily characterized by two key hyperparameters, the temperatureτ\\tauand the coefficientλ\\lambda\. We conduct extensive grid search experiments \(τ∈\[0\.001,0\.01,0\.1\]\\tau\\in\\left\[0\.001,0\.01,0\.1\\right\]andλ∈\[0\.2,0\.5,1,2,5,10,20,50,100\]\\lambda\\in\\left\[0\.2,0\.5,1,2,5,10,20,50,100\\right\]\) on the350350M model to determine the effective range of these hyperparameters, and validate them on larger models according to scaling theory\. The simple settings ofτ=0\.01\\tau=0\.01andλ=10\\lambda=10are sufficient to achieve good performance for most experiments\. To better adapt to the model scaling, we explore a more refined yet simple strategy to determine the selections, which is detailed in Sec\.[A\.2](https://arxiv.org/html/2605.08809#A1.SS2)\.
Evaluations\.To ensure a fair comparison, we conduct all evaluations on EleutherAI/lm\-evaluation benchmark\(Gaoet al\.,[2024](https://arxiv.org/html/2605.08809#bib.bib6)\)\. We mainly evaluate the performance of the pretrained model on downstream tasks of arc\_easy, arc\_challenge, openbookqa, boolq, hellaswag, piqa, winogrande, mmlu, sciq \(general reasoning ability\) and the domain\-specific downstream tasks of gsm8k, drop, race, squadv2, nq\_open, humaneval, mbpp \(three domains: math, code, and reading comprehension\)\.
Training Resources\.We conduct experiments on H800800GPUs\. Pretraining the350350M model on1313B tokens requires approximately5656GPU hours per experiment and the77B model on5252B tokens takes over2,0002\{,\}000GPU hours per experiment\.
### A\.2How to Scale Hyperparameters on Large Models?
In this part, we introduce a refined hyperparameter tuning mechanism to accommodate model scaling\. Before introducing it, we first demonstrate the relationship between the representation ability of ourSimRegloss and the dimensionality of embeddings in the model\. TheSimRegloss regularizes pretraining by leveraging the token embedding similarity between pairs of tokens\. By assuming𝐱,𝐲∈ℝd\\mathbf\{x\},\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}are independent and identically distributed as isotropic random variables, e\.g\.,𝐱,𝐲∼𝒩\(0,Id\)\\mathbf\{x\},\\mathbf\{y\}\\sim\\mathcal\{N\}\\left\(0,I\_\{d\}\\right\)\. Thus, we consider their cosine similarityz=⟨𝐱,𝐲⟩‖𝐱‖⋅‖𝐲‖∈\[−1,1\]z=\\frac\{\\left\\langle\\mathbf\{x\},\\mathbf\{y\}\\right\\rangle\}\{\\\|\\mathbf\{x\}\\\|\\cdot\\\|\\mathbf\{y\}\\\|\}\\in\\left\[\-1,1\\right\]\. Without loss of generality, we can assume𝐲‖𝐲‖=\(1,0,⋯,0\)\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\}=\\left\(1,0,\\cdots,0\\right\)as the first basis of the spherical spaceSd−1S^\{d\-1\}\. Then the distribution ofzzcan be transferred to the study of the first coordinate ofv∼Uinf\(Sd−1\)v\\sim\\text\{Uinf\}\\left\(S^\{d\-1\}\\right\)\. Substitutevvinto the iterative form of spherical coordinatesv=\(cosθ,sinθ⋅ζ\)v=\\left\(\\cos\{\\theta\},\\sin\{\\theta\}\\cdot\\zeta\\right\)whereζ∈Sd−2\\zeta\\in S^\{d\-2\}\. According to the decomposition of the spherical surface unit, we havedσd−1\(v\)=sind−2\(θ\)dθdσd−1\(ζ\)d\\sigma\_\{d\-1\}\(v\)=\\sin^\{d\-2\}\(\\theta\)\\ d\\theta\\ d\\sigma\_\{d\-1\}\(\\zeta\)and the marginal density of the polar angle:
fp\(θ\)=1\|Sd−1\|∫Sd−2sind−2\(θ\)𝑑σd−2\(v\)=\|Sd−2\|\|Sd−1\|sind−2\(θ\)\.\\displaystyle f\_\{p\}\(\\theta\)=\\frac\{1\}\{\|S^\{d\-1\}\|\}\\int\_\{S^\{d\-2\}\}\\sin^\{d\-2\}\(\\theta\)\\ d\\sigma\_\{d\-2\}\(v\)=\\frac\{\|S^\{d\-2\}\|\}\{\|S^\{d\-1\}\|\}\\sin^\{d\-2\}\(\\theta\)\.Then we consider the variablezz\. Due to the first coordinatez=v0=cos\(θ\)z=v\_\{0\}=\\cos\(\\theta\), we have:
fp\(z\)=fp\(θ\)\|dθdz\|=\|Sd−2\|\|Sd−1\|⋅sind−2\(θ\)sin\(θ\)=\|Sd−2\|\|Sd−1\|\(1−z2\)d−32=Γ\(d2\)πΓ\(d−12\)\(1−z2\)d−32\.\\displaystyle f\_\{p\}\(z\)=f\_\{p\}\(\\theta\)\\left\|\\frac\{d\\theta\}\{dz\}\\right\|=\\frac\{\|S^\{d\-2\}\|\}\{\|S^\{d\-1\}\|\}\\cdot\\frac\{\\sin^\{d\-2\}\(\\theta\)\}\{\\sin\(\\theta\)\}=\\frac\{\|S^\{d\-2\}\|\}\{\|S^\{d\-1\}\|\}\\left\(1\-z^\{2\}\\right\)^\{\\frac\{d\-3\}\{2\}\}=\\frac\{\\Gamma\(\\frac\{d\}\{2\}\)\}\{\\sqrt\{\\pi\}\\Gamma\(\\frac\{d\-1\}\{2\}\)\}\\left\(1\-z^\{2\}\\right\)^\{\\frac\{d\-3\}\{2\}\}\.It is easy to check that𝔼\[z\]=0\\mathbb\{E\}\\left\[z\\right\]=0and𝔼\[z2\]=1d\\mathbb\{E\}\\left\[z^\{2\}\\right\]=\\frac\{1\}\{d\}\. Therefore, as the model size increases and the embedding dimensionality changes fromd0d\_\{0\}tod1d\_\{1\}, the capacity ofSimRegloss decreases by a factor ofd1d0\\sqrt\{\\frac\{d\_\{1\}\}\{d\_\{0\}\}\}\. To preserve the representation capability, we can revise theλ\\lambdacoefficient\.
We next investigate the feasibility of this scaling method from an empirical perspective\. We separately sweep the hyperparameters and report the evaluation perplexity \(ppl\) at the end of training\.
Table 6:Validation perplexity \(generalization performance\) of different\(τ,λreg\)\(\\tau,\\lambda\_\{\\text\{reg\}\}\)\.The optimal range and variation trend in Table[6](https://arxiv.org/html/2605.08809#A1.T6)and Figure[4](https://arxiv.org/html/2605.08809#S5.F4)are almost identical to those observed in the optimization process,indicating that the improvements brought by theSimRegloss in both optimization and generalization are consistent\. The optimal choice ofτ\\tauremains concentrated around0\.010\.01\. Next, we evaluate models of different scales \(primarily with increased embedding hidden sizes\), while keepingλ\\lambdafixed at0\.010\.01\.
Table 7:Optimal validation perplexity \(generalization performance\) of different model size\.It can be observed that the trend largely aligns with our hypothesis\. Therefore, we propose the following estimation method for the optimal hyperparameters:
τ=0\.01,λreg≈10×d1024,\\displaystyle\\tau=0\.01,\\ \\lambda\_\{\\text\{reg\}\}\\approx 10\\times\\sqrt\{\\frac\{d\}\{1024\}\},whereddis the dimension of the hidden\-size of the token embedding\. Of course, the scale of the model also affects the results\. In practice, a simple grid search within this range of choices can be performed to identify the optimal combination\.
### A\.3SimReg Loss Curves
In this section, we mainly present the variations of theSimRegloss\. We explore the limitations of cross\-entropy in LLM pretraining, namely, that it cannot achieve better classification performance simply by further reducing feature separability\. This is because cross\-entropy focuses solely on aligning predictions with ground\-truth labels, while leaving the underlying structure of token embeddings insufficiently constrained\. As the model scales up, this weakness becomes more pronounced: embeddings of the same class may still scatter in the representation space, leading to instability in optimization and slower convergence\. By contrast, theSimRegloss explicitly regularizes intra\-class consistency and inter\-class separation, complementing cross\-entropy with a more direct control of embedding geometry\. This additional constraint not only improves convergence speed but also yields more robust generalization in downstream tasks\.
\(a\)
\(b\)
Figure 6:The training curve of theSimRegloss\.Figure[6](https://arxiv.org/html/2605.08809#A1.F6)illustrates the loss behavior when increasing the coefficient of theSimRegloss\. It can be observed that even with a large weighting ratio,SimRegdoes not cause the training to diverge\. At the same time, we also note that the feature consistency loss exhibits a strictly monotonic trend\. This phenomenon suggests thatSimRegserves as a stable regularization term: rather than interfering with the optimization of cross\-entropy, it progressively strengthens the alignment of token embeddings as its weight grows\. In practice, this means that a wide range of coefficient values can be applied without destabilizing training, makingSimReghighly robust and easy to integrate into large\-scale pretraining pipelines\.
Trade\-off ofλ\\lambda\.Although we generally hope that greater feature separability will lead to better performance, the pretraining process involves not only learning representations but also learning classification\. Ifλ\\lambdais increased without bound, the weight ofSimRegmay eventually become too dominant and interfere with the optimization of cross\-entropy\. This phenomenon can be directly observed from the changes in gradient behavior, which provide an intuitive reflection of the trade\-off between the two objectives\. Table[8](https://arxiv.org/html/2605.08809#A1.T8)shows the comparison clearly illustrates the effects of cross\-entropy andSimRegunder different parameter settings\.
Table 8:Changing trend of CrossEntropy andSimRegloss on differentλ\\lambda\.
### A\.4Visualization of the Token Embedding Similarity
Here we provide more visualization demos of the true pretraining data samples from C4 dataset on LLaMA\-7B\.
Text1:\[so I’m not sure if there’s anything holding the back\. I do not think there is by wiggling on it but could possibly have a strap or the like\. I would think there must be a way to remove the panel blocking the bottom of the washer\. We installed our own washer and used the clips mentioned in the previous post\. Here is a PDF file on how they are used and what they look like\. You may want to run your fingers over the entire carpeted lip \.\.\. typically, the\]
Figure 7:The averaged cosine similarity values are 0\.488 \(CrossEntropy only \- left\) and 0\.354 \(CrossEntropy \+SimReg\- right\)\.Text2:\[manufacturer runs screws into the floors/cabinets and the heads are buried in the carpet\. There are two screws with square heads in the top of the carpet\. Have you tired to do the recommended procedure to clean the lint out of the drain\. 1\. Run the unit without clothes and with the dry time off on cycle \# 11\. 2\. When the water stops entering the unit push and hold the start button until all the lights come on then release the button\.\]
Figure 8:The averaged cosine similarity values are 0\.445 \(CrossEntropy only \- left\) and 0\.333 \(CrossEntropy \+SimReg\- right\)\.
## Appendix BAppendix: Theoretical Analysis
In this section, we mainly demonstrate the theoretical understanding to show how theSimRegloss improves the convergence and generalization efficiency\. To this end, we first establish the fundamental properties of the proposed objective and analyze its impact on representation learning\. We then present rigorous bounds and intuitive explanations that highlight its advantages over conventional cross\-entropy training\. These insights not only provide a deeper understanding of whySimRegis effective but also offer useful guidance for its broader application in large\-scale pretraining\.
### B\.1Relationship between Empirical loss and Margins
We first introduce the simplified modeling and corresponding notations of the LLM pretraining\. Without loss of generality, we decompose the model into two simple parts\. The first part is the front\-end structure, which takes the raw data as input and outputs the embedding representations\. The second part is the back\-end structure, which transforms the embeddings into logits, followed by a cross\-entropy loss function\. We denoteX=\[x1,x2,⋯,xn\]∈ℝn×dX=\\left\[x\_\{1\},x\_\{2\},\\cdots,x\_\{n\}\\right\]\\in\\mathbb\{R\}^\{n\\times d\}as the embeddings andZ=fh\(X\)∈ℝn×cZ=f\_\{h\}\(X\)\\in\\mathbb\{R\}^\{n\\times c\}as the logits\. The category label are denoted byY=\[y1,y2,⋯,yn\]∈\{1,2,⋯,C\}nY=\\left\[y\_\{1\},y\_\{2\},\\cdots,y\_\{n\}\\right\]\\in\\left\\\{1,2,\\cdots,C\\right\\\}^\{n\}\. For the sample\-wise cross entropy loss, we have:
ℓ\(xi,yi\)=−zi,yi\+log\(∑j=1Kezi,j\)\.\\displaystyle\\ell\\left\(x\_\{i\},y\_\{i\}\\right\)=\-z\_\{i,y\_\{i\}\}\+\\log\\left\(\\sum\_\{j=1\}^\{K\}e^\{z\_\{i,j\}\}\\right\)\.The empirical loss isL=1n∑i=1nℓ\(xi,yi\)L=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(x\_\{i\},y\_\{i\}\)\. Then we consider the margin value in multi\-class classification, which is also the joint gaps of different categoriesmi=zi,yi−maxj≠yizi,jm\_\{i\}=z\_\{i,y\_\{i\}\}\-\\max\_\{j\\neq y\_\{i\}\}z\_\{i,j\}\. Therefore, we have:
ℓ\(xi,yi\)=log\(1\+∑j≠yie−\(zi,yi−zi,j\)\)≤log\(1\+\(C−1\)e−mi\)≤\(C−1\)e−mi,\\displaystyle\\ell\\left\(x\_\{i\},y\_\{i\}\\right\)=\\log\\left\(1\+\\sum\_\{j\\neq y\_\{i\}\}e^\{\-\\left\(z\_\{i,y\_\{i\}\}\-z\_\{i,j\}\\right\)\}\\right\)\\leq\\log\\left\(1\+\\left\(C\-1\\right\)e^\{\-m\_\{i\}\}\\right\)\\leq\\left\(C\-1\\right\)e^\{\-m\_\{i\}\},where the empirical loss isL=1n∑i=1nℓ\(xi,yi\)≤C−1n∑i=1ne−miL=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(x\_\{i\},y\_\{i\}\)\\leq\\frac\{C\-1\}\{n\}\\sum\_\{i=1\}^\{n\}e^\{\-m\_\{i\}\}\. Generally, if the classification margins of all samples are increased by at leastΔ≥0\\Delta\\geq 0, the loss will be multiplicatively reduced by a factor ofe−Δe^\{\-\\Delta\}\.
### B\.2Equivalent Constraint of the SimReg Loss
Here we learn how theSimRegloss affect the embeddings and the model performance\. Here we let each embedding𝐞i=ri𝐚i\\mathbf\{e\}\_\{i\}=r\_\{i\}\\mathbf\{a\}\_\{i\}whereri=‖𝐞i‖≥0r\_\{i\}=\\\|\\mathbf\{e\}\_\{i\}\\\|\\geq 0is the magnitude and𝐚i\\mathbf\{a\}\_\{i\}is the normalized embedding\.SimRegloss evaluates the exponential of the cosine similarity of two embeddings\. Its core focus lies in the geometric information of the termaa\. To learn the performance of theSimReg, for each labelyiy\_\{i\}, we define a positive set𝒫i=\{aj:yj=yi\}\\mathcal\{P\}\_\{i\}=\\left\\\{a\_\{j\}:y\_\{j\}=y\_\{i\}\\right\\\}and a negative set𝒩i=\{aj:yj≠yi\}\\mathcal\{N\}\_\{i\}=\\left\\\{a\_\{j\}:y\_\{j\}\\neq y\_\{i\}\\right\\\}\. The union of𝒫i\\mathcal\{P\}\_\{i\}and𝒩i\\mathcal\{N\}\_\{i\}always combines a complete sequence\.
To understand the performance ofSimRegin detail, we first introduce a general kernal functionκ\(𝐮,𝐯\)=exp\(𝐮⊤𝐯\)\\kappa\\left\(\\mathbf\{u\},\\mathbf\{v\}\\right\)=\\exp\\left\(\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}\\right\), which admits the Maclaurin seriesκ\(𝐮,𝐯\)=∑m=0∞\(𝐮⊤𝐯\)mm\!\\kappa\\left\(\\mathbf\{u\},\\mathbf\{v\}\\right\)=\\sum\_\{m=0\}^\{\\infty\}\\frac\{\\left\(\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}\\right\)^\{m\}\}\{m\!\}\. It is a positive definite kernel on the unit sphere\. By introducing an explicit map:h:𝕊d−1→ℋh:\\mathbb\{S\}^\{d\-1\}\\rightarrow\\mathcal\{H\}on the symmetric tensor powers:
h\(𝐮\)=\[1,1π𝐮,12\!π2vec\(𝐮⊗2\),13\!π3vec\(𝐮⊗3\),⋯\],h\(\\mathbf\{u\}\)=\\left\[1,\\frac\{1\}\{\\sqrt\{\\pi\}\}\\mathbf\{u\},\\frac\{1\}\{\\sqrt\{2\!\\pi^\{2\}\}\}\\text\{vec\}\\left\(\\mathbf\{u\}^\{\\otimes 2\}\\right\),\\frac\{1\}\{\\sqrt\{3\!\\pi^\{3\}\}\}\\text\{vec\}\\left\(\\mathbf\{u\}^\{\\otimes 3\}\\right\),\\cdots\\right\],\(7\)thus we have the transformation of⟨h\(𝐮\),h\(𝐯\)⟩=κ\(𝐮,𝐯\)\\langle h\(\\mathbf\{u\}\),h\(\\mathbf\{v\}\)\\rangle=\\kappa\\left\(\\mathbf\{u\},\\mathbf\{v\}\\right\)\. The mappinghhis to construct a linear expansion ofκ\\kappain the reproducing kernel Hilbert space \(RKHS\)ℋ\\mathcal\{H\}\. Therefore, we have:
log\(∑i∈𝒫kexp\(𝐞k⊤𝐞i\)\)=log\(∑i∈𝒫k⟨h\(𝐞k\),h\(𝐞i\)⟩\)=log\(⟨h\(𝐞k\),μk\+⟩\)\+log\(\|𝒫k\|\),\\displaystyle\\log\\left\(\\sum\_\{i\\in\\mathcal\{P\}\_\{k\}\}\\exp\\left\(\\mathbf\{e\}\_\{k\}^\{\\top\}\\mathbf\{e\}\_\{i\}\\right\)\\right\)=\\log\\left\(\\sum\_\{i\\in\\mathcal\{P\}\_\{k\}\}\\left\\langle h\(\\mathbf\{e\}\_\{k\}\),h\(\\mathbf\{e\}\_\{i\}\)\\right\\rangle\\right\)=\\log\\left\(\\left\\langle h\(\\mathbf\{e\}\_\{k\}\),\\mu\_\{k\}^\{\+\}\\right\\rangle\\right\)\+\\log\\left\(\|\\mathcal\{P\}\_\{k\}\|\\right\),whereμk\+=1\|𝒫k\|∑i∈𝒫kh\(𝐞i\)\\mu\_\{k\}^\{\+\}=\\frac\{1\}\{\|\\mathcal\{P\}\_\{k\}\|\}\\sum\_\{i\\in\\mathcal\{P\}\_\{k\}\}h\(\\mathbf\{e\}\_\{i\}\)is the positive kernel means\. Here\|𝒫k\|\|\\mathcal\{P\}\_\{k\}\|can be considered as a offset to scale the positive samples\. The theoretical analysis can be symmetrically extended to negative samples, yielding an equivalent conclusion\.
Therefore, theSimRegloss consider the difference between teh positive and negative set by:
min𝐞=fE\(𝐱\)J=𝔼𝐱log\(⟨h\(𝐞k\),μk−⟩⟨h\(𝐞k\),μk\+⟩\)\+log\(\|𝒩k\|\|𝒫k\|\)\.\\displaystyle\\min\_\{\\mathbf\{e\}=f\_\{E\}\(\\mathbf\{x\}\)\}\\ J=\\mathbb\{E\}\_\{\\mathbf\{x\}\}\\log\\left\(\\frac\{\\left\\langle h\(\\mathbf\{e\}\_\{k\}\),\\mu\_\{k\}^\{\-\}\\right\\rangle\}\{\\left\\langle h\(\\mathbf\{e\}\_\{k\}\),\\mu\_\{k\}^\{\+\}\\right\\rangle\}\\right\)\+\\log\\left\(\\frac\{\|\\mathcal\{N\}\_\{k\}\|\}\{\|\\mathcal\{P\}\_\{k\}\|\}\\right\)\.The ratio of positive to negative samples only affects the scale of the loss, but does not alter the primary optimization objective of the first term\. It pushes the anchor direction to align with the positive kernel mean and to anti\-align with the negative kernel mean\. It also nudges the group means themselves: positives move toward anchors that they are already close to, and negatives move away in the RKHS sense\. We also have the nearest positive prototype for each class:
max‖𝐞‖⟨h\(𝐞\),μk\+⟩=‖h\(𝐞\)‖‖μk\+‖=κ\(𝐞,𝐞\)‖μk\+‖=e‖μk\+‖\.\\displaystyle\\max\_\{\\\|\\mathbf\{e\}\\\|\}\\left\\langle h\(\\mathbf\{e\}\),\\mu\_\{k\}^\{\+\}\\right\\rangle=\\\|h\(\\mathbf\{e\}\)\\\|\\\|\\mu\_\{k\}^\{\+\}\\\|=\\kappa\(\\mathbf\{e\},\\mathbf\{e\}\)\\\|\\mu\_\{k\}^\{\+\}\\\|=\\sqrt\{e\}\\\|\\mu\_\{k\}^\{\+\}\\\|\.The same, thee\\sqrt\{e\}scaling also hold for the negative set\. Beyond the optimization objective itself, we can further consider the problem from the perspective of gradient directions to refine the learning target\. By considering the Fréchet gradient, we have:
∇h\(𝐞k\)J=μk−⟨h\(𝐞k\),μk−⟩−μk\+⟨h\(𝐞k\),μk\+⟩\.\\displaystyle\\nabla\_\{h\(\\mathbf\{e\}\_\{k\}\)\}J=\\frac\{\\mu\_\{k\}^\{\-\}\}\{\\left\\langle h\(\\mathbf\{e\}\_\{k\}\),\\mu\_\{k\}^\{\-\}\\right\\rangle\}\-\\frac\{\\mu\_\{k\}^\{\+\}\}\{\\left\\langle h\(\\mathbf\{e\}\_\{k\}\),\\mu\_\{k\}^\{\+\}\\right\\rangle\}\.Generally,μk−≠μk\+\\mu\_\{k\}^\{\-\}\\neq\\mu\_\{k\}^\{\+\}\. From the gradient expression, we can see that the optimization dynamics naturally combine both “attractive” and “repulsive” effects\. Specifically, the first term pushes the representationh\(𝐞k\)h\(\\mathbf\{e\}\_\{k\}\)away from the negative centerμk−\\mu\_\{k\}^\{\-\}, while the second term pulls it closer to the positive centerμk\+\\mu\_\{k\}^\{\+\}\. As a result, the overall update direction is shaped by the joint effect of being attracted to positives and repelled from negatives, thereby optimizing the representation space effectively\. From the above two perspectives, it is clear thatSimRegenforces feature consistency alignment in the RKHS sense\.
### B\.3Center\-aligned Embeddings Can Enhance Optimization
Then we consider the performance of the center\-aligned embedding\. To learn the transferred impact from the mappingh\(𝐞k\)h\(\\mathbf\{e\}\_\{k\}\)to vanilla variable𝐞k\\mathbf\{e\}\_\{k\}, we first consider the normalized𝐚k\\mathbf\{a\}\_\{k\}term, where the cosine similarity can be considered as𝐚k⊤𝐚j\\mathbf\{a\}\_\{k\}^\{\\top\}\\mathbf\{a\}\_\{j\}\. To simplify the notation, we additionally define the weighted average direction of a variable𝐚\\mathbf\{a\}over its associated positive and negative sets by𝐯k\+=1‖𝒫k‖∑i∈𝒫kexp\(𝐚k⊤𝐚i\)𝐚i\\mathbf\{v\}\_\{k\}^\{\+\}=\\frac\{1\}\{\\\|\\mathcal\{P\}\_\{k\}\\\|\}\\sum\_\{i\\in\\mathcal\{P\}\_\{k\}\}\\exp\\left\(\\mathbf\{a\}\_\{k\}^\{\\top\}\\mathbf\{a\}\_\{i\}\\right\)\\mathbf\{a\}\_\{i\}and𝐯k−=1‖𝒩k‖∑j∈𝒩kexp\(𝐚k⊤𝐚j\)𝐚j\\mathbf\{v\}\_\{k\}^\{\-\}=\\frac\{1\}\{\\\|\\mathcal\{N\}\_\{k\}\\\|\}\\sum\_\{j\\in\\mathcal\{N\}\_\{k\}\}\\exp\\left\(\\mathbf\{a\}\_\{k\}^\{\\top\}\\mathbf\{a\}\_\{j\}\\right\)\\mathbf\{a\}\_\{j\}\. Similarly, we also define the loss of positive set and negative set asPkP\_\{k\}andNkN\_\{k\}\. Therefore, we have the following gradient form:
∇𝐚kLsr=NkPk\+Nk\(𝐯k−−𝐯k\+\)\.\\displaystyle\\nabla\_\{\\mathbf\{a\}\_\{k\}\}L\_\{\\text\{sr\}\}=\\frac\{N\_\{k\}\}\{P\_\{k\}\+N\_\{k\}\}\\left\(\\mathbf\{v\}\_\{k\}^\{\-\}\-\\mathbf\{v\}\_\{k\}^\{\+\}\\right\)\.Since the𝐚k\\mathbf\{a\}\_\{k\}is constrainted by‖𝐚k‖=1\\\|\\mathbf\{a\}\_\{k\}\\\|=1, the true update direction is obtained by projecting the gradient onto the tangent space:−∏𝐯k∇𝐚kLsr=−NkPk\+Nk\(I−𝐚k𝐚k⊤\)\(𝐯k−−𝐯k\+\)\-\\prod\_\{\\mathbf\{v\}\_\{k\}\}\\nabla\_\{\\mathbf\{a\}\_\{k\}\}L\_\{\\text\{sr\}\}=\-\\frac\{N\_\{k\}\}\{P\_\{k\}\+N\_\{k\}\}\\left\(I\-\\mathbf\{a\}\_\{k\}\\mathbf\{a\}\_\{k\}^\{\\top\}\\right\)\\left\(\\mathbf\{v\}\_\{k\}^\{\-\}\-\\mathbf\{v\}\_\{k\}^\{\+\}\\right\)\. Next, we analyze how the gradient dynamics associated with the positive sample set vary along the update direction\. This dynamic essentially characterizes how strongly the representation is pulled toward the positive center during optimization\. A larger value indicates that the update direction aligns well with the attraction force from positive samples, thereby accelerating convergence\. Conversely, a smaller value reflects weaker alignment, suggesting limited contribution from positive samples in shaping the optimization trajectory\. For the positive sample loss, we obtain \(for clarity of exposition, we omit constant scalar terms\):
ddt‖𝐚k−𝐯k\+‖2\\displaystyle\\frac\{d\}\{dt\}\\\|\\mathbf\{a\}\_\{k\}\-\\mathbf\{v\}\_\{k\}^\{\+\}\\\|^\{2\}=2\(𝐚k−𝐯k\+\)⊤\(I−𝐚k𝐚k⊤\)𝐯k\+−2\(𝐚k−𝐯k\+\)⊤\(I−𝐚k𝐚k⊤\)𝐯k−⏟negative perturbation\.\\displaystyle=2\\left\(\\mathbf\{a\}\_\{k\}\-\\mathbf\{v\}\_\{k\}^\{\+\}\\right\)^\{\\top\}\\left\(I\-\\mathbf\{a\}\_\{k\}\\mathbf\{a\}\_\{k\}^\{\\top\}\\right\)\\mathbf\{v\}\_\{k\}^\{\+\}\-\\underbrace\{2\\left\(\\mathbf\{a\}\_\{k\}\-\\mathbf\{v\}\_\{k\}^\{\+\}\\right\)^\{\\top\}\\left\(I\-\\mathbf\{a\}\_\{k\}\\mathbf\{a\}\_\{k\}^\{\\top\}\\right\)\\mathbf\{v\}\_\{k\}^\{\-\}\}\_\{\\text\{negative perturbation\}\}\.When treating the update on the negative sample set as a small perturbation to that on the positive samples, we haveddt‖𝐚k−𝐯k\+‖2≤2\(𝐚k⊤𝐯k\+\)2−‖𝐯k\+‖2≤0\\frac\{d\}\{dt\}\\\|\\mathbf\{a\}\_\{k\}\-\\mathbf\{v\}\_\{k\}^\{\+\}\\\|^\{2\}\\leq 2\\left\(\\mathbf\{a\}\_\{k\}^\{\\top\}\\mathbf\{v\}\_\{k\}^\{\+\}\\right\)^\{2\}\-\\\|\\mathbf\{v\}\_\{k\}^\{\+\}\\\|^\{2\}\\leq 0\. Similarly, the gradient dynamics on the negative sample set can be obtained asddt‖𝐚k−𝐯k−‖2≥0\\frac\{d\}\{dt\}\\\|\\mathbf\{a\}\_\{k\}\-\\mathbf\{v\}\_\{k\}^\{\-\}\\\|^\{2\}\\geq 0\. In conclusion, taking a small step along the tangent update direction inherently drives the representation closer to the weighted center of the positive class while simultaneously pushing it away from that of the negative class\. In other words, such updates reinforce the consistency among positive samples and reduce the influence of negatives, thereby shaping a clearer separation in the feature space\. Importantly, this property does not rely on any assumptions about the underlying functional form, but rather arises directly from the optimization objective itself, ensuring both generality and robustness\. To further refine the update dynamics, a temperature coefficient can be introduced as a scaling factor\. By adjusting the sharpness of the similarity distribution, the temperature effectively controls the relative strength of attraction toward positive samples and repulsion from negative samples\. In particular, incorporating a temperature into the formulation normalizes the gradient magnitudes and ensures that the update direction satisfies the desired balance condition between positive and negative contributions\. This modification not only stabilizes training but also enhances the flexibility of the loss function in adapting to different representation scales\. This result can be directly extended from the normalized variables to the original embedding variables𝐞\\mathbf\{e\}, thereby completing the proofs\.Similar Articles
From Weights to Features: SAE-Guided Activation Regularization for LLM Continual Learning
This paper proposes a continual learning method for LLMs that uses pretrained sparse autoencoders (SAEs) to regularize in activation space instead of weight space, achieving better memory efficiency and stronger performance on benchmarks while avoiding catastrophic forgetting without storing previous data.
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
This paper studies data-constrained language model pretraining, proposing masked-input regularization (MIR) to improve validation loss and downstream performance, and SoftQ, a scaling law that better captures model-data interaction under repeated data.
@_akhaliq: VISReg Variance-Invariance-Sketching Regularization for JEPA training
Introduces VISReg, a regularization method for JEPA (Joint Embedding Predictive Architecture) training that combines variance, invariance, and sketching constraints.
Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding
This paper introduces Regret Pre-training, a self-supervised framework that uses a dual-view architecture to incorporate future context into causal language model training, improving performance on downstream tasks by up to 18 percentage points without adding parameters.
Uncovering the Latent Potential of Deep Intermediate Representations
This paper introduces LOES (Layer-wise Optimal Embedding Selection) and GeoReg (Geometric Regularization Loss), methods that select and fuse task-relevant intermediate layers from deep models to improve transfer learning performance, demonstrating consistent gains across architectures and modalities.