Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

arXiv cs.LG 05/25/26, 04:00 AM Papers
instruction-tuning embedding-noise language-models fine-tuning symmetric-noise neftune
Summary
This paper analyzes noisy embedding techniques for instruction fine-tuning, explains why uniform noise outperforms Gaussian, and introduces SymNoise, a symmetric noise method that significantly improves LLaMA-2-7B performance on AlpacaEval over NEFTune.
arXiv:2605.23171v1 Announce Type: new Abstract: Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks using uniform noise. Despite NEFTune's empirical findings that uniform noise outperforms Gaussian noise, the reasons for this remain unclear. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types. Additionally, we introduce a new fine-tuning method for language models, utilizing symmetric noise in embeddings. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune (64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.
Original Article
View Cached Full Text
Cached at: 05/25/26, 09:01 AM
# Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning
Source: [https://arxiv.org/html/2605.23171](https://arxiv.org/html/2605.23171)
###### Abstract

Recent advancements in instructional fine\-tuning have injected noise into embeddings, withNEFTune\(Jain et al\.,[2024](https://arxiv.org/html/2605.23171#bib.bib14)\)setting benchmarks using uniform noise\. DespiteNEFTune’sempirical findingsthat uniform noise outperforms Gaussian noise, the reasons for this remain unclear\. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types\. Additionally, we introduce a new fine\-tuning method for language models, utilizing symmetric noise in embeddings\. This method aims to enhance the model’s function by more stringently regulating its local curvature, demonstrating superior performance over the current method,NEFTune\. When fine\-tuning the LLaMA\-2\-7B model using Alpaca, standard techniques yield a29\.7929\.79% score on AlpacaEval\. However, our approach,SymNoise, increases this score significantly to69\.0469\.04%, using symmetric noisy embeddings\. This is a6\.76\.7% improvement over the state\-of\-the\-art method,NEFTune\(64\.6964\.69%\)\. Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol\-Instruct, ShareGPT, OpenPlatypus,SymNoiseconsistently outperformsNEFTune\. The current literature, includingNEFTune, has underscored the importance of more in\-depth research into the application of noise\-based strategies in the fine\-tuning of language models\. Our approach,SymNoise, is another significant step towards this direction, showing notable improvement over the existing state\-of\-the\-art method\.

## 1Introduction

For Large Language Models\(Vaswani et al\.,[2017](https://arxiv.org/html/2605.23171#bib.bib31); Devlin et al\.,[2019](https://arxiv.org/html/2605.23171#bib.bib8); Radford et al\.,[2019](https://arxiv.org/html/2605.23171#bib.bib23); Raffel et al\.,[2020](https://arxiv.org/html/2605.23171#bib.bib24); Zhang et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib38); Touvron et al\.,[2023b](https://arxiv.org/html/2605.23171#bib.bib29); Zhao et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib39)\)to be effective, their proficiency in executing specific instructions is crucial\(Wang et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib32); Ouyang et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib21); Brown et al\.,[2020](https://arxiv.org/html/2605.23171#bib.bib3); Chung et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib6)\)\. These models typically begin with training on a vast array of unfiltered web data, after which they undergo a more focused fine\-tuning stage using a smaller, selectively chosen collection of instructional data\. The fine\-tuning stage, centered on instructions, is fundamental in unlocking and controlling the full capabilities of LLMs\. The practical value of these models is predominantly dependent on how efficiently we can leverage these concise instructional data sets for optimal performance\.

In recent years, noise injection\(Nukrai et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib19); Zang et al\.,[2021](https://arxiv.org/html/2605.23171#bib.bib36); Akbiyik,[2020](https://arxiv.org/html/2605.23171#bib.bib2)\)has been a focal point in computer vision research, yielding methods that enhance model robustness and accuracy\. This strategy has recently been adapted for fine\-tuning Large Language Models \(LLMs\), exemplified by theNEFTunemethod\(Jain et al\.,[2024](https://arxiv.org/html/2605.23171#bib.bib14)\), which applies uniform random noise to improve model performance on diverse datasets\. DespiteNEFTune’s efficacy surpassing traditional fine\-tuning techniques, the reasons behind its success, particularly against the commonly used Gaussian noise, are not entirely understood\. Our work demystifies this by presenting a detailed theoretical and empirical analysis that reveals comparable results between noise types when appropriately scaled\. Moreover, we introduce a novel noise injection approach which not only facilitates a more intuitive understanding but also achieves superior empirical results, outperformingNEFTuneand other established fine\-tuning methods by a considerable margin\.

In particular, our objective is to regularize the curvature of the function learned during training\. Curvature regularization has been used in domains such as computer vision\(Moosavi\-Dezfooli et al\.,[2019](https://arxiv.org/html/2605.23171#bib.bib18); Lee & Park,[2023](https://arxiv.org/html/2605.23171#bib.bib17)\), graph embedding\(Pei et al\.,[2020](https://arxiv.org/html/2605.23171#bib.bib22)\), and deep neural networks\(Huh,[2020](https://arxiv.org/html/2605.23171#bib.bib13)\)\. Specifically, we aim to ensure that the function’s response changes gradually when the input is modified slightly by noise\. In more technical terms, our goal is to have the gradient approach zero in the immediate vicinity of an input altered by a minimal amount\. This represents a more stringent condition than merely requiring small values for the Hessian or gradient\. However, considering computational efficiency, we opt to avoid the direct computation of gradients or Hessians\. Instead, we employ this stringent condition, which, as our experiments on real\-world benchmark datasets demonstrate, is effective in practical scenarios\.

In this paper, we unveil Symmetric Noise Fine Tuning \(SymNoise\), a new technique that leverages symmetric Bernoulli distribution\-based noise applied to the embedding vectors of training data during the finetuning stage\. Each noise component is generated with an equal probability of12\\frac\{1\}\{2\}for the values−1\-1and11\. This method significantly enhances instruction finetuning outcomes, often with remarkable gains, while avoiding additional computational or data resources\. While maintaining simplicity,SymNoisehas a profound impact on downstream conversational output quality\. We show that when a large langudage model like LLaMA\-2\-7B\(Touvron et al\.,[2023c](https://arxiv.org/html/2605.23171#bib.bib30)\)is finetuned usingSymNoise, its performance onAlpacaEval\(Dubois et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib9)\)rises from29\.7929\.79% to69\.0469\.04% – a substantial increase of about39\.2539\.25percentage points\.

Importantly, when compared to the existingNEFTunemethod \(which uses random uniform noise\),SymNoisedemonstrates a superior performance edge, outperformingNEFTuneby approximately6\.76\.7%\. Thus,SymNoisenot only represents a valuable advancement over traditional finetuning methods but also establishes a new benchmark in efficiency and effectiveness for LLM finetuning\.

Contributions\.In our comprehensive study, we conduct a detailed theoretical and empirical examination, demonstrating that Gaussian and uniform random noise exhibit functional equivalence when adjusted with an appropriate scaling factor, leading to similar performance on real\-world datasets\. This insight holds significant importance, especially considering that the creators of theNEFTunemethod, a leading approach employing uniform noise, have openly recognized gaps in their understanding of the method’s superior performance, notably in comparison to the extensively studied Gaussian noise\. By establishing a connection with Gaussian noise, our study helps demystify theNEFTunemethod\. Moreover, we introduce an innovative noise injection method that exceeds the capabilities ofNEFTuneand existing alternatives\. Our contributions thus propel the momentum for continued exploration in this field\.

Table 1:AlpacaEvalWin Rate against Text\-Davinci\-003 when applied with LLaMA\-2, trained across diverse datasets\.SymNoiseshows an overall improvement throughout all datasets, outperformingNEFTuneon all datasets\. The noise scaling factor for the Gaussian distribution is divided by3\\sqrt\{3\}, resulting in similar performance for both methods\.Table 2:AlpacaEvalWin Rate with and withoutNEFTune,SymNoiseon LLaMA\-2, LLaMA\-1, and OPT on various datasets\.SymNoiseshows improved performance across these datasets and models\.
## 2Background and Related Work

In the evolving landscape of instruction finetuning for Large Language Models \(LLMs\), initial efforts like FLAN and T0 marked the beginning of cross\-task generalization\(Sanh et al\.,[2021](https://arxiv.org/html/2605.23171#bib.bib25); Wei et al\.,[2021](https://arxiv.org/html/2605.23171#bib.bib33)\)\. These models, involving encoder\-decoder language architectures, underwent finetuning across a diverse spectrum of thousands NLP tasks\. This progression, detailed in studies byChung et al\. \([2022](https://arxiv.org/html/2605.23171#bib.bib6)\)andXu et al\. \([2022](https://arxiv.org/html/2605.23171#bib.bib35)\)demonstrated the adaptability of LLMs to a variety of standard NLP tasks\.

Following this trajectory, OpenAI’s InstructGPT\(Ouyang et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib21)\)emerged as a pioneering model adept at handling open\-ended questions with remarkable efficiency\. This model, an iteration of GPT\-3\(Brown et al\.,[2020](https://arxiv.org/html/2605.23171#bib.bib3)\), incorporated reinforcement learning from human feedback \(RLHF\), leading to the development of advanced models like ChatGPT\(OpenAI,[2022](https://arxiv.org/html/2605.23171#bib.bib20)\)\. ChatGPT, in particular, gained widespread attention for generating more coherent and extended texts compared to InstructGPT\.

Building on these developments,Wang et al\. \([2022](https://arxiv.org/html/2605.23171#bib.bib32)\)introduced the Self\-Instruct approach, utilizing InstructGPT to generate instructional pairs for further finetuning of foundational models like LLaMA into specialized variants such as Alpaca\(Taori et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib27)\)\. Concurrently, the trend towards distilled models, as discussed byTaori et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib27)\)andXu et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib34)\), led to the creation of diverse datasets\. These datasets, including works byXu et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib34)\)andLee et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib16)\), focused on refining specific model capabilities like STEM question answering and logical reasoning\. Another notable advancement was AlpaGasus byChen et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib4)\), which employed a quality\-filtering mechanism based on GPT\-4 evaluations to enhance model performance\. In a different methodology, ShareGPT, as described byChiang et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib5)\), was developed through the crowd sourcing of real user interactions sourced from ChatGPT\.

In the context of incorporating noise into model training, the pioneering work byZhu et al\. \([2019](https://arxiv.org/html/2605.23171#bib.bib40)\)with the FreeLB method demonstrated the effectiveness of adversarial perturbations in boosting MLM model performance\. This method involved introducing calculated Gaussian perturbations into the embeddings and optimizing them to maximally impact model performance\. Similar strategies were later applied in various domains, such as image captioning\(Nukrai et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib19)\), point cloud processing\(Zang et al\.,[2021](https://arxiv.org/html/2605.23171#bib.bib36)\), graphs\(Kong et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib15)\), and privacy mechanisms\(Dwork et al\.,[2014](https://arxiv.org/html/2605.23171#bib.bib10)\)\. Curvature regularization has been used in domains such as computer vision\(Moosavi\-Dezfooli et al\.,[2019](https://arxiv.org/html/2605.23171#bib.bib18); Lee & Park,[2023](https://arxiv.org/html/2605.23171#bib.bib17)\), graph embedding\(Pei et al\.,[2020](https://arxiv.org/html/2605.23171#bib.bib22)\), and deep neural networks\(Huh,[2020](https://arxiv.org/html/2605.23171#bib.bib13)\)\. Noise based on the Bernoulli distribution, as opposed to Gaussian or Uniform noise, has been utilized, as mentioned bySpall \([1998](https://arxiv.org/html/2605.23171#bib.bib26)\)\. In this approach, each outcome, either−1\-1or11, is assigned an equal probability of12\\frac\{1\}\{2\}\.

## 3On Similarity of Uniform noise and Gaussian noise

In this section, we investigate the similarity between uniform and Gaussian noise when used for embedding perturbations\. While these noise types yield different statistical properties in low dimensions, their behavior becomes increasingly similar as the number of dimensions grows\. This phenomenon is especially relevant in the context of large language models \(LLMs\), where embeddings typically reside in high\-dimensional spaces\.

###### Lemma 1\.

\(Uniform DistributionL2L\_\{2\}Norm\)ForP=\(x1,x2,…,xd\)∼Ud\(\[−1,1\]\)P=\(x\_\{1\},x\_\{2\},\.\.\.,x\_\{d\}\)\\sim U^\{d\}\(\[\-1,1\]\), the expectedL2L\_\{2\}norm is:

E\[‖P‖2\]=d3\.\\displaystyle E\[\\\|P\\\|\_\{2\}\]=\\sqrt\{\\frac\{d\}\{3\}\}\.\(1\)

The proof is deferred to Appendix[B\.0\.1](https://arxiv.org/html/2605.23171#A2.SS0.SSS1)\.

###### Lemma 2\.

\(Gaussian DistributionL2L\_\{2\}Norm\)ForP=\(x1,x2,…,xd\)∼Nd\(0,1\)P=\(x\_\{1\},x\_\{2\},\.\.\.,x\_\{d\}\)\\sim N^\{d\}\(0,1\), the expectedL2L\_\{2\}norm is:

E\[‖P‖2\]=d\.\\displaystyle E\[\\\|P\\\|\_\{2\}\]=\\sqrt\{d\}\.\(2\)

The proof is deferred to Appendix[B\.0\.2](https://arxiv.org/html/2605.23171#A2.SS0.SSS2)\.

![Refer to caption](https://arxiv.org/html/2605.23171v1/x1.png)\(a\)Gaussian/Uniform averageL2L\_\{2\}ratio
![Refer to caption](https://arxiv.org/html/2605.23171v1/x2.png)\(b\)Bernoulli/Uniform averageL2L\_\{2\}ratio

Figure 1:Comparison of averageL2L\_\{2\}norm ratios for Gaussian and Bernoulli noise relative to Uniform noise as a function of dimensionality\.Drawing from Lemma[1](https://arxiv.org/html/2605.23171#Thmlemma1)and Lemma[2](https://arxiv.org/html/2605.23171#Thmlemma2), it is apparent that the expected noise from the Gaussian distribution is3\\sqrt\{3\}times that of the Uniform distribution\. Consequently, to equate the noise scales for comparison, the noise scaling factor for the Gaussian distribution should be adjusted by a factor of3\\sqrt\{3\}\.

As depicted in Figure[1\(a\)](https://arxiv.org/html/2605.23171#S3.F1.sf1), a distinct pattern emerges in the ratio of average noise \(quantified viaL2L\_\{2\}norms\) between Gaussian and Uniform distributions as dimensionality increases\. Notably, the relative impact of Gaussian noise amplifies, approximating3\\sqrt\{3\}times the effect induced by Uniform noise with increasing dimensions\. An in\-depth exploration and analysis concerning the influence of altering the sample size, while maintaining a fixed dimensionality, are detailed in Appendix[A\.2](https://arxiv.org/html/2605.23171#A1.SS2)\.

Moreover, the comparative results on real\-world datasets are presented in Table[1](https://arxiv.org/html/2605.23171#S1.T1), where all conditions are held constant except for the substitution of the Uniform distribution with a Gaussian distribution\. In this context, the noise scaling factor for the Gaussian distribution is adjusted by a factor of3\\sqrt\{3\}, consistent with the discussion above, and one can notice that the performance of both methods thereafter is comparable\. A more detailed ablation study is given in the Sec\.[5\.5\.2](https://arxiv.org/html/2605.23171#S5.SS5.SSS2)\.

## 4Proposed Method:SymNoise

In the ideal scenario, our goal is to implement curvature regularization, a technique prevalent in fields such as computer vision\(Moosavi\-Dezfooli et al\.,[2019](https://arxiv.org/html/2605.23171#bib.bib18); Lee & Park,[2023](https://arxiv.org/html/2605.23171#bib.bib17)\), graph embedding\(Pei et al\.,[2020](https://arxiv.org/html/2605.23171#bib.bib22)\), and deep neural networks\(Huh,[2020](https://arxiv.org/html/2605.23171#bib.bib13)\)\. However, due to the high computational cost associated with these methods, we aim to explore an alternative approach that adheres to a more stringent condition\. This approach has demonstrated superior performance in practice, surpassing current state\-of\-the\-art methodologies\. Specifically, we seek to design a function with a gradient \(

∇f\\nabla f\) having value as

0in the vicinity of the input, i\.e\.,for

x,ϵ∈ℝdx,\\epsilon\\in\\mathbb\{R\}^\{d\},

∇f=\|f\(\(x−ϵ\)\)−f\(x\+ϵ\)ϵ\|2≤δ\\nabla f=\\frac\{\\left\|\{\\frac\{f\{\\left\(\(x\-\\epsilon\)\\right\)\}\-f\{\\left\(x\+\\epsilon\\right\)\}\}\{\\epsilon\}\}\\right\|\}\{2\}\\leq\\delta, when

δ=0\\delta=0, we have

f\(x\+ϵ\)=f\(x−ϵ\)\.f\{\\left\(x\+\\epsilon\\right\)\}=f\{\\left\(x\-\\epsilon\\right\)\}\.In this formulation, the noise turns out to be based on a Bernoulli distribution, diverging from the more commonly used Gaussian or Uniform noise types\. Specifically, it uses values of

−1\-1and

11with equal probability, as inSpall \([1998](https://arxiv.org/html/2605.23171#bib.bib26)\), to provide a balanced and predictable effect on the network’s learning\.

FollowingJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\), we train instruction\-based models using instruction\-response pairs\. UnlikeNEFTune, which adds uniform noise to token embeddings, we introduce symmetric Bernoulli noise\. While we retain the same noise scaling factor,ϵ=α/Ld\\epsilon=\\alpha/\\sqrt\{Ld\}\(withLLas sequence length,ddas embedding dimension, andα\\alphaas a tunable parameter\), our method differs in how noise is applied\. Details of our approach,SymNoise, are provided in Algorithm[2](https://arxiv.org/html/2605.23171#alg2), alongsideNEFTunein Algorithm[1](https://arxiv.org/html/2605.23171#alg1)for comparison\.

### 4\.1On Similarity of Uniform noise and Bernoulli noise

###### Lemma 3\.

\(Bernoulli DistributionL2L\_\{2\}Norm\)ForP=\(x1,x2,…,xd\)P=\(x\_\{1\},x\_\{2\},\.\.\.,x\_\{d\}\), withxi∈\{−1,1\}x\_\{i\}\\in\\\{\-1,1\\\}andP\(xi=1\)=P\(xi=−1\)=0\.5P\(x\_\{i\}=1\)=P\(x\_\{i\}=\-1\)=0\.5, the expectedL2L\_\{2\}norm is:

E\[‖P‖2\]=d\.\\displaystyle E\[\\\|P\\\|\_\{2\}\]=\\sqrt\{d\}\.\(3\)

The proof is deferred to Appendix[B\.0\.3](https://arxiv.org/html/2605.23171#A2.SS0.SSS3)\.

In alignment with the discussions in Sec\.[3](https://arxiv.org/html/2605.23171#S3)and corroborated by Lemma[1](https://arxiv.org/html/2605.23171#Thmlemma1), Lemma[2](https://arxiv.org/html/2605.23171#Thmlemma2), and Fig[1\(b\)](https://arxiv.org/html/2605.23171#S3.F1.sf2), it is evident that the noise induced by the Bernoulli distribution is amplified by a factor of3\\sqrt\{3\}compared to that of the Uniform distribution\. To accommodate this disparity, our proposed methodSymNoiseincorporates this3\\sqrt\{3\}scaling factor, as detailed in Algorithm[2](https://arxiv.org/html/2605.23171#alg2)\.

Algorithm 1NEFTune:NoisyEmbedding InstructionFinetuning \(Taken from the paper\(Jain et al\.,[2024](https://arxiv.org/html/2605.23171#bib.bib14)\)\)Input:

𝒟=\{xi,yi\}1N\\mathcal\{D\}=\\\{x\_\{i\},y\_\{i\}\\\}\_\{1\}^\{N\}tokenized dataset, embedding layer

emb\(⋅\)\\text\{emb\}\(\\cdot\), rest of model

f/emb\(⋅\)f\_\{/\\text\{emb\}\}\(\\cdot\), model parameters

θ\\theta,

loss\(⋅\)\\text\{loss\}\(\\cdot\), optimizer

opt\(⋅\)\\text\{opt\}\(\\cdot\)
Hyperparameter: base noise scale

α∈ℝ\+\\alpha\\in\\mathbb\{R^\{\+\}\}
Initialize

θ\\thetafrom a pretrained model\.

repeat

\(Xi,Yi\)∼𝒟\(X\_\{i\},Y\_\{i\}\)\\sim\\mathcal\{D\}\{sample a minibatch of data and labels\}

Xemb←emb\(Xi\),ℝB×L×dX\_\{\\text\{emb\}\}\\leftarrow\\text\{emb\}\(X\_\{i\}\),\\mathbb\{R\}^\{B\\times L\\times d\}\{batch size

BB, seq\. length

LL, embedding dimension

dd\}

ϵ∼Uniform\(−1,1\),ℝB×L×d\\epsilon\\sim\\text\{Uniform\}\(\-1,1\),\\mathbb\{R\}^\{B\\times L\\times d\}\{sample a noise vector\}

Xemb′←Xemb\+\(αLd\)ϵX\_\{\\text\{emb\}\}^\{\\prime\}\\leftarrow X\_\{\\text\{emb\}\}\+\(\\frac\{\\alpha\}\{\\sqrt\{Ld\}\}\)\\epsilon\{add scaled noise to embeds111If sequence lengths in a batch are not equivalent, thenLLis a vector∈ℤ\>0B\\in\\mathbb\{Z\}\_\{\>0\}^\{B\}and the scaling factor\(α/Ld\)\(\\alpha/\\sqrt\{Ld\}\)is computed independently for each sequence in batch\.\}

Y^i←f/emb\(Xemb′\)\\hat\{Y\}\_\{i\}\\leftarrow f\_\{/\\text\{emb\}\}\(X\_\{\\text\{emb\}\}^\{\\prime\}\)\{make prediction at noised embeddings\}

θ←opt\(θ,loss\(Y^i,Yi\)\)\\theta\\leftarrow\\text\{opt\}\(\\theta,\\text\{loss\}\(\\hat\{Y\}\_\{i\},Y\_\{i\}\)\)\{train step, e\.g\., grad descent\}

untilStopping criteria met/max iterations\.

Algorithm 2SymNoise:SymmetricNoisyEmbedding Instruction Finetuning \(Proposed Method\)Input:

𝒟=\{xi,yi\}1N\\mathcal\{D\}=\\\{x\_\{i\},y\_\{i\}\\\}\_\{1\}^\{N\}tokenized dataset, embedding layer

emb\(⋅\)\\text\{emb\}\(\\cdot\), rest of model

f/emb\(⋅\)f\_\{/\\text\{emb\}\}\(\\cdot\), model parameters

θ\\theta,

loss\(⋅\)\\text\{loss\}\(\\cdot\), optimizer

opt\(⋅\)\\text\{opt\}\(\\cdot\)
Hyperparameter: base noise scale

α∈ℝ\+\\alpha\\in\\mathbb\{R^\{\+\}\}
Initialize

θ\\thetafrom a pretrained model\.

repeat

\(Xi,Yi\)∼𝒟\(X\_\{i\},Y\_\{i\}\)\\sim\\mathcal\{D\}\{sample a minibatch of data and labels\}

Xemb←emb\(Xi\),ℝB×L×dX\_\{\\text\{emb\}\}\\leftarrow\\text\{emb\}\(X\_\{i\}\),\\mathbb\{R\}^\{B\\times L\\times d\}\{batch size

BB, seq\. length

LL, embedding dimension

dd\}

ϵ∼Bernoulli\{−1,1\},ℝB×L×d\\epsilon\\sim\\text\{Bernoulli\}\\\{\-1,1\\\},\\mathbb\{R\}^\{B\\times L\\times d\}\{sample a noise vector\}

Xemb′←Xemb\+\(αLd\)ϵ3X\_\{\\text\{emb\}\}^\{\\prime\}\\leftarrow X\_\{\\text\{emb\}\}\+\(\\frac\{\\alpha\}\{\\sqrt\{Ld\}\}\)\\frac\{\\epsilon\}\{\\sqrt\{3\}\}\{add scaled noise to embeds222If sequence lengths in a batch are not equivalent, thenLLis a vector∈ℤ\>0B\\in\\mathbb\{Z\}\_\{\>0\}^\{B\}and the scaling factor\(α/Ld\)\(\\alpha/\\sqrt\{Ld\}\)is computed independently for each sequence in batch\.\}

Xemb′′←Xemb−\(αLd\)ϵ3X\_\{\\text\{emb\}\}^\{\\prime\\prime\}\\leftarrow X\_\{\\text\{emb\}\}\-\(\\frac\{\\alpha\}\{\\sqrt\{Ld\}\}\)\\frac\{\\epsilon\}\{\\sqrt\{3\}\}\{subtract same symmetric noise from embeds\}

Y^i←f/emb\(concat\(Xemb′,Xemb′′\)\)\\hat\{Y\}\_\{i\}\\leftarrow f\_\{/\\text\{emb\}\}\(concat\(X\_\{\\text\{emb\}\}^\{\\prime\},X\_\{\\text\{emb\}\}^\{\\prime\\prime\}\)\)\{make prediction at noised embeddings\}

θ←opt\(θ,loss\(Y^i,Yi\)\)\\theta\\leftarrow\\text\{opt\}\(\\theta,\\text\{loss\}\(\\hat\{Y\}\_\{i\},Y\_\{i\}\)\)\{train step\}

untilStopping criteria met/max iterations\.

## 5Experiments

In this section, we perform numerous experiments across various models and benchmarks to demonstrate the efficacy of our proposed methodSymNoiseand compare it with existing approaches includingNEFT\.

### 5\.1Datasets

In this section, we delve into finetuning datasets that have either gained widespread popularity or have recently achieved state\-of\-the\-art results\. Due to the memory limitations of our hardware setup, our focus is exclusively on single\-turn datasets following similar protocol as used inJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\)\. The chosen datasets are: Alpaca\(Taori et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib27)\), ShareGPT\(Chiang et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib5)\), Evol\-Instruc\(Xu et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib34)\), and Open\-Platypus\(Lee et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib16)\)\. More details about these datasets are in Appendix[A\.1](https://arxiv.org/html/2605.23171#A1.SS1)

In the fine\-tuning phase, each model, with the exception of ShareGPT, utilizes the prompt from the Alpaca system\. Conversely, ShareGPT is fine\-tuned using the prompt from the Vicuna system\. Our approach to hyperparameter tuning, including the selection of values, aligns with the methodologies suggested byJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\)\. We adhered strictly to the same set of hyperparameters as those employed inNEFTune\(Jain et al\.,[2024](https://arxiv.org/html/2605.23171#bib.bib14)\)\.

### 5\.2Models

FollowingJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\)setup for experimentation, our experiments predominantly utilize Large Language Models \(LLMs\) with a parameter size of77billion\. Specifically, our focus is on models such as LLaMA\-1\(Touvron et al\.,[2023a](https://arxiv.org/html/2605.23171#bib.bib28)\), LLaMA\-2\(Touvron et al\.,[2023c](https://arxiv.org/html/2605.23171#bib.bib30)\), and OPT\-6\.7B\(Zhang et al\.,[2022](https://arxiv.org/html/2605.23171#bib.bib38)\)\. These transformer\-based models primarily differ in the amount of training data they’ve been exposed to, with OPT\-6\.7B, LLaMA\-1, and LLaMA\-2 being trained on180180billion,11trillion, and22trillion tokens, respectively\. This variance in training data volume is expected to manifest in their performance across benchmarks like MMLU, where LLaMA\-2 typically outperforms the others\.

### 5\.3Evaluation Protocols

Our experimental framework, adapted from the originalNEFTune\(Jain et al\.,[2024](https://arxiv.org/html/2605.23171#bib.bib14)\)setup, primarily utilizes single\-turn data for training\. We assess the models’ conversational skills usingAlpacaEvaland examine their performance on tasks from the OpenLLM Leaderboard\. This was done to verify that the introduction of oursymnoiseaugmentation does not negatively impact the models’ performance on standard multiple\-choice tasks\. Notably, the results demonstrate that our augmented models consistently outperform the originalneftunemodels, albeit by a modest margin\.

- •AlpacaEval: Introduced byDubois et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib9)\),AlpacaEvalis crucial for appraising generative quality\. It functions as an automated model\-based evaluator, comparing Text\-Davinci\-003’s generations with our model’s over805805prompts, focusing on theWin Rate\. The Win Rate indicates how often our model is preferred over Text\-Davinci\-003, as judged by model evaluator \(GPT\-4, ChatGPT etc\)\. The dataset’s805805test prompts, sourced from various platforms, ensure a comprehensive testing scope\.AlpacaEval’shigh human agreement rate\(Dubois et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib9)\), validated by 20K annotations, highlights its usefulness and accuracy\. We employ both GPT\-4 and ChatGPT as evaluators, using ChatGPT initially due to GPT\-4’s API limitations and costs\.
- •Hugging Face OpenLLM Leaderboard: For leaderboard assessments, datasets like ARC\(Clark et al\.,[2018](https://arxiv.org/html/2605.23171#bib.bib7)\), HellaSwag\(Zellers et al\.,[2019](https://arxiv.org/html/2605.23171#bib.bib37)\), and MMLU\(Hendrycks et al\.,[2020](https://arxiv.org/html/2605.23171#bib.bib12)\)are utilized\. These verbalized multiclass classification datasets test the LLM’s capability in factual questioning and reasoning\. Our evaluations confirm that theSymNoisemethod does not diminish the models’ proficiency in these domains\.

### 5\.4Results

The methodology we employed for tuning hyperparameters and choosing their values adheres closely to the protocols proposed byJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\)\. Specifically, we meticulously adopted the identical hyperparameter set as delineated inNEFTunebyJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\)\.

#### 5\.4\.1Improvement in generated text quality

Our results demonstrate an enhanced metric performance compared toNEFTunein terms of generated text quality\. As evident from Table[1](https://arxiv.org/html/2605.23171#S1.T1), there is a notable improvement across all datasets at the 7B size, with an average increase of17\.6%17\.6\\%\(compared toNEFTune’s improvement of15\.1%15\.1\\%\)\. This indicates that the implementation ofSymNoisesignificantly enhances conversational capabilities and answer quality\. These findings are supported by evaluations usingAlpacaEval, whereSymNoisenotably outperformsNEFTune\. Furthermore, as shown in Table[2](https://arxiv.org/html/2605.23171#S1.T2), enhancements are also observed in older models like LLaMA\-1 and OPT\-6\.7B, withSymNoiseconsistently surpassingNEFTunein these models as well\. An interesting observation is the comparatively smaller gain byNEFTunein ShareGPT, as per ChatGPT’s analysis, a trend not mirrored in GPT\-4’s evaluation\. However,SymNoiseconsistently excels overNEFTunefor ShareGPT in evaluations by both GPT\-4 and ChatGPT\. In Table[1](https://arxiv.org/html/2605.23171#S1.T1), the Win Rate shows a significant increase from29\.79%29\.79\\%to69\.04%69\.04\\%for Alpaca, thereby outperforming the state\-of\-the\-art methodNEFTuneby6\.7%6\.7\\%\.

#### 5\.4\.2Improvement inOpenLLM Leaderboardtasks

In addressing the potential thatSymNoisecould enhance conversational abilities at the expense of traditional skills, we conducted evaluations using tasks from the OpenLLM Leaderboard\. Employing the LM\-Eval Harness framework\(Gao et al\.,[2021](https://arxiv.org/html/2605.23171#bib.bib11)\), we assessed our model’s performance on benchmarks such as MMLU\(Hendrycks et al\.,[2020](https://arxiv.org/html/2605.23171#bib.bib12)\), ARC\(Clark et al\.,[2018](https://arxiv.org/html/2605.23171#bib.bib7)\), and HellaSwag\(Zellers et al\.,[2019](https://arxiv.org/html/2605.23171#bib.bib37)\)\. These tests shed light on the model’s knowledge base, reasoning capabilities, and adherence to factual information\. As illustrated in Figure[3](https://arxiv.org/html/2605.23171#S5.T3), the results indicate thatSymNoisenot only stabilizes scores but also actively preserves and, in some cases, enhances the model’s capabilities\. Notably,SymNoiseconsistently outperformsNEFTunein terms of performance, highlighting its effectiveness in striking a balance between conversational proficiency and traditional computational skills\.

Table 3:For OpenLLM Leaderboard tasks, the influence ofNEFTuneandSymNoiseis investigated on LLaMA\-2, encompassing Alpaca, Evol\-Instruct, and OpenPlatypus datasets, alongside LLaMA\-1 trained on the Evol\-Instruct dataset\. Comparative observations reveal a uniformity in performance metrics across the diverse datasets and models, indicating negligible impact ofNEFTunebut slightly better performance ofSymNoiseon the overall effectiveness\. We follow the similar procedure as mentioned inJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\), and report their results for completeness\. In order to minimize computational expenses, we refrained from conducting thorough hyper\-parameter optimization, which may have further improved the results\.

### 5\.5Analysis

As shown inNEFTune\(Jain et al\.,[2024](https://arxiv.org/html/2605.23171#bib.bib14)\)and related work, adding noise to embeddings during training helps mitigate overfitting to dataset\-specific quirks like formatting or phrasing\. This shifts the model from memorizing instructions to leveraging the broader capabilities of the pretrained base model\. A direct effect is that models produce longer, more coherent responses—preferred by both human and automated evaluators\(Dubois et al\.,[2023](https://arxiv.org/html/2605.23171#bib.bib9)\)\. While increased verbosity contributes to performance gains, our analysis shows thatSymNoiseimproves both response quality and quantity beyond whatNEFTuneachieves\.

Conceptually,SymNoiseassigns probability mass to multiple noisy variants of instructions, encouraging the model to learn a broader, more uniform distribution rather than overfitting to the training data or a single perturbed version\. This promotes better generalization and reduces overfitting\.

#### 5\.5\.1Longer responses vs repetition

In this section, our objective is to determine whether the lengthier responses produced usingSymNoiseare a result of increased repetition or if they contribute to more diverse and detailed content\.

Echoing the insights fromJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\)and supporting evidence from leaderboard performances, a notable correlation emerges between extended response lengths and improved performance in theAlpacaEvaltask\. This raises the question of whether the augmentation of response length bySymNoisecould lead to diminished text diversity and quality\. Our analysis scrutinized the frequency of N\-gram repetitions in responses generated by LLaMA\-2, trained on various datasets, both with and withoutSymNoiseapplication\.

Following the methodology ofJain et al\. \([2024](https://arxiv.org/html/2605.23171#bib.bib14)\), our analysis was restricted to the initial segments of each response to maintain consistency\. Specifically, we examined the first5050words for Alpaca\-trained models,100100words for Evol\-Instruct, and150150words for OpenPlatypus, ensuring that at least half of the responses exceeded these thresholds\. Responses shorter than these limits were excluded from the analysis\.

As delineated in Table[5](https://arxiv.org/html/2605.23171#S5.T5), the findings reveal thatSymNoisetypically yields lengthier responses\. However, importantly, the frequency of 2\-gram repetitions and overall token log\-diversity remain largely consistent, paralleling the results observed withNEFTune\. This suggests that the increased length of responses underSymNoiseis not simply due to repetitive content, but rather indicates the inclusion of additional, relevant information, thereby enriching the depth and value of the generated responses\.

#### 5\.5\.2Ablation study with different strength of noise

In this section, we explored the efficacy of employing different noise distributions, specifically uniform \(NEFTune\) versus Gaussian noise, versus within theSymNoisealgorithm\. From the Table[5](https://arxiv.org/html/2605.23171#S5.T5), one can notice that Gaussian noise tends to produce longer outputs\. However, this increased length does not correlate with a corresponding enhancement in performance\. While it is generally observed that longer generations are associated with improved scoring, none of the generation\-time strategies employed matched the effectiveness of models trained withSymNoise\. Interestingly, our innovative approach,SymNoise, exhibits superior performance, surpassing benchmark results\. It demonstrates an approximate improvement of6\.7%6\.7\\%over the models utilizingNEFTune\. Furthermore, we conducted a comparative analysis with Bernoulli noise to underscore the effectiveness of the symmetric opposing noise component inSymNoise\.

Moreover, we maintained consistent experimental conditions while substituting the Uniform distribution with a Gaussian distribution\. In alignment with our theoretical framework, we adjusted the noise scaling factor for the Gaussian distribution by dividing it by3\\sqrt\{3\}\. This adjustment led to comparable performance between the two methods across variousNEFTunenoise levels, reinforcing the validity of our noise scaling approach\.

Table 4:AlpacaEvalWin Rate and Average Character Count assessed by ChatGPT across various noise settings\.SettingAlpacaEvol\-InstructOpenPlatypusLLaMA\-2\-7b48\.26 \(375\)62\.55 \(864\)57\.20 \(1101\)\+NEFTNoise 562\.55 \(1062\)67\.58 \(1404\)60\.99 \(1428\)\+NEFTNoise 1061\.18 \(1010\)65\.59 \(1697\)60\.62 \(1834\)\+NEFTNoise 1561\.86 \(820\)66\.58 \(1651\)61\.74 \(1694\)\+Gaussian Noise 5/3\\sqrt\{3\}62\.6 \(1073\)68\.01 \(1431\)60\.31 \(1437\)\+Gaussian Noise 10/3\\sqrt\{3\}61\.01 \(1211\)65\.29 \(1783\)60\.32 \(1878\)\+Gaussian Noise 15/3\\sqrt\{3\}61\.93 \(835\)65\.99 \(1767\)61\.38 \(1806\)\+Gaussian Noise 560\.93 \(1371\)65\.09 \(2066\)59\.13 \(2061\)\+Bernoulli Noise 561\.32 \(1272\)65\.10 \(1840\)60\.22 \(1968\)\+SymNoiseNoise 564\.92\(1186\)69\.62\(1700\)62\.14\(1689\)
Table 5:Transposed view of average lengths and 2\-gram repetition rates inAlpacaEvalresponses for different training methods\.

## 6CONCLUSION

In this work, we rigorously establish that Gaussian and uniform random noise are functionally analogous, contingent on appropriate scaling, and demonstrate similar effectiveness on real\-world datasets\. This revelation is pivotal, particularly in light of theNEFTunecreators’ admission of the method’s unexplained superiority, especially over the well\-examined Gaussian noise\. This insight not only sheds light on the previously opaque superiority of theNEFTunemethod but also bridges the gap with the well\-understood Gaussian noise\.

Furthermore, we have introducedSymNoise, a novel noise injection technique that outperformsNEFTuneand other existing methods by a large margin\. The advancements showcased bySymNoisein training large language models \(LLMs\) emphasize the importance of innovative algorithmic strategies and regularization techniques\. Echoing the sentiments of\(Jain et al\.,[2024](https://arxiv.org/html/2605.23171#bib.bib14)\), the field of LLMs, unlike its counterpart in computer vision, has often favored standardized training methods focusing on model scaling and dataset expansion\. However,SymNoiseunderscores the potential of fine\-tuning techniques in enhancing model performance, particularly in situations where overfitting to limited instruction datasets is a concern\.

## References

- Sha \(2023\)Sharegpt\.[https://sharegpt\.com/](https://sharegpt.com/), 2023\.
- Akbiyik \(2020\)Murtaza Eren Akbiyik\.Data augmentation in training \{cnn\}s: Injecting noise to images, 2020\.URL[https://openreview\.net/forum?id=SkeKtyHYPS](https://openreview.net/forum?id=SkeKtyHYPS)\.
- Brown et al\. \(2020\)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al\.Language models are few\-shot learners\.*Advances in neural information processing systems*, 33:1877–1901, 2020\.
- Chen et al\. \(2023\)Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al\.Alpagasus: Training a better alpaca with fewer data\.*arXiv preprint arXiv:2307\.08701*, 2023\.
- Chiang et al\. \(2023\)Wei\-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E\. Gonzalez, Ion Stoica, and Eric P\. Xing\.Vicuna: An open\-source chatbot impressing gpt\-4 with 90[https://lmsys\.org/blog/2023\-03\-30\-vicuna/](https://lmsys.org/blog/2023-03-30-vicuna/), Mar 2023\.
- Chung et al\. \(2022\)Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al\.Scaling instruction\-finetuned language models\.*arXiv preprint arXiv:2210\.11416*, 2022\.
- Clark et al\. \(2018\)Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord\.Think you have solved question answering? try arc, the ai2 reasoning challenge\.*arXiv preprint arXiv:1803\.05457*, 2018\.
- Devlin et al\. \(2019\)Jacob Devlin, Ming\-Wei Chang, Kenton Lee, and Kristina Toutanova\.Bert: Pre\-training of deep bidirectional transformers for language understanding\.In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics*, 2019\.URL[https://aclanthology\.org/N19\-1423/](https://aclanthology.org/N19-1423/)\.
- Dubois et al\. \(2023\)Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto\.Alpacafarm: A simulation framework for methods that learn from human feedback\.*arXiv preprint arXiv:2305\.14387*, 2023\.
- Dwork et al\. \(2014\)Cynthia Dwork, Aaron Roth, et al\.The algorithmic foundations of differential privacy\.*Foundations and Trends® in Theoretical Computer Science*, 9\(3–4\):211–407, 2014\.
- Gao et al\. \(2021\)Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, et al\.A framework for few\-shot language model evaluation\.*Version v0\. 0\.1\. Sept*, 2021\.
- Hendrycks et al\. \(2020\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.*arXiv preprint arXiv:2009\.03300*, 2020\.
- Huh \(2020\)Dongsung Huh\.Curvature\-corrected learning dynamics in deep neural networks\.In*International Conference on Machine Learning*, pp\. 4552–4560\. PMLR, 2020\.
- Jain et al\. \(2024\)Neel Jain, Ping\-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong\-Min Chu, Gowthami Somepalli, Brian R Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, et al\.Neftune: Noisy embeddings improve instruction finetuning\.In*ICLR*, 2024\.
- Kong et al\. \(2022\)Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, Gavin Taylor, and Tom Goldstein\.Robust optimization as data augmentation for large\-scale graphs\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 60–69, 2022\.
- Lee et al\. \(2023\)Ariel N Lee, Cole J Hunter, and Nataniel Ruiz\.Platypus: Quick, cheap, and powerful refinement of llms\.*arXiv preprint arXiv:2308\.07317*, 2023\.
- Lee & Park \(2023\)Yonghyeon Lee and Frank C Park\.On explicit curvature regularization in deep generative models\.In*Topological, Algebraic and Geometric Learning Workshops 2023*, pp\. 505–518\. PMLR, 2023\.
- Moosavi\-Dezfooli et al\. \(2019\)Seyed\-Mohsen Moosavi\-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard\.Robustness via curvature regularization, and vice versa\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp\. 9078–9086, 2019\.
- Nukrai et al\. \(2022\)David Nukrai, Ron Mokady, and Amir Globerson\.Text\-only training for image captioning using noise\-injected clip\.*arXiv preprint arXiv:2211\.00575*, 2022\.
- OpenAI \(2022\)OpenAI\.Introducing chatgpt\.[https://openai\.com/blog/chatgpt](https://openai.com/blog/chatgpt), 2022\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.*Advances in Neural Information Processing Systems*, 35:27730–27744, 2022\.
- Pei et al\. \(2020\)Hongbin Pei, Bingzhe Wei, Kevin Chang, Chunxu Zhang, and Bo Yang\.Curvature regularization to prevent distortion in graph embedding\.*Advances in Neural Information Processing Systems*, 33:20779–20790, 2020\.
- Radford et al\. \(2019\)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al\.Language models are unsupervised multitask learners\.*OpenAI blog*, 1\(8\):9, 2019\.
- Raffel et al\. \(2020\)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu\.Exploring the limits of transfer learning with a unified text\-to\-text transformer\.*Journal of Machine Learning Research*, 21\(140\):1–67, 2020\.URL[http://jmlr\.org/papers/v21/20\-074\.html](http://jmlr.org/papers/v21/20-074.html)\.
- Sanh et al\. \(2021\)Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al\.Multitask prompted training enables zero\-shot task generalization\.*arXiv preprint arXiv:2110\.08207*, 2021\.
- Spall \(1998\)James C Spall\.Implementation of the simultaneous perturbation algorithm for stochastic optimization\.*IEEE Transactions on aerospace and electronic systems*, 34\(3\):817–823, 1998\.
- Taori et al\. \(2023\)Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto\.Stanford alpaca: An instruction\-following llama model, 2023\.
- Touvron et al\. \(2023a\)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie\-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al\.Llama: Open and efficient foundation language models\.*arXiv preprint arXiv:2302\.13971*, 2023a\.
- Touvron et al\. \(2023b\)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie\-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al\.Llama: Open and efficient foundation language models\.*arXiv preprint arXiv:2302\.13971*, 2023b\.
- Touvron et al\. \(2023c\)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al\.Llama 2: Open foundation and fine\-tuned chat models\.*arXiv preprint arXiv:2307\.09288*, 2023c\.
- Vaswani et al\. \(2017\)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin\.Attention is all you need\.In*Advances in Neural Information Processing Systems*, volume 30\. Curran Associates, Inc\., 2017\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa\-Paper\.pdf](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)\.
- Wang et al\. \(2022\)Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi\.Self\-instruct: Aligning language model with self generated instructions\.*arXiv preprint arXiv:2212\.10560*, 2022\.
- Wei et al\. \(2021\)Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le\.Finetuned language models are zero\-shot learners\.*arXiv preprint arXiv:2109\.01652*, 2021\.
- Xu et al\. \(2023\)Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang\.Wizardlm: Empowering large language models to follow complex instructions\.*arXiv preprint arXiv:2304\.12244*, 2023\.
- Xu et al\. \(2022\)Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang\.Zeroprompt: scaling prompt\-based pretraining to 1,000 tasks improves zero\-shot generalization\.*arXiv preprint arXiv:2201\.06910*, 2022\.
- Zang et al\. \(2021\)Xiao Zang, Yi Xie, Siyu Liao, Jie Chen, and Bo Yuan\.Noise injection\-based regularization for point cloud processing\.*arXiv preprint arXiv:2103\.15027*, 2021\.
- Zellers et al\. \(2019\)Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi\.Hellaswag: Can a machine really finish your sentence?*arXiv preprint arXiv:1905\.07830*, 2019\.
- Zhang et al\. \(2022\)Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al\.Opt: Open pre\-trained transformer language models\.*arXiv preprint arXiv:2205\.01068*, 2022\.
- Zhao et al\. \(2023\)Wayne Xin Zhao, Yujian Shao, Jingyuan Li, Yuan Wang, Xinyu Li, Zihan Yu, Yujia Ji, Jing Chen, Fei Wang, and Ji\-Rong Li\.A survey of large language models\.*arXiv preprint arXiv:2303\.18223*, 2023\.
- Zhu et al\. \(2019\)Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu\.Freelb: Enhanced adversarial training for natural language understanding\.*arXiv preprint arXiv:1909\.11764*, 2019\.

## Appendix AAppendix

### A\.1Datasets

- •AlpacaTaori et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib27)\): Developed using the Self\-Instruct method byWang et al\. \([2022](https://arxiv.org/html/2605.23171#bib.bib32)\)and the Text\-Davinci\-003Ouyang et al\. \([2022](https://arxiv.org/html/2605.23171#bib.bib21)\)model \(Ouyang et al\., 2022\), Alpaca leverages a small set of seed tasks to generate new instruction tuning tasks and filter out ineffective ones\. This dataset has been instrumental in advancing instruction\-based learning\.
- •ShareGPTChiang et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib5)\): Comprising 70k voluntarily shared ChatGPT conversationsSha \([2023](https://arxiv.org/html/2605.23171#bib.bib1)\), ShareGPT provides a rich source of real\-world interaction data\. While originally multi\-turn, we adapt it to a single\-turn format using the Vicunav1\.1 dataset version for consistency with our experimental setup\.
- •Evol\-InstrucXu et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib34)\): This dataset, comprising 70k single\-turn instructions, is considered more complex than Alpaca\. Originating from the Alpaca dataset, Evol\-Instruct employs ChatGPT to refine and evolve the initial instructions, thus broadening the scope and complexity of the tasks\.
- •Open\-PlatypusLee et al\. \([2023](https://arxiv.org/html/2605.23171#bib.bib16)\): Formed by combining1111open\-source datasets, Open\-Platypus is tailored to enhance LLM performance in STEM and logical reasoning domains\. It includes approximately2525k questions, with around1010% generated by LLMs and the rest by human experts\. This dataset emphasizes the importance of variety and complexity in question formats\.

### A\.2Analysis of Distributional Characteristics in High\-Dimensional Spaces

In this section, we analyze the behavior of Gaussian, Bernoulli, and Uniform distributions in high\-dimensional spaces\. We explore how the averageL2L\_\{2\}norm ratios of these distributions change with respect to varying dimensions and sample sizes, providing insights into their geometric properties and implications for high\-dimensional data analysis\.

#### A\.2\.1AverageL2L\_\{2\}Norm Ratio with Varying Dimensionality

![Refer to caption](https://arxiv.org/html/2605.23171v1/x3.png)Figure 2:Gaussian/Uniform AverageL2L\_\{2\}Norm Ratio as a Function of Dimensionality\. The plot illustrates the ratio of the averageL2L\_\{2\}norm of points drawn from a Gaussian distribution to that of a Uniform distribution, with the number of points fixed at 256 and the dimensionality varying from 1 to 4096\.![Refer to caption](https://arxiv.org/html/2605.23171v1/x4.png)Figure 3:Bernoulli/Uniform AverageL2L\_\{2\}Norm Ratio as a Function of Dimensionality\. The plot depicts the ratio of the averageL2L\_\{2\}norm of points drawn from a Bernoulli distribution to that of a Uniform distribution, with the number of points fixed at 256 and the dimensionality varying from 1 to 4096\.
#### A\.2\.2AverageL2L\_\{2\}Norm Ratio with Varying Number of Points

![Refer to caption](https://arxiv.org/html/2605.23171v1/x5.png)Figure 4:Gaussian/Uniform AverageL2L\_\{2\}Norm Ratio for Varying Number of Points\. The plot illustrates the ratio of the averageL2L\_\{2\}norm of points drawn from a Gaussian distribution to that of a Uniform distribution, with the dimensionality fixed at 4096 and the number of points varying from 64 to 256\.![Refer to caption](https://arxiv.org/html/2605.23171v1/x6.png)Figure 5:Bernoulli/Uniform AverageL2L\_\{2\}Norm Ratio for Varying Number of Points\. The plot depicts the ratio of the averageL2L\_\{2\}norm of points drawn from a Bernoulli distribution to that of a Uniform distribution, with the dimensionality fixed at 4096 and the number of points varying from 64 to 256\.

## Appendix BDeferred proofs

In this section, we show the proofs omitted from Sec\.[3](https://arxiv.org/html/2605.23171#S3)and Sec\.[4\.1](https://arxiv.org/html/2605.23171#S4.SS1)\.

#### B\.0\.1Proof of Lemma[1](https://arxiv.org/html/2605.23171#Thmlemma1)

We state again Lemma[1](https://arxiv.org/html/2605.23171#Thmlemma1)from Sec\.[3](https://arxiv.org/html/2605.23171#S3)and present the proof\.

###### Lemma[1](https://arxiv.org/html/2605.23171#Thmlemma1)\.

\(Uniform DistributionL2L\_\{2\}Norm\)ForP=\(x1,x2,…,xd\)∼Ud\(\[−1,1\]\)P=\(x\_\{1\},x\_\{2\},\.\.\.,x\_\{d\}\)\\sim U^\{d\}\(\[\-1,1\]\), the expectedL2L\_\{2\}norm is:

E\[‖P‖2\]=d3\.\\displaystyle E\[\\\|P\\\|\_\{2\}\]=\\sqrt\{\\frac\{d\}\{3\}\}\.\(4\)Proof:Eachxix\_\{i\}is uniformly distributed over\[−1,1\]\[\-1,1\]\. The second moment about the origin for a uniform distributionU\(a,b\)U\(a,b\)is given byb3−a33\(b−a\)\\frac\{b^\{3\}\-a^\{3\}\}\{3\(b\-a\)\}\. ForU\(−1,1\)U\(\-1,1\), this yieldsE\[xi2\]=13E\[x\_\{i\}^\{2\}\]=\\frac\{1\}\{3\}\. The componentsxix\_\{i\}are independent, hence the sum of their squares, which represents theL2L\_\{2\}norm squared, is the sum of their expected values:E\[‖P‖22\]=∑i=1dE\[xi2\]=d⋅13E\[\\\|P\\\|\_\{2\}^\{2\}\]=\\sum\_\{i=1\}^\{d\}E\[x\_\{i\}^\{2\}\]=d\\cdot\\frac\{1\}\{3\}\. Taking the square root gives the expectedL2L\_\{2\}norm:E\[‖P‖2\]=E\[‖P‖22\]=d3E\[\\\|P\\\|\_\{2\}\]=\\sqrt\{E\[\\\|P\\\|\_\{2\}^\{2\}\]\}=\\sqrt\{\\frac\{d\}\{3\}\}\.

#### B\.0\.2Proof of Lemma[2](https://arxiv.org/html/2605.23171#Thmlemma2)

We state again Lemma[2](https://arxiv.org/html/2605.23171#Thmlemma2)from Sec\.[3](https://arxiv.org/html/2605.23171#S3)and present the proof\.

###### Lemma[2](https://arxiv.org/html/2605.23171#Thmlemma2)\.

\(Gaussian DistributionL2L\_\{2\}Norm\)ForP=\(x1,x2,…,xd\)∼Nd\(0,1\)P=\(x\_\{1\},x\_\{2\},\.\.\.,x\_\{d\}\)\\sim N^\{d\}\(0,1\), the expectedL2L\_\{2\}norm is:

E\[‖P‖2\]=d\.\\displaystyle E\[\\\|P\\\|\_\{2\}\]=\\sqrt\{d\}\.\(5\)Proof:Eachxix\_\{i\}is distributed according toN\(0,1\)N\(0,1\)\. The square of a standard normal variable,xi2x\_\{i\}^\{2\}, follows a chi\-squared distribution with 1 degree of freedom, for which the mean \(expected value\) is 1\. Given the independence of the componentsxix\_\{i\}, the expected value of the sum of their squares, representing theL2L\_\{2\}norm squared, is:E\[‖P‖22\]=∑i=1dE\[xi2\]=d⋅1=dE\[\\\|P\\\|\_\{2\}^\{2\}\]=\\sum\_\{i=1\}^\{d\}E\[x\_\{i\}^\{2\}\]=d\\cdot 1=d\. The expectedL2L\_\{2\}norm is the square root of this sum:E\[‖P‖2\]=E\[‖P‖22\]=dE\[\\\|P\\\|\_\{2\}\]=\\sqrt\{E\[\\\|P\\\|\_\{2\}^\{2\}\]\}=\\sqrt\{d\}\.

#### B\.0\.3Proof of Lemma[3](https://arxiv.org/html/2605.23171#Thmlemma3)

We state again Lemma[3](https://arxiv.org/html/2605.23171#Thmlemma3)from Sec\.[4\.1](https://arxiv.org/html/2605.23171#S4.SS1)and present the proof\.

###### Lemma[3](https://arxiv.org/html/2605.23171#Thmlemma3)\.

\(Bernoulli DistributionL2L\_\{2\}Norm\)ForP=\(x1,x2,…,xd\)P=\(x\_\{1\},x\_\{2\},\.\.\.,x\_\{d\}\), withxi∈\{−1,1\}x\_\{i\}\\in\\\{\-1,1\\\}andP\(xi=1\)=P\(xi=−1\)=0\.5P\(x\_\{i\}=1\)=P\(x\_\{i\}=\-1\)=0\.5, the expectedL2L\_\{2\}norm is:

E\[‖P‖2\]=d\.\\displaystyle E\[\\\|P\\\|\_\{2\}\]=\\sqrt\{d\}\.\(6\)Proof:Eachxix\_\{i\}takes values \-1 or 1 with equal probability, leading toxi2=1x\_\{i\}^\{2\}=1irrespective ofxix\_\{i\}’s actual value\. Hence,E\[xi2\]=1E\[x\_\{i\}^\{2\}\]=1\. Given the independence of the componentsxix\_\{i\}, the expected value of the sum of their squares, which represents theL2L\_\{2\}norm squared, is simply the sum of the expected values:E\[‖P‖22\]=∑i=1dE\[xi2\]=d⋅1=dE\[\\\|P\\\|\_\{2\}^\{2\}\]=\\sum\_\{i=1\}^\{d\}E\[x\_\{i\}^\{2\}\]=d\\cdot 1=d\. Therefore, the expectedL2L\_\{2\}norm is the square root of this value:E\[‖P‖2\]=E\[‖P‖22\]=dE\[\\\|P\\\|\_\{2\}\]=\\sqrt\{E\[\\\|P\\\|\_\{2\}^\{2\}\]\}=\\sqrt\{d\}\.
Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

Similar Articles

Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning

Submit Feedback

Similar Articles

Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune
Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Benchmarking Instance-Dependent Label Noise with Controlled Corruptions
Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning