Speculative Decoding Across Languages

arXiv cs.CL Papers

Summary

This paper compares three strategies to improve speculative decoding efficiency for non-English languages, finding that task-specific distillation improves acceptance rates but generalizes poorly, while n-gram draft models offer consistent speed-ups despite lower acceptance rates.

arXiv:2605.30580v1 Announce Type: new Abstract: Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disproportionately poor multilingual capabilities. Thus, when generating text in a non-English language, speculative decoding is far less effective. We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task-specific data (translation); finetuning the draft model on unlabeled monolingual corpora; and training simple n-gram draft models on the same monolingual corpora. We evaluate efficiency on translation (from English into the target language) and the held-out task of story generation. We find that while task-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task. Meanwhile, n-gram draft models, despite lower acceptance rates, consistently provide large speed-ups due to much faster draft generation.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:25 AM

# Speculative Decoding Across Languages
Source: [https://arxiv.org/html/2605.30580](https://arxiv.org/html/2605.30580)
Michael GinnLuc De NardiAlexis Palmer University of Colorado \*Equal contribution

###### Abstract

Speculative decoding\(Leviathan et al\.,[2023](https://arxiv.org/html/2605.30580#bib.bib14); Chen et al\.,[2023](https://arxiv.org/html/2605.30580#bib.bib5)\)has become a crucial component of large language model \(LLM\) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel\. However, small draft models tend to suffer from disproportionately poor multilingual capabilities\(Conneau et al\.,[2020](https://arxiv.org/html/2605.30580#bib.bib6)\)\. Thus, when generating text in a non\-English language, speculative decoding is far less effective\(Yi et al\.,[2024](https://arxiv.org/html/2605.30580#bib.bib26); Sandler et al\.,[2025](https://arxiv.org/html/2605.30580#bib.bib19)\)\.

We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task\-specific data \(translation\); finetuning the draft model on unlabeled monolingual corpora; and training simple n\-gram draft models on the same monolingual corpora\. We evaluate efficiency on translation \(from English into the target language\) and the held\-out task of story generation\. We find that while task\-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task\. Meanwhile, n\-gram draft models, despite lower acceptance rates, consistently provide large speed\-ups due to much faster draft generation\.

Speculative Decoding Across Languages

Nirajan Paudel\* and Michael Ginn\* and Luc De Nardi and Alexis PalmerUniversity of ColoradoEqual contribution

## 1Introduction

Autoregressive decoding from LLMs requiresKKserial forward passes \(each reloading the model weights into memory\) to generate a sequence ofKKtokens\.Speculative decodingis a popular technique to accelerate inference by drafting a sequence of tokens with a lightweightdraft modeland verifying the drafted tokens in parallel\(Leviathan et al\.,[2023](https://arxiv.org/html/2605.30580#bib.bib14); Chen et al\.,[2023](https://arxiv.org/html/2605.30580#bib.bib5)\), which can drastically reduce the total number of forward passes with the target model \(also known as theverifier\)\.

![Refer to caption](https://arxiv.org/html/2605.30580v1/x1.png)Figure 1:Overall structure of experiments\. We test four different approaches to creating draft models, which generate tokens that are then verified by the larger model\. We test on two tasks: translation and story generation\.However, the effectiveness of speculative decoding depends on the similarity between the verifier model and draft model’s next\-token probability distributions, often measured via theacceptance rate: how often a drafted token is accepted by the verifier model\. A poor acceptance rate results in a minimal speed\-up, or even worse generation speed if the cost of drafting tokens outweighs the benefit\.

This raises a concern for languages other than English\. It is well known that model capacity is correlated with multilingual capabilities\(Conneau et al\.,[2020](https://arxiv.org/html/2605.30580#bib.bib6); Chang et al\.,[2024](https://arxiv.org/html/2605.30580#bib.bib4)\), suggesting that smaller draft models may differ significantly from the verifier model when generating text in a less common language\.Yi et al\. \([2024](https://arxiv.org/html/2605.30580#bib.bib26)\); Sandler et al\. \([2025](https://arxiv.org/html/2605.30580#bib.bib19)\)provide empirical support for this claim, observing significantly worse acceptance rates for non\-English languages\. The issue is exacerbated by tokenization biases, where less common languages can require far more tokens on average\(Petrov et al\.,[2023](https://arxiv.org/html/2605.30580#bib.bib17)\)\. As a result, non\-English users of LLMs may suffer far slower generation speed, as yet another instance of disparities in language technologies\(Blasi et al\.,[2022](https://arxiv.org/html/2605.30580#bib.bib2)\)\.

We test a standard speculative decoding setup with multilingual verifier and draft models \(Qwen 3\.5\) on translation from English into eleven different languages, observing poor acceptance rates\. We compare three different approaches for training draft models for specific languages: \(1\) task\-specific distillation, \(2\) distillation on general domain monolingual corpora, and \(3\) n\-gram modeling\. We test whether these approaches generalize to a new domain—story generation—without specifically training for that domain\. We find resounding evidence in favor of n\-gram models, which achieve large speed\-ups due to inexpensive forward passes\. Meanwhile, distillation is effective in\-domain, but generalizes to new tasks poorly\. We will release our code on GitHub \(anonymized\)\.

## 2Background and Related Work

Leviathan et al\. \([2023](https://arxiv.org/html/2605.30580#bib.bib14)\); Chen et al\. \([2023](https://arxiv.org/html/2605.30580#bib.bib5)\); Xia et al\. \([2023](https://arxiv.org/html/2605.30580#bib.bib25)\)proposed speculative decoding as a technique for accelerating generation\. Specifically, the draft model autoregressively generatesγ\\gammanew tokens with conditional probabilitiesp​\(x1\)​…​p​\(xγ\)p\(x\_\{1\}\)\\dots p\(x\_\{\\gamma\}\), and the verifier model runs a single forward pass over the full drafted sequence, assigning distinct probabilitiesq​\(x1\)​…​q​\(xγ\)q\(x\_\{1\}\)\\dots q\(x\_\{\\gamma\}\)\. For each tokenii, ifpi≤qip\_\{i\}\\leq q\_\{i\}, the token is accepted\. Ifpi\>qip\_\{i\}\>q\_\{i\}, we randomly choose between: a\) accepting the token with probabilityqipi\\frac\{q\_\{i\}\}\{p\_\{i\}\}; or b\) rejecting the token and all subsequent draft tokens and then sampling a new token from the adjusted distributionp′=q​\(x\)−p​\(x\)p^\{\\prime\}=q\(x\)\-p\(x\)\. This results in a sampling distribution equivalent to the verifier model’s distribution\.

To improve the probability of accepting a draft token, a popular technique is performingknowledge distillationfrom the verifier model to the draft model\(Zhou et al\.,[2024](https://arxiv.org/html/2605.30580#bib.bib31)\), where the draft model is finetuned to minimize its KL divergence with the teacher’s distribution\.Yi et al\. \([2024](https://arxiv.org/html/2605.30580#bib.bib26)\)applies this technique to a multilingual setting, distilling models for translation from another language into English, but they do not study the more difficult case of generating text in the target language\.111Except for a brief evaluation in the appendix

Our work is most similar toSandler et al\. \([2025](https://arxiv.org/html/2605.30580#bib.bib19)\), which observes acceptance rate disparities across tasks and languages\. They propose methods to mitigate these issues by balancing the training dataset or scaling per\-task gradients during distillation\. However, their approach requires a representative dataset for each of the desired tasks, which is infeasible to acquire for many languages\. Instead, we focus on testing whether methods cangeneralize to new tasks without task\-specific training data\.

## 3Methodology

We evaluate on two tasks: machine translation \(MT\) and open\-ended story generation\. MT serves as our in\-domain task, as we perform distillation \(see §[3\.3](https://arxiv.org/html/2605.30580#S3.SS3)\) on task\-specific translation examples\. Story generation serves as an out\-of\-domain task \(where the draft model isnottrained on story generation data\) in order to measure the robustness of our methods to different types of tasks\.

### 3\.1Datasets

For machine translation, we use parallel data from a variety of sources—limited to 5,200 examples per language—with dataset counts described in[Table 1](https://arxiv.org/html/2605.30580#S3.T1)and sources in[Appendix C](https://arxiv.org/html/2605.30580#A3)\. Our main evaluation for speculative decoding efficiency usesthe test split only, and the train split is reserved for distillation\. We also collect monolingual, unlabeled data for distillation and training, also in[Table 1](https://arxiv.org/html/2605.30580#S3.T1)\. We use datasets for eleven different languages, covering a range of resourcedness, typological properties, and geographic regions\.

LanguageMono \(trn/tst\)Parallel \(trn/tst\)Amharic \[amh\]1\.3M / 321\.0k4800 / 400Berber \[ber\]306\.3k / 76\.9k4800 / 400Cherokee \[chr\]1\.3M / 315\.8k4800 / 400Guarani \[grn\]319\.3k / 79\.2k788 / 198Hawaiian \[haw\]82\.1k / 20\.5k96 / 25Igbo \[ibo\]677\.5k / 164\.0k1398 / 350Nepali \[npi\]13\.0M / 3\.2M3133 / 400Occitan \[oci\]46\.0k / 11\.5k3631 / 400Quechua \[que\]510\.2k / 131\.8k4800 / 400Yoruba \[yor\]1\.0M / 256\.9k4800 / 400Tamazight \[zgh\]4\.8M / 336\.6k4800 / 400Table 1:Per language, number of tokens of monolingual text and number of parallel sentences\. Sources are described in[Table 4](https://arxiv.org/html/2605.30580#A4.T4)and[Table 5](https://arxiv.org/html/2605.30580#A4.T5)\.For story generation, we create topics by pairing nouns fromBrysbaert et al\. \([2014](https://arxiv.org/html/2605.30580#bib.bib3)\)and adjectives from the Brown Corpus\(Francis,[1979](https://arxiv.org/html/2605.30580#bib.bib10)\)by similarity according to Glove word embeddings\(Pennington et al\.,[2014](https://arxiv.org/html/2605.30580#bib.bib16)\)\. During evaluation we prompt the model to generate a story in the target language about each topic\. A few examples of generated stories appear in[Figure 11](https://arxiv.org/html/2605.30580#A5.F11)\. We do not compute any metrics over these stories, but we do verify that they are consistently generated in the correct language and do not get stuck in endless repetition loops\. As generated stories tend to be longer than translations, we only use 200 test examples\.

### 3\.2Models

We use the Qwen 3\.5 family of models\(Team,[2026](https://arxiv.org/html/2605.30580#bib.bib21)\)\. Our main results are presented using the 9b parameter model as the verifier and the 0\.8b parameter model as the draft model; however, we include results with 2b and 4b parameter draft models in §[D\.3](https://arxiv.org/html/2605.30580#A4.SS3)\. We use sampling\-based inference using a top\-k of 100, top\-p of 0\.9, and a maximum 128 generated tokens\. We use KV caching for both the draft and verifier models\(Pope et al\.,[2022](https://arxiv.org/html/2605.30580#bib.bib18)\)\. For each setting, we sweep theγ\\gammavalue over\[2,4\]\[2,4\]and report the best result\.

### 3\.3Experimental Settings

Ourbaselinesetting is an unmodified smaller model of the same family \(0\.8b Qwen\)\. We then perform soft\-target distillation \(details in §[B](https://arxiv.org/html/2605.30580#A2)\) from the verifier model to the 0\.8b model on two types of corpora\. First, for ourdistilled \(task\)setting, we use parallel data for distillation on translation prompts\. Second, we distill on monolingual, general domain text for thedistilled \(general\)setting\.

Finally, we train simplen\-gramdraft models on the same monolingual corpora, using the Qwen tokenizer to segment tokens\. We test variousnnvalues and find that bigrams \(2\-grams\) produce the highest acceptance rates for all languages\. During inference, we compute a simple logit distribution for an\(n−1\)\(n\-1\)length prefix from the conditional distribution observed during training\.

### 3\.4Metrics

#### Acceptance Rate \(α\\alpha\)\.

FollowingLeviathan et al\. \([2023](https://arxiv.org/html/2605.30580#bib.bib14)\), Definition 3\.1, we measure the probability that a draft token is accepted by the target model, given all prior tokens have been confirmed\. We estimate this probability for a single exampleiiwith the Monte Carlo estimatorαi^=midi\\hat\{\\alpha\_\{i\}\}=\\frac\{m\_\{i\}\}\{d\_\{i\}\}wheremim\_\{i\}is the number of accepted tokens anddid\_\{i\}is the number proposed, and we take the mean over examples\.

#### Speed\-up Factor \(ff\)\.

Usingα\\alpha, we compute the speed\-up factor, which quantifies the expected improvement under speculative decoding:

f=1−αγ\+1\(1−α\)​\(γ​c\+1\)f=\\frac\{1\-\\alpha^\{\\gamma\+1\}\}\{\(1\-\\alpha\)\(\\gamma c\+1\)\}\(1\)wherec=tdraft/ttargetc=t\_\{\\text\{draft\}\}/t\_\{\\text\{target\}\}is the cost ratio of a single draft forward pass to a single target forward pass \(measured via CUDA timing\)\. Speed\-up values are reported as multiplicative factors \(e\.g\.,1\.2×1\.2\\times\)\.

![Refer to caption](https://arxiv.org/html/2605.30580v1/x2.png)Figure 2:Distribution of average forward pass times for each LLM and our n\-gram models\. The small line in the upper left represents the n\-gram time of 0\.001s\.![Refer to caption](https://arxiv.org/html/2605.30580v1/x3.png)
![Refer to caption](https://arxiv.org/html/2605.30580v1/x4.png)

Figure 3:Acceptance rates \(top\) and speed\-up factors \(bottom\) fortranslation prompts, over four experimental settings and eleven languages\. Error bars are standard deviation\.![Refer to caption](https://arxiv.org/html/2605.30580v1/x5.png)Figure 4:Speed\-up factors forstory generation prompts, over four experimental settings and eleven languages\. Error bars are standard deviation\.

## 4Findings

### 4\.1Translation

[Figure 3](https://arxiv.org/html/2605.30580#S3.F3)reports the average acceptance rates and speed\-up factors per setting for the translation task\.

#### Standard speculative decoding is ineffective due to low acceptance rates\.

In the baseline setting using unspecialized draft models, acceptance rates are fairly low, ranging from0\.300\.30\(ber\) to0\.540\.54\(amh\) with an average over languages ofα¯=0\.40\\bar\{\\alpha\}=0\.40\. This results in most languages having speed\-up factors close to 1 \(f¯=1\.02×\\bar\{f\}=1\.02\\times\), in which case speculative decoding is not beneficial\. We find no correlation between translation quality \(chrF score\) and acceptance rates, discussed in §[8](https://arxiv.org/html/2605.30580#A0.F8)\.

#### Task\-specific distillation improves acceptance rates, while general domain distillation is less effective\.

As expected, distillation on translation prompts clearly improves acceptance rates \(α¯=0\.60\\bar\{\\alpha\}=0\.60\) and therefore speed\-up factors \(f¯=1\.28×\\bar\{f\}=1\.28\\times\)\. Meanwhile, the general domain distillation setting \(α¯=0\.39,f¯=1\.03×\\bar\{\\alpha\}=0\.39,\\bar\{f\}=1\.03\\times\) only beats the baseline in five of eleven languages\. Thus, distillation is an effective option only for specific, predetermined domains\. In this case, the distillation loss provides a lower bound on acceptance rate \(see §[D\.5](https://arxiv.org/html/2605.30580#A4.SS5)\)\.

### 4\.2Story Generation

[Figure 4](https://arxiv.org/html/2605.30580#S3.F4)and[Figure 6](https://arxiv.org/html/2605.30580#A0.F6)report speed\-up factors and acceptance rates, respectively, for the story generation prompts\. We see similar trends as the prior setting for the baseline \(α¯=0\.46,f¯=1\.09×\\bar\{\\alpha\}=0\.46,\\bar\{f\}=1\.09\\times\), n\-gram model \(α¯=0\.30,f¯=1\.39×\\bar\{\\alpha\}=0\.30,\\bar\{f\}=1\.39\\times\), and general domain distilled model \(α¯=0\.47,f¯=1\.10×\\bar\{\\alpha\}=0\.47,\\bar\{f\}=1\.10\\times\)\.

#### Task\-specific distillation does not generalize to new domains well\.

The task\-specific distilled models show a large deterioration on the new task \(α¯=0\.43,f¯=1\.03×\\bar\{\\alpha\}=0\.43,\\bar\{f\}=1\.03\\times\), underperforming the baseline and the general distillation approach\.[Figure 5](https://arxiv.org/html/2605.30580#S4.F5)shows this trade\-off: distillation improves acceptance on the training task at the expense of the other, unseen task\.

#### N\-gram models are highly effective due to favorable draft\-to\-verifier cost ratios\.

While the n\-gram models tend to have inferior acceptance rates \(α¯=0\.24\\bar\{\\alpha\}=0\.24\), they are much faster to run, with a mean forward pass of0\.0010\.001s compared to0\.0330\.033s for the 0\.8b draft model \([Figure 2](https://arxiv.org/html/2605.30580#S3.F2)\)\. This results in highly competitive speed\-up factors \(f¯=1\.30×\\bar\{f\}=1\.30\\times\), beating the baseline on all but one language \(oci\), and suggesting that n\-gram models are a viable option if a monolingual training corpus is available\.

![Refer to caption](https://arxiv.org/html/2605.30580v1/x6.png)Figure 5:Relationship between the acceptance rates, per language, for translation and story generation\.

## 5Conclusion

Our results confirm that standard speculative decoding may not be useful for non\-English languages, and can even be detrimental\. We identify distillation as an effective way to improve per\-language performance, but improvements are limited to the training domain and fail to generalize to new tasks\. Meanwhile, we demonstrate that simple n\-gram models can provide superior speed\-up factors \(across domains\) thanks to the low cost of inference\. In real\-world usage, then, it would be beneficial to dynamically switch the draft model based on the language being generated, and in many cases an n\-gram model may be the best choice\.

## Limitations

This work studied a single family of models \(Qwen 3\.5\), using the 9b model as a verifier and primarily using the 0\.8b model as the draft\. Absolute results may vary across other size combinations and model families, but we expect the general trends to hold\. Our work also studies a small set of eleven languages, and results may differ for languages dissimilar to those eleven\. Finally, our work studies two realistic tasks in a rare language, and our results may not hold for tasks like mathematical reasoning \(as inSandler et al\. \([2025](https://arxiv.org/html/2605.30580#bib.bib19)\)\)\.

## Ethical Considerations

This work suggests approaches to mitigate user experience disparities for disadvantaged languages\. However, we recognize that work done without the involvement of language stakeholders risks poor alignment with the actual needs of speakers\. In particular, we are aware that the generations in our evaluation suites were very poor quality, largely unusable for the desired tasks\. This was not a primary concern as our main focus was generation speed, but there remains a large gap in usability for many languages\. Our work used data sourced from existing open access datasets, and we defer to their ethical policies regarding data collection and processing\. Finally, we recognize the severe environmental cost of training large\-scale models, and have striven to use our resources efficiently\. We used AI assistants for code review and to help modify plotting code\.

## References

- Adelani et al\. \(2021\)David Ifeoluwa Adelani, Dana Ruiter, Jesujoba O\. Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Esther Awokoya, and Cristina España\-Bonet\. 2021\.[The effect of domain and diacritics in Yoruba–English neural machine translation](https://aclanthology.org/2021.mtsummit-research.6/)\.In*Proceedings of Machine Translation Summit XVIII: Research Track*, pages 61–75, Virtual\. Association for Machine Translation in the Americas\.
- Blasi et al\. \(2022\)Damian Blasi, Antonios Anastasopoulos, and Graham Neubig\. 2022\.[Systematic inequalities in language technology performance across the world’s languages](https://doi.org/10.18653/v1/2022.acl-long.376)\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 5486–5505, Dublin, Ireland\. Association for Computational Linguistics\.
- Brysbaert et al\. \(2014\)Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman\. 2014\.Concreteness ratings for 40 thousand generally known english word lemmas\.*Behavior research methods*, 46\(3\):904–911\.
- Chang et al\. \(2024\)Tyler A\. Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen\. 2024\.[When is multilinguality a curse? language modeling for 250 high\- and low\-resource languages](https://doi.org/10.18653/v1/2024.emnlp-main.236)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 4074–4096, Miami, Florida, USA\. Association for Computational Linguistics\.
- Chen et al\. \(2023\)Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean\-Baptiste Lespiau, Laurent Sifre, and John Jumper\. 2023\.[Accelerating large language model decoding with speculative sampling](https://arxiv.org/abs/2302.01318)\.*Preprint*, arXiv:2302\.01318\.
- Conneau et al\. \(2020\)Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov\. 2020\.[Unsupervised cross\-lingual representation learning at scale](https://doi.org/10.18653/v1/2020.acl-main.747)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 8440–8451, Online\. Association for Computational Linguistics\.
- Dasgupta et al\. \(2026\)Sayantan Dasgupta, Timothy Baldwin, and Trevor Cohn\. 2026\.[Don’t ignore the tail: Decoupling top\-$k$ probabilities for efficient language model distillation](https://openreview.net/forum?id=EYflZV1caL)\.
- Doherty \(2016\)Liam Doherty\. 2016\.[The hawaiian corpus project: Data from a corpus of written hawaiian](https://dohliam.github.io/corpus/haw/)\.
- Ebrahimi et al\. \(2024\)Abteen Ebrahimi, Ona de Gibert, Raul Vazquez, Rolando Coto\-Solano, Pavel Denisov, Robert Pugh, Manuel Mager, Arturo Oncevay, Luis Chiruzzo, Katharina von der Wense, and Shruti Rijhwani\. 2024\.[Findings of the AmericasNLP 2024 shared task on machine translation into indigenous languages](https://doi.org/10.18653/v1/2024.americasnlp-1.28)\.In*Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas \(AmericasNLP 2024\)*, pages 236–246, Mexico City, Mexico\. Association for Computational Linguistics\.
- Francis \(1979\)W Nelson Francis\. 1979\.Brown corpus manual\.*http://icame\. uib\. no/brown/bcm\. html*\.
- Goldhahn et al\. \(2012\)Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff\. 2012\.[Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages](https://aclanthology.org/L12-1154/)\.In*Proceedings of the Eighth International Conference on Language Resources and Evaluation \(LREC’12\)*, pages 759–765, Istanbul, Turkey\. European Language Resources Association \(ELRA\)\.
- Hinton et al\. \(2015\)Geoffrey Hinton, Oriol Vinyals, and Jeff Dean\. 2015\.[Distilling the knowledge in a neural network](https://arxiv.org/abs/1503.02531)\.*Preprint*, arXiv:1503\.02531\.
- IRCA \(2015\)IRCA\. 2015\.[Talam \(traitement automatique de la langue amazighe\)](https://tal.ircam.ma/talam/corpus.php)\.
- Leviathan et al\. \(2023\)Yaniv Leviathan, Matan Kalman, and Yossi Matias\. 2023\.[Fast inference from transformers via speculative decoding](https://arxiv.org/abs/2211.17192)\.*Preprint*, arXiv:2211\.17192\.
- \(15\)Tim Nuttle, Tim Orr, TommyLee Whitlock, and Sarah Orndorff\.[\[link\]](https://cherokeedictionary.net/)\.
- Pennington et al\. \(2014\)Jeffrey Pennington, Richard Socher, and Christopher Manning\. 2014\.[GloVe: Global vectors for word representation](https://doi.org/10.3115/v1/D14-1162)\.In*Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 1532–1543, Doha, Qatar\. Association for Computational Linguistics\.
- Petrov et al\. \(2023\)Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi\. 2023\.[Language model tokenizers introduce unfairness between languages](https://proceedings.neurips.cc/paper_files/paper/2023/file/74bb24dca8334adce292883b4b651eda-Paper-Conference.pdf)\.In*Advances in Neural Information Processing Systems*, volume 36, pages 36963–36990\. Curran Associates, Inc\.
- Pope et al\. \(2022\)Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean\. 2022\.[Efficiently scaling transformer inference](https://arxiv.org/abs/2211.05102)\.*Preprint*, arXiv:2211\.05102\.
- Sandler et al\. \(2025\)Jameson Sandler, Ahmet Üstün, Marco Romanelli, Sara Hooker, and Ferdinando Fioretto\. 2025\.[The disparate impacts of speculative decoding](https://arxiv.org/abs/2510.02128)\.*Preprint*, arXiv:2510\.02128\.
- Singh et al\. \(2024\)Shivalika Singh, Freddie Vargus, Daniel D’souza, Börje F\. Karlsson, Abinaya Mahendiran, Wei\-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O’Mahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergun, Ifeoma Okoh, and 14 others\. 2024\.[Aya dataset: An open\-access collection for multilingual instruction tuning](https://doi.org/10.18653/v1/2024.acl-long.620)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 11521–11567, Bangkok, Thailand\. Association for Computational Linguistics\.
- Team \(2026\)Qwen Team\. 2026\.[Qwen3\.5: Accelerating productivity with native multimodal agents](https://qwen.ai/blog?id=qwen3.5)\.
- Thapa et al\. \(2025\)Prajwal Thapa, Jinu Nyachhyon, Mridul Sharma, and Bal Krishna Bal\. 2025\.[Development of pre\-trained transformer\-based models for the Nepali language](https://aclanthology.org/2025.chipsal-1.2/)\.In*Proceedings of the First Workshop on Challenges in Processing South Asian Languages \(CHiPSAL 2025\)*, pages 9–16, Abu Dhabi, UAE\. International Committee on Computational Linguistics\.
- Tiedemann \(2012\)Jörg Tiedemann\. 2012\.[Parallel data, tools and interfaces in OPUS](https://aclanthology.org/L12-1246/)\.In*Proceedings of the Eighth International Conference on Language Resources and Evaluation \(LREC’12\)*, pages 2214–2218, Istanbul, Turkey\. European Language Resources Association \(ELRA\)\.
- Tiedemann \(2020\)Jörg Tiedemann\. 2020\.[The tatoeba translation challenge – realistic data sets for low resource and multilingual MT](https://aclanthology.org/2020.wmt-1.139/)\.In*Proceedings of the Fifth Conference on Machine Translation*, pages 1174–1182, Online\. Association for Computational Linguistics\.
- Xia et al\. \(2023\)Heming Xia, Tao Ge, Peiyi Wang, Si\-Qing Chen, Furu Wei, and Zhifang Sui\. 2023\.[Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation](https://doi.org/10.18653/v1/2023.findings-emnlp.257)\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 3909–3925, Singapore\. Association for Computational Linguistics\.
- Yi et al\. \(2024\)Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du\-Seong Chang, and Se\-Young Yun\. 2024\.[Towards fast multilingual LLM inference: Speculative decoding and specialized drafters](https://doi.org/10.18653/v1/2024.emnlp-main.602)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 10789–10802, Miami, Florida, USA\. Association for Computational Linguistics\.
- Yu et al\. \(2025\)Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, and Ji Pei\. 2025\.[Opencsg chinese corpus: A series of high\-quality chinese datasets for llm training](https://arxiv.org/abs/2501.08197)\.*Preprint*, arXiv:2501\.08197\.
- Zevallos et al\. \(2022\)Rodolfo Zevallos, John Ortega, William Chen, Richard Castro, Núria Bel, Cesar Yoshikawa, Renzo Venturas, Hilario Aradiel, and Nelsi Melgarejo\. 2022\.[Introducing QuBERT: A large monolingual corpus and BERT model for Southern Quechua](https://doi.org/10.18653/v1/2022.deeplo-1.1)\.In*Proceedings of the Third Workshop on Deep Learning for Low\-Resource Natural Language Processing*, pages 1–13, Hybrid\. Association for Computational Linguistics\.
- Zhang et al\. \(2020a\)Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich\. 2020a\.[Improving massively multilingual neural machine translation and zero\-shot translation](https://doi.org/10.18653/v1/2020.acl-main.148)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 1628–1639, Online\. Association for Computational Linguistics\.
- Zhang et al\. \(2020b\)Shiyue Zhang, Benjamin Frey, and Mohit Bansal\. 2020b\.[ChrEn: Cherokee\-English machine translation for endangered language revitalization](https://doi.org/10.18653/v1/2020.emnlp-main.43)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 577–595, Online\. Association for Computational Linguistics\.
- Zhou et al\. \(2024\)Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean\-François Kagy, and Rishabh Agarwal\. 2024\.[Distillspec: Improving speculative decoding via knowledge distillation](https://openreview.net/forum?id=rsY6J3ZaTF)\.In*The Twelfth International Conference on Learning Representations*\.

![Refer to caption](https://arxiv.org/html/2605.30580v1/x7.png)Figure 6:Acceptance rates forstory generation prompts, over four experimental settings and eleven languages\. Error bars are standard deviation\.![Refer to caption](https://arxiv.org/html/2605.30580v1/x8.png)
![Refer to caption](https://arxiv.org/html/2605.30580v1/x9.png)

Figure 7:Tokens/second for translation \(top\) and story generation \(bottom\) prompts, over four experimental settings and eleven languages\. Error bars are standard deviation\.![Refer to caption](https://arxiv.org/html/2605.30580v1/x10.png)
![Refer to caption](https://arxiv.org/html/2605.30580v1/x11.png)

Figure 8:Acceptance rates \(left\) and speed\-up factors \(right\) fortranslation promptsin the baseline setting over draft model sizes\.![Refer to caption](https://arxiv.org/html/2605.30580v1/x12.png)Figure 9:Relationship between the model’s translation capability \(chrF\+\+\) in a language and acceptance rates on translation prompts, using the baseline draft model\.## Appendix AImplementation

We use the Hugging Face libraries for model implementations and tokenizers, Weights and Biases for tracking, Sacrebleu for evaluation, and NLTK for story generation\.

## Appendix BDistillation Details

We train draft models via distillation using the truncated cross\-entropy loss on the top twenty logits of the teacher\. We use the train splits from each dataset, and we select 5% of these examples to use as a validation split\. For translation, we use greedy decoding to generate translations using the teacher model, and perform soft\-target distillation on the generated outputs using the top 20 logits per\-token\(Hinton et al\.,[2015](https://arxiv.org/html/2605.30580#bib.bib12)\)\. We ignore the tail of the logit distribution, which is known to incur bias\(Dasgupta et al\.,[2026](https://arxiv.org/html/2605.30580#bib.bib7)\)\. For the general domain distillation, we do not generate outputs, but rather compute teacher logits on the raw texts\. All languages and models use the hyperparameters in[Table 2](https://arxiv.org/html/2605.30580#A2.T2)and bfloat16 precision, and we sweep the learning rate over\{1​e−4,5​e−5,2​e−5\}\\\{1e\-4,5e\-5,2e\-5\\\}and select the best model by validation loss\. For the general domain distillation approach, we truncate sequences to 128 tokens\. For translation, we allow up to 128 generated tokens \(excluding the prompt\)\.

Total steps800Batch size16Grad\. acc\. steps4Warm\-up steps60OptimizerAdamWLR SchedulecosineWeight decay0\.01Table 2:Distillation hyperparameters
## Appendix CDataset Sources

We list the various data sources in[Table 3](https://arxiv.org/html/2605.30580#A4.T3), as well as the per\-language counts in[Table 4](https://arxiv.org/html/2605.30580#A4.T4)and[Table 5](https://arxiv.org/html/2605.30580#A4.T5)\.

## Appendix DAdditional Results

### D\.1Story Generation Acceptance Rates

Acceptance rates for story generation are provided in[Figure 6](https://arxiv.org/html/2605.30580#A0.F6), corresponding to the speed\-up factors in[Figure 4](https://arxiv.org/html/2605.30580#S3.F4)\.

### D\.2Throughput

We report throughput \(tokens/second\) metrics for both tasks in[Figure 7](https://arxiv.org/html/2605.30580#A0.F7)\. Raw decoding throughput is an implementation\-agnostic measure of generation speed\. We measure this as total generated tokens over total decode wall clock time, excluding the prefill:

TPS=∑i=1Ngi∑i=1Nti\\text\{TPS\}=\\frac\{\\sum\_\{i=1\}^\{N\}g\_\{i\}\}\{\\sum\_\{i=1\}^\{N\}t\_\{i\}\}\(2\)
wheregig\_\{i\}is the number of output tokens for sentenceiiandtit\_\{i\}is its decode time\.

### D\.3Draft Model Size

In addition to the 0\.8b parameter draft model, we also test the baseline setting \(no distillation\) with the 2b and 4b parameter Qwen 3\.5 models\. From the 0\.8b results, we select the language with the highest acceptance rate \(amh\), the lowest acceptance rate \(ber\), and a random language from the middle \(grn\)\. We report speed\-up factors and acceptance rates in[Figure 8](https://arxiv.org/html/2605.30580#A0.F8)\. We observe that larger draft models improve both the acceptance rate and speed\-up factor most of the time, although in one case \(amh\) the largest model shows a drop in speed\-up\.

### D\.4Translation Quality

In[Figure 9](https://arxiv.org/html/2605.30580#A0.F9), we visualize the relationship between the model’s translation capability in a language \(measured via the chrF score of generated translations\) and acceptance rate\. We had hypothesized that acceptance rates would be higher for languages the model is stronger on \(as a proxy for the prevalence of a given language in the training data\)\. However, we observe a very weak correlation \(r=0\.170r=0\.170\), failing to provide evidence for our hypothesis\.

### D\.5Distillation and Acceptance Rate

When performing distillation on the same domain as the test data, we can compute a lower bound on the acceptance rate from the KL divergence between the teacher and student model\. First, we use Theorem 3\.5 fromLeviathan et al\. \([2023](https://arxiv.org/html/2605.30580#bib.bib14)\)to get:

α=∑xmin⁡\(p​\(x\),q​\(x\)\)\\alpha=\\sum\_\{x\}\\min\(p\(x\),q\(x\)\)\(3\)LetA=\{x:p​\(x\)≥q​\(x\)\}A=\\\{x:p\(x\)\\geq q\(x\)\\\}andB=\{x:q​\(x\)\>p​\(x\)\}B=\\\{x:q\(x\)\>p\(x\)\\\}\. Then, we get:

α=∑x∈Aq​\(x\)\+∑x∈Bp​\(x\)\\alpha=\\sum\_\{x\\in A\}q\(x\)\+\\sum\_\{x\\in B\}p\(x\)=∑x∈Ap​\(x\)−\(p​\(x\)−q​\(x\)\)\+∑x∈Bp​\(x\)=\\sum\_\{x\\in A\}p\(x\)\-\(p\(x\)\-q\(x\)\)\+\\sum\_\{x\\in B\}p\(x\)=∑x∈Ap​\(x\)\+∑x∈Bp​\(x\)−∑x∈A\(p​\(x\)−q​\(x\)\)=\\sum\_\{x\\in A\}p\(x\)\+\\sum\_\{x\\in B\}p\(x\)\-\\sum\_\{x\\in A\}\(p\(x\)\-q\(x\)\)=1−∑x∈A\(p​\(x\)−q​\(x\)\)=1\-\\sum\_\{x\\in A\}\(p\(x\)\-q\(x\)\)\(4\)Next, we use the definition oftotal variation distanceδ\\delta:

δ\(P,Q\)=12∑x\|p\(x\)−q\(x\)\|\\delta\(P,Q\)=\\frac\{1\}\{2\}\\sum\_\{x\}\|p\(x\)\-q\(x\)\\rvert\(5\)Using the same definitions ofAAandBB,

δ​\(P,Q\)=12​\(∑x∈Ap​\(x\)−q​\(x\)\+∑x∈Bq​\(x\)−p​\(x\)\)\\delta\(P,Q\)=\\frac\{1\}\{2\}\\left\(\\sum\_\{x\\in A\}p\(x\)\-q\(x\)\+\\sum\_\{x\\in B\}q\(x\)\-p\(x\)\\right\)We show these two sums are equal:

∑x∈Ap​\(x\)−q​\(x\)−∑x∈Bq​\(x\)−p​\(x\)\\sum\_\{x\\in A\}p\(x\)\-q\(x\)\-\\sum\_\{x\\in B\}q\(x\)\-p\(x\)=∑x∈Ap​\(x\)−q​\(x\)\+∑x∈Bp​\(x\)−q​\(x\)=\\sum\_\{x\\in A\}p\(x\)\-q\(x\)\+\\sum\_\{x\\in B\}p\(x\)\-q\(x\)=∑xp​\(x\)−∑xq​\(x\)=0=\\sum\_\{x\}p\(x\)\-\\sum\_\{x\}q\(x\)=0∴∑x∈Ap​\(x\)−q​\(x\)=∑x∈Bq​\(x\)−p​\(x\)=δ​\(P,Q\)\\therefore\\sum\_\{x\\in A\}p\(x\)\-q\(x\)=\\sum\_\{x\\in B\}q\(x\)\-p\(x\)=\\delta\(P,Q\)Substituting into[4](https://arxiv.org/html/2605.30580#A4.E4), we get

α=1−δ​\(P,Q\)\\alpha=1\-\\delta\(P,Q\)\(6\)By Pinsker’s inequality, we have an upper bound on the total variation distance:

δ​\(P,Q\)≤12DK​L\(P\|\|Q\)\\delta\(P,Q\)\\leq\\sqrt\{\\frac\{1\}\{2\}D\_\{KL\}\(P\|\|Q\)\}\(7\)whereDK​LD\_\{KL\}is the KL\-divergence\. Finally, this gives the lower bound on acceptance rate:

α≥1−12DK​L\(P\|\|Q\)\\alpha\\geq 1\-\\sqrt\{\\frac\{1\}\{2\}D\_\{KL\}\(P\|\|Q\)\}\(8\)
We validate this bound against the task\-specific distillation runs in[Figure 10](https://arxiv.org/html/2605.30580#A4.F10)\. We observe that all of the models well out\-perform the bound, and there is not necessarily the negative correlation we would expect\. However, this insight makes it possible to predict a lower bound on acceptance rate solely based on the distillation training run\.

![Refer to caption](https://arxiv.org/html/2605.30580v1/x13.png)Figure 10:Relationship between KL divergence and acceptance rate for task\-specific distilled models\.CodeNameSourceMonoParallelAmeAmerica’s NLP 2024Ebrahimi et al\. \([2024](https://arxiv.org/html/2605.30580#bib.bib9)\)✓AyaAya DatasetSingh et al\. \([2024](https://arxiv.org/html/2605.30580#bib.bib20)\)✓Che\-Hugging FaceA✓ChrChrEnZhang et al\. \([2020b](https://arxiv.org/html/2605.30580#bib.bib30)\)✓ChEnCherokee English Dictionary[Nuttle et al\.](https://arxiv.org/html/2605.30580#bib.bib15)✓HawHawaiian Corpus ProjectDoherty \([2016](https://arxiv.org/html/2605.30580#bib.bib8)\)✓LeiLeipzig Corpus CollectionsGoldhahn et al\. \([2012](https://arxiv.org/html/2605.30580#bib.bib11)\)✓MenMENYO\-20kAdelani et al\. \([2021](https://arxiv.org/html/2605.30580#bib.bib1)\)✓OpeOpenCSRYu et al\. \([2025](https://arxiv.org/html/2605.30580#bib.bib27)\)✓OpuOPUS\-100Zhang et al\. \([2020a](https://arxiv.org/html/2605.30580#bib.bib29)\)✓OpusOpusTiedemann \([2012](https://arxiv.org/html/2605.30580#bib.bib23)\)✓QueQuBERTZevallos et al\. \([2022](https://arxiv.org/html/2605.30580#bib.bib28)\)✓TalTALAMIRCA \([2015](https://arxiv.org/html/2605.30580#bib.bib13)\)✓TatTatoebaTiedemann \([2020](https://arxiv.org/html/2605.30580#bib.bib24)\)✓✓Tha\-Thapa et al\. \([2025](https://arxiv.org/html/2605.30580#bib.bib22)\)✓WubUnified Amharic\-English CorpusGitHubB✓
- A
- B

Table 3:All of the sources used throughout our experiments\. TheMonoandParallelcolumns indicate how we use the corpus; in some cases, a corpus has parallel sentences but we only use the text in the target language\.LanguageTotal TokensSourcesAmharic \[amh\]2\.2MAyaBerber \[ber\]335\.0kTatCherokee \[chr\]1\.1MChr, TatGuarani \[grn\]426\.7kLeiHawaiian \[haw\]101\.9kHawIgbo \[ibo\]852\.3klei, AyaNepali \[npi\]22\.3MTha, AyaOccitan \[oci\]60\.1kLeiQuechua \[que\]684\.7kQue, TatYoruba \[yor\]1\.7MAyaTamazight \[zgh\]2\.8MTat, TalTable 4:Monolingual source counts for each language; see[Table 3](https://arxiv.org/html/2605.30580#A4.T3)for source abbreviations\.LanguageTotal ExamplesSourcesAmharic \[amh\]5200TatBerber \[ber\]5200TatCherokee \[chr\]5200ChEn, CheGuarani \[grn\]986TatHawaiian \[haw\]121TatIgbo \[ibo\]1748Opus, TatNepali \[npi\]3533TatOccitan \[oci\]4031TatQuechua \[que\]5200Opus, TatYoruba \[yor\]5200MenTamazight \[zgh\]5200TatTable 5:Parallel sentence counts for each language; see[Table 3](https://arxiv.org/html/2605.30580#A4.T3)for source abbreviations\.

## Appendix EGenerated Story Examples

Topic: sunlit garden

BerberAza tanura taqul ta aymez nna\. Sef n’wag\. Lhaz tana wu aya na aqamra nqasim tawila, tawz\. Tannit\.

GuaraniO sol kóme ña\. Ndon, o ro añañ, o c ó n n ñ añañ, o r e añañ\. Ipeca añe ñ ñ ïo añañ, o eca añ e nñ ïañ, añ\. O añañ ñ añañ nñ añ ïo e o\.

Topic: vibrant festival

BerberAṭ n Imla\\T3\\textgammaer n wass, tmazran dḥ n tmazulay\.\\T3\\textgammaer amla s wllm t dda n wul n yimezyanum,\\T3\\textgamma\. A umez dda n wad amur s d ayen imi i yes\\T3\\textgammamz\.

GuaraniÑemo’ê heta, áreko oñe, ojapo\. Ñembyrã, ojeri’ro’ro "¡Yvy’re\! ’guí\!", oikéramo\. "k, ojei’ro’\. Ñe, o ña\. "Ñ\! "reko’ ko’rë\. Ojapópe ’pytaro’\. Omo, ojapo\. "Ñamandúta\! "

Figure 11:Samples of story generation for the topicssunlit gardenandvibrant festival, sampled from the 9b Qwen 3\.5 model\.

Similar Articles

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

arXiv cs.CL

This paper identifies a new vulnerability in model-based speculative decoding for large language models, where small perturbations can reduce draft token acceptance without affecting output quality, collapsing acceleration. The authors propose Mistletoe, an attack that jointly optimizes degradation and semantic preservation, demonstrating significant speedup reduction across various systems.

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Reddit r/LocalLLaMA

This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.