Optimizing Korean-Centric LLMs via Token Pruning

arXiv cs.CL Papers

Summary

This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.

arXiv:2604.16235v1 Announce Type: new Abstract: This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in inference latency.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# Optimizing Korean-Centric LLMs via Token Pruning

Source: https://arxiv.org/html/2604.16235

First Author Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 email@domain Second Author Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 email@domain Hoyeol Kim College of Computing Georgia Institute of Technology [email protected] & Hyeonwoo Kim Independent Researcher [email protected]

###### Abstract

This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning—a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in inference latency.

Optimizing Korean-Centric LLMs via Token Pruning

Hoyeol Kim College of Computing Georgia Institute of Technology [email protected] Hyeonwoo Kim Independent Researcher [email protected]

## 1 Introduction

The proliferation of massively multilingual LLMs has significantly expanded the accessibility of advanced natural language processing. Models such as LLaMA-3 (Dubey et al., 2024), Qwen (Yang et al., 2025), and Gemma (Team et al., 2025) rely on vast multilingual training corpora to achieve universal applicability. However, this breadth introduces a "curse of multilinguality," wherein a substantial fraction of the model's parameters—particularly within the vocabulary and embedding layers—is allocated to languages that are superfluous for specific downstream applications. For deployments in strictly monolingual or bilingual contexts, such as Korean-centric services, this redundancy imposes unnecessary memory overhead and computational inefficiency.

In the context of Korean NLP, this challenge is two-pronged. First, despite being a high-resource language, Korean often comprises a small minority of the training data in dominant English- or Chinese-centric models, which risks diluting cultural nuance. Second, while local deployment requires lightweight models, optimization strategies must not come at the cost of linguistic proficiency.

Token pruning offers a targeted solution. By systematically excising tokens linked to non-target languages, this method compresses the vocabulary and embedding matrices, reducing the memory footprint without altering the core transformer architecture. While prior works have explored pruning generally, systematic research quantifying its effects on Korean language tasks remains limited. This paper bridges that gap by benchmarking token pruning on state-of-the-art (SOTA) multilingual LLMs. We specifically evaluate how aggressive vocabulary compression impacts performance across three distinct pillars: general aptitude, cultural competence, and translation quality.

## 2 Related Work

### 2.1 Token Pruning

Token pruning is increasingly recognized as a vital component of transformer efficiency. Li et al. (2025) provide a comprehensive review of token pruning methods, categorizing them into heuristic-based, learnable, and reinforcement learning-based approaches, and emphasize its role in reducing inference complexity while retaining accuracy. In multimodal contexts, Wen and Gao (2024) question the utility of complex attention-based importance scores, demonstrating that simple random baselines can approach the performance of specialized methods that lack strong theoretical grounding. Building on the theoretical view of token redundancy in transformers, Tai et al. (2025) formalized benchmarks to capture FLOP reductions and latency trade-offs under different pruning strategies. Complementing these input-level approaches, Lee and Kim (2024) introduced dynamic vocabulary pruning to restrict the output search space during testing, suggesting a synergistic effect between vocabulary- and token-level pruning for practical deployment.

### 2.2 Korean-Centric LLMs and Language Specialization

Efforts to enhance Korean LLM performance bifurcate into adapting English-centric models and developing sovereign AI. Vo et al. (2024) showed that continual pretraining with tokenizer augmentation effectively adapts English foundation models, while Kim et al. (2025) introduced Thunder-LLM, achieving near-SOTA results via balanced bilingual corpora. Domain-specific applications have also expanded, exemplified by the bilingual GECKO model for code generation (Oh and Kim, 2024) and the FINKRX model for financial NLP (Son et al., 2025a).

## 3 Methods

### 3.1 Target Models and Configurations

We evaluate a diverse set of open-weight multilingual LLMs, specifically the Qwen3 series (ranging from 0.6B to 14B parameters), Gemma-3 (270M to 12B), and the Llama-3 family (3.1-8B and 3.2-3B). Additional models include Tri (Han et al., 2025), Ministral-8B (Liu et al., 2026), and Aya-23 (Aryabumi et al., 2024). To assess the impact of vocabulary compression, each model is processed into three distinct configurations: the Original setup, which retains the full multilingual vocabulary; the EnKo configuration, pruned to retain only English and Korean tokens; and the EnKoZh configuration, which preserves English, Korean, and Chinese tokens. The pruning process necessitates tokenizer reconstruction, positional encoding realignment, and the remapping of output layer weights to correspond to the reduced vocabulary indices.

### 3.2 Vocabulary Pruning Procedure

We implement a "language-aware filtering" strategy that proceeds in three distinct stages. First, tokens are categorized by language using Unicode block ranges and script properties such as Hangul, Latin, and Hanzi (Chinese characters). Subsequently, tokens irrelevant to the target configuration—whether EnKo or EnKoZh—are discarded, and the retained tokens are mapped onto a new, continuous index space. Finally, the embedding matrix and output projection layer are physically rearranged to align with the new indices, effectively reducing the parameter count. This procedure leaves the internal transformer blocks and positional encodings intact, preserving the model's sequence modeling capabilities.

### 3.3 Benchmarks and Metrics

To quantify performance, we utilize four categories of Korean-centric benchmarks. General aptitude is assessed via KMMLU (Son et al., 2025b), which tests general knowledge and logical reasoning. Cultural and linguistic understanding are evaluated using HAERAE (Son et al., 2024) for cultural literacy and CLIcK (Kim et al., 2024) for linguistic nuance. Instruction-following capabilities are measured through LogicKor (Park, 2024) and KoMTBench (Research, 2024), which assess reasoning and constraint satisfaction. Finally, machine translation performance is evaluated using the WMT 24++ (Korean–English) benchmark (Deutsch et al., 2025), scored via XCOMET-XXL (Guerreiro et al., 2024).

### 3.4 Latency Measurement

Token-level inference latency is quantified using the Seed-X-PPO-7B model (Cheng et al., 2025) on the Google/WMT-pp dataset (Deutsch et al., 2025). We compare configurations under identical hardware conditions to isolate the speedups resulting from the reduced softmax computational load.

## 4 Experiments and Analysis

### 4.1 General Aptitude and Cultural Alignment

Table 1 summarizes the assessment of fundamental Korean language grasp and cultural knowledge. The data reveals minimal performance degradation in pruned models. For the Qwen3 series, the variance between configurations is negligible (<0.01 fluctuation), while Gemma-3-12b-it shows a slight improvement in linguistic capability (0.6311→0.6321). This stability suggests that pruning successfully eliminates vocabulary redundancy without excising the semantic structures requisite for Korean reasoning. Furthermore, Llama-3.2-3B-Inst exhibits a marginal increase in KMMLU accuracy after pruning (0.3307→0.3311), implying that narrowing the vocabulary search space may reduce generation noise in lower-parameter models.

### 4.2 Instruction-Following Capabilities

Performance in complex instruction following (LogicKor and KoMTBench) exhibits greater variance, influenced by the interaction between model architecture and pruning depth. For the Qwen3 family, retaining Chinese tokens (EnKoZh) consistently outperforms the stricter EnKo pruning. Notably, Qwen3-4B scores 7.85 in LogicKor (EnKoZh), surpassing its Original score (7.77). This suggests that models heavily pre-trained on Chinese corpora rely on latent cross-lingual alignments for reasoning. Conversely, larger models like Llama-3.1-8B-Inst show improved performance in the EnKo setting (5.36→5.57), indicating that for sufficiently large models, vocabulary reduction may sharpen instruction focus.

### 4.3 Machine Translation

Machine translation (WMT24++) results provide the strongest evidence for the efficacy of token pruning. Pruned configurations consistently match or outperform baselines. Llama-3.1-8B-Inst (0.5879→0.6342) and Aya-expanse-8b (0.6957→0.7496) show substantial gains. This indicates that eliminating extraneous language tokens regularizes the output distribution, minimizing off-target hallucinations and enhancing the English-Korean translation pathway.

### 4.4 Language Confusion Performance

We measured the Word-level Pass Rate (WPR) to evaluate stability in language generation, following the methodology of Marchisio et al. (Marchisio et al., 2024). Table 4 details the improvements yielded by the EnKo adaptation. The EnKo method significantly mitigates language confusion. Qwen3-4B, which exhibited the lowest baseline stability (0.8882), demonstrated the most dramatic recovery (ΔWPR = +0.1041). Even stable baselines like Qwen3-0.6B achieved near-perfect consistency (>0.999) post-pruning.

### 4.5 Computational Latency

Efficiency gains were measured using the Seed-X-PPO-7B model. While vocabulary reduction is substantial (36% reduction in EnKo), latency improvement is modest (0.89%). This confirms that while pruning alleviates memory overhead related to embeddings, it does not significantly impact the computational bottlenecks in attention mechanisms.

## 5 Conclusion

We presented a systematic benchmark of token pruning for Korean NLP, confirming that high-resource languages can be effectively decoupled from massive multilingual vocabularies without performance penalty. Pruning eliminated extraneous tokens, leading to near-perfect generation consistency (WPR > 0.99) and improved translation quality, all while significantly reducing parameter count.

However, our results also highlight critical architectural dependencies. While pruning is generally robust, the performance gap between EnKo and EnKoZh configurations in Qwen series models suggests that latent cross-lingual representations (specifically Chinese) remain integral to reasoning capabilities in certain architectures. Finally, while the inference latency gains were modest (<1%), the substantial reduction in vocabulary size offers significant memory savings. This positions token pruning not merely as a compression technique, but as a vital tool for deploying stable, high-performance sovereign AI models in resource-constrained local environments.

## References

- Aryabumi et al. (2024) Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, et al. 2024. Aya 23: Open weight releases to further multilingual progress. *arXiv preprint arXiv:2405.15032*.
- Cheng et al. (2025) Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, and Yuxuan Wang. 2025. Seed-x: Building strong multilingual translation llm with 7b parameters. *Preprint*, arXiv:2507.13618.
- Deutsch et al. (2025) Daniel Deutsch, Eleftheria Briakou, Isaac Rayburn Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, et al. 2025. WMT24++: Expanding the language coverage of WMT24 to 55 languages & dialects. In *Findings of the Association for Computational Linguistics: ACL 2025*, pages 12257–12284.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. *arXiv e-prints*, pages arXiv–2407.
- Guerreiro et al. (2024) Nuno M Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and André FT Martins. 2024. XCOMET: Transparent machine translation evaluation through fine-grained error detection.

Similar Articles

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.