Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization
Summary
This paper proposes a parameter-efficient vocabulary adaptation method for LLM-based text summarization in specialized domains, augmenting pretrained tokenizers with domain-specific tokens and selectively replacing under-trained ones to reduce training time by 35-55% and parameter counts by up to 37%.
View Cached Full Text
Cached at: 05/19/26, 06:39 AM
# Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization Source: [https://arxiv.org/html/2605.17379](https://arxiv.org/html/2605.17379) Gunjan Balde1,Soumyadeep Roy2,Mainack Mondal1andNiloy Ganguly1 1Dept\. of Computer Science and Engg\., IIT Kharagpur, Kharagpur, India 2Dept\. of Medicine \(Biomedical Informatics\), Stanford University, Stanford, CA, USA Correspondence:[balde\.gunjan0812@gmail\.com](https://arxiv.org/html/2605.17379v1/mailto:[email protected]) ###### Abstract Large language models pretrained on general\-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains\. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch\. To address this gap, we introduce a targeted parameter\-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM\-based text summarization\. Our unified framework augments pretrained tokenizers with domain\-specific tokens while selectively replacing under\-trained and unreachable tokens to limit parameter growth\. We evaluate our approach on Llama\-3\.1\-8B and Qwen2\.5\-7B across legal and medical summarization tasks on a challenge\-oriented evaluation protocol focused on expert\-driven text and summaries which typically has higher concentration ofover\-fragmentedOut\-of\-Vocabulary \(OOV\) words\. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references\. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain\-specific words, leading to improved coherence, relevance, and faithfulness\. We further observe that our proposed approach significantly reduce training time by35−55%35\-55\\%over continual pretraining and reduce parameter counts up to37%37\\%w\.r\.t expansion\-only methods\. We make the codebase publicly available at[https://github\.com/gb\-kgp/VocabReplace\-Then\-Expand](https://github.com/gb-kgp/VocabReplace-Then-Expand)\. Learning Faster with Better Tokens: Parameter\-Efficient Vocabulary Adaptation for Specialized Text Summarization Gunjan Balde1, Soumyadeep Roy2, Mainack Mondal1and Niloy Ganguly11Dept\. of Computer Science and Engg\., IIT Kharagpur, Kharagpur, India2Dept\. of Medicine \(Biomedical Informatics\), Stanford University, Stanford, CA, USACorrespondence:[balde\.gunjan0812@gmail\.com](https://arxiv.org/html/2605.17379v1/mailto:[email protected]) ††footnotetext:This is the author’s version of the manuscript\. It is posted here for your personal use\. Not for redistribution\. To Appear in the the 64th Annual Meeting of the Association for Computational Linguistics, ACL \(Mains\) 2026\.## 1Introduction While large language models \(LLMs\) have revolutionized natural language processing, adapting generalist models to expert domains remains challenging due to high vocabulary mismatch between general and domain\-specific corpora\. Recent domain\-specific models including Meditron\-70BChenet al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib62)\); BioMistralLabraket al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib55)\), built on Mistral\-7B and further pretrained on PubMed Central; and PMC\-LLaMAWuet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib72)\)demonstrate that continued pretraining on specialized corpora yields substantial performance improvements\. However, vocabulary mismatch fundamentally limits these gains: PubMedBERTGuet al\.\([2021](https://arxiv.org/html/2605.17379#bib.bib73)\)demonstrates that medical terms like "naloxone" fragment into meaningless subwords \("nal", "\#\#ox", "\#\#one"\), while domain\-specific vocabularies treat them atomically\. This tokenization inefficiency imposes substantial costs—non\-English and domain\-specific text can require up to13×13\\timesmore tokens than EnglishRustet al\.\([2021](https://arxiv.org/html/2605.17379#bib.bib32)\); Ahiaet al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib74)\); Petrovet al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib30)\), directly increasing API costs, latency, and memory requirements\. Recent work establishes that this fragmentation reduces effective context window size and impedes learning meaningful representationsHofmannet al\.\([2022](https://arxiv.org/html/2605.17379#bib.bib29)\); Kaplanet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib75)\)\. The conventional approach to addressing vocabulary mismatch involves domain\-adaptive pretraining \(DAPT\), where models undergo continued pretraining on domain\-specific corporaGururanganet al\.\([2020](https://arxiv.org/html/2605.17379#bib.bib28)\)\. While effective, this paradigm presents significant practical limitations\. BioMistral\-7B required 32 A100 GPUs for 20 hours on 3 billion tokens from PubMed Central, while Meditron\-70B consumed 128 A100 GPUs for 332 hours processing 46 billion tokens while achieving marginal improvements\. For contemporary large language models,Huet al\.\([2022](https://arxiv.org/html/2605.17379#bib.bib31)\)note that full fine\-tuning is “prohibitively expensive”, requiring complete parameter updates and storage of separate model instances per domain\. While parameter\-efficient methods reduce trainable parameters, they do not address the underlying tokenization inefficiency–vocabulary mismatch\. An alternative paradigm that directly addresses vocabulary mismatch is vocabulary adaptation, modifying a pretrained model’s tokenizer and embedding layer to incorporate domain\-specific vocabulary\. Recent worksSachidanandaet al\.\([2021](https://arxiv.org/html/2605.17379#bib.bib49)\); Honget al\.\([2021](https://arxiv.org/html/2605.17379#bib.bib40)\); Liuet al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib13)\); Yamaguchiet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib33)\); Baldeet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib16)\); Gaoet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib24)\); Baldeet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib61)\)establishes this as a resource\-efficient path\. However, vocabulary expansion introduces computational overhead through parameter growth: adding 10,000 tokens to Llama\-3\-8B requires approximately 80 million additional parameters \(at 4096\-dimensional embeddings\), representing non\-trivial increase in model size and inference cost\.Land and Bartolo \([2024](https://arxiv.org/html/2605.17379#bib.bib65)\)reveal a critical insight: contemporary language models contain0\.1−1%0\.1\-1\\%severely under\-trained "glitch tokens"—vocabulary tokens that occupy vocabulary slots but contribute minimally due to insufficient pretraining exposure\. This observation suggests efficient vocabulary adaptation is possible by strategically replacing under\-trained tokens with domain\-specific vocabulary, achieving adaptation benefits with minimal parameter expansionPurasonet al\.\([2026](https://arxiv.org/html/2605.17379#bib.bib80)\)\. In this work, we propose a vocabulary adaptation method that strategically replaces under\-trained and unreachable tokens with domain\-specific vocabulary before resorting to expansion, thereby minimizing parameter overhead while enabling effective domain specialization\. Our approach operates on Llama\-3\.1\-8B and Qwen2\.5\-7B and consists of four key steps: \(1\) we train a BPE tokenizer on domain\-specific corpora to identify candidate domain vocabulary, \(2\) we select the top 10,000 tokens based on frequency and coverage statistics as our vocabulary adaptation budget, \(3\) we compile a replacement candidate list by identifying under\-trained and unreachable tokens usingLand and Bartolo \([2024](https://arxiv.org/html/2605.17379#bib.bib65)\)methodology, and \(4\) we replace tokens from this candidate list with domain\-specific tokens, expanding the vocabulary only when the replacement budget is exhausted\. This hybrid replacement\-then\-expansion strategy enables us to prioritize recycling underutilized vocabulary slots, minimizing net parameter increase while maximizing domain vocabulary coverage\. Beyond standard benchmarking, we introduce a challenge\-oriented evaluation framework that stress\-tests model performance under conditions where domain vocabulary knowledge is critical\. We restructure the downstream domain\-specific corpus to explicitly capture challenging scenarios: test sets with high out\-of\-vocabulary \(OOV\) concentrations in either source documents \(SD\) and reference summaries \(RS\)–OOV\_SDandOOV\_RSrespectively\. We also take aRandomsubset without any restriction on OOV concentration to compare the degree of performance in these challenging scenarios\. This targeted evaluation approach allows us to assess how well generalist models handle expert\-level summarization tasks where domain\-specific terminology is essential, providing a more rigorous test of vocabulary adaptation effectiveness beyond aggregate performance metrics\. We evaluate our approach on two specialized domains—medical and legal literature—demonstrating that our method achieves competitive or superior performance compared to conventional vocabulary extension while substantially reducing parameter overhead and maintaining inference efficiency\. We hypothesize that the effectiveness of vocabulary adaptation is governed by the severity and location of lexical mismatch between pretrained tokenizers and downstream data\. We find that: \(i\) across challenging scenarios ofOOV\_SDandOOV\_RS, we observe more improvement in former setting over competing baselines\. Although margins of gain in slightly higher inOOV\_RS\(4\.44%4\.44\\%\) thanOOV\_SD\(4\.26%4\.26\\%\); \(ii\) performance gains are notably higher than the gains observed inRandomsetting \(3\.06%3\.06\\%\), validating that gains are higher in higher vocabulary mismatch scenario; \(iii\) vocabulary adaptation enables models to reach their best\-performing checkpoints35–55% earlierthan continual pretraining alone, reducing the training time; \(iv\) hybrid replacement\-then\-expansion strategy remains highly parameter\-efficient reducing parameters by12\.04%12\.04\\%and37\.19%37\.19\\%for Llama and Qwen models respectively averaged across both the domains\. These results identify tokenization mismatch as a bottleneck in domain adaptation and motivate vocabulary\-adaptation strategies as a targeted, data\-dependent intervention\. We make our codebase publicly available at[https://github\.com/gb\-kgp/VocabReplace\-Then\-Expand](https://github.com/gb-kgp/VocabReplace-Then-Expand)\. ## 2Proposed Methodology \(VocabAdapt\) ### 2\.1Background Generalist LLMs are pretrained on broad\-coverage corpora, resulting in tokenizers optimized for general text distributions\. When deployed on specialized domains such as medical text, these tokenizers exhibit systematic over\-fragmentation\. For instance, the term “Osteoporosis” is tokenized as\[O, ste, opor, osis\]by the Llama tokenizer, splitting into four subwords\. This over\-fragmentation introduces two primary challenges: first, the model must reconstruct semantic meaning across multiple token positions, increasing computational overhead and representation noise; second, generation becomes error\-prone as the model must correctly predict each fragment in sequence, with errors compounding across token boundaries\. The standard solution to this vocabulary mismatch problem involves expanding the model’s vocabulary by adding domain\-specific tokens\. LetVsrcV\_\{\\text\{src\}\}denote the source vocabulary of size\|Vsrc\|\|V\_\{\\text\{src\}\}\|with corresponding embedding matrixE∈ℝ\|Vsrc\|×dE\\in\\mathbb\{R\}^\{\|V\_\{\\text\{src\}\}\|\\times d\}and unembedding matrixU∈ℝd×\|Vsrc\|U\\in\\mathbb\{R\}^\{d\\times\|V\_\{\\text\{src\}\}\|\}, whereddrepresents the model’s hidden dimension\. Addingkkdomain\-specific tokens to form an expanded vocabularyVexp=Vsrc∪VnewV\_\{\\text\{exp\}\}=V\_\{\\text\{src\}\}\\cup V\_\{\\text\{new\}\}necessitates expanding both embedding and unembedding matrices, introducing2k⋅d2k\\cdot dadditional parameters\. For models with large hidden dimensions and substantial domain vocabularies, this parameter overhead becomes significant, increasing memory footprint and inference cost\. We propose an alternative approach that challenges the necessity of vocabulary expansion\. Our central hypothesis is that generalist tokenizers contain a substantial subset of undertrained and unreachable tokens that contribute minimally to model performance\. Rather than expanding the vocabulary, we identify these ineffectual tokens and replace them with domain\-specific terminology, maintaining constant vocabulary size while addressing fragmentation\. When domain requirements exceed the available candidate tokens, we resort to expansion only for the remaining terms, thereby minimizing parameter growth\. ### 2\.2Identifying Candidate Tokens for Replacement Our replacement strategy relies on identifying a candidate setVcand⊆VsrcV\_\{\\text\{cand\}\}\\subseteq V\_\{\\text\{src\}\}comprising tokens that satisfy two independent criteria: they must be undertrained and unreachable\. The undertrained tokens are identified through the methodology ofLand and Bartolo \([2024](https://arxiv.org/html/2605.17379#bib.bib65)\)where the L2 norm for each token embeddingeie\_\{i\}in the vocabulary is computed,‖ei‖2\\\|e\_\{i\}\\\|\_\{2\}, excluding partial utf\-8, fallback bytes, and unreachable tokens\. Their analysis demonstrates that tokens with embedding norms below a threshold corresponds to vocabulary items that appeared infrequently during pretraining and hence undertrained\. This token token set is henceforth represented asVundertrainedV\_\{\\text\{undertrained\}\}\. The unreachable tokens are identified through a consistency testLand and Bartolo \([2024](https://arxiv.org/html/2605.17379#bib.bib65)\); Purasonet al\.\([2026](https://arxiv.org/html/2605.17379#bib.bib80)\)\. A tokenttis deemed unreachable if decoding111encoding and decoding here corresponds to buit\-intokenizer\.encodeandtokenizer\.decodefunction calls of a model tokenizer\.its corresponding vocabulary token\-idtit\_\{i\}and encoding the decoded token does not yield the original token\-idtit\_\{i\}\. E\.g\. decoding the encoding token\-id378378in Llama\-3\.1\-8B results inâĢ, which upon encoding yield token\-id58095809\. Formally, a token is unreachable whenencode\(decode\(ti\)\)≠\[ti\]\\text\{encode\}\(\\text\{decode\}\(t\_\{i\}\)\)\\neq\[t\_\{i\}\]\. These tokens represent vocabulary entries that cannot be produced through the standard tokenization algorithm and thus remain inaccessible during normal model inference\. While they occupy vocabulary slots and contribute to parameter count, they serve no functional role in model operation\. This token set is henceforth represented asVunreachableV\_\{\\text\{unreachable\}\}\. We define our candidate set as the union of these two criteria: Vcand=Vundertrained∪VunreachableV\_\{\\text\{cand\}\}=V\_\{\\text\{undertrained\}\}\\cup V\_\{\\text\{unreachable\}\}\(1\)This union ensures we replace tokens that are poorly trained and inaccessible, providing a conservative strategy that minimizes risk of degrading model performance on general domains\. Empirically, we observe that approximately3−4%3\-4\\%percent of vocabulary tokens in both Llama\-3\.1\-8B and Qwen2\.5\-7B satisfy the candidate set criterion, providing a substantial pool of replacement candidates\. We apply a final refinement to ensure tokenizer integrity\. BPE \(Byte\-Pair Encoding\) subword tokenization algorithm construct vocabulary through iterative merge operations, where character sequences are progressively combined into larger units based on merge rules\. Replacing a token that appears in the merge rule of another token outside the candidate set would fundamentally break the tokenization process, rendering certain vocabulary tokens untokenizable\. To prevent this, we filter the candidate set to exclude any token that appears as a component in the merge rule of a token not designated for replacement\. We construct a directed acyclic graph \(DAG\) with nodes as the token\-id and an edge from token\-i to token\-j marking the relationship if token\-i contributed in merge\-rule of token\-j \(E\.g\., in→\\rightarrowing\)\. Then, for every candidate that could be replaced, we checked if it has any descendants \(nodes reachable from this node\) that lies outside the candidate replacement set\. If yes, we do not replace it, else we consider it for replacement\. This set of tokens is marked asVexcludeV\_\{\\text\{exclude\}\}\. This constraint guarantees that all remaining merge rules remain valid after vocabulary modification, preserving the deterministic and complete nature of the tokenization algorithm\. The refined candidate set therefore contains only tokens that are undertrained, unreachable, and removing does not compromise the structural integrity of the tokenizer\. Vcand=Vcand\\VexcludeV\_\{\\text\{cand\}\}=V\_\{\\text\{cand\}\}\\backslash V\_\{\\text\{exclude\}\}\(2\) The final replacement candidate set is of size15281528for Llama\-3\.1\-8B \(vocabulary size: 128K\) and39873987for Qwen\-2\.5\-7B \(vocabulary size: 151K\)\. We next describe our domain\-specific vocabulary construction step\. ### 2\.3Building Domain\-Specific Vocabulary We construct domain\-specific vocabulary through a process involving corpus curation, independent tokenizer training, and vocabulary filtering for each target domain\. This approach ensures that our added tokens genuinely represent domain\-salient terminology rather than arbitrary subword fragments\. We curate two domain\-specific corpora, each comprising 100 million tokens \(100M\) sampled from authoritative sources within their respective domains\. The medical domain corpus is sampled from the MEDITRON pretraining corpora\(Chenet al\.,[2023](https://arxiv.org/html/2605.17379#bib.bib62)\), which aggregates clinical practice guidelines, PubMed Central full\-text articles, and article abstracts, providing comprehensive coverage of both clinical and biomedical language\. For the legal domain, we compile a corpus from Supreme Court of India case documents, capturing the specialized vocabulary and linguistic conventions of Indian jurisprudence\. We train an independent Byte\-Pair Encoding tokenizer using the HuggingFace tokenizers222[https://github\.com/huggingface/tokenizers](https://github.com/huggingface/tokenizers)library with a vocabulary size of 256,000 tokens dor each domain corpus\. This training process learns domain\-optimized merge operations that naturally surface frequently occurring domain\-specific terms as single tokens\. From each trained domain tokenizer vocabulary, we extract candidate tokens for addition to the base model\. We filter this set to exclude any tokens that already exist in the source model vocabularyVsrcV\_\{\\text\{src\}\}, as these tokens require no adaptation\. This non\-overlapping constraint ensures we only add genuinely new vocabulary items that address coverage gaps in the original tokenizer\. We apply an additional refinement to ensure linguistic coherence across models and avoid introducing problematic tokens\. We restrict the candidate set to tokens containing only English alphabetic characters, excluding any subwords that contain numeric digits, special symbols, or mixed alphanumeric patterns\. This filtering serves multiple purposes: it eliminates formatting artifacts, date fragments, and identifier components that do not represent meaningful linguistic units; it ensures that added tokens correspond to genuine lexical items rather than incidental character sequences; and it maintains consistency with the predominantly alphabetic nature of established vocabulary in pretrained models\. The resulting filtered set forms our domain\-specific vocabularyVnew𝒟V\_\{\\text\{new\}\}^\{\\mathcal\{D\}\}, comprising high\-frequency, domain\-salient, purely alphabetic tokens that address the most significant tokenization inefficiencies for the target domain\. In both the settings, we select the top 10,000 vocabulary tokens ranked by frequency in the domain corpus, representing the most salient domain\-specific vocabulary items\. We next describe the procedure of vocabulary replacement\. ### 2\.4Vocabulary Replacement\-Then\-Expansion and Embedding Initialization Thus far, we have a domain vocabularyVnew𝒟V\_\{\\text\{new\}\}^\{\\mathcal\{D\}\}and replacement candidate setVcandV\_\{\\text\{cand\}\}\(Eq\.[2](https://arxiv.org/html/2605.17379#S2.E2)\), such that\|Vnew𝒟\|\>\|Vcand\|\|V\_\{\\text\{new\}\}^\{\\mathcal\{D\}\}\|\>\|V\_\{\\text\{cand\}\}\|\. We first replace theVcandV\_\{\\text\{cand\}\}from LLM’s base vocabulary with equal sized set fromVnew𝒟V\_\{\\text\{new\}\}^\{\\mathcal\{D\}\}sorted by the natural merge order\. We then expand the base vocabulary with the remaining\|Vnew𝒟\|−\|Vcand\|\|V\_\{\\text\{new\}\}^\{\\mathcal\{D\}\}\|\-\|V\_\{\\text\{cand\}\}\|elements fromVnew𝒟V\_\{\\text\{new\}\}^\{\\mathcal\{D\}\}\. Initializing embeddings for the newly replaced and added tokens presents a critical challenge, as random initialization would require substantial training to achieve reasonable representations\. Instead, we employ subword aggregationYamaguchiet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib33)\), leveraging model’s existing understanding of subwords\. For each new tokentnewt\_\{\\text\{new\}\}, we tokenize it using the original tokenizer to obtain a sequence of source tokens\[t1,…,tn\]\[t\_\{1\},\\ldots,t\_\{n\}\]\. We then initialize the new token’s embedding as the mean of these constituent embeddings: etnew=1n∑i=1netie\_\{t\_\{\\text\{new\}\}\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}e\_\{t\_\{i\}\}\(3\)This initialization provides a reasonable starting point that captures compositional semantics while allowing subsequent training to refine the representation\. The same subword aggregation strategy is applied to initialize the corresponding unembedding matrix row\. Next, we describe the procedure to tune the model with the modified vocabulary\. ### 2\.5Domain\-Specific Continual Pretraining Following vocabulary modification, we conduct domain\-specific continual pretraining to adapt the model to the target domain while training the new token representations\. We employ Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2022](https://arxiv.org/html/2605.17379#bib.bib31)\)to enable parameter\-efficient training, inserting trainable low\-rank matrices into the model’s attention and feed\-forward layers while keeping the original pretrained parameters frozen\. This approach substantially reduces the number of trainable parameters and memory requirements during adaptation\. Each domain model is trained independently on a domain\-specific corpus of 100M tokens sampled from high\-quality sources representative as discussed previously\. We train using the standard causal language modeling objective with next\-token prediction, optimizing the model to predict each token given all preceding context\. Training is conducted separately for medical and legal domains, producing two specialized model variants from each base model architecture\. ## 3Experimental Setup Here, we describe the evaluation metrics and datasets used, followed by the baseline models and implementation details\. CorpusSD Token CountRS Token CountSD OOV Conc\.RS OOV Conc\.SizeLlamaQwenLlamaQwenLlamaQwenLlamaQwenMedicalRandom39982384715015311\.8311\.8813\.5113\.59OOV\_SD39980882814715016\.9016\.9417\.2317\.24OOV\_RS39983786213713914\.4014\.4321\.6121\.65LegalRandom71158706059117112214\.754\.764\.764\.76OOV\_SD710466148019409757\.487\.496\.976\.98OOV\_RS711505852148568896\.396\.408\.578\.57Table 1:Dataset statistics across Legal and Medical domains underRandom,OOV\_RS, andOOV\_SDsettings, reporting mean token counts, OOV concentration \(fraction of unigrams in text split more than once\), and novel unigram concentration \(fraction of unigrams in RS not present in SD\)\. Medical domain exhibits higher OOV concentrations than Legal domain Legal domain has substantially higher token counts than Medical domain\.#### Datasets\. We test our pipeline on two summarization datasets one from each domain\. We use the English subset of MultiClinSumm datasetLima Lópezet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib83)\)for medical domain\. The dataset comprises clinical case reports as source document \(SD\) and their corresponding summaries derived from case report as the reference summaries \(RS\)\. We use the abstractive summarization dataset \(IN\-ABS\) proposed inShuklaet al\.\([2022](https://arxiv.org/html/2605.17379#bib.bib85)\)for Legal domain\. Here SD is a court case judgment from an Indian court and RS is an abstractive summary of the case judgment\. To understand the generalizability of our approach across tasks, we further supplement the evaluation for medical domain on two summarization tasks: Evidence\-based summarizationMollá and Santiago\-Martinez \([2011](https://arxiv.org/html/2605.17379#bib.bib2)\)and patient healthcare query summarizationBen Abacha and Demner\-Fushman \([2019](https://arxiv.org/html/2605.17379#bib.bib58)\); Van Veenet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib9)\)\. The EBM \(Evidence\-based Summarization\) comprises a query accompanied by a PubMed abstract as a context as the source document and the reference summary as answer to the question in context of the query\. CHQ \(Patient healthcare query summarization\) consists of the a patient\-written healthcare query as input and a medical\-expert written one\-line concise question for the patient query as the summary\. In the main text we discuss the results using clinical report summarization and the results for EBM and CHQ datasets in Appendix[A](https://arxiv.org/html/2605.17379#A1)\. #### Restructuring Datasets for Expert\-Level Summaries\. We restructure the standard dataset in such a way that challenging data points constitute our test setBaldeet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib16),[2025](https://arxiv.org/html/2605.17379#bib.bib61)\)\. We specifically consider two scenarios where: a\) the source documents have higher OOV concentration–OOV\_SD, b\) the reference summaries have higher OOV \(Out\-of\-Vocabulary\) concentration–OOV\_RS\. The top\-10% of data points from each of these categories are considered higher concentration documents which constitute our restructured test set\. The rest 90% of corpus is kept as training set\. Additionally, we create an equal\-sizedRandomtrain/test subset without any restrictions on OOV concentrations to understand the degree of improvements in challenging scenarios\. The dataset statistics are reported in Table[1](https://arxiv.org/html/2605.17379#S3.T1)\. We note that there is roughly30−40%30\-40\\%overlap in the test set of challenging scenarios\. Prompt structureMedicalYou are an expert medical professional\.\#\#\#\#Summarize the given clinical case report into a discharge summary of 100 words or less\. Use the examples to guide word choice\.Clinical Case Report 1: \{Train\-Case\-Document\}Discharge Summary 1 : \{Train\-Summary\}\#\#Clinical Case Report 2: \{Test\-Case\-Document\}Discharge Summary 2 :LegalYou are an expert Indian Legal professional\.\#\#\#\#Summarize the given legal case document in 300 words on less\. Use the examples to guide word choice\.Case Document 1: \{Train\-SD\}Summary 1 : \{Train\-RS\}\#\#Case Document 2: \{Test\-SD\}Summary 2 :Table 2:The prompt structure used for prompting LLMs inspired based on the structure proposed in ClinSummVan Veenet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib9)\)\. Since we are using BASE LLMs there is no explicit segregation of system prompt and user prompt\. #### Baseline Models\. We used the base variants of two LLMs \- Qwen\-2\.5Qwen:et al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib56)\)\(Model id:[Qwen/Qwen2\.5\-7B](https://huggingface.co/Qwen/Qwen2.5-7B)\), and Llama\-3\.1Touvronet al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib4)\)\(Model id:[meta\-llama/Llama\-3\.1\-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)\) as ourBASEmodels\. They do not undergo vocabulary adaptation and continual pretraining\. Additionally, we also used continually pretrained variants of these base models on domain\-specific text, which we label as ‘CPTOnly \(No Vocab Adapt\)’\. This helps us to evaluate the improvements observed solely because of vocabulary adaptation\. #### Training and Inference Strategy\. All the experiments are conducted on a single H100 80 GB GPU\. We train the models using standard causal language modeling task of next token prediction and use greedy decoding to generate summaries\. We use LoRA and set rank at 32, alpha at 64, learning rate at2e−52e\-5\. For all the domains, we adapt a vocabulary of size10K10Kand train the models on100M100Mtokens dataset for a total of 3 epochs with an effective batch size of 64\. Both CPTOnly andVocabAdaptare trained on identical corpora and hyperparameter setting, withVocabAdaptadditionally performing a one\-time vocabulary construction step that takes roughly 30 minutes \(on a single core of Apple M3 Pro laptop\)\. Despite this overhead,VocabAdaptcompletes training in6\.5–8\.56\.5–8\.5hours total, making it notably faster than CPTOnly, which requires10\.5–12\.510\.5–12\.5hours\. For inference, we use in\-context learningBrownet al\.\([2020](https://arxiv.org/html/2605.17379#bib.bib6)\)to provide inputs to model with only one example demonstration appended to the test data point \(Appendix[A\.1](https://arxiv.org/html/2605.17379#A1.SS1)contains details on the sampling procedure for ICL demonstration\)\. The prompt structure for ICL is provided in Table[2](https://arxiv.org/html/2605.17379#S3.T2)\. #### Evaluation Metrics\. We evaluate the summarization quality using Rouge\-LCS \(R\-LCS\) as the main evaluation metric and report F\-score values, as followed by prior worksBaldeet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib16),[2025](https://arxiv.org/html/2605.17379#bib.bib61)\); Fabbriet al\.\([2021](https://arxiv.org/html/2605.17379#bib.bib41)\)\. We also report BertScoreZhanget al\.\([2020](https://arxiv.org/html/2605.17379#bib.bib76)\)where we use BioBertLeeet al\.\([2020](https://arxiv.org/html/2605.17379#bib.bib77)\)embeddings and InLegalBERTPaulet al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib78)\)embeddings for the medical and legal domain evaluation respectively\. We also conduct a LLM\-as\-judge evaluation of the summaries generated in medical and legal domains\. We use the Google’s MedGemma\-27B modelSellergrenet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib91)\)for medical domain and Gemma3\-27BTeamet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib68)\)for legal domain to evaluate the model\-generated summaries across three evaluation dimensions: coherence, relevance, and faithfulness on a scale of1−51\-5Fabbriet al\.\([2021](https://arxiv.org/html/2605.17379#bib.bib41)\); Zhanget al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib8)\)\. ## 4Experimental Results Table 3:Comparison of best vocabulary adaptation methods across different domains \(Legal and Medical\) using in\-context learning with one exemplar demonstration in two challenging scenarios –OOV\_RSandOOV\_SDandRandomsubset\. We report Rouge\-LCS \(R\-LCS\), BERTScore \(BSr\), and Fragment scores in SD \(FrSrSD\) and RS \(FrSrRS\)\. We note that: \(i\) vocabulary adaptation significantly brings down fragment scores, \(ii\) improvement margins are higher in medical domain compared to legal domain \(owing to higher OOV concentration\), \(iii\) improvements due to vocabulary adaptation is typically higher in challeneging scenarios thanRandomsetting\. \(iv\) vocabulary adaptation brings down training time by35−55%35\-55\\%compared to CPTOnly baselines\. This contrast betweenOOV\_SDandOOV\_RShighlights that source\-side OOV primarily affects content understanding, while reference\-side OOV impacts lexical realization, and effective vocabulary adaptation is crucial in addressing both challenges beyond what is observed in theRandomsetting\.We report Rouge\-LCS, BERTScore, and Fragment Scores \(avg\. number of subwords a word is tokenized into\) in Table[3](https://arxiv.org/html/2605.17379#S4.T3)focusing on best vocabulary adaptation strategies\. Further results are provided in Appendix[A](https://arxiv.org/html/2605.17379#A1)\. We observe that the impact of vocabulary expansion is strongly domain\-dependent\. Improvements are more pronounced in medical domain which has higher OOV concentration as compared to legal domain\. We now provide a detailed discussion across scenarios highlighting where vocabulary adaptation does and does not work\. #### Vocabulary adaptation leads to a lower fragment score\. Vocabulary adaptation techniques improve fragment score, thus reducing over\-fragmentation and addressing vocabulary mismatch\. In medical domain, we see a reduction of16\.02%16\.02\\%and15\.63%15\.63\\%for Llama and Qwen respectively across challenging OOV scenarios\. In legal domain, we see a reduction of5\.95%5\.95\\%and5\.73%5\.73\\%for Llama and Qwen respectively across challenging scenarios\. This reduction makes models energy\-efficient as fewer tokens are needed to encode and generate compared to BASE, resulting in better representations\. #### Vocabulary Adaptation improves more in OOV concentration subset of source document versus reference summary\. Vocabulary adaptation improves in all cases \(in terms of R\-LCS and BERTScore\) over BASE and 6 out of 8 comparisons over CPTOnly inOOV\_SD\. InOOV\_RS, vocabulary adaptation improves in a total of 7 out of 8 comparisons over BASE and only 3 out of 8 comparisons over CPTOnly\. The observed improvement can be attributed to a greater reduction in source\-side token fragmentation —10\.16%10\.16\\%forOOV\_SDcompared to8\.92%8\.92\\%forOOV\_RS\. Higher fragmentation inOOV\_RSleads to a more dispersed attention distribution, which can hinder the model’s ability to effectively capture and understand the source document, ultimately affecting overall performance\. #### Improvements in medical domain is higher than legal domain\. Although for both the domains vocabulary adaptation has consistently improved over BASE\. This behavior could be tied to rather simple observation from Table[1](https://arxiv.org/html/2605.17379#S3.T1)\. Medical domain has substantially higher OOV concentrations in source documents and reference summaries which make it an ideal candidate for vocabulary adaptation\. #### Improvement inRandomis moderate compared toOOVsettings\. Randomsetting yields slightly lower absolute performance than challenging OOV scenarios for medical domain \(Qwen: 75\.72 vsOOV\_SD76\.15 BSr; Llama: 75\.98 vsOOV\_SD76\.55 BSr\), validating that vocabulary adaptation is most beneficial under severe OOV constraints\. The performance gap betweenRandomand OOV scenarios is more pronounced in medical domain, consistent with higher OOV concentration in SD and RS subsets\. Fragmentation reduction is more substantial in OOV scenarios thanRandomfor medical domain \(Qwen: FrSr 1\.10\-1\.14 in OOV vs 1\.09\-1\.10 in Random\), demonstratingVocabAdapthigher performance is consistent with higher reduction in fragmentation\. That said, it needs to be mentioned that in a random situationVocabAdapthas positive impact albeit small\. Figure 1:Median novel unigram concentration observed in the summaries generated by BASE, CPTOnly, andVocabAdaptmethods for Llama and Qwen models\. We note that vocabulary adaptation method brings more \(meaningful\) novel words compared to baselines\. #### Vocabulary adaptation improves training efficiency\. Beyond final performance, vocabulary adaptation methods consistently achieve their best checkpoints substantially earlier than CPTOnly across domains and model families\. Concretely, from Table[3](https://arxiv.org/html/2605.17379#S4.T3), we note that vocabulary adaptation variants results in an approximate35–55% reduction in training stepsto peak performance compared to CPTOnly\.This in turn reduce training time while maintaining similar performance or even outperforming CPTOnly\. This indicates that correcting tokenization mismatch improves optimization efficiency by allowing models to allocate capacity to coherent domain tokens\. #### Vocabulary adaptation improves semantic overlap\. We conduct a brief analysis to understand why in certain cases there is a slight drop in Rouge\-LCS but gains in BERTScore\. We hypothesize that since vocabulary adaptation results in higher abstraction–generation of novel unigrams that are absent in the source document is more prevalent\. This might in turn brings terms that are not lexically overlapping with reference summary but does carry similar semantics\. We report our findings in Figure[1](https://arxiv.org/html/2605.17379#S4.F1)\. We note that vocabulary adaptation indeed introduce more meaningful abstraction \(novel unigrams\) than baselines consistently across evaluation scenarios\. Thus validating it might account for slight drop in Rouge\-LCS complemented by increase in BERTScore\. The next question which can be asked is whether introduction of such novel words improve the readability, coherence of the summary, we answer that question by using LLM as a judge\. #### LLM\-as\-a\-Judge Evaluation for measuring quality of a generated summary\. We conduct a LLM\-as\-a\-Judge evaluationCroxfordet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib90)\)of the summaries generated by CPTOnly and vocabulary adaption methods in medical and legal domain\. We conduct evaluation across three dimensions of coherence, relevance, and faithfulness as done in prior artZhanget al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib8)\); Baldeet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib16),[2025](https://arxiv.org/html/2605.17379#bib.bib61)\)\. We take 100 random samples from medical domain and 20 from legal domain distributed uniformly across OOV scenarios and models\. We report the average scores in Table[4](https://arxiv.org/html/2605.17379#S4.T4)\. We find that vocabulary adaptation generates more coherent, relevant, and faithful summaries compared to competitive CPTOnly baseline\. \(See Appendix[A\.4](https://arxiv.org/html/2605.17379#A1.SS4)for further details\)\. Table 4:LLM\-as\-a\-Judge results for Medical domain using MedGemma\-27B, and Gemma\-27B model for Legal domin as the evaluator\. The evaluation is carried out across coherence, relevance, and faithfulness on a scale of1−51\-5\. We observe that the summaries generated by vocabulary adaptation methods are mostly rated higher than CPTOnly baseline, resulting in better summaries\.Table 5:Ablation analysis for vocabulary adaptation methods with and without replacement\. We show vocabulary sizes, parameter counts \(in Millions\), and performance metrics, R\-LCS and BERTScore in challenging scenarios\. We note that replacement\-based methods \(i\) save12\.04%12\.04\\%parameters in Llama\-3\.1 and37\.19%37\.19\\%parameters in Qwen2\.5\-7B, \(ii\) performs better than without replacement in13/1613/16settings\. #### Ablation analysis of vocabulary adaptation with and without replacement\. We report an ablation of vocabulary adaptation techniques with and without replacement in Table[5](https://arxiv.org/html/2605.17379#S4.T5)\. We note that replacement\-based strategies perform better or at par with without\-replacement strategies in 13 out of 16 settings\. Replacement based strategies favors Llama \(7 out of 8 settings\) slightly more than Qwen \(6 out of 8 settings\)\. Contrary to previous discussions on higher OOV concentration in medical domain, we find here that Legal domain benefits more \(all 8 settings\) than medical domain \(5 out of 8 settings\)\. One possible explanation for this observation could be higher replacement fraction in legal domain \(25\.43%25\.43\\%\) compared to medical domain \(23\.81%23\.81\\%\)\. Importantly, replacement\-based vocabulary adaptation does not increase the number of trainable parameters beyond the expanded embedding and unembedding layer \(lm\_headlm\\\_head\)\. We note that replacement\-based methods save12\.04%12\.04\\%parameters in Llama\-3\.1 and37\.19%37\.19\\%parameters in Qwen2\.5\-7B\. #### Comparison with closed\-source LLMs\. We conducted a zero\-shot analysis on a closed\-source model: GPT\-5 \(gpt\-5\-mini\-2025\-08\-07\)\. We aim to understand that as the number of parameters increases, does over\-fragmentation still persists as an underlying issue\. To that end, we ran the evaluation on GPT\-5 and compared the results with the best of Llama and Qwen results onVocabAdaptmethod\. The results are shown in Table[6](https://arxiv.org/html/2605.17379#S4.T6)\. Table 6:Performance of GPT\-5\-mini andVocabAdaptBeston summarization across Medical and Legal domains, evaluated using Rouge\-LCS \(R\-LCS\) and BERTScore \(BSr\) under challenging scenarios \(OOV\_SDandOOV\_RS\) and Random setting\.We note that our 7\-8B parameter model with vocabulary adaptation is consistently outperforming gpt\-5\-mini \(speculated several orders larger than 7B model with more complex architecture and workflow including MoE imbibed\) in all the scenarios\. This motivates the need for vocabulary adaptation even for larger parameter models\. ## 5Related Works #### Domain Adaptation via Continued Pretraining\. Standard adaptation strategies rely on continued pretraining \(CPT\) to align generalist models to expert domain\. Prominent examples include MEDITRON\(Chenet al\.,[2023](https://arxiv.org/html/2605.17379#bib.bib62)\)BioMistral\(Labraket al\.,[2024](https://arxiv.org/html/2605.17379#bib.bib55)\)and ChatLawCuiet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib25)\), which utilize massive domain corpora to enhance performance\. However, these model\-centric approaches are computationally\-intensive and fail to address the underlying tokenization over\-fragmentationSiet al\.\([2019](https://arxiv.org/html/2605.17379#bib.bib53)\), leading to inefficient inference and context window erosion\(Guet al\.,[2021](https://arxiv.org/html/2605.17379#bib.bib73)\)\. #### Vocabulary Expansion Strategies\. To mitigate fragmentation, recent research has pivoted towards vocabulary expansion\.Honget al\.\([2021](https://arxiv.org/html/2605.17379#bib.bib40)\)introduced AVocaDo to optimize vocabulary based on fragment scoresRustet al\.\([2021](https://arxiv.org/html/2605.17379#bib.bib32)\), while Task\-Adaptive Tokenization\(Liuet al\.,[2023](https://arxiv.org/html/2605.17379#bib.bib13)\)leverages subword regularization to reduce sequence length\. More targeted approaches like MEDVOC\(Baldeet al\.,[2024](https://arxiv.org/html/2605.17379#bib.bib16)\), Gold Panning\(Liuet al\.,[2024](https://arxiv.org/html/2605.17379#bib.bib12)\), HYPEROFAÖzerenet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib92)\), AdaptiVocabNakashet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib93)\), and MEDVOC\-LLM and ScafFixBaldeet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib61)\)focus on selecting high\-value domain tokens, though these additive methods inevitably increase the model’s parameter count and memory footprint\. #### Vocabulary Pruning and Replacement\. Addressing parameter efficiency, emerging works investigates pruning and token recycling\.Land and Bartolo \([2024](https://arxiv.org/html/2605.17379#bib.bib65)\)identified "glitch tokens" as undertrained vocabulary artifacts ripe for removal\. Building on this, methods like Vocab Diet\(Reifet al\.,[2025](https://arxiv.org/html/2605.17379#bib.bib81)\), and COMPACT\(Konget al\.,[2025](https://arxiv.org/html/2605.17379#bib.bib67)\)demonstrate that pruning unused tokens or replacing them with domain\-specific terms can maintain performancePurasonet al\.\([2026](https://arxiv.org/html/2605.17379#bib.bib80)\)\. This establishes the basis for our replacement\-based framework, which achieves adaptation with relatively less parameter growth as compared to expansion techniques\. ## 6Conclusion We presented a systematic study of vocabulary adaptation for domain\-specific summarization, focusing on when and why it improves LLMs performance\. Across controlled settings:OOV\_RSandOOV\_SD, we showed that gains are governed by the severity and location of vocabulary mismatch\. Vocabulary adapted models converge faster \(35−55%35\-55\\%\) than continual pretraining alone\. Furthermore, vocabulary adaptation not only improves performance quantitatively \(in terms of ROUGE\-LCS and BERTScore\) but also qualitatively \(coherence, relevance, and faithfulness\) as noted in our LLM\-as\-a\-Judge evaluation\. Replacement\-based strategies remain parameter\-efficient saving up to37%37\\%parameters and further improve robustness over expansion\-only counterpart\. These findings position tokenization as a design consideration for future domain adaptation works\. We make our codebase pulicly available at[https://github\.com/gb\-kgp/VocabReplace\-Then\-Expand](https://github.com/gb-kgp/VocabReplace-Then-Expand)\. ## 7Limitations Our work has the the following limitations\. First, we built ourfixed\-size100M pretraining corpora inspired from prior artBeltagyet al\.\([2019](https://arxiv.org/html/2605.17379#bib.bib79)\); Chenet al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib62)\); Paulet al\.\([2023](https://arxiv.org/html/2605.17379#bib.bib78)\); however, there can be many other ways to come up with a much more fine\-grained pretraining corpora\. This can be an interesting future work to explore\. Second, we note that LLMs considered, Llama\-3\.1 and Qwen2\.5, have large vocabulary sizes \(128K and 151K\); still, there is a significant overlap in the vocabulary of these models\. However, this in no way affects the findings of this work\. It could indeed be interesting to explore the efficacy of these strategies of other varying vocabulary size models, like Microsoft\-PhiAbdinet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib69)\)with a vocabulary size of 100K, and GemmaTeamet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib68)\)series with a vocabulary size of 256K\. Third, we fix the size of expansion vocabulary at 10K based on the natural frequency order which we found resulted in decent fragment scores mitigating over\-fragmentation\. We speculate there can be more nuanced ways to carefully select this 10K subset, and leave this as a potential future work to explore\. ## 8Ethics Statement and Broader Impact The LLMs considered in this study, Llama and Qwen family, are general purpose LLMs\. Although our techniques are showing promising improvements, they are in no way ready for a production ready deployment before ensuring proper saftery checks and balances\. There still needs to be more dedicated research to investigate hallucination, correctness, and completeness of the response in real\-world open\-ended generation\. ## Acknowledgments We thank the Ministry of Education, Govt of India, for supporting Gunjan Balde with Prime Minister Research Fellowship during his Ph\.D\. tenure\. This research was partially funded by a Google Academic Research Award\. We acknowledge National Supercomputing Mission \(NSM\) for providing computing resources of ‘PARAM Shakti’ at IIT Kharagpur, implemented by C\-DAC and supported by the Ministry of Electronics and Information Technology \(MeitY\) and Department of Science and Technology \(DST\), Government of India\. ## References - Phi\-4 technical report\.arXiv preprint arXiv:2412\.08905\.Cited by:[§7](https://arxiv.org/html/2605.17379#S7.p1.1)\. - O\. Ahia, S\. Kumar,et al\.\(2023\)Do all languages cost the same? tokenization in the era of commercial language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 9904–9923\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.614/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.614)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1)\. - G\. Balde, S\. Roy,et al\.\(2024\)MEDVOC: vocabulary adaptation for fine\-tuning pre\-trained language models on medical text summarization\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence, IJCAI\-24,pp\. 6180–6188\.Note:Main TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2024/683)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p3.1),[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1),[§4](https://arxiv.org/html/2605.17379#S4.SS0.SSS0.Px7.p1.1),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px2.p1.1)\. - G\. Balde, S\. Roy,et al\.\(2025\)Evaluation of LLMs in medical text summarization: the role of vocabulary adaptation in high OOV settings\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 22989–23004\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1179/),ISBN 979\-8\-89176\-256\-5Cited by:[§A\.2](https://arxiv.org/html/2605.17379#A1.SS2.SSS0.Px3.p1.1),[§A\.2](https://arxiv.org/html/2605.17379#A1.SS2.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2605.17379#S1.p3.1),[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1),[§4](https://arxiv.org/html/2605.17379#S4.SS0.SSS0.Px7.p1.1),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px2.p1.1)\. - I\. Beltagy, K\. Lo, and A\. Cohan \(2019\)SciBERT: a pretrained language model for scientific text\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3615–3620\.External Links:[Link](https://aclanthology.org/D19-1371/),[Document](https://dx.doi.org/10.18653/v1/D19-1371)Cited by:[§7](https://arxiv.org/html/2605.17379#S7.p1.1)\. - A\. Ben Abacha and D\. Demner\-Fushman \(2019\)On the summarization of consumer health questions\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Florence, Italy,pp\. 2228–2234\.Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px1.p1.1)\. - T\. Brown, B\. Mann,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px4.p1.5)\. - Z\. Chen, A\. Hernández Cano,et al\.\(2023\)MEDITRON\-70b: scaling medical pretraining for large language models\.External Links:2311\.16079Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.17379#S2.SS3.p2.1),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2605.17379#S7.p1.1)\. - E\. Croxford, Y\. Gao,et al\.\(2025\)Automating evaluation of ai text generation in healthcare with a large language model \(llm\)\-as\-a\-judge\.medRxiv,pp\. 2025–04\.Cited by:[§4](https://arxiv.org/html/2605.17379#S4.SS0.SSS0.Px7.p1.1)\. - J\. Cui, M\. Ning,et al\.\(2024\)Chatlaw: a multi\-agent collaborative legal assistant with knowledge graph enhanced mixture\-of\-experts large language model\.External Links:2306\.16092Cited by:[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px1.p1.1)\. - A\. R\. Fabbri, W\. Kryściński,et al\.\(2021\)SummEval: re\-evaluating summarization evaluation\.Transactions of the Association for Computational Linguistics9,pp\. 391–409\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00373)Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1)\. - P\. Gao, T\. Yamasaki, and K\. Imoto \(2024\)VE\-kd: vocabulary\-expansion knowledge\-distillation for training smaller domain\-specific language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 15046–15059\.Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p3.1)\. - Y\. Gu, R\. Tinn,et al\.\(2021\)Domain\-specific language model pretraining for biomedical natural language processing\.ACM Transactions on Computing for Healthcare \(HEALTH\)3\(1\),pp\. 1–23\.Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px1.p1.1)\. - S\. Gururangan, A\. Marasović,et al\.\(2020\)Don’t stop pretraining: adapt language models to domains and tasks\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 8342–8360\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.740)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p2.1)\. - V\. Hofmann, H\. Schuetze, and J\. Pierrehumbert \(2022\)An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),Dublin, Ireland,pp\. 385–393\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-short.43)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1)\. - J\. Hong, T\. Kim,et al\.\(2021\)AVocaDo: strategy for adapting vocabulary to downstream domain\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Online and Punta Cana, Dominican Republic,pp\. 4692–4700\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.385)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p3.1),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px2.p1.1)\. - E\. J\. Hu, yelong shen,et al\.\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p2.1),[§2\.5](https://arxiv.org/html/2605.17379#S2.SS5.p1.1)\. - A\. Joshi, S\. Paul,et al\.\(2024\)IL\-TUR: benchmark for Indian legal text understanding and reasoning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 11460–11499\.External Links:[Link](https://aclanthology.org/2024.acl-long.618/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.618)Cited by:[§A\.1](https://arxiv.org/html/2605.17379#A1.SS1.p1.1)\. - G\. Kaplan, M\. Oren,et al\.\(2025\)From tokens to words: on the inner lexicon of LLMs\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=328vch6tRs)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1)\. - Z\. Kong, Y\. Li,et al\.\(2025\)Token reduction should go beyond efficiency in generative models–from vision, language to multimodality\.arXiv preprint arXiv:2505\.18227\.Cited by:[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px3.p1.1)\. - Y\. Labrak, A\. Bazoge,et al\.\(2024\)BioMistral: a collection of open\-source pretrained large language models for medical domains\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 5848–5864\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.348)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px1.p1.1)\. - S\. Land and M\. Bartolo \(2024\)Fishing for magikarp: automatically detecting under\-trained tokens in large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 11631–11646\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.649/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.649)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p3.1),[§1](https://arxiv.org/html/2605.17379#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.17379#S2.SS2.p2.3),[§2\.2](https://arxiv.org/html/2605.17379#S2.SS2.p3.7),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px3.p1.1)\. - J\. Lee, W\. Yoon,et al\.\(2020\)BioBERT: a pre\-trained biomedical language representation model for biomedical text mining\.Bioinformatics36\(4\),pp\. 1234–1240\.Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1)\. - S\. Lima López, M\. Rodríguez Ortega,et al\.\(2025\)MultiClinSum dataset: summarization of clinical case reports in english, spanish, french and portuguese\.Zenodo\.Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px1.p1.1)\. - C\. Liu, S\. Wang,et al\.\(2024\)Gold panning in vocabulary: an adaptive method for vocabulary expansion of domain\-specific LLMs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 7442–7459\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.424)Cited by:[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px2.p1.1)\. - S\. Liu, N\. Deng,et al\.\(2023\)Task\-adaptive tokenization: enhancing long\-form text generation efficacy in mental health and beyond\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 15264–15281\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.944)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p3.1),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px2.p1.1)\. - X\. H\. Lù \(2024\)BM25S: orders of magnitude faster lexical search via eager sparse scoring\.External Links:2407\.03618,[Link](https://arxiv.org/abs/2407.03618)Cited by:[§A\.1](https://arxiv.org/html/2605.17379#A1.SS1.p1.1)\. - D\. Mollá and M\. E\. Santiago\-Martinez \(2011\)Development of a corpus for evidence based medicine summarisation\.InProceedings of the Australasian Language Technology Association Workshop 2011,pp\. 86–94\.Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px1.p1.1)\. - I\. Nakash, N\. Calderon,et al\.\(2025\)AdaptiVocab: enhancing LLM efficiency in focused domains through lightweight vocabulary adaptation\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=TyXf9dwpZP)Cited by:[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px2.p1.1)\. - E\. Özeren, Y\. Liu, and H\. Schuetze \(2025\)HYPEROFA: expanding LLM vocabulary to new languages via hypernetwork\-based embedding initialization\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 4: Student Research Workshop\),J\. Zhao, M\. Wang, and Z\. Liu \(Eds\.\),Vienna, Austria,pp\. 79–96\.External Links:[Link](https://aclanthology.org/2025.acl-srw.6/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-srw.6),ISBN 979\-8\-89176\-254\-1Cited by:[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px2.p1.1)\. - S\. Paul, A\. Mandal,et al\.\(2023\)Pre\-trained language models for the legal domain: a case study on indian law\.InProceedings of the Nineteenth International Conference on Artificial Intelligence and Law,ICAIL ’23,New York, NY, USA,pp\. 187–196\.External Links:ISBN 9798400701979,[Link](https://doi.org/10.1145/3594536.3595165),[Document](https://dx.doi.org/10.1145/3594536.3595165)Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1),[§7](https://arxiv.org/html/2605.17379#S7.p1.1)\. - A\. Petrov, E\. La Malfa,et al\.\(2023\)Language model tokenizers introduce unfairness between languages\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 36963–36990\.Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1)\. - T\. Purason, P\. Chizhov,et al\.\(2026\)Teaching old tokenizers new words: efficient tokenizer adaptation for pretrained models\.InFindings of the Association for Computational Linguistics: EACL 2026,V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 6492–6516\.External Links:[Link](https://aclanthology.org/2026.findings-eacl.341/),[Document](https://dx.doi.org/10.18653/v1/2026.findings-eacl.341),ISBN 979\-8\-89176\-386\-9Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.17379#S2.SS2.p3.7),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px3.p1.1)\. - Qwen:, A\. Yang,et al\.\(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px3.p1.1)\. - Y\. Reif, G\. Kaplan, and R\. Schwartz \(2025\)Vocab diet: reshaping the vocabulary of llms with vector arithmetic\.External Links:2510\.17001,[Link](https://arxiv.org/abs/2510.17001)Cited by:[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px3.p1.1)\. - P\. Rust, J\. Pfeiffer,et al\.\(2021\)How good is your tokenizer? on the monolingual performance of multilingual language models\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),Online,pp\. 3118–3135\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.243)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1),[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px2.p1.1)\. - V\. Sachidananda, J\. Kessler, and Y\. Lai \(2021\)Efficient domain adaptation of language models via adaptive tokenization\.InProceedings of the Second Workshop on Simple and Efficient Natural Language Processing,Virtual,pp\. 155–165\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.sustainlp-1.16)Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p3.1)\. - A\. Sellergren, S\. Kazemzadeh,et al\.\(2025\)MedGemma technical report\.External Links:2507\.05201,[Link](https://arxiv.org/abs/2507.05201)Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1)\. - A\. Shukla, P\. Bhattacharya,et al\.\(2022\)Legal case document summarization: extractive and abstractive methods and their evaluation\.InProceedings of the 2nd Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),Y\. He, H\. Ji, S\. Li, Y\. Liu, and C\. Chang \(Eds\.\),Online only,pp\. 1048–1064\.External Links:[Link](https://aclanthology.org/2022.aacl-main.77/),[Document](https://dx.doi.org/10.18653/v1/2022.aacl-main.77)Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px1.p1.1)\. - Y\. Si, J\. Wang,et al\.\(2019\)Enhancing clinical concept extraction with contextual embeddings\.Journal of the American Medical Informatics Association26\(11\),pp\. 1297–1304\.Cited by:[§5](https://arxiv.org/html/2605.17379#S5.SS0.SSS0.Px1.p1.1)\. - G\. Team, A\. Kamath, and J\. others \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§A\.4](https://arxiv.org/html/2605.17379#A1.SS4.p1.1),[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1),[§7](https://arxiv.org/html/2605.17379#S7.p1.1)\. - H\. Touvron, L\. Martin,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px3.p1.1)\. - D\. Van Veen, C\. Van Uden,et al\.\(2024\)Adapted large language models can outperform medical experts in clinical text summarization\.Nature medicine30\(4\),pp\. 1134–1142\.Cited by:[Table 8](https://arxiv.org/html/2605.17379#A1.T8),[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2605.17379#S3.T2)\. - C\. Wu, W\. Lin,et al\.\(2024\)PMC\-llama: toward building open\-source language models for medicine\.Journal of the American Medical Informatics Association31\(9\),pp\. 1833–1843\.Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p1.1)\. - A\. Yamaguchi, A\. Villavicencio, and N\. Aletras \(2024\)How can we effectively expand the vocabulary of llms with 0\.01 gb of target language text?\.arXiv preprint arXiv:2406\.11477\.Cited by:[§1](https://arxiv.org/html/2605.17379#S1.p3.1),[§2\.4](https://arxiv.org/html/2605.17379#S2.SS4.p2.2)\. - N\. Zhang, Y\. Zhang,et al\.\(2023\)FaMeSumm: investigating and improving faithfulness of medical summarization\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 10915–10931\.Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1),[§4](https://arxiv.org/html/2605.17379#S4.SS0.SSS0.Px7.p1.1)\. - T\. Zhang, V\. Kishore,et al\.\(2020\)BERTScore: evaluating text generation with bert\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkeHuCVFDr)Cited by:[§3](https://arxiv.org/html/2605.17379#S3.SS0.SSS0.Px5.p1.1)\. ## Appendix AExperiments and Results Details ### A\.1Sampling ICL Demonstration In order to sample the ICL demonstration from train set per test example, we use cosine similarity over embeddings obtained from sentence\-transformers model variant of PubMedBERT333[https://huggingface\.co/pritamdeka/PubMedBERT\-mnli\-snli\-scinli\-scitail\-mednli\-stsb](https://huggingface.co/pritamdeka/PubMedBERT-mnli-snli-scinli-scitail-mednli-stsb)for medical domain\. Due to extremely lengthy nature of documents in legal domain, we use standard bm25 model inspired from prior artJoshiet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib89)\)to get the closest training demonstration for the test point\. We use bm25s libraryLù \([2024](https://arxiv.org/html/2605.17379#bib.bib88)\)to setup the retriever\. ### A\.2Baseline Models Here we describe the methods evaluated in this work: #### xx\-BASE\. These are the BASE LLMs variant \(not the instruction tuned ones\): Llama\-3\.1\-8B\-BASE and Qwen2\.5\-7B\-BASE\. They have not undergone any vocabulary modification and continual pretraining\. #### CPTOnly \(No Vocab Adapt\)\. These are the variants of BASE LLMs models have not undergone any vocabulary modification but only standard continual pretraining over the domain\-specific corpora\. #### VocabAdaptW/o Replace\. These are pure vocabulary expansion baselines without any replacement\. We directly take non\-overlapping \(from BASE LLMs vocabulary\) top\-10K vocabulary tokens learn from the domain\-specififc tokenizers and add it to the model vocabulary\. The expansion and addition procedure is similar to MEDVOCBaldeet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib61)\), where for each vocabulary token to be added we iteratively add its subwords as obtained from the BASE LLM tokenizer\. #### VocabAdaptW/ Replace\. This is the replacement variant ofVocabAdaptW/o Replace\. Here, we first replace tokens from the candidate replace set–VcandV\_\{\\text\{cand\}\}\(Eq\.[2](https://arxiv.org/html/2605.17379#S2.E2)\), then expand the BASE LLMs vocabulary with the remaining vocabulary tokens\. #### VocabAdaptRefineW/o Replace\. These are vocabulary expansion baselines without any replacement\. Here, before selecting the top 10K tokens for expansion, we do a refinement steps of removing non\-standard tokens–mixture of chars and numbers, chars and punctuation, numbers and punctuations–that might be inconsistent with BASE LLM tokenization, inspired from MEDVOC\-LLMBaldeet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib61)\)\. Post\-refinement we take non\-overlapping \(from BASE LLMs vocabulary\) top\-10K vocabulary tokens learn from the domain\-specififc tokenizers and add it to the model vocabulary\. CorpusSD Token CountRS Token CountSD OOV Conc\.RS OOV Conc\.SizeLlamaQwenLlamaQwenLlamaQwenLlamaQwenEvidence\-Based SummarizationRandom18539241092977\.817\.818\.388\.38OOV\_SD185346354757914\.7714\.7710\.7610\.76OOV\_RS18538239553549\.939\.9319\.1019\.10Clinical Healthcare Query SummarizationRandom150798013138\.328\.3311\.0711\.07OOV\_SD1145152141423\.3023\.3315\.1415\.14OOV\_RS1438080151512\.1112\.1227\.7427\.74 Table 7:Dataset statistics of summarization datasets underRandom,OOV\_RS, andOOV\_SDsettings, reporting mean token counts, OOV concentration \(fraction of unigrams in text split more than once\)\.Prompt structureEvidence\-Based SummarizationYou are an expert medical professional\.\#\#\#\#Summarize the given source document in the context of the input query in 100 words or less\. Use the examples to guide word choice\.Query 1: \{Train\-Query\}Source Document 1: \{Train\-Source\-Document\}Query\-Focused Summary 1: \{Train\-Target\-Summary\}\#\#Query 2: \{Test\-Query\}Source Document 2: \{Test\-Source\-Document\}Query\-Focused Summary 2:Clinical Healthcare Query SummarizationYou are an expert medical professional\.\#\#\#\#Summarize the given patient healthcare query into a concise single question of 10 words or less\. Use the examples to guide word choice\.Patient Health Query 1: \{Train\-Patient\-Query\}Summarized Question 1 : \{Train\-Summarized\-Question\}\#\#Patient Health Query 2: \{Test\-Patient\-Query\}Discharge Summary 2 : Table 8:The prompt structure used for prompting LLMs inspired from the prompt structure proposed in ClinSummVan Veenet al\.\([2024](https://arxiv.org/html/2605.17379#bib.bib9)\)\. Since we are using BASE LLMs there is no explicit segregation of system prompt and user prompt\. #### VocabAdaptRefineW/ Replace\. This is the replace\-then\-expand variant ofVocabAdaptRefineW/o Replace\. ### A\.3Results Trends We report the full results in Table[9](https://arxiv.org/html/2605.17379#A1.T9)\. We now provide discussions as observed across challenging scenarios\. VocabParamBestRandomOOV\_SDOOV\_RSSizeIncr\.Ckpt\.FrSrSDFrSrRSR\-LCSBSrFrSrSDFrSrRSR\-LCSBSrFrSrSDFrSrRSR\-LCSBSrMEDICAL –ClinSummLlama\-3\.1\-8B\-BASE128256\-\-1\.161\.1623\.0370\.661\.251\.2724\.3971\.351\.201\.3421\.3370\.41CPTOnly \(No Vocab Adapt\)128256\-75001\.161\.1624\.8975\.551\.251\.2526\.2976\.221\.201\.3423\.9275\.43VocabAdaptW/o Replace\.141779110M35001\.051\.0624\.8175\.811\.091\.0925\.8976\.361\.061\.1223\.4175\.48VocabAdaptW/ Replace\.14025198M35001\.051\.0624\.1375\.361\.091\.0925\.4575\.831\.061\.1222\.8175\.02VocabAdaptRefineW/o Replace\.141791110M35001\.051\.0724\.5875\.661\.091\.0926\.5676\.521\.061\.1223\.8675\.54VocabAdaptRefineW/ Replace\.14026398M35001\.051\.0724\.9875\.981\.091\.0926\.6876\.551\.061\.1223\.6575\.58Qwen2\.5\-7B\-BASE151665\-\-1\.191\.2315\.5143\.351\.281\.2914\.1537\.441\.241\.3612\.2335\.22CPTOnly \(No Vocab Adapt\)151665\-80001\.191\.2324\.7375\.321\.281\.2925\.9675\.901\.241\.3622\.4174\.44VocabAdaptW/o Replace\.16273879M35001\.091\.1025\.0475\.721\.121\.1126\.1176\.151\.101\.1423\.1275\.21VocabAdaptW/ Replace\.15875150M35001\.091\.1024\.6775\.411\.121\.1126\.4376\.191\.101\.1423\.0674\.98VocabAdaptRefineW/o Replace\.16274579M35001\.081\.0924\.2275\.261\.121\.1125\.7675\.841\.101\.1422\.4074\.77VocabAdaptRefineW/ Replace\.15875851M35001\.081\.0924\.6875\.511\.121\.1126\.2176\.121\.101\.1422\.8174\.93MEDICAL –EBMLlama\-3\.1\-8B\-BASE128256\-\-1\.071\.0818\.3167\.141\.241\.1217\.7571\.661\.121\.3015\.9068\.20CPTOnly \(No Vocab Adapt\)1282560M75001\.071\.0819\.9974\.901\.241\.1217\.7671\.301\.121\.3016\.5872\.56VocabAdaptW/o Replace\.141779110M35001\.011\.0120\.1872\.401\.091\.0217\.1269\.101\.031\.1216\.4472\.21VocabAdaptW/ Replace\.14025198M35001\.011\.0120\.2073\.991\.091\.0216\.8569\.731\.031\.1216\.3072\.23VocabAdaptRefineW/o Replace\.141791110M35001\.011\.0120\.6072\.381\.091\.0216\.8570\.451\.031\.1216\.3472\.44VocabAdaptRefineW/ Replace\.14026398M35001\.011\.0119\.4374\.281\.091\.0216\.7471\.051\.031\.1216\.2472\.23Qwen2\.5\-7B151665\-\-1\.131\.1315\.3155\.331\.261\.1714\.5362\.121\.171\.3313\.2955\.76CPTOnly \(No Vocab Adapt\)1516650M80001\.131\.1317\.7963\.701\.261\.1714\.3957\.321\.171\.3312\.2755\.93VocabAdaptW/o Replace\.16273879M35001\.051\.0520\.0673\.131\.121\.0716\.5871\.441\.071\.1415\.0870\.10VocabAdaptW/ Replace\.15875150M35001\.051\.0519\.7773\.901\.121\.0716\.5671\.431\.071\.1415\.1970\.81VocabAdaptRefineW/o Replace\.16274579M35001\.051\.0519\.4870\.151\.121\.0715\.4064\.451\.071\.1414\.8063\.72VocabAdaptRefineW/ Replace\.15875851M35001\.051\.0518\.9969\.801\.121\.0715\.4166\.211\.071\.1414\.7866\.32MEDICAL –CHQLlama\-3\.1\-8B\-BASE128256\-\-1\.081\.1743\.8183\.441\.351\.2550\.0685\.111\.391\.5148\.5484\.23CPTOnly \(No Vocab Adapt\)1282560M75001\.081\.1743\.3083\.671\.351\.2551\.3285\.941\.391\.5147\.6984\.93VocabAdaptW/o Replace\.141779110M35001\.061\.0843\.6983\.981\.281\.1350\.3786\.031\.091\.2746\.8884\.52VocabAdaptW/ Replace\.14025198M35001\.061\.0842\.9983\.811\.281\.1351\.4185\.941\.091\.2746\.6784\.38VocabAdaptRefineW/o Replace\.141791110M35001\.061\.0842\.1483\.511\.281\.1350\.2285\.281\.091\.2746\.5884\.62VocabAdaptRefineW/ Replace\.14026398M35001\.061\.0841\.5883\.311\.281\.1350\.6085\.691\.091\.2746\.8284\.70Qwen2\.5\-7B151665\-\-1\.091\.1740\.5983\.841\.361\.2546\.5685\.171\.521\.5149\.6085\.27CPTOnly \(No Vocab Adapt\)1516650M80001\.091\.1743\.0984\.021\.361\.2551\.4485\.991\.521\.5148\.6784\.75VocabAdaptW/o Replace\.16273879M35001\.071\.0842\.0983\.551\.291\.1349\.2285\.901\.091\.2748\.0884\.99VocabAdaptW/ Replace\.15875150M35001\.071\.0841\.2883\.241\.291\.1350\.0086\.201\.091\.2748\.2585\.12VocabAdaptRefineW/o Replace\.16274579M35001\.071\.0841\.6083\.771\.291\.1347\.8685\.741\.091\.2747\.2685\.10VocabAdaptRefineW/ Replace\.15875851M35001\.071\.0842\.8583\.831\.291\.1347\.4185\.151\.091\.2748\.2285\.19LEGALLlama\-3\.1\-8B\-BASE128256\-\-1\.031\.0325\.8967\.041\.081\.0624\.8264\.101\.061\.0824\.8665\.28CPTOnly \(No Vocab Adapt\)128256\-100001\.031\.0325\.3669\.051\.081\.0624\.8368\.041\.061\.0824\.3668\.38VocabAdaptW/o Replace\.13947091M65001\.011\.0125\.4769\.051\.011\.0124\.1767\.451\.011\.0123\.7167\.64VocabAdaptW/ Replace\.13794279M65001\.011\.0125\.4269\.141\.011\.0124\.8968\.121\.011\.0123\.9268\.11VocabAdaptRefineW/o Replace\.13965393M65001\.011\.0124\.9168\.861\.011\.0124\.4967\.721\.011\.0123\.5967\.71VocabAdaptRefineW/ Replace\.13812581M65001\.011\.0124\.6168\.591\.011\.0123\.9367\.281\.011\.0123\.4167\.55Qwen2\.5\-7B\-BASE151665\-\-1\.061\.0610\.6328\.251\.111\.109\.9727\.641\.091\.129\.9027\.15CPTOnly \(No Vocab Adapt\)151665\-105001\.061\.0625\.6869\.041\.111\.1025\.1667\.601\.091\.1224\.7267\.97VocabAdaptW/o Replace\.16220676M65001\.021\.0223\.3166\.831\.051\.0522\.2265\.251\.041\.0521\.9565\.67VocabAdaptW/ Replace\.15821947M65001\.021\.0224\.1267\.891\.051\.0523\.3266\.251\.041\.0522\.9166\.94VocabAdaptRefineW/o Replace\.16235277M65001\.021\.0223\.5467\.291\.051\.0523\.0866\.041\.041\.0522\.2666\.13VocabAdaptRefineW/ Replace\.15836548M65001\.021\.0224\.2367\.891\.051\.0523\.6066\.191\.051\.0523\.2266\.89 Table 9:Comparison of vocabulary adaptation methods across different Legal and Medical domains using in\-context learning with 1 ICL demonstration in two challenging scenarios–OOV\_RSandOOV\_SD, and aRandomsubset\. Performance is measured using Rouge\-LCS \(R\-LCS\) and BertScore \(BSr\) metrics\.#### EBM and CHQ results\. The dataset statistics for these datasets is reported in Table[7](https://arxiv.org/html/2605.17379#A1.T7)\. The prompt structure used for inference is reported in Table[8](https://arxiv.org/html/2605.17379#A1.T8)\. We conduct evaluation in line with our challenge\-oriented evaluation focusing on points that are difficult to generate \(OOV\_RS\) and difficult to encode \(OOV\_SD\)\. We note thatVOCABADAPToutperforms BASE in a total of 11 out of 16 comparisons and CPTOnly in 9 out of 16 comparisons of difficult scenarios aroundOOV\_RSandOOV\_SDshowing exactly the same behavior as observed in the paper\. This further strengthens the generalizability of our approach and supports a broader task coverage\. #### OOV\_SD\(High OOV in Source Documents\)\. In theOOV\_SDsetting, the test set is explicitly constructed from documents whose*inputs*exhibit the highest OOV concentration, making accurate content understanding and alignment particularly challenging\. Results show that BASE and CPTOnly models degrade noticeably in both R\-LCS and BERTScore, indicating that continual pretraining alone is insufficient when the source text itself is dominated by unseen or poorly tokenized terms\. Vocabulary adaptation methods consistently improve performance, demonstrating that better lexical coverage at the input level directly enhances content selection and factual grounding in summaries\. Refinement\-based expansion yields the most stable gains, suggesting that removing noisy or irregular candidate tokens before expansion helps the model form cleaner input representations\. Replacement\-based variants offer additional improvements in some cases, but the gains are less uniform, highlighting the sensitivity of source\-side comprehension to overly aggressive vocabulary restructuring\. Overall,OOV\_SDemphasizes the importance of robust input tokenization, where accurate segmentation of domain\-specific terms is critical for downstream summarization quality\. #### OOV\_RS\(High OOV in Reference Summaries\)\. In contrast,OOV\_RSfocuses on datapoints where the*references*—rather than the sources—contain high OOV concentration, stressing the model’s ability to generate or align with rare and domain\-specific lexical forms\. While BASE and CPTOnly models perform reasonably on theRandomsplit, they lag behind vocabulary\-adapted models inOOV\_RS, particularly in R\-LCS, indicating difficulty in matching reference phrasing and terminology\. Vocabulary expansion significantly narrows this gap, with consistent improvements across backbones, confirming that enhanced output\-side lexical expressivity enables closer overlap with reference summaries\. Refinement again proves beneficial by stabilizing gains across both metrics, whereas replacement\-based methods yield modest but less consistent improvements\. The contrast betweenOOV\_SDandOOV\_RShighlights that source\-side OOV primarily affects content understanding, while reference\-side OOV impacts lexical realization, and effective vocabulary adaptation is crucial in addressing both challenges beyond what is observed in theRandomsetting\. #### Implications\. Taken together, these results demonstrate that the benefits of vocabulary expansion scale with the severity and location of vocabulary mismatch\. When OOVs are heavily concentrated in reference summaries \(OOV\_RS\), vocabulary expansion directly improves generation fidelity\. When OOVs originates in the source document \(OOV\_SD\), vocabulary expansion becomes critical, yielding the largest and most consistent improvements\. These findings highlight vocabulary mismatch as a bottleneck in expert domain adaptation and suggest that it should be selectively applied based on domain characteristics where vocabulary mismatch is indeed a significant problem\. Figure 2:We report the score distribution as obtained from our LLM\-as\-a\-Judge evaluation for medical domain\. Acoss each dimension, the bars to the left corresponds to Llama model and bars to the right are for Qwen model\. We note consistentlyVocabAdaptresults in higher score of 4 or 5 compared to CPTOnly\.Figure 3:We report the score distribution as obtained from our LLM\-as\-a\-Judge evaluation for Legal domain\. Acoss each dimension, the bars to the left corresponds to Llama model and bars to the right are for Qwen model\. ### A\.4LLM\-as\-a\-Judge Evaluation We conduct LLM\-as\-a\-judge \(LlaJ\) evaluation for summaries generated in medical and legal domain\. We gather a uniform random subset of 100 summaries for medical domain and 20 summaries for legal domain fromOOV\_SDandOOV\_RSsettings distributed uniformly accorss Llama and Qwen models\. We use models from Google’s Gemma3 familyTeamet al\.\([2025](https://arxiv.org/html/2605.17379#bib.bib68)\): MedGemma\-27B\-text\-it444[https://huggingface\.co/google/medgemma\-27b\-text\-it](https://huggingface.co/google/medgemma-27b-text-it)model as our judge model for medical domain, and Gemma3\-27B\-it555[https://huggingface\.co/google/gemma\-3\-27b\-it](https://huggingface.co/google/gemma-3-27b-it)as our judge model for Legal domain\. The model is provided as input the source document and a generated summary\. It is then asked to rate the generated summary on three dimensions in three separate runs: \(i\) coherence, \(ii\) relevance, and \(iii\) faithfulness; each on a scale of 1\-5 \(higher better\)\. This is done for bothVocabAdaptand CPTOnly summaries\. The detailed system prompts and user prompts for each of the settings are available in the codebase inside the folder "Random\-Eval\-LlaJ" folder\. Our final scores are reported as average across across summary pairs in Table[4](https://arxiv.org/html/2605.17379#S4.T4)and Figures[2](https://arxiv.org/html/2605.17379#A1.F2)and[3](https://arxiv.org/html/2605.17379#A1.F3)\. ### A\.5Supplementary Human Evaluation Figure 4:Annotation instructions shown to the participants\.We conducted an additional human assessment across fluency, consistency, relevance, and coherence rated on a scale of 1\-5 on a subset of 20 summaries for the medical domain\. The annotation instructions are shown in Figure[4](https://arxiv.org/html/2605.17379#A1.F4)\. The evaluation was conducted on the Prolific platform with the following participation criterion: - •Highest education level completed\. Undergraduate degree \(BA/BS/other\)ORGraduate degree \(MA/MS/MPhil/other\)ORDoctorate degree \(PhD/other\) - •Employment Status\.Full\-TimeORPart\-TimeORDue to start a new job within the next month - •Subject: Biochemistry \(Molecular and Cellular\)ORBiological SciencesORBiologyORBiomedical Sciences Each annotator was shown five summary pairs and each summary pair was evaluated independently by three annotators\. The median time to complete the study was 20 minutes\. In total 12 annotators were hired for the evaluation task\. All the annotators were compensated at a rate of GBP 9 per hour\. The average results across each category of annotation are shown in Table[10](https://arxiv.org/html/2605.17379#A1.T10)\. Table 10:Human evaluation trends comparing competing baseline and our proposeVocabAdaptmethod\.The human evaluation exhibits the same overall trend as the LLM\-as\-a\-Judge results; vocabulary adaptation models generate better summaries than the CPTOnly counterpart; reinforcing the validity of our conclusions\. However, we must here mention conducting domain\-specific human evaluation presented substantial practical challenges\. Evaluating only 20 summary pairs in the medical domain incurred us a cost of approximatelyGBP 48 via Prolific\(GBP 36 for annotators and GBP 12 as platform fees\)\.In the legal domain, Prolific does not even provide a sufficiently large or appropriate participant pool to enable reliable evaluation\. These constraints make large\-scale domain\-expert evaluation difficult to sustain\. We therefore view LLM\-as\-a\-Judge not as a replacement for human evaluation, but as a scalable and reproducible alternative that is particularly valuable in settings where domain expertise is limited, costly, or difficult to source\. Our results demonstrate that, when validated against human judgments, it provides consistent and reliable comparative assessment\.
Similar Articles
Good Summarization SLMs for < 2000 tokens
A novice asks for recommendations on small language models and prompting strategies to build an employee note summarization engine under 2000 tokens, after experiencing hallucinations with Qwen2.5-7B-Instruct.
Learning to summarize with human feedback
OpenAI demonstrates a technique for improving language model summarization by training a reward model on human preferences and fine-tuning models with reinforcement learning, achieving significant quality improvements that generalize across datasets. This work advances model alignment through human feedback at scale, with applications beyond summarization.
Optimizing Korean-Centric LLMs via Token Pruning
This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.
Learning, Fast and Slow: Towards LLMs That Adapt Continually [R]
This paper introduces a Fast-Slow Training framework for LLMs that combines parameter updates with optimized context to improve sample efficiency and reduce catastrophic forgetting during continual learning.
Efficient Pre-Training with Token Superposition
Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.