From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

arXiv cs.CL 05/13/26, 04:00 AM Papers
prompt-compression clinical-ai efficiency tokenization healthcare llm-optimization
Summary
This paper introduces MedTPE, a method for efficient, lossless prompt compression of electronic health records for large language models, significantly reducing token length and inference latency in clinical prediction tasks.
arXiv:2605.11774v1 Announce Type: new Abstract: By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM's parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.
Original Article
View Cached Full Text
Cached at: 05/13/26, 06:18 AM
# From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
Source: [https://arxiv.org/html/2605.11774](https://arxiv.org/html/2605.11774)
###### Abstract

By processing electronic health records \(EHRs\) as natural language sequences, large language models \(LLMs\) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping\. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance\. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information\. To achieve lossless compression of token sequences without additional cost or loss of performance, we proposeMedicalToken\-PairEncoding \(MedTPE\), a layered method that extends standard tokenisation for EHR sequences\. MedTPE merges frequently co\-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency\-aware replacement strategy\. Only the embeddings of the newly introduced tokens of merely 0\.5\-1\.0% of the LLM’s parameters are fine\-tuned via self\-supervised learning\. Experiments on real\-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34\-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks\. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages\. The code is available in the[GitHub repository](https://github.com/JasonZuu/MedTPE)\.

Machine Learning, ICML

![Refer to caption](https://arxiv.org/html/2605.11774v1/x1.png)Figure 1:Illustration of the LLM\-based clinical prediction\.## 1Introduction

Electronic health records \(EHRs\) document a longitudinal timeline of clinical events, including diagnosis, discharge notes, laboratory results, vital signs, medications, and procedures\(Theodorouet al\.,[2023](https://arxiv.org/html/2605.11774#bib.bib32)\)\. By transforming these clinical events into natural language sequences, large language models \(LLMs\) can capture temporal and contextual patterns across the entire clinical trajectory, thereby supporting patient care and decision\-making within healthcare systems\(Leeet al\.,[2020](https://arxiv.org/html/2605.11774#bib.bib18); Zhuet al\.,[2026](https://arxiv.org/html/2605.11774#bib.bib68)\)\. Recent studies have reported that LLMs can perform a range of clinical prediction tasks in zero\-shot settings and produce human\-readable explanations, such as predictions of mortality and phenotyping\(Rencet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib12); Cuiet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib20); Williamset al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib19)\)\. Specifically, this paradigm combines the EHR sequence in text with a task\-specific prompt, allowing the LLM to generate both predictions and free\-text explanations\. This offers promising solutions for both predictive performance and interpretability, as illustrated in Figure[1](https://arxiv.org/html/2605.11774#S0.F1)\.

While LLM\-based clinical prediction seems promising, the transformation of longitudinal medical records often produces token sequences that exceed the context window limitations of most pre\-trained LLMs\(Wornowet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib21)\)\. For example, even a single stay in the intensive care unit \(ICU\) can result in a token sequence with length exceeding 64,000 due to the high frequency of clinical events\(Fleminget al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib2)\)\. Such lengthy sequences lead to increased computational requirements, slower inference, and limit the feasibility of context\-length and test\-time scaling strategies that can improve the clinical prediction performance of LLMs\. This inefficiency arises because widely used tokenisation algorithms, such as Byte\-Pair Encoding \(BPE\), WordPiece, and SentencePiece\(Sennrichet al\.,[2016](https://arxiv.org/html/2605.11774#bib.bib6); Songet al\.,[2021](https://arxiv.org/html/2605.11774#bib.bib7); Kudo and Richardson,[2018](https://arxiv.org/html/2605.11774#bib.bib8)\), were originally optimised for general language modelling and are not suitable for the complex and specialised vocabulary of clinical text\. As a result, medical terms are fragmented into multiple subword tokens, unnecessarily extending the sequence length and computation\(Yu,[2025](https://arxiv.org/html/2605.11774#bib.bib22)\)\. For example, a standard tokeniser divides the single clinical concept “Spirometry” into three separate tokens, \[Spi’, rom’, ‘etry’\], rather than processing it as a unified term\.

To address the challenge of long token sequences in EHRs, several approaches have been proposed, but each has notable limitations in the clinical context\. One method is to develop a medical\-specific vocabulary from medical corpora, which helps to avoid over\-fragmented tokens\(Boltonet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib10); Kimet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib11); Rencet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib12)\)\. However, this method requires the resource\-intensive retraining of the entire LLM and may compromise the core capabilities of pre\-trained LLMs\. Removal\-based compression methods, which discard less important tokens from the input, do not require model retraining, but risk omitting clinically significant information\(Liskavetset al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib34); Jianget al\.,[2023](https://arxiv.org/html/2605.11774#bib.bib35); Panet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib36)\)\. Merge\-based compression methods, which dynamically merge medical tokens during inference, enable lossless prompt compression, but often introduce additional parameters or modules, thereby increasing inference latency and model complexity\(Nakashet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib23); Hanet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib24); Harvillet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib33)\)\. Consequently, there remains a need for a medical tokenisation approach that achieves lossless compression, maintains compatibility with pre\-trained LLMs, and does not add further space or computational overhead\.

To address the limitations of existing compression methods in the clinical context, we propose medical token\-pair encoding \(MedTPE\), as illustrated in Figure[2](https://arxiv.org/html/2605.11774#S2.F2)\. Specifically, MedTPE operates in three steps to achieve efficient lossless compression of token sequences in EHRs\. First, it discovers and merges frequently co\-occurring token pairs from EHR sequences, creating TPE tokens tailored to medical text\. Next, MedTPE employs a dependency\-aware replacement strategy, substituting approximately 3% of the least common original tokens in the pre\-trained LLM’s vocabulary with the most common TPE tokens\. This strategy preserves the original vocabulary size and model parameters while ensuring the integrity of the original tokenisation process, retaining the same computational complexity as standard tokenisation methods\. Finally, only the embeddings of the new TPE tokens are fine\-tuned via self\-supervised learning, while all other model parameters remain fixed\. By design, MedTPE increases the information density of each token, allowing for a more compact representation of the EHR sequence within the model’s fixed context constraints\. Overall, MedTPE delivers efficient and lossless compression that integrates smoothly with pre\-trained LLMs, improving the inference efficiency of LLMs in clinical prediction tasks\. Our main contributions are as follows:

- •Tokenisation\-driven compression for clinical prediction\.We are the first to address the challenge of long token sequences of EHR in LLM\-based clinical prediction by optimising the tokenisation process\. The proposed method achieves lossless compression of EHR sequences, enhancing the efficiency of LLMs in clinical prediction across different tokenisers and LLM backbones\.
- •Efficient and label\-free tokenisation\.MedTPE preserves the computational complexity of standard tokenisation by maintaining the original tokenisation rules through the dependency\-aware replacement\. Furthermore, it fine\-tunes the embeddings of new tokens with self\-supervised learning, enabling embedding alignment without requiring any labelled data\.
- •Compression without performance loss\.MedTPE achieves substantial compression while maintaining and even improving predictive performance on clinical tasks for the highly frequent and heterogeneous ICU scenario and the long and sparse longitudinal care\. Beyond surpassing state\-of\-the\-art compression strategies, MedTPE exhibits robustness across varying context lengths and demonstrates strong generalisability across clinical narratives, scientific reasoning, and financial summarisation\.

## 2Related Work

#### LLM\-based Prediction on EHRs

Recent studies have leveraged LLMs for clinical prediction based on EHR\(Chenet al\.,[2024a](https://arxiv.org/html/2605.11774#bib.bib1); Fleminget al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib2); Niuet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib3); Wuet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib4)\)\. EHR\-KnowGen\(Niuet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib3)\)extracted targeted subsets of medical events and laboratory results from EHR sequences, presenting them as narrative summaries for input to the LLM\. ClinicalBench\(Chenet al\.,[2024a](https://arxiv.org/html/2605.11774#bib.bib1)\)converted diagnosis, procedure, and medication codes into descriptive sentences to enrich the clinical context available to the model\. Similarly, Llemr\(Wuet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib4)\)transformed the entire set of EHR events into descriptive sentences, embedding these events using ClinicalBERT\(Alsentzeret al\.,[2019](https://arxiv.org/html/2605.11774#bib.bib5)\)before passing them onto the LLM\. Moreover, MedAlign\(Fleminget al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib2)\)transformed complete patient event histories into XML\-formatted text for LLM input\. Despite their successes, these approaches encountered challenges with excessively long token sequences, which they addressed either by omitting portions of clinical events that risk losing important information or by employing hybrid encoding schemes that increase model complexity\.

#### Tokenisation and Compression Strategies

Modern LLMs commonly use subword tokenisation methods to represent rare or out\-of\-vocabulary words using a fixed\-size vocabulary, such as BPE\(Sennrichet al\.,[2016](https://arxiv.org/html/2605.11774#bib.bib6)\), WordPiece\(Songet al\.,[2021](https://arxiv.org/html/2605.11774#bib.bib7)\), and SentencePiece\(Kudo and Richardson,[2018](https://arxiv.org/html/2605.11774#bib.bib8)\)\. BPE iteratively merges frequent adjacent characters into subword tokens\. WordPiece optimises merges based on language model predictability, while SentencePiece selects tokens through iterative probabilistic pruning\. Although effective for general text, these tokenisers often over\-segment specialised medical terms, leading to longer token sequences, higher computational costs, and reduced semantic cohesion\(Hasanet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib9)\)\.

To address this, prompt compression methods have been proposed to reduce the input sequence length, generally falling into two categories: removal\-based and merge\-based\. Removal\-based methods evaluate the importance of individual tokens or sentences and selectively remove those deemed less relevant from the input sequence\. Specifically, LLMLingua\(Jianget al\.,[2023](https://arxiv.org/html/2605.11774#bib.bib35)\)and LLMLingua2\(Panet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib36)\)implement token\-level compression by estimating token importance and discarding lower\-ranked tokens\. In contrast, CPC\(Liskavetset al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib34)\)operates at the sentence level, measuring the semantic relevance between each context sentence and the query, subsequently retaining only those most relevant to the given question\. However, these methods risk discarding diagnostic nuances that are essential for clinical fidelity, potentially compromising the performance of clinical predictions\.

Merge\-based methods, conversely, create domain\-specific tokens by aggregating frequent co\-occurring units\. For example, AdaptiVocab\(Nakashet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib23)\)dynamically replaces less useful tokens with domain\-specific ones during inference, thus increasing computational complexity\. The meta\-token method named LTSC\(Harvillet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib33)\)replaces co\-occurring tokens with a single meta\-token, also requiring dynamic inference\-time replacement\. Both of these approaches require supervised alignment of the newly introduced embeddings with labelled data to ensure the model remains effective within the target domain\. Similarly, ZeTT\(Minixhoferet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib52)\)uses a hypernetwork to generate embeddings for new tokens by aggregating information from their original constituent sub\-tokens, thereby extending both the original vocabulary and the embedding space\. In contrast, our method maintains the original vocabulary size and model parameter count, preserves the computational complexity of the original tokenisation, and eliminates the need for labelled data using self\-supervised learning\. More information can be found in Table[1](https://arxiv.org/html/2605.11774#S2.T1)\.

Table 1:Summary of prompt compression methods, describing methods that achieve lossless compression and label\-free training, maintain their original vocabulary size and parameter count, and their tokenisation complexity \(wherennis the sequence length\)\.![Refer to caption](https://arxiv.org/html/2605.11774v1/x2.png)Figure 2:Overview of MedTPE tokenisation and its integration with LLMs\. \(a\)Token\-pair encoding:MedTPE identifies frequently co\-occurring pairs in a medical corpus to form unified TPE tokens\. \(b\)Dependency\-aware replacement:The vocabulary is optimised by replacing low\-utility general tokens \(e\.g\., replacing “Cat” with “Spirometry”\) with high\-value medical tokens, while strictly retaining all dependent sub\-tokens to preserve the original tokenisation logic\. \(c\)Self\-supervised fine\-tuning \(SSFT\):The original LLM processes the input \(“Incentive Spirometry”\) to generate pseudo\-labels\. These labels supervise the fine\-tuning ofonlythe new TPE token embeddings, aligning them with the pre\-trained latent space while the rest of the model remains frozen\.

## 3Preliminaries

Formally, we represent the longitudinal EHR for a patienti∈\{1,…,Np\}i\\in\\\{1,\\dots,N\_\{p\}\\\}as a sequence of timestamped events𝐄\(i\)=\{ej\(i\)\}j=1T\(i\)\\mathbf\{E\}^\{\(i\)\}=\\\{e^\{\(i\)\}\_\{j\}\\\}\_\{j=1\}^\{T^\{\(i\)\}\}\. Each event is defined as a tupleej\(i\)=\(cj\(i\),oj\(i\),tj\(i\)\)e^\{\(i\)\}\_\{j\}=\(c^\{\(i\)\}\_\{j\},o^\{\(i\)\}\_\{j\},t^\{\(i\)\}\_\{j\}\), comprising a clinical conceptcj\(i\)∈𝒞c^\{\(i\)\}\_\{j\}\\in\\mathcal\{C\}\(e\.g\., diagnosis, medication code\), an observation valueoj\(i\)∈𝒪o^\{\(i\)\}\_\{j\}\\in\\mathcal\{O\}\(e\.g\., lab result\), and a timestamptj\(i\)∈ℝt^\{\(i\)\}\_\{j\}\\in\\mathbb\{R\}\. The sequence is ordered chronologically such thattj\(i\)≤tj\+1\(i\)t^\{\(i\)\}\_\{j\}\\leq t^\{\(i\)\}\_\{j\+1\}, allowing multiple events to occur at the same timestamp\. Subsequently, each structured event is transformed into a natural language sequencesj\(i\)=ϕ\(ej\(i\)\)s^\{\(i\)\}\_\{j\}=\\phi\(e^\{\(i\)\}\_\{j\}\)via a data\-to\-text functionϕ:𝒞×𝒪×ℝ→𝒮\\phi:\\mathcal\{C\}\\times\\mathcal\{O\}\\times\\mathbb\{R\}\\rightarrow\\mathcal\{S\}\. These sequences are concatenated chronologically to form the full patient history:

S\(i\)=s1\(i\)⊕s2\(i\)⊕⋯⊕sT\(i\)\(i\)\.S^\{\(i\)\}=s^\{\(i\)\}\_\{1\}\\oplus s^\{\(i\)\}\_\{2\}\\oplus\\dots\\oplus s^\{\(i\)\}\_\{T^\{\(i\)\}\}\.\(1\)Finally, this aggregate textS\(i\)S^\{\(i\)\}is processed by a tokeniserτ:𝒮→𝒱∗\\tau:\\mathcal\{S\}\\rightarrow\\mathcal\{V\}^\{\*\}to yield a discrete token sequence as

𝒳\(i\)=\{xn\(i\)\}n=1L\(i\),xn\(i\)∈𝒱,\\small\\mathcal\{X\}^\{\(i\)\}=\\\{x^\{\(i\)\}\_\{n\}\\\}\_\{n=1\}^\{L^\{\(i\)\}\},\\qquad x^\{\(i\)\}\_\{n\}\\in\\mathcal\{V\},\(2\)where𝒱\\mathcal\{V\}is the tokeniser vocabulary andL\(i\)L^\{\(i\)\}is the sequence length\. For clinical prediction, a natural language promptspmts\_\{\\mathrm\{pmt\}\}\(e\.g\., “What is the discharge diagnosis?”\) is tokenised into a sequence𝐏=\{pm\}m=1M\\mathbf\{P\}=\\\{p\_\{m\}\\\}\_\{m=1\}^\{M\}and concatenated to form the final input:

𝒳′⁣\(i\)=𝒳\(i\)⊕𝐏\.\\small\\mathcal\{X\}^\{\\prime\(i\)\}=\\mathcal\{X\}^\{\(i\)\}\\oplus\\mathbf\{P\}\.\(3\)A pre\-trained autoregressive modelfθf\_\{\\theta\}generates the output sequence𝒢\(i\)=\(g1\(i\),…,gK\(i\)\(i\)\)\\mathcal\{G\}^\{\(i\)\}=\(g^\{\(i\)\}\_\{1\},\\dots,g^\{\(i\)\}\_\{K^\{\(i\)\}\}\)by modelling the conditional probability:

p\(𝒢\(i\)∣𝒳′⁣\(i\);θ\)=∏k=1K\(i\)p\(gk\(i\)∣𝒳′⁣\(i\),g<k\(i\);θ\),\\small p\(\\mathcal\{G\}^\{\(i\)\}\\mid\\mathcal\{X\}^\{\\prime\(i\)\};\\theta\)=\\prod\_\{k=1\}^\{K^\{\(i\)\}\}p\(g^\{\(i\)\}\_\{k\}\\mid\\mathcal\{X\}^\{\\prime\(i\)\},g^\{\(i\)\}\_\{<k\};\\theta\),\(4\)whereK\(i\)K^\{\(i\)\}is the output length\. A task\-specific extraction function𝑒𝑥𝑡:𝒱∗→𝒴\\mathit\{ext\}:\\mathcal\{V\}^\{\*\}\\rightarrow\\mathcal\{Y\}then maps the generated text to a clinical prediction:

y^\(i\)=𝑒𝑥𝑡\(𝒢\(i\)\)\.\\small\\hat\{y\}^\{\(i\)\}=\\mathit\{ext\}\(\\mathcal\{G\}^\{\(i\)\}\)\.\(5\)In the following, we usexxto denote a generic token\.

## 4Methodology

The core design of MedTPE, as illustrated in Figure[2](https://arxiv.org/html/2605.11774#S2.F2), consists of three modules that transform the original tokenisation into a medical\-domain–optimised process\. First, TPE mines a medical corpus to identify and merge frequently co\-occurring original tokens into composite TPE tokens\. Second, dependency\-aware replacement integrates these TPE tokens into the original vocabulary while preserving all merge dependencies, thus keeping the vocabulary size and tokenisation complexity unchanged\. Third, a light self\-supervised fine\-tuning \(SSFT\) step aligns the new TPE embeddings with the pre\-trained embedding spaces so that the LLM can leverage the compressed, clinically relevant tokens for downstream prediction tasks\.

### 4\.1Token\-pair Encoding \(TPE\)

TPE is designed to encode subwords in medical terminology as single, semantically meaningful medical tokens, as illustrated in Figure[2](https://arxiv.org/html/2605.11774#S2.F2)\(a\)\. TPE functions as an extra layer built upon any base tokeniser, treating original subword units as the fundamental atoms for its merging operations\.

#### Vocabulary Discovery\.

The construction of the TPE vocabulary,𝒱TPE\\mathcal\{V\}\_\{\\mathrm\{TPE\}\}, begins with a data\-driven discovery phase on a large\-scale medical corpus\. Initialising with the base tokeniser’s vocabulary𝒱\\mathcal\{V\}, we systematically identify contiguous sequences ofNNtokens \(where2≤N≤nmax2\\leq N\\leq n\_\{\\max\}\) that represent cohesive clinical concepts\. We calculate the global frequency of theseNN\-gram candidates of TPE tokens in the medical corpus to identify a pool of clinical terms that are frequently fragmented by general\-purpose tokenisers\. This candidate set serves as the basis for the subsequent dependency\-aware replacement strategy\.

#### Encoding Process\.

The TPE tokenisation process,τ⋆\\tau^\{\\star\}, consists of two stages\. First, the input text is processed by the base tokeniser to generate an intermediate sequence of original tokens\. Subsequently, the TPE tokeniser applies a structured merge table,ℳTPE\\mathcal\{M\}\_\{\\text\{TPE\}\}, to further consolidate these units\. Specifically, for any contiguous span of lengthN∈\{2,…,nmax\}N\\in\\\{2,\\dots,n\_\{\\max\}\\\}, a TPE token is constructed as:

dj=x1⊕x2⊕⋯⊕xN,xi∈𝒱,dj∈𝒱TPE,\\small d\_\{j\}=x\_\{1\}\\,\\oplus\\,x\_\{2\}\\,\\oplus\\,\\cdots\\,\\oplus\\,x\_\{N\},\\quad x\_\{i\}\\in\\mathcal\{V\},\\quad d\_\{j\}\\in\\mathcal\{V\}\_\{\\mathrm\{TPE\}\},\(6\)where⊕\\oplusdenotes string concatenation, andxi,xi\+1x\_\{i\},x\_\{i\+1\}are adjacent sub\-tokens\.

To formalise the encoding ofdjd\_\{j\}based on tokens\{xi\}i=1N\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}, we traverse its constituent sub\-tokens from left to right to construct the merge pathMP\(dj\)MP\(d\_\{j\}\):

MP\(dj\)=\[\(x1,x2\),\(x1x2,x3\),…,\(x1⋯xN−1,xN\)\]\.\\small MP\(d\_\{j\}\)=\\Bigl\[\(x\_\{1\},\\,x\_\{2\}\),\\;\(x\_\{1\}x\_\{2\},\\,x\_\{3\}\),\\dots,\\;\(x\_\{1\}\\cdots x\_\{N\-1\},\\,x\_\{N\}\)\\Bigr\]\.\(7\)The TPE merging rule is defined by the merge tableℳTPE\\mathcal\{M\}\_\{\\text\{TPE\}\}, which is generated by concatenating the unique merge paths of all discovered TPE tokens, ordered by their empirical frequency in the medical corpus:

ℳTPE=\[MP\(d1\)‖MP\(d2\)‖…∥MP\(dJ\)\],\\small\\mathcal\{M\}\_\{\\text\{TPE\}\}=\\left\[\\,MP\(d\_\{1\}\)\\\|MP\(d\_\{2\}\)\\\|\\dots\\\|MP\(d\_\{J\}\)\\,\\right\],\(8\)where∥\\\|denotes list concatenation with duplicate removal\. This layered approach achieves substantial compression of EHR sequences while maintaining a computational complexity of𝒪\(n\)\\mathcal\{O\}\(n\), wherennis the length of the input sequence\. Crucially, this encoding process is information\-lossless, ensuring that the original input text can be completely reconstructed through decoding \(detokenise\(τ∗\(s\)\)=s\\text\{detokenise\}\(\\tau^\{\*\}\(s\)\)=s\)\. Although the initial average embeddings are representationally lossy, the subsequent SSFT step recovers the representational fidelity required for high\-precision clinical inference\.

### 4\.2Dependency\-aware Replacement

The dependency\-aware replacement strategy is central to integrating the original and TPE vocabularies without increasing the overall vocabulary size while ensuring the layered TPE tokenisation\. A straightforward solution is simply taking the union of the original and TPE vocabularies𝒱∪𝒱TPE\\mathcal\{V\}\\cup\\mathcal\{V\}\_\{\\mathrm\{TPE\}\}, which would enlarge the embedding matrix and reduce computational efficiency\. Instead, we maintain the original vocabulary size by replacing the least frequent original tokens with the most beneficial TPE tokens\. To identify which TPE tokens to include, we use a length\-aware frequency score that prioritises tokens that both occur frequently and contribute more to sequence compression:

score\(dj\)=freq\(dj\)⋅\|dj\|orig,\\small\\mathrm\{score\}\(d\_\{j\}\)=\\mathrm\{freq\}\(d\_\{j\}\)\\cdot\\lvert d\_\{j\}\\rvert\_\{\\text\{orig\}\},\(9\)wherefreq\(dj\)\\mathrm\{freq\}\(d\_\{j\}\)is the raw corpus frequency of tokendjd\_\{j\}, and\|dj\|orig\\lvert d\_\{j\}\\rvert\_\{\\text\{orig\}\}is the number of composited original tokens\.

We select the top\-MMTPE tokens ranked by this score to form the insertion setℐ\\mathcal\{I\}\. Each TPE tokendj∈ℐd\_\{j\}\\in\\mathcal\{I\}is encoded with a specific merge pathMP\(dj\)MP\(d\_\{j\}\)that starts from the bytes of input\. To ensure correct layered tokenisation, we retain all original tokens involved in these merge paths, including any tokens that are required recursively along the merge path\. This guarantees the presence of all necessary tokens for the original tokenisation\. We refer to this preserved collection as the*dependent set*𝒟\\mathcal\{D\}, defined as

𝒟=\{x\|∃dj∈ℐ,∃u:\(u,x\)∈MP\(dj\),or∃\(u,x′\)∈MP\(dj\),∃v:\(v,x\)∈MP\(x′\)\}\.\\small\\mathcal\{D\}=\\left\\\{\\,x\\;\\middle\|\\;\\begin\{array\}\[\]\{l\}\\exists\\,d\_\{j\}\\in\\mathcal\{I\},\\exists\\,u:\\\>\(u,x\)\\in MP\(d\_\{j\}\),\\\\\[2\.0pt\] \\ \\text\{or\}\\\\\[2\.0pt\] \\exists\\,\(u,x^\{\\prime\}\)\\in MP\(d\_\{j\}\),\\exists\\,v:\(v,x\)\\in MP\(x^\{\\prime\}\)\\end\{array\}\\right\\\}\.\(10\)
Given the dependent set𝒟\\mathcal\{D\}, we identify tokens eligible for replacement in the unprotected set𝒰=𝒱∖𝒟\\mathcal\{U\}=\\mathcal\{V\}\\setminus\\mathcal\{D\}\. We then select theMMleast frequent tokens from𝒰\\mathcal\{U\}according to occurring frequency, forming the eviction setℰ⊆𝒰\\mathcal\{E\}\\subseteq\\mathcal\{U\}\. Combining removals and insertions gives the MedTPE vocabulary𝒱⋆\\mathcal\{V\}^\{\\star\}:

𝒱⋆=\(𝒱∖ℰ\)∪ℐ,\|ℐ\|=\|ℰ\|=M\.\\small\\mathcal\{V\}^\{\\star\}=\(\\mathcal\{V\}\\setminus\\mathcal\{E\}\)\\cup\\mathcal\{I\},\\quad\|\\mathcal\{I\}\|=\|\\mathcal\{E\}\|=M\.\(11\)
For tokens in the eviction setℰ\\mathcal\{E\}that are encountered during inference, the tokeniser returns to their underlying byte\-level or sub\-token representations present in the preserved vocabulary𝒱∖ℰ\\mathcal\{V\}\\setminus\\mathcal\{E\}\. This mechanism ensures that after removing rare tokens, the vocabulary remains computationally complete and capable of encoding any input string\. This replacement strategy also prevents any increase in vocabulary size or embedding matrix, making MedTPE an efficient and compatible module for existing LLMs\. The process is illustrated in Algorithm[1](https://arxiv.org/html/2605.11774#alg1)\.

### 4\.3Self\-supervised Fine\-tuning \(SSFT\)

To align the embeddings of the newly introduced TPE tokens with the latent space of the pre\-trained LLM, we employ SSFT\. This process involves minimising the cross\-entropy loss between the model’s predictions and pseudo\-labels generated by the original, uncompressed LLM\. Specifically, we employ the original model to generate pseudo\-labels for the training samples\. The LLM, integrated with the MedTPE tokeniser, is then fine\-tuned to predict these labels\. This self\-supervised alignment leverages the original model’s generative behaviour as a supervisory signal, allowing the new token embeddings to seamlessly integrate into the pre\-trained latent space without requiring manual annotations\.

To facilitate training, we initialise the embedding of each new TPE token based on the embeddings of its constituent original tokens\. For a TPE tokendjd\_\{j\}composed ofNNconstituent original tokens\{x1,x2,⋯,xN\}\\\{x\_\{1\},x\_\{2\},\\cdots,x\_\{N\}\\\}, we initialise its embedding vector𝐞dj\\mathbf\{e\}\_\{d\_\{j\}\}by computing the arithmetic mean of the pre\-trained embeddings of its constituent tokens:

𝐞~dj=1N∑i=1N𝐞xi\.\\small\\tilde\{\\mathbf\{e\}\}\_\{d\_\{j\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{e\}\_\{x\_\{i\}\}\.\(12\)
To maintain numerical stability, we normalise this vector to the weighted average norm of all original token embeddings\. Specifically, the initial embedding is given by:

𝐞dj=α⋅μ⋅𝐞~dj‖𝐞~dj‖,μ=1\|𝒱\|∑x∈𝒱‖𝐞x‖,\\small\\mathbf\{e\}\_\{d\_\{j\}\}=\\alpha\\cdot\\,\\mu\\cdot\\,\\frac\{\\tilde\{\\mathbf\{e\}\}\_\{d\_\{j\}\}\}\{\\\|\\tilde\{\\mathbf\{e\}\}\_\{d\_\{j\}\}\\\|\},\\quad\\mu=\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{x\\in\\mathcal\{V\}\}\\\|\\mathbf\{e\}\_\{x\}\\\|,\(13\)whereμ\\mudenotes the average norm of embeddings across the BPE vocabulary𝒱\\mathcal\{V\}\.α\\alphais a scaling factor for normalisation, and we setα=0\.5\\alpha=0\.5in this study\. Furthermore, to utilise the pre\-trained LLM and improve the fine\-tuning efficiency, we freeze all parameters of the LLM and train only the embeddings of the newly introduced TPE tokens\. This self\-supervised approach preserves the original capability of LLM while integrating the MedTPE tokeniser with LLM without labels\. The process is illustrated in Algorithm[2](https://arxiv.org/html/2605.11774#alg2)\.

Table 2:Assessment of LLMs with MedTPE\. Mean and standard deviation of F1 scores are reported, calculated by bootstrapping the test set 1,000 times\. Inference time is reported in minutes, with the percentage change shown relative to the original LLM\.Boldvalues indicate the best performance among the compression methods, andunderlinedvalues indicate the second\-best performance\.\(a\)Comparison of prompt compression methods on MIMIC\-IV
\(b\)Comparison of prompt compression methods on EHRSHOT

## 5Experiments

To evaluate MedTPE, we formulate the following key research questions \(RQs\):RQ1 \(Effectiveness\)assesses whether MedTPE outperforms existing prompt compression baselines;RQ2 \(Ablation\)isolates the contributions of token merging and embedding alignment;RQ3 \(Efficiency\)quantifies the setup costs;RQ4 \(Robustness\)examines performance stability across varying context lengths; andRQ5 \(Generalisability\)tests the method’s transferability to non\-clinical domains\.

### 5\.1Experiment Setup

#### Datasets & Tasks

To evaluate the effectiveness of MedTPE, we conducted experiments on MIMIC\-IV\(Johnsonet al\.,[2023](https://arxiv.org/html/2605.11774#bib.bib13)\)and EHRSHOT\(Wornowet al\.,[2023](https://arxiv.org/html/2605.11774#bib.bib31)\)\. MIMIC\-IV is an ICU dataset characterised by highly sampled data and a wide variety of clinical features of an ICU at the Beth Israel Deaconess Medical Centre in Boston\. EHRSHOT includes patient sequences from 1990 to 2023 with irregular timestamps from Stanford Hospital, reflecting the longitudinal complexity of real\-world scenarios\. The statistics of the datasets are shown in Appendix Table[5](https://arxiv.org/html/2605.11774#A3.T5)\. In MIMIC\-IV, we have evaluated model performance on two clinical prediction tasks using the first 24 hours of data: \(1\) ICU Mortality, a binary classification task predicting whether a patient will die during their ICU stay, and \(2\) Phenotyping, a multi\-label classification task predicting the presence of 25 clinical phenotypes\. In EHRSHOT, we evaluate tasks \(3\) 30\-day Hospital Readmission, predicting whether a patient will be readmitted within 30 days of discharge, and \(4\) 1\-year Pancreatic Cancer Prediction, predicting pancreatic cancer diagnosis within one year\.

#### Metrics & Baselines\.

We evaluate LLMs using four key metrics: the F1 score for predictive performance, the format compliance rate \(FCR\) for format adherence\(Niet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib26)\), the inference time for inference efficiency, and the compression rate \(CR\) to quantify the reduction in the length of the token sequence\. The LLMs evaluated include Qwen2\.5\(Team,[2024](https://arxiv.org/html/2605.11774#bib.bib39)\), Llama3\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib40)\)\.

As primary baselines for prompt compression, we employ three distinct approaches: a specialised clinical text summarisation method T5Summary\(Wilsonet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib50)\), the removal\-based compression method LLMLingua2\(Panet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib36)\), and the merge\-based compression method ZeTT\(Minixhoferet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib52)\)\. To ensure a rigorous comparison, we configure ZeTT with the same vocabulary as MedTPE to eliminate the impact of different tokenisation granularities and align LLMLingua2’s CR to match our method\. More details can be found in Appendix[C](https://arxiv.org/html/2605.11774#A3)\.

### 5\.2RQ1: Comparison of MedTPE with Other Methods

Table[2](https://arxiv.org/html/2605.11774#S4.T2)presents the experimental results on MIMIC\-IV and EHRSHOT\. In both datasets, MedTPE reduces inference time by 33\.9% to 62\.5% and achieves compression rates between 22\.8% and 32\.4%\. Compared to prompt compression baselines \(LLMLingua2 and ZeTT\), MedTPE consistently improves inference efficiency while preserving, and often enhancing, predictive performance across diverse LLMs and tasks\. In particular, baseline methods often increase total inference latency despite compressing the input\. We attribute the specific underperformance of ZeTT to its reliance on a hypernetwork to generate embeddings\. While effective for general vocabulary adaptation, hypernetworks struggle to capture the semantic meaning from the combination of sub\-tokens \(e\.g\., ”hypotension” in the clinical context is quite different from ”hypo” and ”tension” in the general context\)\. This semantic ambiguity forces the LLM to generate longer and more redundant output sequences to resolve the confusion, thereby extending generation time\(Liet al\.,[2023](https://arxiv.org/html/2605.11774#bib.bib30)\)\. Compared to the summarisation baseline \(T5Summary\), MedTPE produces higher F1 scores\. Although T5Summary achieves lower inference latency through aggressive compression, it compromises predictive performance, especially on challenging tasks like phenotyping prediction\. MedTPE, in contrast, achieves a substantial gain in inference efficiency without sacrificing the reliability of clinical predictions\. MedTPE consistently outperforms the baseline methods, even when they are augmented with an embedding fine\-tuning step \(Appendix[H](https://arxiv.org/html/2605.11774#A8)\)\. Furthermore, MedTPE demonstrates robust effectiveness across a wide range of model families and parameter scales \(1B to 32B\), different training domains \(Appendix[E](https://arxiv.org/html/2605.11774#A5)\), and Chain\-of\-Thought \(CoT\) prompting scenarios \(Appendix[F](https://arxiv.org/html/2605.11774#A6)\)\.

Table 3:Ablation study on MIMIC\-IV tasks comparing MedTPE, MedTPE without SSFT, and MedTPE without dependency\-aware replacement \(Rep\.\)\. Inference time is reported in minutes\.
### 5\.3RQ2: Ablation Study

We evaluated MedTPE’s core components via an ablation study in Table[3](https://arxiv.org/html/2605.11774#S5.T3)\. Omitting Self\-Supervised Fine\-Tuning \(”w\.o\. SSFT”\) causes catastrophic performance collapse\. Without SSFT, average\-initialised embeddings act as out\-of\-distribution noise, triggering hallucinations that negate compression efficiency\. Furthermore, omitting dependency\-aware replacement \(”w\.o\. Rep\.”\) by expanding the vocabulary and embedding matrices degrades accuracy and increases latency\. Retaining original tokens introduces semantic ambiguity as they compete with new TPE tokens during generation, and unnecessarily expands structural matrices\. Ultimately, both SSFT and strategic token replacement are indispensable to prevent generation confusion, minimise computational overhead, and preserve model capabilities\.

Table 4:Generalisation evaluation of MedTPE on clinical narrative, scientific, and financial domains\. Metrics include Accuracy \(Acc\.\), F1, FCR, ROUGE \(R\-1, R\-2, R\-L\), BERTScore \(BS\), and inference time\.\(a\)Evaluation of MedTPE on clinical narratives\. Time is reported in minutes\.
\(b\)Evaluation of MedTPE across scientific \(ARC\-Challenge\) and financial \(ECTSum\) domains\. Time is reported in seconds\.
\(c\)Evaluation of MedTPE on the Chinese\-language dataset \(CMedQA2\)\. Inference time is reported in seconds\.

### 5\.4RQ3: Evaluation of the Setup Cost

![Refer to caption](https://arxiv.org/html/2605.11774v1/x3.png)\(a\)MIMIC\-IV
![Refer to caption](https://arxiv.org/html/2605.11774v1/x4.png)\(b\)EHRSHOT

Figure 3:Analysis of cost for MedTPE integration\. Each curve shows the CR achieved with different numbers of replaced tokens andNN\-gram configurations \(N=2,3,4,5N=2,3,4,5\), evaluated using the Qwen2\.5 tokeniser on \(a\) MIMIC\-IV and \(b\) EHRSHOT\.Figure[3](https://arxiv.org/html/2605.11774#S5.F3)presents the CR achieved by MedTPE across variousNN\-gram configurations and replacement budgets\. We observe that the CR plateaus at approximately 30%, which likely represents the empirical bound for lossless compression on the MIMIC\-IV and EHRSHOT datasets\. This saturation aligns with Shannon’s Source Coding Theorem\(Shannon,[1948](https://arxiv.org/html/2605.11774#bib.bib57)\), which dictates that lossless reduction is strictly limited by the inherent entropy of the source text\(Li,[2025](https://arxiv.org/html/2605.11774#bib.bib54)\)\. Our results demonstrate that a budget of 5,000 tokens offers the optimal trade\-off for approaching this bound\. Beyond this threshold, the marginal gain in compression diminishes significantly, as the remaining patterns fall into the ’long tail’ of the clinical term distribution and yield diminishing gains in compression effectiveness\(Portelliet al\.,[2022](https://arxiv.org/html/2605.11774#bib.bib55)\)\. Furthermore, to achieve the optimal gains, it is necessary to construct the vocabulary in the target domain \(Appendix[I](https://arxiv.org/html/2605.11774#A9)\)\.

In addition, the results indicate that bigrams \(N=2N=2\) consistently yield the highest compression rate across all settings\. Although largerNN\-grams \(e\.g\., “blood pressure monitor”\) are semantically richer, they are statistically rarer than their constituent bigrams \(“blood pressure”, “pressure monitor”\)\(Clarket al\.,[2013](https://arxiv.org/html/2605.11774#bib.bib56)\)\. Given a fixed replacement budget \(M=5,000M=5,000\), allocating slots to high\-frequency bigrams maximises the total number of merge operations across the corpus, whereas lower\-frequency multi\-word patterns offer diminishing returns\. We find that a budget of 5,000 tokens achieves an optimal balance between computational overhead and compression performance\. This addition represents only 3\.3% of the Qwen\-2\.5 vocabulary and requires fine\-tuning a negligible fraction of model parameters \(≈\\approx0\.5–1\.0%\)\. Supported by the qualitative analysis in Appendix[G](https://arxiv.org/html/2605.11774#A7), our findings demonstrate that MedTPE achieves substantial efficiency gains by targeting high\-frequency clinical units with minimal architectural modification\. The cost of each setup step for MedTPE is shown in Appendix Table[7](https://arxiv.org/html/2605.11774#A3.T7)\.

![Refer to caption](https://arxiv.org/html/2605.11774v1/x5.png)\(a\)Llama3\-1B \(MIMIC\-IV\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x6.png)\(b\)Llama3\-1B \(EHRSHOT\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x7.png)\(c\)Qwen2\.5\-1\.5B \(MIMIC\-IV\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x8.png)\(d\)Qwen2\.5\-1\.5B \(EHRSHOT\)

Figure 4:Context length robustness of MedTPE\. Each plot shows the mean F1 score with 95% confidence interval \(shaded areas\)\.
### 5\.5RQ4: Evaluation across Context\-length

Figure[4](https://arxiv.org/html/2605.11774#S5.F4)illustrates the predictive stability of MedTPE compared to original tokenisers across varying context scales\. Evaluated on MIMIC\-IV and EHRSHOT, MedTPE consistently maintains or surpasses baseline performance, substantiating its robustness across diverse sequence lengths\. This reliability is further enhanced when integrated with test\-time scaling strategies \(Appendix[K](https://arxiv.org/html/2605.11774#A11)\), allowing the effective leverage of computational budgets\. Consequently, MedTPE enables predictive performance to scale alongside input length, offering a robust solution for clinical applications over highly frequent or longitudinal patient records\.

### 5\.6RQ5: Evaluation of Generalisability

To evaluate the generalisability of MedTPE beyond structured EHR sequences, we assessed its performance across three distinct textual domains: clinical narratives, scientific reasoning, and financial summarisation\. Within the clinical domain, we utilised the MIMIC\-IV\-Note dataset\(Johnsonet al\.,[2023](https://arxiv.org/html/2605.11774#bib.bib13)\)for two tasks: 30\-day readmission prediction from discharge summaries and coding of the main diagnostic category \(MDC\), which requires mapping unstructured text to 25 standardised diagnostic groups\. To test cross\-domain robustness, we further added the ARC\-Challenge dataset\(Clarket al\.,[2018](https://arxiv.org/html/2605.11774#bib.bib49)\), comprising science questions that require logical reasoning, and the ECTSum dataset\(Mukherjeeet al\.,[2022](https://arxiv.org/html/2605.11774#bib.bib51)\), which requires the generation of concise summaries from lengthy financial earnings call transcripts\. To evaluate cross\-lingual transferability, we incorporated the Chinese Medical Question Answer Matching v2 \(CMedQA2\) dataset\(Zhanget al\.,[2018](https://arxiv.org/html/2605.11774#bib.bib65)\)\. Following the evaluation protocol in\(Jianget al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib64)\), we formulated this as a text generation task\. We include LLMLingua2\(Panet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib36)\)as the prompt compression baseline\.

As shown in Table[4](https://arxiv.org/html/2605.11774#S5.T4), MedTPE demonstrates generalisability, delivering substantial efficiency gains with minimal impact on predictive performance\. On clinical narratives with typographical errors and artefacts, MedTPE maintains its effectiveness, confirming its robustness in processing noisy, real\-world documentation\. While LLMLingua2 succeeds in this context by removing linguistic redundancies, MedTPE consistently achieves superior efficiency gains while maintaining comparable or higher predictive performance\.

In the scientific domain, the method achieves high accuracy while reducing inference latency by 49\.4% to 76\.4%\. Similarly, in the financial summarisation task, MedTPE consistently improves inference efficiency across diverse LLM families and parameter scales\. In particular, for Llama\-3\-1B, MedTPE yields higher R\-1 and R\-L scores than the baseline, suggesting that the increased information density of TPE tokens may help smaller models to attend to salient concepts, even when applied to out\-of\-domain financial text\.

MedTPE also exhibits cross\-lingual transferability on the Chinese language CMedQA2\. Unlike LLMLingua2, which suffers performance and latency degradation, MedTPE preserves baseline predictive performance while accelerating inference, enabling effective compression of non\-English texts without language\-specific retraining\.

## 6Conclusion

In this study, we proposed MedTPE, an efficient and lossless compression method that addresses the challenge of long EHR sequences by merging token pairs yielded from the general tokeniser into medical tokens\. MedTPE achieves a substantial gain in inference efficiency without increasing the parameter size or sacrificing performance\. Empirical evaluations across highly frequent and longitudinal EHR sequences and clinical narratives demonstrate that MedTPE effectively enhances the inference efficiency and robustness of pre\-trained LLMs for clinical tasks\.

While MedTPE provides efficiency gains, it relies on modifications to the tokeniser and embedding matrix\. This process requires direct access to and alteration of the tokeniser, which may be impractical in closed\-source models or production environments constrained by a fixed vocabulary\. In addition, our qualitative analysis of Meditron3\-8B under CoT prompting reveals instruction fragility in medically continual\-pretrained models, which can be highly sensitive to input distribution shifts with CoT prompting\. Integrating MedTPE with CoT prompt or advanced reasoning frameworks remains a current limitation of our methodology and represents a valuable direction for future exploration\. Future work will explore instruction\-aware token\-pair selection and optimisation that preserves both clinical reasoning coherence and output\-format compliance under CoT prompting\.

## Acknowledgements

Mingcheng Zhu was supported by the Clarendon Fund Scholarship\. Tingting Zhu was supported by the Royal Academy of Engineering under the Research Fellowship scheme\.

## Impact Statement

The TPE vocabularies developed in this study are mined from specific, high\-resource databases \(e\.g\., MIMIC\-IV, EHRSHOT\) and inherently reflect localised clinical documentation practices\. Consequently, deploying these tailored models in under\-resourced settings or different geographic regions without rigorous re\-validation introduces a distinct risk of dataset bias\. If a model utilising these specific embeddings misinterprets divergent clinical terminologies or regional shorthand, it could inadvertently exacerbate existing healthcare disparities\. Therefore, localised validation and vocabulary re\-alignment are critical prerequisites for safe, real\-world deployment\.

## References

- E\. Alsentzer, J\. Murphy, W\. Boag, W\. Weng, D\. Jindi, T\. Naumann, and M\. McDermott \(2019\)Publicly available clinical\.InProceedings of the 2nd Clinical Natural Language Processing Workshop,Cited by:[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Arnrich, E\. Choi, J\. A\. Fries, M\. B\. McDermott, J\. Oh, T\. Pollard, N\. Shah, E\. Steinberg, M\. Wornow, and R\. van de Water \(2024\)Medical event data standard \(meds\): facilitating machine learning for health\.InICLR 2024 Workshop on Learning from Time Series For Health,pp\. 03–08\.Cited by:[§C\.1](https://arxiv.org/html/2605.11774#A3.SS1.p1.1)\.
- E\. Bolton, A\. Venigalla, M\. Yasunaga, D\. Hall, B\. Xiong, T\. Lee, R\. Daneshjou, J\. Frankle, P\. Liang, M\. Carbin,et al\.\(2024\)Biomedlm: a 2\.7 b parameter language model trained on biomedical text\.arXiv preprint arXiv:2403\.18421\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p3.1)\.
- N\. Calderon, N\. Porat, E\. Ben\-David, A\. Chapanin, Z\. Gekhman, N\. Oved, V\. Shalumov, and R\. Reichart \(2024\)Measuring the robustness of nlp models to domain shifts\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 126–154\.Cited by:[Appendix I](https://arxiv.org/html/2605.11774#A9.p2.1)\.
- C\. Chen, J\. Yu, S\. Chen, C\. Liu, Z\. Wan, D\. Bitterman, F\. Wang, and K\. Shu \(2024a\)ClinicalBench: can llms beat traditional ml models in clinical prediction?\.arXiv preprint arXiv:2411\.06469\.Cited by:[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Chen, J\. Q\. Davis, B\. Hanin, P\. Bailis, I\. Stoica, M\. A\. Zaharia, and J\. Y\. Zou \(2024b\)Are more llm calls all you need? towards the scaling properties of compound ai systems\.Advances in Neural Information Processing Systems37,pp\. 45767–45790\.Cited by:[Appendix K](https://arxiv.org/html/2605.11774#A11.p1.2)\.
- T\. Chen, M\. Zhu, Z\. Luo, and T\. Zhu \(2026\)Cross\-representation benchmarking in time\-series electronic health records for clinical outcome prediction\.InICASSP 2026\-2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 7076–7080\.Cited by:[§C\.1](https://arxiv.org/html/2605.11774#A3.SS1.p1.1)\.
- P\. Chizhov, C\. Arnett, E\. Korotkova, and I\. Yamshchikov \(2024\)BPE gets picky: efficient vocabulary refinement during tokenizer training\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 16587–16604\.Cited by:[§C\.4](https://arxiv.org/html/2605.11774#A3.SS4.p1.1)\.
- A\. Clark, G\. Giorgolo, and S\. Lappin \(2013\)Statistical representation of grammaticality judgements: the limits of n\-gram models\.InProceedings of the fourth annual workshop on cognitive modeling and computational linguistics \(CMCL\),pp\. 28–36\.Cited by:[§5\.4](https://arxiv.org/html/2605.11774#S5.SS4.p2.4)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§C\.1](https://arxiv.org/html/2605.11774#A3.SS1.p3.1),[§5\.6](https://arxiv.org/html/2605.11774#S5.SS6.p1.1)\.
- H\. Cui, Z\. Shen, J\. Zhang, H\. Shao, L\. Qin, J\. C\. Ho, and C\. Yang \(2025\)Llms\-based few\-shot disease predictions using ehr: a novel approach combining predictive agent reasoning and critical agent instruction\.InAMIA Annual Symposium Proceedings,Vol\.2024,pp\. 319\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p1.1)\.
- W\. Du, Y\. Yang, and S\. Welleck \(2025\)Optimizing temperature for language models with multi\-sample inference\.InForty\-second International Conference on Machine Learning,Cited by:[§C\.3](https://arxiv.org/html/2605.11774#A3.SS3.p1.1)\.
- S\. L\. Fleming, A\. Lozano, W\. J\. Haberkorn, J\. A\. Jindal, E\. Reis, R\. Thapa, L\. Blankemeier, J\. Z\. Genkins, E\. Steinberg, A\. Nayak,et al\.\(2024\)Medalign: a clinician\-generated dataset for instruction following with electronic medical records\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 22021–22030\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p2.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§C\.4](https://arxiv.org/html/2605.11774#A3.SS4.p1.1),[§5\.1](https://arxiv.org/html/2605.11774#S5.SS1.SSS0.Px2.p1.1)\.
- H\. Han, A\. Eriguchi, H\. Xu, H\. Hoang, M\. Carpuat, and H\. Khayrallah \(2025\)Adapters for altering llm vocabularies: what languages benefit the most?\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p3.1)\.
- J\. Harvill, Z\. Fan, H\. Wang, Y\. Sun, H\. Ding, L\. Huan, and A\. Deoras \(2025\)Lossless token sequence compression via meta\-tokens\.arXiv preprint arXiv:2506\.00307\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p3.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p3.1)\.
- A\. Hasan, J\. Wu, Q\. N\. Nguyen, S\. Andres, I\. Guellil, H\. Zhang, A\. Casey, B\. Alex, B\. Guthrie, and H\. Wu \(2024\)Infusing clinical knowledge into tokenisers for language models\.arXiv preprint arXiv:2406\.14312\.Cited by:[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Hu, X\. Zuo, Y\. Zhou, X\. Peng, J\. Huang, V\. K\. Keloth, V\. J\. Zhang, R\. Weng, C\. Shyr, Q\. Chen,et al\.\(2026\)Information extraction from clinical notes: are we ready to switch to large language models?\.Journal of the American Medical Informatics Association,pp\. ocaf213\.Cited by:[Appendix I](https://arxiv.org/html/2605.11774#A9.p2.1)\.
- H\. Jiang, Q\. Wu, C\. Lin, Y\. Yang, and L\. Qiu \(2023\)LLMLingua: compressing prompts for accelerated inference of large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 13358–13376\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p3.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p2.1)\.
- Z\. Jiang, L\. Zhong, M\. Sun, J\. Xu, R\. Sun, H\. Cai, S\. Luo, and Z\. Zhang \(2024\)Efficient knowledge infusion via kg\-llm alignment\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 2986–2999\.Cited by:[§5\.6](https://arxiv.org/html/2605.11774#S5.SS6.p1.1)\.
- A\. E\. Johnson, L\. Bulgarelli, L\. Shen, A\. Gayles, A\. Shammout, S\. Horng, T\. J\. Pollard, S\. Hao, B\. Moody, B\. Gow,et al\.\(2023\)MIMIC\-iv, a freely accessible electronic health record dataset\.Scientific data10\(1\),pp\. 1\.Cited by:[§C\.1](https://arxiv.org/html/2605.11774#A3.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.11774#S5.SS1.SSS0.Px1.p1.1),[§5\.6](https://arxiv.org/html/2605.11774#S5.SS6.p1.1)\.
- D\. Kim, R\. Hwa, and M\. M\. Rahman \(2024\)MhGPT: a lightweight generative pre\-trained transformer for mental health text analysis\.arXiv preprint arXiv:2408\.08261\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p3.1)\.
- T\. Kudo and J\. Richardson \(2018\)SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp\. 66–71\.Cited by:[§C\.4](https://arxiv.org/html/2605.11774#A3.SS4.p1.1),[§1](https://arxiv.org/html/2605.11774#S1.p2.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p1.1)\.
- T\. C\. Lee, N\. U\. Shah, A\. Haack, and S\. L\. Baxter \(2020\)Clinical implementation of predictive models embedded within electronic health record systems: a systematic review\.InInformatics,Vol\.7,pp\. 25\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p1.1)\.
- L\. M\. Li \(2025\)Empirical lossless compression bound of a data sequence\.Entropy27\(8\),pp\. 864\.Cited by:[§5\.4](https://arxiv.org/html/2605.11774#S5.SS4.p1.1)\.
- Y\. Li, Z\. Li, W\. Yang, and C\. Liu \(2023\)RT\-lm: uncertainty\-aware resource management for real\-time inference of language models\.In2023 IEEE Real\-Time Systems Symposium \(RTSS\),pp\. 158–171\.Cited by:[§5\.2](https://arxiv.org/html/2605.11774#S5.SS2.p1.1)\.
- B\. Liskavets, M\. Ushakov, S\. Roy, M\. Klibanov, A\. Etemad, and S\. K\. Luke \(2025\)Prompt compression with context\-aware sentence encoding for fast and improved llm inference\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 24595–24604\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p3.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p2.1)\.
- M\. Mesinovic, S\. Molaei, P\. Watkinson, and T\. Zhu \(2025\)DynaGraph: interpretable multi\-label prediction from ehrs via dynamic graph learning and contrastive augmentation\.arXiv preprint arXiv:2503\.22257\.Cited by:[§C\.1](https://arxiv.org/html/2605.11774#A3.SS1.p1.1)\.
- B\. Minixhofer, E\. M\. Ponti, and I\. Vulić \(2024\)Zero\-shot tokenizer transfer\.Advances in Neural Information Processing Systems37,pp\. 46791–46818\.Cited by:[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p3.1),[§5\.1](https://arxiv.org/html/2605.11774#S5.SS1.SSS0.Px2.p2.1)\.
- R\. Mukherjee, A\. Bohra, A\. Banerjee, S\. Sharma, M\. Hegde, A\. Shaikh, S\. Shrivastava, K\. Dasgupta, N\. Ganguly, S\. Ghosh,et al\.\(2022\)ECTSum: a new benchmark dataset for bullet point summarization of long earnings call transcripts\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 10893–10906\.Cited by:[§C\.1](https://arxiv.org/html/2605.11774#A3.SS1.p3.1),[§5\.6](https://arxiv.org/html/2605.11774#S5.SS6.p1.1)\.
- I\. Nakash, N\. Calderon, E\. B\. David, E\. Hoffer, and R\. Reichart \(2025\)AdaptiVocab: enhancing llm efficiency in focused domains through lightweight vocabulary adaptation\.arXiv preprint arXiv:2503\.19693\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p3.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p3.1)\.
- M\. Ni, Z\. Yang, L\. Li, C\. Lin, K\. Lin, W\. Zuo, and L\. Wang \(2025\)Point\-rft: improving multimodal reasoning with visually grounded reinforcement finetuning\.arXiv preprint arXiv:2505\.19702\.Cited by:[§5\.1](https://arxiv.org/html/2605.11774#S5.SS1.SSS0.Px2.p1.1)\.
- S\. Niu, J\. Ma, L\. Bai, Z\. Wang, L\. Guo, and X\. Yang \(2024\)EHR\-knowgen: knowledge\-enhanced multimodal learning for disease diagnosis generation\.Information Fusion102,pp\. 102069\.Cited by:[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Pan, Q\. Wu, H\. Jiang, M\. Xia, X\. Luo, J\. Zhang, Q\. Lin, V\. Rühle, Y\. Yang, C\. Lin,et al\.\(2024\)LLMLingua\-2: data distillation for efficient and faithful task\-agnostic prompt compression\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 963–981\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p3.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p2.1),[§5\.1](https://arxiv.org/html/2605.11774#S5.SS1.SSS0.Px2.p2.1),[§5\.6](https://arxiv.org/html/2605.11774#S5.SS6.p1.1)\.
- B\. Portelli, S\. Scaboro, E\. Santus, H\. Sedghamiz, E\. Chersoni, and G\. Serra \(2022\)Generalizing over long tail concepts for medical term normalization\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 8580–8591\.Cited by:[§5\.4](https://arxiv.org/html/2605.11774#S5.SS4.p1.1)\.
- P\. Renc, Y\. Jia, A\. E\. Samir, J\. Was, Q\. Li, D\. W\. Bates, and A\. Sitek \(2024\)Zero shot health trajectory prediction using transformer\.NPJ Digital Medicine7\(1\),pp\. 256\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p1.1),[§1](https://arxiv.org/html/2605.11774#S1.p3.1)\.
- A\. Sallinen, A\. Solergibert, M\. Zhang, G\. Boyé, M\. Dupont\-Roc, X\. Theimer\-Lienhard, E\. Boisson, B\. Bernath, H\. Hadhri, A\. Tran,et al\.\(2025\)Llama\-3\-meditron: an open\-weight suite of medical llms based on llama\-3\.1\.InWorkshop on Large Language Models and Generative AI for Health at AAAI 2025,Cited by:[§C\.4](https://arxiv.org/html/2605.11774#A3.SS4.p1.1)\.
- R\. Sennrich, B\. Haddow, and A\. Birch \(2016\)Neural machine translation of rare words with subword units\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1715–1725\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p2.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p1.1)\.
- C\. E\. Shannon \(1948\)A mathematical theory of communication\.The Bell system technical journal27\(3\),pp\. 379–423\.Cited by:[§5\.4](https://arxiv.org/html/2605.11774#S5.SS4.p1.1)\.
- X\. Song, A\. Salcianu, Y\. Song, D\. Dopson, and D\. Zhou \(2021\)Fast wordpiece tokenization\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 2089–2103\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p2.1),[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé,et al\.\(2024\)Gemma 2: improving open language models at a practical size\.arXiv preprint arXiv:2408\.00118\.Cited by:[§C\.4](https://arxiv.org/html/2605.11774#A3.SS4.p1.1)\.
- Q\. Team \(2024\)Qwen2 technical report\.arXiv preprint arXiv:2407\.10671\.Cited by:[§C\.4](https://arxiv.org/html/2605.11774#A3.SS4.p1.1),[§5\.1](https://arxiv.org/html/2605.11774#S5.SS1.SSS0.Px2.p1.1)\.
- B\. Theodorou, C\. Xiao, and J\. Sun \(2023\)Synthesize high\-dimensional longitudinal electronic health records via hierarchical autoregressive language model\.Nature communications14\(1\),pp\. 5305\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p1.1)\.
- C\. Y\. Williams, B\. Y\. Miao, A\. E\. Kornblith, and A\. J\. Butte \(2024\)Evaluating the use of large language models to provide clinical recommendations in the emergency department\.Nature Communications15\(1\),pp\. 8236\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p1.1)\.
- J\. Wilson, M\. Pollack, R\. Edwards, A\. Bellamy, and H\. Salgi \(2025\)RainCityNLP at biolaysumm2025: extract then summarize at home\.InProceedings of the 24th Workshop on Biomedical Language Processing \(Shared Tasks\),pp\. 190–195\.Cited by:[§5\.1](https://arxiv.org/html/2605.11774#S5.SS1.SSS0.Px2.p2.1)\.
- M\. Wornow, S\. Bedi, M\. A\. F\. Hernandez, E\. Steinberg, J\. A\. Fries, C\. Re, S\. Koyejo, and N\. Shah \(2025\)Context clues: evaluating long context models for clinical prediction tasks on ehr data\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p2.1)\.
- M\. Wornow, R\. Thapa, E\. Steinberg, J\. Fries, and N\. Shah \(2023\)Ehrshot: an ehr benchmark for few\-shot evaluation of foundation models\.Advances in Neural Information Processing Systems36,pp\. 67125–67137\.Cited by:[§5\.1](https://arxiv.org/html/2605.11774#S5.SS1.SSS0.Px1.p1.1)\.
- Z\. Wu, A\. Dadu, M\. Nalls, F\. Faghri, and J\. Sun \(2024\)Instruction tuning large language models to understand electronic health records\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§2](https://arxiv.org/html/2605.11774#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§C\.4](https://arxiv.org/html/2605.11774#A3.SS4.p1.1)\.
- M\. Yu \(2025\)MedSeg: a statistical approach to tokenization assessment in medical nlp\.Journal of Information Systems Engineering and Management10,pp\. 698–704\.External Links:[Document](https://dx.doi.org/10.52783/jisem.v10i37s.6506)Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p2.1)\.
- S\. Zhang, X\. Zhang, H\. Wang, L\. Guo, and S\. Liu \(2018\)Multi\-scale attentive interaction networks for chinese medical question answer selection\.IEEE access6,pp\. 74061–74071\.Cited by:[§5\.6](https://arxiv.org/html/2605.11774#S5.SS6.p1.1)\.
- M\. Zhu, Y\. Liu, Z\. Luo, and T\. Zhu \(2026\)The taxonomies, training, and applications of event stream modelling for electronic health records\.arXiv preprint arXiv:2603\.14003\.Cited by:[§1](https://arxiv.org/html/2605.11774#S1.p1.1)\.

## Appendix AAlgorithm of Dependency\-aware Replacement

Algorithm 1Dependency\-aware ReplacementInput:Original vocabulary

𝒱\\mathcal\{V\}, TPE candidates

𝒱TPE\\mathcal\{V\}\_\{\\mathrm\{TPE\}\}, budget

MM
Output:Optimised vocabulary

𝒱⋆\\mathcal\{V\}^\{\\star\}
Step 1: Formulate Insertion Set \(ℐ\\mathcal\{I\}\)

foreach

dj∈𝒱TPEd\_\{j\}\\in\\mathcal\{V\}\_\{\\mathrm\{TPE\}\}do

Calculate

score\(dj\)\\mathrm\{score\}\(d\_\{j\}\)via Eq\. \([9](https://arxiv.org/html/2605.11774#S4.E9)\)

endfor

ℐ←Top\-M\(𝒱TPE,score\)\\mathcal\{I\}\\leftarrow\\text\{Top\-\}M\(\\mathcal\{V\}\_\{\\mathrm\{TPE\}\},\\text\{score\}\)

Step 2: Identify Dependencies \(𝒟\\mathcal\{D\}\)

Initialize

𝒟←∅\\mathcal\{D\}\\leftarrow\\emptyset
foreach

dj∈ℐd\_\{j\}\\in\\mathcal\{I\}do

Retrieve merge path components

MP\(dj\)MP\(d\_\{j\}\)
𝒟←𝒟∪\{x∣x∈MP\(dj\)\}\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\\{x\\mid x\\in MP\(d\_\{j\}\)\\\}\{Preserve recursive dependencies \(Eq\. \([10](https://arxiv.org/html/2605.11774#S4.E10)\)\)\}

endfor

Step 3: Select Eviction Set \(ℰ\\mathcal\{E\}\)

𝒰←𝒱∖𝒟\\mathcal\{U\}\\leftarrow\\mathcal\{V\}\\setminus\\mathcal\{D\}\{Identify unprotected tokens\}

𝒮←SortAscending\(𝒰,freq\)\\mathcal\{S\}\\leftarrow\\text\{SortAscending\}\(\\mathcal\{U\},\\text\{freq\}\)

ℰ←𝒮\[1:M\]\\mathcal\{E\}\\leftarrow\\mathcal\{S\}\[1:M\]\{Select least frequent candidates\}

Step 4: Construct Final Vocabulary \(𝒱⋆\\mathcal\{V\}^\{\\star\}\)

𝒱⋆←\(𝒱∖ℰ\)∪ℐ\\mathcal\{V\}^\{\\star\}\\leftarrow\(\\mathcal\{V\}\\setminus\\mathcal\{E\}\)\\cup\\mathcal\{I\}\{Merge vocabularies \(Eq\. \([11](https://arxiv.org/html/2605.11774#S4.E11)\)\)\}

return

𝒱⋆\\mathcal\{V\}^\{\\star\}

## Appendix BAlgorithm of MedTPE Integration

Algorithm 2MedTPE Integration with LLMInput:Medical corpus

𝒞\\mathcal\{C\}, original tokeniser

τ\\tau, LLM

F=\(τ,Θ\)F=\(\\tau,\\Theta\), budget

MM, input

ss
Output:LLM with MedTPE

F⋆=\(τ⋆,Θ⋆\)F^\{\\star\}=\(\\tau^\{\\star\},\\Theta^\{\\star\}\)
Step 1: Build MedTPE Tokeniser

Extract original vocab

𝒱\\mathcal\{V\}and merge table

MorigM\_\{\\text\{orig\}\}from

τ\\tau
Score all

NN\-grams \(

2≤N≤Nmax2\\leq N\\leq N\_\{\\max\}\) via Eq\.[9](https://arxiv.org/html/2605.11774#S4.E9)

ℐ←\\mathcal\{I\}\\leftarrowtop\-

MMTPE tokens

𝒟←\\mathcal\{D\}\\leftarrowdependent set via Eq\.[10](https://arxiv.org/html/2605.11774#S4.E10)

ℰ←\\mathcal\{E\}\\leftarrowMMleast\-frequent tokens in

𝒱∖𝒟\\mathcal\{V\}\\setminus\\mathcal\{D\}
𝒱⋆←\(𝒱∖ℰ\)∪ℐ\\mathcal\{V\}^\{\\star\}\\leftarrow\(\\mathcal\{V\}\\setminus\\mathcal\{E\}\)\\cup\\mathcal\{I\}via Eq\.[11](https://arxiv.org/html/2605.11774#S4.E11)

ℳTPE←\[∥d∈ℐMP\(d\)\]\\mathcal\{M\}\_\{\\mathrm\{TPE\}\}\\leftarrow\\bigl\[\\\|\_\{d\\in\\mathcal\{I\}\}MP\(d\)\\bigr\]

Define

τ⋆=\(τ,𝒱⋆,ℳTPE\)\\tau^\{\\star\}=\(\\tau,\\mathcal\{V\}^\{\\star\},\\mathcal\{M\}\_\{\\mathrm\{TPE\}\}\)
Step 2: MedTPE Encoding

𝒳←τ\(s\)\\mathcal\{X\}\\leftarrow\\tau\(s\)

𝒳⋆←\[\]\\mathcal\{X\}^\{\\star\}\\leftarrow\[\\,\]

i←1i\\leftarrow 1

while

i≤\|𝒳\|i\\leq\|\\mathcal\{X\}\|do

j←max⁡\{k≥i:MP\(xi…xk\)⊆ℳTPE\}j\\leftarrow\\max\\\{k\\geq i:MP\(x\_\{i\}\\dots x\_\{k\}\)\\subseteq\\mathcal\{M\}\_\{\\mathrm\{TPE\}\}\\\}

Append

\(xi⊕⋯⊕xj\)\(x\_\{i\}\\oplus\\dots\\oplus x\_\{j\}\)to

𝒳⋆\\mathcal\{X\}^\{\\star\}
i←j\+1i\\leftarrow j\+1

endwhile

Step 3: Self\-supervised Fine\-tuning

Initialise embedding

ede\_\{d\}for

d∈ℐd\\in\\mathcal\{I\}with Eq\.[12](https://arxiv.org/html/2605.11774#S4.E12)and Eq\.[13](https://arxiv.org/html/2605.11774#S4.E13)

Freeze

Θ∖\{𝐞d\}d∈ℐ\\Theta\\setminus\\\{\\mathbf\{e\}\_\{d\}\\\}\_\{d\\in\\mathcal\{I\}\}
foreach mini\-batch

b⊂𝒞b\\subset\\mathcal\{C\}do

X⋆←X^\{\\star\}\\leftarrowMedTPE\-Tokenise\(

bb\)

ypseu←argmax⁡\(ℱ\(X⋆\)\)y\_\{\\text\{pseu\}\}\\leftarrow\\operatorname\*\{arg\\,max\}\(\\mathcal\{F\}\(X^\{\\star\}\)\)\{Generate pseudo\-labels\}

Update

\{𝐞d\}\\\{\\mathbf\{e\}\_\{d\}\\\}using cross\-entropy loss with

ypseuy\_\{\\text\{pseu\}\}
endfor

return

F⋆=\(τ⋆,Θ⋆\)F^\{\\star\}=\(\\tau^\{\\star\},\\Theta^\{\\star\}\)

## Appendix CImplementation Details

Table 5:Statistics of DatasetsTable 6:Summary of dataset splits for all tasks\.Task \(Dataset\)DomainSize\# LabelsTrainValidationTestICU Mortality \(MIMIC\-IV\)Clinical \(EHR\)46,1065,6025,8152Phenotyping \(MIMIC\-IV\)Clinical \(EHR\)26,9373,2673,4202530\-day Readmission \(EHRSHOT\)Clinical \(EHR\)2,6082,2062,18921\-year Pancreatic Cancer \(EHRSHOT\)Clinical \(EHR\)2,5762,2152,220230\-day Readmission \(MIMIC\-IV\-Note\)Clinical \(Narrative\)25,6703,1663,1802MDC Classification \(MIMIC\-IV\-Note\)Clinical \(Narrative\)17,9352,2802,25825Chinese Medical QA Matching v2 \(cMedQA2\)Clinical \(Chinese\)100,0004,0004,000Free\-textARC\-Challenge \(ARC\)Scientific1,1192991,1724ECT Summarisation \(ECTSUM\)Financial1,681249495Free\-text### C\.1Datasets

For clinical datasets, we transform raw EHR data into structured event streams following the Medical Event Data Standard \(MEDS\)\(Arnrichet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib14)\), capturing each patient’s record as a chronologically ordered sequence of timestamped clinical events\. Following previous works\(Mesinovicet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib47); Chenet al\.,[2026](https://arxiv.org/html/2605.11774#bib.bib67)\), both datasets are split at the patient level to ensure that records from the same patient appear in a single subset, guaranteeing the independence between training, validation, and test sets\. Patient IDs are matched across datasets according to MEDS specifications to maintain consistency and ensure reproducibility\(Arnrichet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib14)\)\. For MIMIC\-IV, we use an 8:1:1 split for training, validation, and test sets, while for EHRSHOT, we use a 1:1:1 split\.

For clinical narratives, we utilised the MIMIC\-IV\-Note dataset\(Johnsonet al\.,[2023](https://arxiv.org/html/2605.11774#bib.bib13)\), which provides a comprehensive collection of de\-identified clinical notes associated with hospital admissions\. Unlike the structured events in the core MIMIC\-IV database, this dataset consists of unstructured text, including discharge summaries, radiology reports, and nursing notes, which often contain typographical errors and idiosyncratic human artefacts\. We use the discharge summaries for our evaluation\.

To evaluate the cross\-domain generalisation of MedTPE beyond the clinical sphere, we extended our evaluation to the scientific and financial sectors using the ARC\-Challenge and ECTSUM datasets\. The ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2605.11774#bib.bib49)\)comprises grade\-school science questions requiring complex reasoning, while ECTSUM\(Mukherjeeet al\.,[2022](https://arxiv.org/html/2605.11774#bib.bib51)\)focuses on the financial domain, requiring long\-form summarisation of verbose earnings call transcripts to extract salient corporate facts\. These datasets provide a diverse testing ground to verify that MedTPE’s tokenisation logic is robust across varied linguistic structures\. Table[6](https://arxiv.org/html/2605.11774#A3.T6)summarises the training, validation, and test splits for all datasets, all of which were processed in full compliance with their respective licensing conditions\.

### C\.2Training Details

We performed SSFT of all LLMs using a single NVIDIA A800 80G GPU, while inference experiments were conducted on a single NVIDIA A5500 24G GPU\.

We fine\-tuned the LLM using self\-supervised learning, setting a fixed learning rate of5×10−55\\times 10^\{\-5\}with the AdamW optimiser\. Training was performed with a batch size of 2 and gradient accumulation in 2 steps, producing an effective batch size of 4\. The input and output sequence lengths were capped at 4,096 tokens\. We did not conduct hyperparameter tuning\. Instead, we adopted a configuration feasible for the largest model \(Llama3\-8B\) on a single 80 GB GPU and applied it consistently across all LLMs\. The learning rate schedule involved a linear warmup during the first 10% of the total training steps, followed by a cosine decay until the end of training\. Model performance was validated every 1,000 steps, with early stopping applied if validation performance did not improve for 3 consecutive checks\. To ensure reproducibility, the random seed was set to 1\. Only the embeddings of new tokens were trainable during fine\-tuning, with all other embeddings and LLM layers frozen\. The padding token was set to be identical to the EOS token\. We did not perform hyperparameter tuning since the selected settings were constrained by hardware limits \(e\.g\., batch size and sequence length\), and no baselines required training, so a single plausible configuration was used throughout\.

### C\.3Evaluation Details

For all tasks, the maximum input length was set to 8,192, and the maximum output length was set to 4,096 tokens\. The amount of input information provided to each model was determined by the maximum input length allowed by the model’s original tokeniser\. The sampling temperature for LLM inference was set at 0\.7, following common practice to balance generation diversity and output reliability for general\-purpose LLMs\(Duet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib38)\)\. All reported F1 scores were obtained by bootstrapping the test set 1,000 times to ensure a robust and reliable evaluation\. We implement a dedicated embedding split module \(Appendix[D](https://arxiv.org/html/2605.11774#A4)\) to ensure that gradients are computed exclusively for the new embeddings, avoiding unnecessary memory usage and computation\.

### C\.4Evaluated LLMs

We evaluated MedTPE across several LLM families to ensure broad architectural compatibility\. Specifically, we used the Qwen2\.5 \(1\.5B and 7B\) models\(Team,[2024](https://arxiv.org/html/2605.11774#bib.bib39)\), which were trained on an 18\-trillion\-token corpus with dual\-stage supervised and RLHF fine\-tuning\. We also included Qwen3 \(1\.7B\) model\(Yanget al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib66)\)into the evaluation\. From the Llama3 family\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib40)\), we selected the 1B and 8B variants\. We also included Meditron3\-8B\(Sallinenet al\.,[2025](https://arxiv.org/html/2605.11774#bib.bib41)\), an open clinical LLM suite developed through continued pre\-training of Llama3 in medical corpora for enhanced clinical decision support\. All these families use Byte Pair Encoding \(BPE\) tokenisation\(Chizhovet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib37)\)\. Finally, to evaluate MedTPE’s versatility with different segmentation methods, we included Gemma2\-2B\(Teamet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib48)\), which employs the SentencePiece tokeniser\(Kudo and Richardson,[2018](https://arxiv.org/html/2605.11774#bib.bib8)\)for text processing\.

Table 7:Time breakdown for MedTPE processing phases on MIMIC\-IV datasets\.
### C\.5Setup Cost of MedTPE

Table[7](https://arxiv.org/html/2605.11774#A3.T7)delineates the computational overhead associated with the MedTPE setup pipeline, comprising the Encoding, Replacement, and SSFT phases\. The initial encoding and replacement steps are highly efficient, executing in a few minutes\. The primary computational investment lies in the SSFT phase\. However, because this phase freezes the core LLM backbone and updates only the newly introduced TPE embeddings \(constituting roughly 0\.5% to 1\.0% of total parameters\), it is efficient and remains a strictly one\-time offline procedure\. Overall training times scale with model capacity: processing 1B to 1\.5B parameter models requires approximately 2 to 6 hours on a single NVIDIA RTX 6000 Ada GPU, while the larger 7B to 8B architectures require between 13 and 16\.5 hours on a single NVIDIA A100 GPU\.

### C\.6Code Availability

## Appendix DEmbedding Split Module

To enable supervised fine\-tuning of only a subset of embedding vectors, we introduce the embedding split module\. Fine\-tuning a subset of embeddings within a single unified embedding matrix can result in unnecessary gradient computation for the frozen embeddings, leading to increased memory usage and computational overhead\. Our module addresses this by explicitly partitioning the embedding set into two disjoint subsets:

𝐄=𝐄orig,fixed∪𝐄TPE,trainable,\\mathbf\{E\}=\\mathbf\{E\}\_\{\\mathrm\{orig,\\,fixed\}\}\\cup\\mathbf\{E\}\_\{\\mathrm\{TPE,\\,trainable\}\},where𝐄orig,fixed\\mathbf\{E\}\_\{\\mathrm\{orig,\\,fixed\}\}denotes the pre\-trained token embeddings, which remain fixed, and𝐄TPE,trainable\\mathbf\{E\}\_\{\\mathrm\{TPE,\\,trainable\}\}is the trainable embeddings of the new TPE tokens\.

During the forward pass, the embeddings are retrieved by separately indexing both subsets and concatenating the results to form the complete embedding set𝐄\\mathbf\{E\}\. During the backward pass, gradient\-based updates are restricted to the trainable subsetℰTPE,trainable\\mathcal\{E\}\_\{\\mathrm\{TPE,\\,trainable\}\}, substantially improving memory and computational efficiency\.

## Appendix EEvaluation of MedTPE on More LLMs

Table 8:Assessment of LLMs with MedTPE\. Mean and standard deviation of F1 scores are reported, calculated by bootstrapping the test set 1,000 times\. Inference time is reported in minutes, with the percentage change shown relative to the original LLM\. Llama3 series and Meditron3 use the same tokeniser, resulting in identical CR with prompt compression\.\(a\)Evaluation of LLMs with MedTPE on MIMIC\-IV
\(b\)Evaluation of LLMs with MedTPE on EHRSHOT
\(c\)Scalability evaluation of MedTPE on the phenotyping task \(MIMIC\-IV\)\.\*Evaluated using eight A100 GPUs\.

To evaluate the versatility and robustness of MedTPE, we extended our evaluation to a broader range of LLMs across different model families, parameter scales \(1B to 8B\), and training domains\. As detailed in the Table[8](https://arxiv.org/html/2605.11774#A5.T8)\(a\)\(b\), MedTPE consistently enhances inference efficiency across both general\-purpose models \(Llama\-3, Qwen\-2\.5, Qwen\-3, Gemma\-2\) and domain\-specific architectures \(Meditron\-3\)\. Across all configurations, MedTPE achieves a substantial reduction in inference latency, ranging from 30\.9% to 62\.5%, regardless of the underlying tokenisation method or parameter count\. Crucially, the predictive performance remains remarkably stable\. Even as model size increases to 7B and 8B parameters, MedTPE preserves the high reliability of the original models, maintaining FCR values near or at 1\.0\. These results demonstrate that MedTPE is a highly adaptable and reliable framework for clinical LLM acceleration, offering consistent efficiency gains without compromising diagnostic precision\.

To evaluate the scalability of MedTPE, we extended our experiments to larger architectural scales using the Qwen2\.5\-14B and 32B models\. For these evaluations, we specifically selected the phenotyping prediction task, which is the most challenging benchmark in our study\. As summarised in Table[8](https://arxiv.org/html/2605.11774#A5.T8)\(c\), MedTPE maintains its efficiency and efficacy as model capacity increases\. Specifically, we observed substantial reductions in inference latency 36\.3% for the 14B model and 40\.9% for the 32B model—while preserving F1 scores with marginal degradation\. These findings confirm that MedTPE’s compression mechanism is highly compatible with large\-scale architectures and remains robust even under the most demanding clinical reasoning requirements\.

Table 9:Assessment of LLMs with MedTPE using CoT prompting\. Mean and standard deviation of F1 scores are reported, calculated by bootstrapping the test set 1,000 times\. Inference time is reported in minutes, with the percentage change shown relative to the original LLM\. Llama3 series and Meditron3 use the same tokeniser, resulting in identical CR with prompt compression\.\(a\)Evaluation of LLMs with MedTPE on MIMIC\-IV with CoT Prompting
\(b\)Evaluation of LLMs with MedTPE on EHRSHOT with CoT Prompting

Table 10:Qualitative failure\-mode analysis of Meditron3\-8B with MedTPE under CoT prompting on the MIMIC\-IV phenotyping task\. Representative outputs are shortened for readability while preserving the relevant reasoning and output\-format generation\.
## Appendix FImpact of CoT Prompting

We further evaluated the effectiveness of MedTPE under CoT prompting, as shown in Table[9](https://arxiv.org/html/2605.11774#A5.T9)\. In all models and tasks, MedTPE consistently reduced inference time and achieved substantial sequence compression, with compression rates ranging from 22\.8% to 32\.4%\. For LLMs larger than 7B parameters, F1 scores and format compliance rates were largely maintained with only marginal decreases\. In contrast, smaller models such as Llama3\-1B and Qwen2\.5\-1\.5B exhibited notable improvements in both predictive performance and compliance with the output format when equipped with MedTPE\. For example, Llama3\-1B with CoT prompting in ICU mortality saw its F1 score increase from 0\.030 to 0\.122 and FCR from 0\.231 to 0\.999, while the inference time was reduced by more than 66%\. Similar trends were observed across the EHRSHOT tasks and for Qwen2\.5\-1\.5B, which demonstrated improvements in both efficiency and predictive performance\.

An exception to the general CoT results is observed for Meditron3\-8B on the MIMIC\-IV phenotyping task\. In this setting, MedTPE reduces the input length by 32\.4%, but the F1 score decreases from 0\.176 to 0\.137, the format compliance rate decreases from 0\.987 to 0\.852, and the inference time increases from 94\.0 to 135\.8 minutes\. To better understand this degradation, we qualitatively inspected representative generated responses from Meditron3\-8B with CoT prompting on ICU phenotyping\. The analysis reveals one expected behaviour and two recurring failure modes, summarised in Table[10](https://arxiv.org/html/2605.11774#A5.T10)\. Failure Mode A indicates reasoning hallucination: the model preserves the required output format, but the CoT rationale becomes misaligned with the final prediction\. Failure Mode B indicates a loss of instruction adherence: the model continues generating clinical explanations but fails to produce the required JSON output\. These behaviours are consistent with instruction fragility in medically continual\-pretrained models, where changes to the input token distribution can disrupt either the reasoning trajectory or the formatting constraint\. Therefore, although MedTPE is generally effective under CoT prompting, integrating token\-pair compression with advanced reasoning prompts should be validated for each fine\-tuned model\.

Table 11:Top\-10 most frequent TPE tokens in MIMIC\-IV\. Counts represent frequencies observed across the corpus\.Table 12:Assessment of Llama3\-1B on the MIMIC\-IV Phenotyping task with baselines augmented with embedding fine\-tuning \(Emb\-FT\)\.Table 13:Evaluation of MedTPE cross\-domain transferability\. ”In\-Domain” refers to mining and SSFT on EHRSHOT\. ”MIMIC\-IV→\\rightarrowEHRSHOT” refers to applying the MIMIC\-IV trained tokeniser to EHRSHOT\.
## Appendix GAnalysis of TPE Tokens

To identify the primary drivers of the observed efficiency gains, we analysed the most frequent TPE tokens within the MIMIC\-IV dataset\. As illustrated in Table[11](https://arxiv.org/html/2605.11774#A6.T11), the top\-10 tokens are dominated by clinical units, physiological measurements, and frequent medical terminology\. Notably, while the specific rankings differ slightly, Llama3 and Qwen2\.5 produced identical top\-10 token sets, primarily capturing vital sign descriptors \(e\.g\., “mmHg”, “Heart rate”\) and nursing observations \(e\.g\., “Nonin”, “Infusion”\)\. Gemma2 similarly prioritises high\-frequency substrings such as “Infusion” and various volumetric units\.

This distribution indicates that MedTPE effectively identifies and merges semantically meaningful substrings that recur throughout clinical documentation\. By representing these common patterns, such as blood pressure metrics and infusion statuses, MedTPE achieves substantial sequence compression\. This targeted reduction in input length allows the models to retain essential diagnostic information while significantly decreasing the computational cost of processing lengthy clinical records\.

## Appendix HEvaluating Baselines with Embedding Fine\-Tuning

To ensure a strictly equitable comparison, we conducted an experiment applying the embedding fine\-tuning \(Emb\-FT\) step to the prompt compression baselines\. While MedTPE inherently utilises SSFT to align new tokens, methods like LLMLingua2 and T5Summary do not traditionally undergo domain\-specific embedding adaptation\. To isolate the impact of this tuning phase, we augmented both baselines with an equivalent Emb\-FT step on the target domain, evaluating them on the MIMIC\-IV Phenotyping task using the Llama3\-1B model\. ZeTT was explicitly excluded from this setup, as its architecture relies on a hypernetwork to generate dynamic embeddings rather than tuning static representations\.

As shown in Table[12](https://arxiv.org/html/2605.11774#A6.T12), introducing this fine\-tuning step to the baselines fails to close the performance gap\. Emb\-FT yields a marginal F1 improvement for T5Summary \(from 0\.078 to 0\.081\) and degrades LLMLingua2’s predictive performance \(from 0\.058 to 0\.017\)\. This degradation likely occurs because traditional prompt compression methods aggressively prune tokens, resulting in discontinuous text\. Fine\-tuning representations on this fragmented context disrupts the LLM’s pre\-trained semantic space\. Conversely, MedTPE preserves semantic continuity and significantly outperforms these augmented baselines, confirming that its efficacy stems from its dependency\-aware token construction rather than the fine\-tuning step alone\.

## Appendix ICross\-Domain Transferability of MedTPE

To evaluate MedTPE’s sensitivity to the corpus used for vocabulary mining and representation alignment, we conducted a cross\-domain transferability experiment\. Specifically, we directly applied the MedTPE tokeniser and embeddings optimised exclusively on the MIMIC\-IV dataset to the EHRSHOT evaluation suites without any adaptation\.

Results in Table[13](https://arxiv.org/html/2605.11774#A6.T13)reveal a limitation regarding the direct cross\-domain transferability of MedTPE\. Applying the MIMIC\-IV\-derived MedTPE model to the EHRSHOT dataset degrades performance on both tasks\. This decline is driven by differences in term frequencies and lexical distributions across distinct healthcare scenarios\(Huet al\.,[2026](https://arxiv.org/html/2605.11774#bib.bib62)\)\. As indicated by the drops of CR, the highly frequentNN\-grams mined from MIMIC\-IV appear less frequently in EHRSHOT patient records\. Consequently, the model is forced to rely on misaligned sub\-token representations\. Furthermore, the results show that Llama3\-1B is highly sensitive to this domain shift while Qwen2\.5\-1\.5B demonstrates relative transfer robustness\. This disparity can be attributed to the ability of higher\-capacity base models to partially mitigate the impact of out\-of\-domain token representations\(Calderonet al\.,[2024](https://arxiv.org/html/2605.11774#bib.bib63)\)\. However, while robust LLMs can partially absorb the shock of out\-of\-domain token representations, achieving optimal efficiency and predictive performance requires domain\-specific vocabulary mining and SSFT to capture the unique linguistic distributions\.

## Appendix JContext\-length Robustness Evaluation

Appendix Figure[5](https://arxiv.org/html/2605.11774#A11.F5)further illustrates the context\-length robustness of MedTPE across a diverse set of LLMs and clinical prediction tasks\. Across all model sizes and both MIMIC\-IV and EHRSHOT benchmarks, MedTPE consistently matches or outperforms the original tokeniser at varying input lengths\. This pattern holds for both small and large models, including Llama3\-1B, Qwen2\.5\-1\.5B, Qwen2\.5\-7B, Llama3\-8B, and Meditron3\-8B, and across multiple prediction tasks such as ICU mortality, phenotyping, 30\-day readmission, and 1\-year pancreatic cancer prediction\. Although the magnitude of improvement varies by task and model, MedTPE demonstrates stable or enhanced F1 scores as the input window grows, confirming its effectiveness for long\-context clinical inference\. These results reinforce the conclusion that MedTPE is a robust solution for compressing clinical input sequences and enabling scalable LLM performance under extended context settings\.

## Appendix KTest\-time scaling Evaluation

In this experiment, we adopt majority voting as the test\-time scaling approach\. In majority voting, the model generates multiple independent responses for each input, and the final prediction is determined by the most frequent output\(Chenet al\.,[2024b](https://arxiv.org/html/2605.11774#bib.bib42)\)\. Figure[6](https://arxiv.org/html/2605.11774#A11.F6)reports the improvement in the F1 score and the relative inference time for MedTPE, where the relative time is measured against the inference time of the original tokeniser for one round\. The number of independent responses is denoted bynn\(n=1,3,5n=1,3,5\)\. In most LLMs and tasks on the MIMIC\-IV and EHRSHOT datasets, MedTPE enables improvements in the F1 score over the original tokeniser as the number of majority voting responses increases, without increasing relative inference time\. This positive trend is observed for both small and large models, including Llama3\-1B, Qwen2\.5\-1\.5B, Qwen2\.5\-7B, Llama3\-8B, and Meditron3\-8B\. These results demonstrate that MedTPE can be effectively integrated with test\-time scaling strategies, delivering improved performance without incurring extra inference costs compared to original models\.

![Refer to caption](https://arxiv.org/html/2605.11774v1/x9.png)\(a\)Llama3\-1B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x10.png)\(b\)Llama3\-1B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x11.png)\(c\)Llama3\-1B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x12.png)\(d\)Llama3\-1B
\(1\-year Pancreatic Cancer\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x13.png)\(e\)Qwen2\.5\-1\.5B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x14.png)\(f\)Qwen2\.5\-1\.5B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x15.png)\(g\)Qwen2\.5\-1\.5B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x16.png)\(h\)Qwen2\.5\-1\.5B
\(1\-year Pancreatic Cancer\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x17.png)\(i\)Qwen2\.5\-7B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x18.png)\(j\)Qwen2\.5\-7B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x19.png)\(k\)Qwen2\.5\-7B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x20.png)\(l\)Qwen2\.5\-7B
\(1\-year Pancreatic Cancer\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x21.png)\(m\)Llama3\-8B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x22.png)\(n\)Llama3\-8B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x23.png)\(o\)Llama3\-8B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x24.png)\(p\)Llama3\-8B
\(1\-year Pancreatic Cancer\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x25.png)\(q\)Meditron3\-8B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x26.png)\(r\)Meditron3\-8B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x27.png)\(s\)Meditron3\-8B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x28.png)\(t\)Meditron3\-8B
\(1\-year Pancreatic Cancer\)

Figure 5:Context\-length robustness of MedTPE across LLMs and clinical tasks\.Each curve shows the mean F1 score \(solid line for the original tokeniser, dashed line for MedTPE\) with shaded areas indicating 95% confidence interval, evaluated across 0 to 8,192 input tokens\. Token counts are measured using the original tokeniser for each LLM, ensuring that both models take the same amount of information\. Subfigures \(a–d\) show results for Llama3\-1B, \(e–h\) for Qwen2\.5\-1\.5B, \(i–l\) for Qwen2\.5\-7B, \(m–p\) for Llama3\-8B, and \(q–t\) for Meditron3\-8B, each covering ICU mortality and phenotyping on MIMIC\-IV, as well as 30\-day readmission and 1\-year pancreatic cancer prediction on EHRSHOT\.![Refer to caption](https://arxiv.org/html/2605.11774v1/x29.png)\(a\)Llama3\-1B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x30.png)\(b\)Llama3\-1B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x31.png)\(c\)Llama3\-1B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x32.png)\(d\)Llama3\-1B
\(1\-year Pancreatic Cancer\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x33.png)\(e\)Qwen2\.5\-1\.5B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x34.png)\(f\)Qwen2\.5\-1\.5B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x35.png)\(g\)Qwen2\.5\-1\.5B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x36.png)\(h\)Qwen2\.5\-1\.5B
\(1\-year Pancreatic Cancer\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x37.png)\(i\)Qwen2\.5\-7B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x38.png)\(j\)Qwen2\.5\-7B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x39.png)\(k\)Qwen2\.5\-7B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x40.png)\(l\)Qwen2\.5\-7B
\(1\-year Pancreatic Cancer\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x41.png)\(m\)Llama3\-8B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x42.png)\(n\)Llama3\-8B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x43.png)\(o\)Llama3\-8B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x44.png)\(p\)Llama3\-8B
\(1\-year Pancreatic Cancer\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x45.png)\(q\)Meditron3\-8B
\(ICU Mortality\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x46.png)\(r\)Meditron3\-8B
\(Phenotyping\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x47.png)\(s\)Meditron3\-8B
\(30\-day Readmission\)
![Refer to caption](https://arxiv.org/html/2605.11774v1/x48.png)\(t\)Meditron3\-8B
\(1\-year Pancreatic Cancer\)

Figure 6:Test\-time scaling performance of MedTPE\.Each point shows the improvement in F1 score relative to the original tokeniser \(y\-axis\) versus the relative inference time \(x\-axis\) for different numbers of responses \(n=1,3,5n=1,3,5\)\. Subfigures \(a–d\) show results for Llama3\-1B, \(e–h\) for Qwen2\.5\-1\.5B, \(i–l\) for Qwen2\.5\-7B, \(m–p\) for Llama3\-8B, and \(q–t\) for Meditron3\-8B, each covering ICU mortality and phenotyping on MIMIC\-IV, as well as 30\-day readmission and 1\-year pancreatic cancer prediction on EHRSHOT\.
From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

Similar Articles

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

Compute Optimal Tokenization (2 minute read)

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

Generic Triple-Latent Compression with Gated Associative Retrieval

Token maxxing

Submit Feedback

Similar Articles

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts
Compute Optimal Tokenization (2 minute read)
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
Generic Triple-Latent Compression with Gated Associative Retrieval