tokenization

#tokenization

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

arXiv cs.CL ↗ · yesterday Cached

This paper introduces MedTPE, a method for efficient, lossless prompt compression of electronic health records for large language models, significantly reducing token length and inference latency in clinical prediction tasks.

0 favorites 0 likes

#tokenization

Compute Optimal Tokenization (2 minute read)

TLDR AI ↗ · yesterday Cached

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

0 favorites 0 likes

#tokenization

The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

arXiv cs.CL ↗ · 3d ago Cached

This research paper investigates the 'Text Uncanny Valley,' a phenomenon where LLM performance in information retrieval tasks degrades non-monotonically as word-boundary corruption increases. The authors propose a mode transition hypothesis to explain this U-shaped performance curve and demonstrate its relevance to real-world noisy text inputs.

0 favorites 0 likes

#tokenization

@0xLogicrw: MiniMax published a technical blog post detailing the root cause analysis for its M2 series large models' inability to output the person's name "Ma Jiaqi". Starting from a single case study, the investigation ultimately revealed a systematic degradation issue affecting nearly 5% of the entire vocabulary. The root cause was a severe disconnect in data coverage between the two training stages of the large model. In the first stage (pre-training), massive amounts of internet text were used to cre…

X AI KOLs Timeline ↗ · 4d ago

MiniMax published a technical blog post providing an in-depth analysis of the systematic vocabulary degradation issue behind its M2 series large models' inability to output specific personal names. It reveals parameter shifts caused by a disconnect in data coverage between pre-training and post-training stages, and proposes an effective solution involving full-scale synthetic data for remediation.

0 favorites 0 likes

#tokenization

Human typing habits and token counts

Hacker News Top ↗ · 5d ago Cached

A blog post exploring how human typing habits like typos, shorthand, filler words, and whitespace affect token counts in OpenAI and Claude tokenizers, noting that common misspellings can inflate token usage and costs without changing meaning.

0 favorites 0 likes

#tokenization

When Informal Text Breaks NLI: Tokenization Failure, Distribution Shift, and Targeted Mitigations

arXiv cs.CL ↗ · 2026-04-21 Cached

This paper investigates how informal text (slang, emoji, Gen-Z filler tokens) degrades NLI accuracy in ELECTRA-small and RoBERTa-large models, identifying two distinct failure mechanisms—tokenization failure (emoji mapped to [UNK]) and distribution shift (out-of-domain noise tokens)—and proposes targeted mitigations that recover accuracy without harming clean-text performance.

0 favorites 0 likes

#tokenization

Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from University of Utah and CMU propose FragMend, an interpretability-based approach for vocabulary expansion in LLMs that addresses token over-fragmentation in non-Latin script languages. Their method outperforms frequency-based vocabulary selection and baseline embedding initialization by ~20 points for several underrepresented languages.

0 favorites 0 likes

#tokenization

A Triadic Suffix Tokenization Scheme for Numerical Reasoning

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces Triadic Suffix Tokenization (TST), a deterministic tokenization scheme that partitions digits into three-digit triads with explicit magnitude markers to improve numerical reasoning in large language models. The method addresses inconsistent number fragmentation in standard tokenizers by providing transparent order-of-magnitude relationships at the token level, with two implementation variants offering scalable vocabulary expansion.

0 favorites 0 likes

#tokenization

Stochasticity in Tokenization Improves Robustness

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.

0 favorites 0 likes

#tokenization

(1D) Ordered Tokens Enable Efficient Test-Time Search

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

This paper investigates how 1D coarse-to-fine token structures in autoregressive models improve test-time search efficiency compared to classical 2D grid tokenization. The authors show that such ordered tokens enable better test-time scaling and even training-free text-to-image generation when guided by image-text verifiers.

0 favorites 0 likes

tokenization

Submit Feedback