Tag
This paper introduces MedTPE, a method for efficient, lossless prompt compression of electronic health records for large language models, significantly reducing token length and inference latency in clinical prediction tasks.
This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.
This research paper investigates the 'Text Uncanny Valley,' a phenomenon where LLM performance in information retrieval tasks degrades non-monotonically as word-boundary corruption increases. The authors propose a mode transition hypothesis to explain this U-shaped performance curve and demonstrate its relevance to real-world noisy text inputs.
MiniMax published a technical blog post providing an in-depth analysis of the systematic vocabulary degradation issue behind its M2 series large models' inability to output specific personal names. It reveals parameter shifts caused by a disconnect in data coverage between pre-training and post-training stages, and proposes an effective solution involving full-scale synthetic data for remediation.
A blog post exploring how human typing habits like typos, shorthand, filler words, and whitespace affect token counts in OpenAI and Claude tokenizers, noting that common misspellings can inflate token usage and costs without changing meaning.
This paper investigates how informal text (slang, emoji, Gen-Z filler tokens) degrades NLI accuracy in ELECTRA-small and RoBERTa-large models, identifying two distinct failure mechanisms—tokenization failure (emoji mapped to [UNK]) and distribution shift (out-of-domain noise tokens)—and proposes targeted mitigations that recover accuracy without harming clean-text performance.
Researchers from University of Utah and CMU propose FragMend, an interpretability-based approach for vocabulary expansion in LLMs that addresses token over-fragmentation in non-Latin script languages. Their method outperforms frequency-based vocabulary selection and baseline embedding initialization by ~20 points for several underrepresented languages.
This paper introduces Triadic Suffix Tokenization (TST), a deterministic tokenization scheme that partitions digits into three-digit triads with explicit magnitude markers to improve numerical reasoning in large language models. The method addresses inconsistent number fragmentation in standard tokenizers by providing transparent order-of-magnitude relationships at the token level, with two implementation variants offering scalable vocabulary expansion.
This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.
This paper investigates how 1D coarse-to-fine token structures in autoregressive models improve test-time search efficiency compared to classical 2D grid tokenization. The authors show that such ordered tokens enable better test-time scaling and even training-free text-to-image generation when guided by image-text verifiers.