Tag
This paper systematically quantifies the tokenization penalty for 20 African languages across 11 frontier and open tokenizers, finding up to 8.9× inference cost and latency multipliers and as little as 11% effective context window compared to English, highlighting a structural digital divide encoded in subword vocabularies.
This paper systematically investigates the optimal order of preprocessing techniques for sentiment analysis on Twitter data, finding that tokenisation is most impactful and spelling correction least, with the best order being tokenisation, cleaning, stemming, then stopword removal.
This paper presents QuechuaTok, a benchmark for evaluating tokenization strategies for Southern Quechua, and introduces Morphological Boundary Accuracy (MorphAcc) as a necessary metric. It shows that BPE achieves low fertility but poor morphological accuracy, while a morphology-aware PRPE tokenizer achieves 83% MorphAcc, demonstrating that fertility rate alone is insufficient for agglutinative languages.
TOTEN is a knowledge-based ontological tokenization framework that replaces statistical tokenization with declarative classification grounded in a formal ontology of engineering entities, achieving high ontological atomicity and numerical reconstruction for physical quantities and technical notation in Brazilian Portuguese.
BioMatrix is a multimodal foundation model that unifies molecular sequences, structures, and natural language in a single decoder-only architecture, achieving state-of-the-art performance on 77 out of 80 biological tasks.
A developer working on an AI agent wrapper observes that the agent's hallucinations of user responses can actually aid problem-solving, and proposes treating such hallucinations as imagined events rather than errors.
This paper introduces CADE, a framework for time-series question answering that maps each timestep directly into the LLM embedding space and uses a one-directional supervised contrastive loss to align time-series representations with frozen text anchors, outperforming existing baselines on the Time-MQA benchmark.
This paper presents Morpheus, a neural tokenizer and word embedder for Turkish that learns morpheme boundaries without string normalization, achieving lossless tokenization and competitive embeddings for lexical retrieval, while using less GPU memory than subword tokenizers.
This paper discovers that large language models partially exhibit emergent symmetry under retokenization—replacing a prompt's canonical tokenization with an alternative valid segmentation while preserving bytes exactly. The authors use this phenomenon to probe compositional understanding and propose retokenization as a novel inference-time sampling strategy that can recover solutions not found by conventional temperature sampling.
Introduces PACUTE, a diagnostic benchmark of 4,600 tasks evaluating morphological understanding in Filipino, revealing that even frontier models struggle with morpheme decomposition and productive morphological composition.
This paper systematically compares equitable tokenizers for multilingual LLMs across 11 Southeast Asian languages, finding that Parity-aware BPE achieves the best efficiency-equity trade-off and that cross-lingual fairness and tokenization efficiency are not fundamentally at odds.
Discusses whether byte-level tokenizers outperform subword tokenizers for precise tasks like distinguishing similar names, counting characters, and case sensitivity, and asks for current recommendations.
A Chinese science tweet that intuitively explains the core chain of LLMs (Large Language Models): from token, embedding, position encoding, attention, FFN to residual stream and next-token prediction, helping readers without a math background understand AI papers.
This tweet shares a well-made explanation of the internal workings of LLMs, covering tokens, embeddings, positional encoding, attention, and feed-forward networks, via a blog post by 0xkato.
This blog post presents an algorithm using integer linear programming to compute optimal tokenizers for language models, drawing parallels to solving the Traveling Salesman Problem. It notes that while the result is theoretically interesting, practical tokenizers are already near-optimal and the method may not generalize well.
Visa and OpenAI partner to enable AI agents to make purchases on users' behalf using tokenized Visa credentials, with user-controlled spending limits and fraud monitoring, backed by Microsoft, IBM, Anthropic, Samsung, and Stripe.
A tweet promoting a resource for learning LLM internals step by step, covering tokenization, attention, and optimization techniques.
This article systematically outlines the nine core mechanisms inside modern LLMs, from tokenization to next-token prediction, including tokenization, embedding, positional encoding, attention, multi-head attention, feed-forward networks, etc., and compares architectural differences between various models.
An in-depth walkthrough of how modern LLMs work, covering core mechanisms from tokenization to next-token prediction, without heavy math.
Introduces SelfBootTok, a self-bootstrapped tokenization method that separates global and local information, reducing generator computation by ~40% and achieving a new state-of-the-art gFID of 1.56 with only 64 tokens.