tokenization

#tokenization

The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

arXiv cs.CL ↗ · 4d ago Cached

This paper systematically quantifies the tokenization penalty for 20 African languages across 11 frontier and open tokenizers, finding up to 8.9× inference cost and latency multipliers and as little as 11% effective context window compared to English, highlighting a structural digital divide encoded in subword vocabularies.

0 favorites 0 likes

#tokenization

Best Preprocessing Techniques for Sentiment Analysis

arXiv cs.CL ↗ · 4d ago Cached

This paper systematically investigates the optimal order of preprocessing techniques for sentiment analysis on Twitter data, finding that tokenisation is most impactful and spelling correction least, with the best order being tokenisation, cleaning, stemming, then stopword removal.

0 favorites 0 likes

#tokenization

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

arXiv cs.CL ↗ · 4d ago Cached

This paper presents QuechuaTok, a benchmark for evaluating tokenization strategies for Southern Quechua, and introduces Morphological Boundary Accuracy (MorphAcc) as a necessary metric. It shows that BPE achieves low fertility but poor morphological accuracy, while a morphology-aware PRPE tokenizer achieves 83% MorphAcc, demonstrating that fertility rate alone is insufficient for agglutinative languages.

0 favorites 0 likes

#tokenization

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

arXiv cs.AI ↗ · 2026-06-20 Cached

TOTEN is a knowledge-based ontological tokenization framework that replaces statistical tokenization with declarative classification grounded in a formal ontology of engineering entities, achieving high ontological atomicity and numerical reconstruction for physical quantities and technical notation in Brazilian Portuguese.

0 favorites 0 likes

#tokenization

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Hugging Face Daily Papers ↗ · 2026-06-20 Cached

BioMatrix is a multimodal foundation model that unifies molecular sequences, structures, and natural language in a single decoder-only architecture, achieving state-of-the-art performance on 77 out of 80 biological tasks.

0 favorites 0 likes

#tokenization

Hallucinations = Imagination

Reddit r/ArtificialInteligence ↗ · 2026-06-18

A developer working on an AI agent wrapper observes that the agent's hallucinations of user responses can actually aid problem-solving, and proposes treating such hallucinations as imagined events rather than errors.

0 favorites 0 likes

#tokenization

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

arXiv cs.CL ↗ · 2026-06-18 Cached

This paper introduces CADE, a framework for time-series question answering that maps each timestep directly into the LLM embedding space and uses a one-directional supervised contrastive loss to align time-series representations with frozen text anchors, outperforming existing baselines on the Time-MQA benchmark.

0 favorites 0 likes

#tokenization

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

arXiv cs.CL ↗ · 2026-06-18 Cached

This paper presents Morpheus, a neural tokenizer and word embedder for Turkish that learns morpheme boundaries without string normalization, achieving lossless tokenization and competitive embeddings for lexical retrieval, while using less GPU memory than subword tokenizers.

0 favorites 0 likes

#tokenization

Emergent retokenization symmetry in large language models: phenomenology and applications

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper discovers that large language models partially exhibit emergent symmetry under retokenization—replacing a prompt's canonical tokenization with an alternative valid segmentation while preserving bytes exactly. The authors use this phenomenon to probe compositional understanding and propose retokenization as a novel inference-time sampling strategy that can recover solutions not found by conventional temperature sampling.

0 favorites 0 likes

#tokenization

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

arXiv cs.CL ↗ · 2026-06-16 Cached

Introduces PACUTE, a diagnostic benchmark of 4,600 tasks evaluating morphological understanding in Filipino, revealing that even frontier models struggle with morpheme decomposition and productive morphological composition.

0 favorites 0 likes

#tokenization

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper systematically compares equitable tokenizers for multilingual LLMs across 11 Southeast Asian languages, finding that Parity-aware BPE achieves the best efficiency-equity trade-off and that cross-lingual fairness and tokenization efficiency are not fundamentally at odds.

0 favorites 0 likes

#tokenization

Byte-level models

Reddit r/LocalLLaMA ↗ · 2026-06-15

Discusses whether byte-level tokenizers outperform subword tokenizers for precise tasks like distinguishing similar names, counting characters, and case sensitivity, and asks for current recommendations.

0 favorites 0 likes

#tokenization

@freeman1266: You don't need math to understand most AI papers—just understand this chain: token → embedding → position encoding → attention → FFN → residual stream → next-token prediction. LLMs essentially stack Transf…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

A Chinese science tweet that intuitively explains the core chain of LLMs (Large Language Models): from token, embedding, position encoding, attention, FFN to residual stream and next-token prediction, helping readers without a math background understand AI papers.

0 favorites 0 likes

#tokenization

@CamilleRoux: Une explication bien faite du fonctionnement interne des LLMs : tokens, embeddings, positional encoding, attention, fee…

X AI KOLs Timeline ↗ · 2026-06-14 Cached

This tweet shares a well-made explanation of the internal workings of LLMs, covering tokens, embeddings, positional encoding, attention, and feed-forward networks, via a blog post by 0xkato.

1 favorites 1 likes

#tokenization

Finding Optimal Tokenizers

Hacker News Top ↗ · 2026-06-11 Cached

This blog post presents an algorithm using integer linear programming to compute optimal tokenizers for language models, drawing parallels to solving the Traveling Salesman Problem. It notes that while the result is theoretically interesting, practical tokenizers are already near-optimal and the method may not generalize well.

0 favorites 0 likes

#tokenization

Visa and OpenAI Let AI Agents Shop on Your Behalf Using Visa's Global Network

Reddit r/artificial ↗ · 2026-06-11 Cached

Visa and OpenAI partner to enable AI agents to make purchases on users' behalf using tokenized Visa credentials, with user-controlled spending limits and fraud monitoring, backed by Microsoft, IBM, Anthropic, Samsung, and Stripe.

0 favorites 0 likes

#tokenization

@pallavishekhar_: Learn LLM internals step by step - from tokenization to attention to inference optimization - BPE - Tokenization - Tran…

X AI KOLs Timeline ↗ · 2026-06-09 Cached

A tweet promoting a resource for learning LLM internals step by step, covering tokenization, attention, and optimization techniques.

0 favorites 0 likes

#tokenization

@Potatoloogs: How LLMs Actually Work Inside: From Token to Next-Token – A Complete Overview of Nine Core Mechanisms a) Tokenization: The model doesn't read text, it reads integers · Text is first split into subword pieces, then mapped to integer IDs; modern LLM vocabularies typically have tens of thousands to...

X AI KOLs Timeline ↗ · 2026-06-08 Cached

This article systematically outlines the nine core mechanisms inside modern LLMs, from tokenization to next-token prediction, including tokenization, embedding, positional encoding, attention, multi-head attention, feed-forward networks, etc., and compares architectural differences between various models.

0 favorites 0 likes

#tokenization

How LLMs Actually Work

Lobsters Hottest ↗ · 2026-06-07 Cached

An in-depth walkthrough of how modern LLMs work, covering core mechanisms from tokenization to next-token prediction, without heavy math.

0 favorites 0 likes

#tokenization

Balancing Image Compression and Generation with Bootstrapped Tokenization

arXiv cs.LG ↗ · 2026-06-05 Cached

Introduces SelfBootTok, a self-bootstrapped tokenization method that separates global and local information, reducing generator computation by ~40% and achieving a new state-of-the-art gFID of 1.56 with only 64 tokens.

0 favorites 0 likes

tokenization

Submit Feedback