tokenization

#tokenization

Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper theoretically and empirically examines adaptive patching for time-series Transformers, deriving conditions under which content-adaptive tokenization should outperform tuned uniform patching. Controlled experiments on standard benchmarks show that a well-tuned uniform baseline is competitive with dynamic patching methods, challenging the assumed benefit of adaptive approaches.

0 favorites 0 likes

#tokenization

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

arXiv cs.CL ↗ · 2026-06-04 Cached

LDARNet is a 120M-parameter hierarchical genomic foundation model that introduces learnable adaptive tokenization (inspired by H-Net's dynamic chunking) for masked language modeling on DNA sequences. It achieves state-of-the-art results on 5 histone modification tasks and outperforms models up to 20× larger on several genomic benchmarks, with learned token boundaries aligning with biological features like promoter motifs and splice junctions.

0 favorites 0 likes

#tokenization

@MaximeRivest: Tool calling in open source LLMs is wildly different from one model to another. I just wipped up: http://chattemplatepl…

X AI KOLs Following ↗ · 2026-06-03 Cached

A new web tool, Chat Template Playground, lets users visualize how different open-source LLMs render their chat templates, highlighting differences in prompting and tokenization.

0 favorites 0 likes

#tokenization

MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

MeshWeaver presents an autoregressive mesh generation framework that directly predicts vertices using a multi-level sparse-voxel encoder, achieving state-of-the-art compression and geometric fidelity for high-poly meshes.

0 favorites 0 likes

#tokenization

Incremental BPE Tokenization

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper introduces an incremental algorithm for Byte Pair Encoding (BPE) tokenization that processes each byte in O(log^2 t) time, enabling efficient partial tokenization in streaming settings and achieving speedups over existing implementations.

0 favorites 0 likes

#tokenization

Agentic RL: Token-In, Token-Out Done Right (16 minute read)

TLDR AI ↗ · 2026-06-01 Cached

This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.

0 favorites 0 likes

#tokenization

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper proposes COM, a method that enforces continuity and ordinality constraints on time series token embeddings to improve the performance of token-based time series large language models.

0 favorites 0 likes

#tokenization

@royvanrijn: For curious developers I built "The Anatomy of an LLM", an interactive explainer showing how text becomes tokens, vecto…

X AI KOLs Timeline ↗ · 2026-05-28 Cached

An interactive visual guide that explains how large language models work, from tokenization through attention, transformer blocks, and text generation, built by Roy van Rijn.

0 favorites 0 likes

#tokenization

An In-Vitro Study on Cross-Lingual Generalization in Language Models

arXiv cs.CL ↗ · 2026-05-27 Cached

This paper introduces an in-vitro framework with two procedurally generated languages to study cross-lingual generalization in language models, finding that tokenization's preservation of reusable substructure is more critical than lexical similarity or data balance for transferring capabilities across languages.

0 favorites 0 likes

#tokenization

BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization

arXiv cs.AI ↗ · 2026-05-27 Cached

BrickAnything is an autoregressive framework that generates physically buildable brick structures from diverse 3D representations using point clouds and structure-aware tree tokenization, ensuring geometric fidelity and structural stability.

0 favorites 0 likes

#tokenization

Do machines think or tokenize?

Reddit r/artificial ↗ · 2026-05-26

This paper introduces the SAPS (Synthetic Algorithmic Predictive Systems) framework, arguing that modern AI systems do not think but tokenize and compute statistical patterns, and clarifies the critical distinction between artificial and synthetic systems.

0 favorites 0 likes

#tokenization

@gordic_aleksa: new in-depth blog post time: Inside the Transformer: The Life of a Token a deep dive into a modern dense transformer, i…

X AI KOLs Timeline ↗ · 2026-05-26 Cached

An in-depth blog post exploring the inner workings of modern dense transformers, covering topics such as YaRN for positional information, hybrid attention for long context lengths, soft capping, QK normalization, and transformer math including FLOPs/token formulas and cluster sizing.

0 favorites 0 likes

#tokenization

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

A deep learning framework is developed to analyze grammatical gender evolution from Latin to Romance languages, focusing on low-resource historical settings using lexical and contextual analysis.

0 favorites 0 likes

#tokenization

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Hugging Face Daily Papers ↗ · 2026-05-25 Cached

LLaVA-OneVision-2 introduces codec-stream tokenization and windowed attention for efficient video understanding, achieving state-of-the-art performance across multiple multimodal benchmarks including video, spatial, and tracking tasks.

0 favorites 0 likes

#tokenization

@shabnam_774: https://x.com/shabnam_774/status/2058517919760355729

X AI KOLs Timeline ↗ · 2026-05-24 Cached

This article provides a comprehensive step-by-step breakdown of how modern Large Language Models like ChatGPT and Claude are built from scratch, covering data collection, tokenization, transformer architectures, training, alignment, and deployment.

0 favorites 0 likes

#tokenization

@Tabbu_ai: https://x.com/Tabbu_ai/status/2058145123444347339

X AI KOLs Timeline ↗ · 2026-05-23 Cached

An educational thread explaining 11 key lessons for understanding and building LLM architectures from scratch, covering tokens, embeddings, attention, positional encoding, data quality, and common misconceptions.

0 favorites 0 likes

#tokenization

AI agents are making tokenization platforms far more usable than I expected

Reddit r/AI_Agents ↗ · 2026-05-20

A developer shares how AI agents are improving tokenization platforms through intelligent orchestration of humans and systems, rather than full autonomy.

0 favorites 0 likes

#tokenization

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

arXiv cs.CL ↗ · 2026-05-19 Cached

This paper proposes a parameter-efficient vocabulary adaptation method for LLM-based text summarization in specialized domains, augmenting pretrained tokenizers with domain-specific tokens and selectively replacing under-trained ones to reduce training time by 35-55% and parameter counts by up to 37%.

0 favorites 0 likes

#tokenization

@nemild: Y Combinator is hosting fintech happy hour on Thursday in New York City. Thinking of a startup in stablecoins, tokeniza…

X AI KOLs Timeline ↗ · 2026-05-18 Cached

Y Combinator is hosting a fintech happy hour on Thursday in New York City, inviting startups working on stablecoins, tokenization, financial AI, agentic commerce, and prediction markets.

0 favorites 0 likes

#tokenization

Dywave: Event-Aligned Dynamic Tokenization for Heterogeneous IoT Sensing Signal

arXiv cs.LG ↗ · 2026-05-15 Cached

Dywave is a dynamic tokenization framework for IoT sensing signals that uses wavelet-based hierarchical decomposition to align tokens with semantic events, achieving up to 12% higher accuracy and 75% reduction in input token length on five real-world datasets.

0 favorites 0 likes

tokenization

Submit Feedback