Tag
This paper theoretically and empirically examines adaptive patching for time-series Transformers, deriving conditions under which content-adaptive tokenization should outperform tuned uniform patching. Controlled experiments on standard benchmarks show that a well-tuned uniform baseline is competitive with dynamic patching methods, challenging the assumed benefit of adaptive approaches.
LDARNet is a 120M-parameter hierarchical genomic foundation model that introduces learnable adaptive tokenization (inspired by H-Net's dynamic chunking) for masked language modeling on DNA sequences. It achieves state-of-the-art results on 5 histone modification tasks and outperforms models up to 20× larger on several genomic benchmarks, with learned token boundaries aligning with biological features like promoter motifs and splice junctions.
A new web tool, Chat Template Playground, lets users visualize how different open-source LLMs render their chat templates, highlighting differences in prompting and tokenization.
MeshWeaver presents an autoregressive mesh generation framework that directly predicts vertices using a multi-level sparse-voxel encoder, achieving state-of-the-art compression and geometric fidelity for high-poly meshes.
This paper introduces an incremental algorithm for Byte Pair Encoding (BPE) tokenization that processes each byte in O(log^2 t) time, enabling efficient partial tokenization in streaming settings and achieving speedups over existing implementations.
This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.
This paper proposes COM, a method that enforces continuity and ordinality constraints on time series token embeddings to improve the performance of token-based time series large language models.
An interactive visual guide that explains how large language models work, from tokenization through attention, transformer blocks, and text generation, built by Roy van Rijn.
This paper introduces an in-vitro framework with two procedurally generated languages to study cross-lingual generalization in language models, finding that tokenization's preservation of reusable substructure is more critical than lexical similarity or data balance for transferring capabilities across languages.
BrickAnything is an autoregressive framework that generates physically buildable brick structures from diverse 3D representations using point clouds and structure-aware tree tokenization, ensuring geometric fidelity and structural stability.
This paper introduces the SAPS (Synthetic Algorithmic Predictive Systems) framework, arguing that modern AI systems do not think but tokenize and compute statistical patterns, and clarifies the critical distinction between artificial and synthetic systems.
An in-depth blog post exploring the inner workings of modern dense transformers, covering topics such as YaRN for positional information, hybrid attention for long context lengths, soft capping, QK normalization, and transformer math including FLOPs/token formulas and cluster sizing.
A deep learning framework is developed to analyze grammatical gender evolution from Latin to Romance languages, focusing on low-resource historical settings using lexical and contextual analysis.
LLaVA-OneVision-2 introduces codec-stream tokenization and windowed attention for efficient video understanding, achieving state-of-the-art performance across multiple multimodal benchmarks including video, spatial, and tracking tasks.
This article provides a comprehensive step-by-step breakdown of how modern Large Language Models like ChatGPT and Claude are built from scratch, covering data collection, tokenization, transformer architectures, training, alignment, and deployment.
An educational thread explaining 11 key lessons for understanding and building LLM architectures from scratch, covering tokens, embeddings, attention, positional encoding, data quality, and common misconceptions.
A developer shares how AI agents are improving tokenization platforms through intelligent orchestration of humans and systems, rather than full autonomy.
This paper proposes a parameter-efficient vocabulary adaptation method for LLM-based text summarization in specialized domains, augmenting pretrained tokenizers with domain-specific tokens and selectively replacing under-trained ones to reduce training time by 35-55% and parameter counts by up to 37%.
Y Combinator is hosting a fintech happy hour on Thursday in New York City, inviting startups working on stablecoins, tokenization, financial AI, agentic commerce, and prediction markets.
Dywave is a dynamic tokenization framework for IoT sensing signals that uses wavelet-based hierarchical decomposition to align tokens with semantic events, achieving up to 12% higher accuracy and 75% reduction in input token length on five real-world datasets.