Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Summary
This paper introduces Scratchpad Patching, a technique for tokenizer-free language models that decouples compute from patch size by dynamically refreshing context within patches to reduce patch lag.
View Cached Full Text
Cached at: 05/12/26, 07:28 AM
Paper page - Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Source: https://huggingface.co/papers/2605.09630
Abstract
Tokenizer-free language models using patch-based approaches face a trade-off between compute efficiency and modeling quality due to patch lag, which Scratchpad Patching addresses by dynamically refreshing context within patches based on prediction entropy.
Tokenizer-free language modelseliminate the tokenizer step of the language modeling pipeline by operating directly on bytes;patch-based variantsfurther aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off topatch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transientscratchpadsinside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggersscratchpadsusingnext-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at 16 bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a 16times smallerKV cacheover patches and 3-4times lessinference compute.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.09630
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09630 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.09630 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09630 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting
This paper theoretically and empirically examines adaptive patching for time-series Transformers, deriving conditions under which content-adaptive tokenization should outperform tuned uniform patching. Controlled experiments on standard benchmarks show that a well-tuned uniform baseline is competitive with dynamic patching methods, challenging the assumed benefit of adaptive approaches.
When Attribution Patching Lies: Diagnosis and a Second-Order Correction
This paper diagnoses systematic errors in attribution patching, a gradient-based approximation used for causal localization in language models, and proposes a second-order correction using Hessian-vector products that improves reliability with minimal additional computational cost.
PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration
PatchBoard replaces natural-language dialogue in LLM multi-agent systems with validated JSON Patch mutations over a shared structured state, achieving higher success rates and significantly lower token usage on ALFWorld benchmarks.
Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining
This paper investigates training-time data augmentation techniques to mitigate overfitting in autoregressive language model pretraining under data-constrained, compute-abundant regimes, finding that combining token-level noise, sequence permutations, and target offset prediction improves validation loss.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
This paper investigates the impact of subword tokenization on LLM training efficiency and performance by conducting controlled byte-level pretraining experiments. It reveals key factors such as training throughput and the integration of subword boundaries as linguistic priors.