compute-optimal

#compute-optimal

Compute Optimal Tokenization (2 minute read)

TLDR AI ↗ · yesterday Cached

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

0 favorites 0 likes

#compute-optimal

Prescriptive Scaling Laws for Data Constrained Training

Hugging Face Daily Papers ↗ · 2026-05-02 Cached

A modified scaling law accounting for data repetition effects provides compute-optimal training strategies for data-constrained scenarios, showing that beyond a point further repetition is counterproductive and compute is better spent on model capacity.

0 favorites 0 likes

compute-optimal

Compute Optimal Tokenization (2 minute read)

Prescriptive Scaling Laws for Data Constrained Training

Submit Feedback