byte-level-tokenization

#byte-level-tokenization

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

arXiv cs.CL ↗ · 3d ago Cached

This paper investigates the relationship between training scale and UTF-8 generation reliability in byte-level language models, finding that UTF-8 validity convergence lags behind perplexity by roughly a factor of two. The authors introduce evaluation protocols to isolate structural validity and show that reliable UTF-8 generation is a distinct capability requiring separate evaluation.

0 favorites 0 likes

byte-level-tokenization

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

Submit Feedback