byte-level-tokenization

Tag

Cards List
#byte-level-tokenization

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

arXiv cs.CL · 3d ago Cached

This paper investigates the relationship between training scale and UTF-8 generation reliability in byte-level language models, finding that UTF-8 validity convergence lags behind perplexity by roughly a factor of two. The authors introduce evaluation protocols to isolate structural validity and show that reliable UTF-8 generation is a distinct capability requiring separate evaluation.

0 favorites 0 likes
← Back to home

Submit Feedback