Tag
This paper investigates the relationship between training scale and UTF-8 generation reliability in byte-level language models, finding that UTF-8 validity convergence lags behind perplexity by roughly a factor of two. The authors introduce evaluation protocols to isolate structural validity and show that reliable UTF-8 generation is a distinct capability requiring separate evaluation.