Tag
This paper measures tokenizer fertility across 25 European languages on parallel text, revealing a 2.5x spread from English to Greek/Maltese, with Ukrainian paying a 15-18% penalty. It demonstrates domain invariance of fertility rankings, analyzes subword fragmentation, and evaluates cross-lingual few-shot effects.
Benchmarks seven foundation models on Ukrainian legal text, finding tokenizer fertility varies 1.6×, few-shot prompting degrades performance, and cost-performance analysis shows NVIDIA Nemotron Super 3 outperforms larger models.