ukrainian-nlp

Tag

Cards List
#ukrainian-nlp

The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty

arXiv cs.CL · 2026-05-26 Cached

This paper measures tokenizer fertility across 25 European languages on parallel text, revealing a 2.5x spread from English to Greek/Maltese, with Ukrainian paying a 15-18% penalty. It demonstrates domain invariance of fertility rankings, analyzes subword fragmentation, and evaluates cross-lingual few-shot effects.

0 favorites 0 likes
#ukrainian-nlp

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

arXiv cs.CL · 2026-05-15 Cached

Benchmarks seven foundation models on Ukrainian legal text, finding tokenizer fertility varies 1.6×, few-shot prompting degrades performance, and cost-performance analysis shows NVIDIA Nemotron Super 3 outperforms larger models.

0 favorites 0 likes
← Back to home

Submit Feedback