Tag
This paper investigates the relationship between training scale and UTF-8 generation reliability in byte-level language models, finding that UTF-8 validity convergence lags behind perplexity by roughly a factor of two. The authors introduce evaluation protocols to isolate structural validity and show that reliable UTF-8 generation is a distinct capability requiring separate evaluation.
This paper introduces the Resource Density Index (RDI) and uses LLM-assisted citation mining to reveal that many languages appear data-poor in catalogue records but have substantial dataset activity in research literature, highlighting a visibility asymmetry in low-resource multilingual NLP.
Developer seeks advice on handling English-Hindi code-mixed text classification without heavy LLMs, as sentence transformers fail on Romanized Hindi.
This paper introduces a data-efficient fine-tuning framework for teaching reasoning models to code-switch (mix languages) effectively, demonstrating that strategic code-switching can improve reasoning capabilities for lower-resource languages. The work analyzes code-switching behaviors in large language models across diverse languages, tasks, and domains, then develops interventions to promote beneficial code-switching patterns.