Tag
Presents a systematic methodology for converting Hindi WordNet into 1.25 million instruction-response pairs to fine-tune a 12B-parameter language model using LoRA, demonstrating improved pedagogical effectiveness for specialized conversational systems in low-resource languages.
Presents Tatoxa, a state-of-the-art system for text detoxification in the Tatar language, outperforming existing LLMs. Introduces a new dataset and shows that cross-lingual transfer performs worse than native data.
This paper proposes SARA, a framework that aligns routing distributions of multilingual inputs using Jensen-Shannon divergence to improve expert sharing for low-resource languages in sparse Mixture-of-Experts models. Experiments on Qwen3-30B-A3B and Phi-3.5-MoE-instruct show improvements on multilingual benchmarks.
This paper proposes an error-aware TF-IDF retrieval-augmented generation framework for correcting automatic speech recognition errors, achieving significant accuracy gains on Persian FLEURS with near-zero inference latency.
This paper systematically quantifies the tokenization penalty for 20 African languages across 11 frontier and open tokenizers, finding up to 8.9× inference cost and latency multipliers and as little as 11% effective context window compared to English, highlighting a structural digital divide encoded in subword vocabularies.
This paper presents QuechuaTok, a benchmark for evaluating tokenization strategies for Southern Quechua, and introduces Morphological Boundary Accuracy (MorphAcc) as a necessary metric. It shows that BPE achieves low fertility but poor morphological accuracy, while a morphology-aware PRPE tokenizer achieves 83% MorphAcc, demonstrating that fertility rate alone is insufficient for agglutinative languages.
This paper investigates activation steering as an alternative to few-shot prompting for generating synthetic data in low-resource languages. The authors propose LanguageSteering and QualitySteering strategies, showing that steering on early layers improves diversity and downstream model performance.
This paper investigates whether pretrained self-supervised speech models like Wav2Vec2 and HuBERT can accurately recognize click consonants, which are rare in training data, by fine-tuning on Khoisan languages. Results show the models recognize clicks more accurately than non-clicks, indicating generalization to uncommon phonemes.
Translate-R1 introduces a reinforcement learning approach for cost-aware translation tool use in LLMs, where the model learns to decide when to translate inputs based on its own comprehension and a cost-sensitivity parameter, achieving Pareto-optimal trade-offs across multiple languages.
This paper proposes a modular approach for adapting pretrained language models to low-resource languages by freezing embeddings and tuning the rest, showing improvements on NLU tasks for Scottish Gaelic, Irish, and Quechua.
This paper proposes a novel pipeline for multilingual coreference resolution that uses cycle-consistent machine translation from English to low-resource languages to generate training data, validated by back-translation and BERT similarity. Experiments on four low-resource languages show significant performance gains, enabling accurate coreference resolution where no prior corpora existed.
GlossAssist is a tool for creating interlinear glossed text (IGT) corpora in low-resource language documentation settings, built around the CWoMP retrieval-based architecture with an active learning feedback loop that improves predictions as annotators make corrections without retraining the model.
This paper proposes a reinforcement learning approach to enable large language models to translate unseen languages by leveraging in-context linguistic knowledge, outperforming in-context learning and supervised fine-tuning.
This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.
This survey catalogs publicly available text and speech resources for Hausa and Fongbe, two West African languages, assessing availability, quality, and gaps for NLP development, and providing task-specific recommendations.
This paper presents a comparative evaluation of embedding models and generator backends for Khmer-language retrieval-augmented question answering in the telecom domain, finding that BGE-M3 performs best for retrieval while generator strengths vary across metrics.
University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages, using a two-stage pipeline with Qwen2.5-VL for Spanish captioning and retrieval-augmented Gemini 2.5 Flash for target-language translation, achieving significant improvements over the baseline.
This paper introduces the Resource Density Index (RDI) and uses LLM-assisted citation mining to reveal that many languages appear data-poor in catalogue records but have substantial dataset activity in research literature, highlighting a visibility asymmetry in low-resource multilingual NLP.
Introduces Vividh-ASR, a complexity-tiered benchmark for Hindi and Malayalam ASR, identifies studio-bias in fine-tuning, and proposes R-MFT to improve spontaneous speech performance efficiently.
The article introduces LLiMba, a 3B parameter model adapted from Qwen2.5 for Sardinian using continued pretraining and supervised fine-tuning on a single consumer GPU. It evaluates various LoRA configurations, finding that adapter capacity significantly impacts performance and factual accuracy in low-resource language adaptation.