low-resource-languages

#low-resource-languages

From Lexicon to AI: A Structured-Data Pipeline for Specialized Conversational Systems in Low-Resource Languages

arXiv cs.CL ↗ · 20h ago Cached

Presents a systematic methodology for converting Hindi WordNet into 1.25 million instruction-response pairs to fine-tune a 12B-parameter language model using LoRA, demonstrating improved pedagogical effectiveness for specialized conversational systems in low-resource languages.

0 favorites 0 likes

#low-resource-languages

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

arXiv cs.CL ↗ · yesterday Cached

Presents Tatoxa, a state-of-the-art system for text detoxification in the Tatar language, outperforming existing LLMs. Introduces a new dataset and shows that cross-lingual transfer performs worse than native data.

0 favorites 0 likes

#low-resource-languages

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

arXiv cs.CL ↗ · yesterday Cached

This paper proposes SARA, a framework that aligns routing distributions of multilingual inputs using Jensen-Shannon divergence to improve expert sharing for low-resource languages in sparse Mixture-of-Experts models. Experiments on Qwen3-30B-A3B and Phi-3.5-MoE-instruct show improvements on multilingual benchmarks.

0 favorites 0 likes

#low-resource-languages

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

arXiv cs.CL ↗ · yesterday Cached

This paper proposes an error-aware TF-IDF retrieval-augmented generation framework for correcting automatic speech recognition errors, achieving significant accuracy gains on Persian FLEURS with near-zero inference latency.

0 favorites 0 likes

#low-resource-languages

The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

arXiv cs.CL ↗ · 2d ago Cached

This paper systematically quantifies the tokenization penalty for 20 African languages across 11 frontier and open tokenizers, finding up to 8.9× inference cost and latency multipliers and as little as 11% effective context window compared to English, highlighting a structural digital divide encoded in subword vocabularies.

0 favorites 0 likes

#low-resource-languages

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

arXiv cs.CL ↗ · 2d ago Cached

This paper presents QuechuaTok, a benchmark for evaluating tokenization strategies for Southern Quechua, and introduces Morphological Boundary Accuracy (MorphAcc) as a necessary metric. It shows that BPE achieves low fertility but poor morphological accuracy, while a morphology-aware PRPE tokenizer achieves 83% MorphAcc, demonstrating that fertility rate alone is insufficient for agglutinative languages.

0 favorites 0 likes

#low-resource-languages

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

arXiv cs.CL ↗ · 2026-06-18 Cached

This paper investigates activation steering as an alternative to few-shot prompting for generating synthetic data in low-resource languages. The authors propose LanguageSteering and QualitySteering strategies, showing that steering on early layers improves diversity and downstream model performance.

0 favorites 0 likes

#low-resource-languages

Pretrained self-supervised speech models can recognize unseen consonants

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper investigates whether pretrained self-supervised speech models like Wav2Vec2 and HuBERT can accurately recognize click consonants, which are rare in training data, by fine-tuning on Khoisan languages. Results show the models recognize clicks more accurately than non-clicks, indicating generalization to uncommon phonemes.

0 favorites 0 likes

#low-resource-languages

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

arXiv cs.CL ↗ · 2026-06-08 Cached

Translate-R1 introduces a reinforcement learning approach for cost-aware translation tool use in LLMs, where the model learns to decide when to translate inputs based on its own comprehension and a cost-sensitivity parameter, achieving Pareto-optimal trade-offs across multiple languages.

0 favorites 0 likes

#low-resource-languages

Modular Monolingual Adaptation using Pretrained Language Models

arXiv cs.CL ↗ · 2026-06-08 Cached

This paper proposes a modular approach for adapting pretrained language models to low-resource languages by freezing embeddings and tuning the rest, showing improvements on NLU tasks for Scottish Gaelic, Irish, and Quechua.

0 favorites 0 likes

#low-resource-languages

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

arXiv cs.CL ↗ · 2026-06-05 Cached

This paper proposes a novel pipeline for multilingual coreference resolution that uses cycle-consistent machine translation from English to low-resource languages to generate training data, validated by back-translation and BERT similarity. Experiments on four low-resource languages show significant performance gains, enabling accurate coreference resolution where no prior corpora existed.

0 favorites 0 likes

#low-resource-languages

GlossAssist -- A Tool to Simplify Corpus Creation and Study the Effect of NLP Models in Low-Resource Documentation Settings

arXiv cs.CL ↗ · 2026-06-04 Cached

GlossAssist is a tool for creating interlinear glossed text (IGT) corpora in low-resource language documentation settings, built around the CWoMP retrieval-based architecture with an active learning feedback loop that improves predictions as annotators make corrections without retraining the model.

0 favorites 0 likes

#low-resource-languages

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

This paper proposes a reinforcement learning approach to enable large language models to translate unseen languages by leveraging in-context linguistic knowledge, outperforming in-context learning and supervised fine-tuning.

0 favorites 0 likes

#low-resource-languages

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.

0 favorites 0 likes

#low-resource-languages

A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

arXiv cs.CL ↗ · 2026-05-25 Cached

This survey catalogs publicly available text and speech resources for Hausa and Fongbe, two West African languages, assessing availability, quality, and gaps for NLP development, and providing task-specific recommendations.

0 favorites 0 likes

#low-resource-languages

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

arXiv cs.CL ↗ · 2026-05-22 Cached

This paper presents a comparative evaluation of embedding models and generator backends for Khmer-language retrieval-augmented question answering in the telecom domain, finding that BGE-M3 performs best for retrieval while generator strengths vary across metrics.

0 favorites 0 likes

#low-resource-languages

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

arXiv cs.CL ↗ · 2026-05-21 Cached

University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages, using a two-stage pipeline with Qwen2.5-VL for Spanish captioning and retrieval-augmented Gemini 2.5 Flash for target-language translation, achieving significant improvements over the baseline.

0 favorites 0 likes

#low-resource-languages

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

arXiv cs.CL ↗ · 2026-05-19 Cached

This paper introduces the Resource Density Index (RDI) and uses LLM-assisted citation mining to reveal that many languages appear data-poor in catalogue records but have substantial dataset activity in research literature, highlighting a visibility asymmetry in low-resource multilingual NLP.

0 favorites 0 likes

#low-resource-languages

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

Introduces Vividh-ASR, a complexity-tiered benchmark for Hindi and Malayalam ASR, identifies studio-bias in fine-tuning, and proposes R-MFT to improve spontaneous speech performance efficiently.

0 favorites 0 likes

#low-resource-languages

LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

arXiv cs.CL ↗ · 2026-05-12 Cached

The article introduces LLiMba, a 3B parameter model adapted from Qwen2.5 for Sardinian using continued pretraining and supervised fine-tuning on a single consumer GPU. It evaluates various LoRA configurations, finding that adapter capacity significantly impacts performance and factual accuracy in low-resource language adaptation.

0 favorites 0 likes

low-resource-languages

Submit Feedback