Tag
This paper introduces a KAN-enhanced BiGRU architecture for classifying and summarizing multilingual legal documents from Bangladesh, achieving modest accuracy and ROUGE scores and demonstrating that the KAN block improves classification accuracy over the baseline BiGRU.
This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built via cross-lingual tokenizer surgery and offline distillation, achieving strong performance on Turkish benchmarks with a cost-quality trade-off.
This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.
A deep learning framework is developed to analyze grammatical gender evolution from Latin to Romance languages, focusing on low-resource historical settings using lexical and contextual analysis.
This paper presents a reproducible pipeline for building Universal Dependencies-style parsing resources for Katharevousa Greek parliamentary text, including OCR reconstruction, LLM-assisted annotation, and evaluation of multiple parsers. The best model (XLM-R) achieves 0.8893 UPOS accuracy and 0.5162 LAS, significantly outperforming off-the-shelf baselines.
This paper proposes a knowledge-aware Text-to-SQL framework that uses knowledge distillation to improve performance in low-resource settings by constructing task-specific knowledge bases and generating synthetic training data. Experiments on seven benchmarks show substantial improvements, especially for open-source models.
This paper introduces BLADE, a culturally aligned instruction-tuning dataset of 4,196 interaction pairs for fixing honorific failures and pragmatic gaps in multilingual Bangla generation. Fine-tuning models like DeepSeek-8B and LLaMA-3.2-3B on this dataset yields substantial improvements in structural fidelity and honorific alignment.
This paper introduces CLD, a lightweight convex optimization-based language detection head for ASR that achieves 97-98% accuracy with under 100 training samples while reducing compute costs by 13x, addressing accent and dialect robustness across 5 languages and 24 sub-dialects.
DPR-BAG is a training-free, zero-shot framework that generates coherent biomedical abstracts from full-text articles by decomposing them into rhetorical facets, summarizing each with an LLM, and refining for coherence, achieving better novelty than baselines while maintaining factual consistency.
This paper proposes a multi-pass prompt verification method to improve the performance of quantized LLMs (LLaMA-3.1 8B) in qualitative analysis, reducing hallucinations and increasing stability across different quantization levels (8-bit, 4-bit, 3-bit, 2-bit).
This critical survey examines the Annotation Scarcity Paradox in low-resource NLP evaluation, where rapid model scaling outpaces the human infrastructure needed for authentic evaluation, and discusses emerging responses with equity and validity trade-offs.
This tutorial paper provides an overview of building multilingual and multimodal LLMs for low-resource languages, covering data creation, model alignment, fine-tuning, and evaluation, with a focus on practical recipes and hands-on resources.
This paper presents Adesua, a WhatsApp-based AI teaching assistant for science education in West Africa, integrating retrieval-augmented generation with curated textbooks and exam questions. A 6-month feasibility study in Ghana showed high perceived usefulness (93.75% helpfulness) but with a small sample size.
This paper proposes a context-aware synthetic augmentation framework combined with a hybrid classification model to address data scarcity and class imbalance in classifying psychological defense mechanisms from text. The method achieves significant improvements on the PsyDefDetect shared task benchmark.