Tag
This paper proposes SARA, a framework that aligns routing distributions of multilingual inputs using Jensen-Shannon divergence to improve expert sharing for low-resource languages in sparse Mixture-of-Experts models. Experiments on Qwen3-30B-A3B and Phi-3.5-MoE-instruct show improvements on multilingual benchmarks.
This paper investigates prompt-based learning for automatically generating highlights of academic papers, using models like GPT-2, T5, and ChatGPT, and shows that ChatGPT with few-shot prompts achieves performance comparable to or better than supervised methods without requiring task-specific training data.
This paper develops a codebook for self-stigma among people who use drugs and analyzes 72,115 Reddit posts to examine prevalence, co-occurrence, and temporal patterns of cognitive, affective, and behavioral stigma indicators, finding that self-stigma is expressed as an integrated phenomenon with behavioral indicators often preceding core indicators.
This paper proposes a resource-light algorithm to automatically assign part-of-speech tags to senses in the Al-Mawrid Arabic-English bilingual dictionary by transferring tags from English WordNet after disambiguation, achieving high accuracy with minimal cost.
T2D-Bench is a benchmark for evaluating LLM outputs for Type 2 Diabetes using a multi-layer clinical-lifestyle knowledge graph. It reveals that current LLMs fail evidence-path checks in about a third of cases.
This paper constructs large-scale algorithm co-occurrence networks from the full text of academic papers to study the collective influence of algorithms in NLP, finding that classic, high-performing, and intersectional algorithms hold central network positions.
This paper introduces RASC+, a retrieval-constrained LLM adjudication method for clinical value set authoring that improves candidate-pool recall and selection precision over prior RASC baselines, demonstrating that blinded LLM adjudication with Qwen3-based retrieval significantly outperforms direct generation.
This paper presents a scalable framework using LLMs for implicit sentiment analysis of product desirability from qualitative feedback, achieving up to 0.97 Pearson correlation and 94% accuracy while providing explanations, with GPT-4o-mini offering similar performance at 94% lower cost.
A systematic experimental analysis evaluating eight state-of-the-art Diffusion Language Models across multiple benchmarks, analyzing trade-offs between generation quality and computational efficiency.
The article discusses why AI systems have difficulty interpreting uncertainty and ambiguity in human conversation, highlighting ongoing challenges in natural language understanding.
The Jan 6, 2026 draft of the 3rd edition of 'Speech and Language Processing' by Dan Jurafsky and James H. Martin is released, featuring a revised structure with a focus on large language models and updated chapters.
This paper introduces Approximate Structured Diffusion, a method that combines conditional random fields (CRFs) with discrete diffusion for sequence labelling. It uses a CRF conditioned on noisy label sequences and approximate mean-field inference, achieving a 16.5% error reduction on POS tagging.
This paper introduces PEC-Home, a simulated home dataset for interpreting progressively elliptical commands in smart homes, and finds that current LLM-based assistants struggle with such commands due to referential and intention ambiguity.
This paper introduces a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations to evaluate whether LLMs preserve diagnostic uncertainty in clinical text, finding that LLMs often fail to preserve original uncertainty cues and struggle with nuanced distinctions.
This paper investigates verbalized methods for extracting LLM confidence in machine translation outputs, comparing them with internal token probabilities. The study finds that while both approaches perform similarly in error detection and calibration, there is little correlation between internal and verbalized confidence measures.
A fine-grained study of narrative features in web-scale LLM pretraining data, introducing NarraBERT and NarraDolma to measure narrative patterns and their distribution across sources.
Sumi is a 7B uniform diffusion language model pretrained from scratch on 1.5T tokens, achieving competitive performance on knowledge and reasoning tasks while being fully open-source with released weights and training recipe.
This paper introduces the Call Playbook dataset for classifying real-world B2B conversations and proposes methods to distill examples into compact, interpretable task instructions, achieving 99% token reduction and up to 7% AUC improvement over traditional in-context learning.
This study investigates whether instruction-tuned LLMs (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, Phi-3-mini) can reliably classify Correct Information Units in aphasic discourse transcripts. Few-shot prompting yields competitive F1 scores (0.776–0.817) for three models, but performance varies by severity and human agreement remains insufficient for fully autonomous use.
Proposes CoCoGEC, a counterfactual generation framework that alters error-irrelevant contexts in GEC training data to improve model robustness, achieving significant F0.5 gains on perturbed benchmarks.