Tag
Introduces BioDivergence, a benchmark and evaluation framework for detecting context-conditioned contradictions in biomedical abstracts, featuring a six-class conflict taxonomy and a silver dataset of 11,865 claim pairs.
SHALA-LLM is a reinforcement learning framework that enables LLMs to learn directly from annotator distributions and dynamically prioritize highly ambiguous samples during alignment, improving agreement with human label distributions and classification performance.
Proposes a Multi-Granularity Reasoning Network (MGRN) that explicitly leverages hierarchical semantic features for natural language inference, outperforming strong baselines on multiple benchmarks.
Introduces SEA-NLI, a culturally grounded NLI benchmark covering eight Southeast Asian countries, revealing low performance of LLMs on culturally specific knowledge, especially in languages and science/technology. Shows that culture-aware prompting helps but chain-of-thought offers limited gains.
This paper examines the effect of labeled data size on natural language inference performance for 16 African languages using the AfriXNLI benchmark. The results show that scaling behavior is language-sensitive and often non-monotonic, challenging the common assumption of monotonic improvement, and emphasizing the need for language-specific dataset creation and stronger multilingual strategies.
This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.
LLMBridge introduces an LLM-based pipeline for end-to-end referential bridging resolution, achieving state-of-the-art performance on three English datasets. The system combines heuristic pre/post-processing with LLM natural language inference.
This paper proposes Product-of-Experts (PoE) training to reduce dataset artifacts in Natural Language Inference, downweighting examples where biased models are overconfident. PoE nearly preserves accuracy on SNLI (89.10% vs. 89.30%) while reducing bias reliance by ~4.85 percentage points.
This paper investigates how informal text (slang, emoji, Gen-Z filler tokens) degrades NLI accuracy in ELECTRA-small and RoBERTa-large models, identifying two distinct failure mechanisms—tokenization failure (emoji mapped to [UNK]) and distribution shift (out-of-domain noise tokens)—and proposes targeted mitigations that recover accuracy without harming clean-text performance.