Tag
This paper argues that universal LLM reliability is impossible, but within operationally bounded patches (e.g., legal review, medical RAG), failures are sparse and repetitive, making reliability a local catalogue-discovery problem. It formalizes this with propositions and a corollary, relocating rather than dissolving the difficulty of long-context generation.
This paper presents a framework using LLMs to generate targeted synthetic misconceptions aligned to a five-class taxonomy adapted from Bloom's taxonomy, addressing the scarcity of labeled student error data in education research.
This paper decomposes MXFP4 quantization error into three additive components—scale bias, deadzone truncation, and grid noise—and proposes targeted corrections that recover BF16 accuracy to within 0.7 pp on Qwen2.5-3B and 3.0 pp on Qwen3-30B-A3B-Base for LLM reinforcement learning post-training.
This paper provides the first systematic analysis of error sources in trajectory-based data attribution methods, identifies optimizer mismatch as the dominant error, proposes AdamW-influence to address it, and offers practical guidelines for data selection via a K-step look-ahead framework.
This update to the RLM arXiv paper adds depth>1 experiments with recursive RLM calls, showing significant performance gains on OOLONG-Pairs and other benchmarks, along with new comparisons to OpenCode and Claude Code, additional training results on MRCRv2, and an expanded error analysis.
This paper introduces the Context-Contaminated Restart Model (CCRM) to formally analyze how failed attempts in LLM agent pipelines contaminate context and increase error rates during retries. It provides theoretical proofs and validates the model against SWE-bench data, showing significant discrepancies with standard independent models.
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
DELEGATE-52 is a new benchmark revealing that current LLMs, including frontier models like GPT-5.4 and Claude 4.6 Opus, corrupt an average of 25% of document content during long delegated workflows across 52 professional domains. The research demonstrates that LLMs introduce sparse but severe errors that compound over interactions, raising concerns about their reliability for delegated work paradigms.