error-analysis

#error-analysis

The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper argues that universal LLM reliability is impossible, but within operationally bounded patches (e.g., legal review, medical RAG), failures are sparse and repetitive, making reliability a local catalogue-discovery problem. It formalizes this with propositions and a corollary, relocating rather than dissolving the difficulty of long-context generation.

0 favorites 0 likes

#error-analysis

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

arXiv cs.CL ↗ · 2026-05-29 Cached

This paper presents a framework using LLMs to generate targeted synthetic misconceptions aligned to a five-class taxonomy adapted from Bloom's taxonomy, addressing the scarcity of labeled student error data in education research.

0 favorites 0 likes

#error-analysis

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

arXiv cs.LG ↗ · 2026-05-21 Cached

This paper decomposes MXFP4 quantization error into three additive components—scale bias, deadzone truncation, and grid noise—and proposes targeted corrections that recover BF16 accuracy to within 0.7 pp on Qwen2.5-3B and 3.0 pp on Qwen3-30B-A3B-Base for LLM reinforcement learning post-training.

0 favorites 0 likes

#error-analysis

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

arXiv cs.LG ↗ · 2026-05-20

This paper provides the first systematic analysis of error sources in trajectory-based data attribution methods, identifies optimizer mismatch as the dominant error, proposes AdamW-influence to address it, and offers practical guidelines for data selection via a K-step look-ahead framework.

0 favorites 0 likes

#error-analysis

@a1zhang: RLM arXiv paper update: depth>1 results, more comparisons, more training, and more error analysis! We add depth=2/3 exp…

X AI KOLs Following ↗ · 2026-05-12

This update to the RLM arXiv paper adds depth>1 experiments with recursive RLM calls, showing significant performance gains on OOLONG-Pairs and other benchmarks, along with new comparisons to OpenCode and Claude Code, additional training results on MRCRv2, and an expanded error analysis.

0 favorites 0 likes

#error-analysis

Why Retrying Fails: Context Contamination in LLM Agent Pipelines

arXiv cs.AI ↗ · 2026-05-12 Cached

This paper introduces the Context-Contaminated Restart Model (CCRM) to formally analyze how failed attempts in LLM agent pipelines contaminate context and increase error rates during retries. It provides theoretical proofs and validates the model against SWE-bench data, showing significant discrepancies with standard independent models.

0 favorites 0 likes

#error-analysis

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

arXiv cs.CL ↗ · 2026-04-20 Cached

A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.

0 favorites 0 likes

#error-analysis

LLMs Corrupt Your Documents When You Delegate

arXiv cs.CL ↗ · 2026-04-20 Cached

DELEGATE-52 is a new benchmark revealing that current LLMs, including frontier models like GPT-5.4 and Claude 4.6 Opus, corrupt an average of 25% of document content during long delegated workflows across 52 professional domains. The research demonstrates that LLMs introduce sparse but severe errors that compound over interactions, raising concerns about their reliability for delegated work paradigms.

0 favorites 0 likes

error-analysis

Submit Feedback