error-analysis

#error-analysis

@a1zhang: RLM arXiv paper update: depth>1 results, more comparisons, more training, and more error analysis! We add depth=2/3 exp…

X AI KOLs Following ↗ · yesterday

This update to the RLM arXiv paper adds depth>1 experiments with recursive RLM calls, showing significant performance gains on OOLONG-Pairs and other benchmarks, along with new comparisons to OpenCode and Claude Code, additional training results on MRCRv2, and an expanded error analysis.

0 favorites 0 likes

#error-analysis

Why Retrying Fails: Context Contamination in LLM Agent Pipelines

arXiv cs.AI ↗ · 2d ago Cached

This paper introduces the Context-Contaminated Restart Model (CCRM) to formally analyze how failed attempts in LLM agent pipelines contaminate context and increase error rates during retries. It provides theoretical proofs and validates the model against SWE-bench data, showing significant discrepancies with standard independent models.

0 favorites 0 likes

#error-analysis

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

arXiv cs.CL ↗ · 2026-04-20 Cached

A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.

0 favorites 0 likes

#error-analysis

LLMs Corrupt Your Documents When You Delegate

arXiv cs.CL ↗ · 2026-04-20 Cached

DELEGATE-52 is a new benchmark revealing that current LLMs, including frontier models like GPT-5.4 and Claude 4.6 Opus, corrupt an average of 25% of document content during long delegated workflows across 52 professional domains. The research demonstrates that LLMs introduce sparse but severe errors that compound over interactions, raising concerns about their reliability for delegated work paradigms.

0 favorites 0 likes

error-analysis

@a1zhang: RLM arXiv paper update: depth>1 results, more comparisons, more training, and more error analysis! We add depth=2/3 exp…

Why Retrying Fails: Context Contamination in LLM Agent Pipelines

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

LLMs Corrupt Your Documents When You Delegate

Submit Feedback