Tag
This update to the RLM arXiv paper adds depth>1 experiments with recursive RLM calls, showing significant performance gains on OOLONG-Pairs and other benchmarks, along with new comparisons to OpenCode and Claude Code, additional training results on MRCRv2, and an expanded error analysis.
This paper introduces the Context-Contaminated Restart Model (CCRM) to formally analyze how failed attempts in LLM agent pipelines contaminate context and increase error rates during retries. It provides theoretical proofs and validates the model against SWE-bench data, showing significant discrepancies with standard independent models.
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
DELEGATE-52 is a new benchmark revealing that current LLMs, including frontier models like GPT-5.4 and Claude 4.6 Opus, corrupt an average of 25% of document content during long delegated workflows across 52 professional domains. The research demonstrates that LLMs introduce sparse but severe errors that compound over interactions, raising concerns about their reliability for delegated work paradigms.