Tag
DELEGATE-52 is a new benchmark revealing that current LLMs, including frontier models like GPT-5.4 and Claude 4.6 Opus, corrupt an average of 25% of document content during long delegated workflows across 52 professional domains. The research demonstrates that LLMs introduce sparse but severe errors that compound over interactions, raising concerns about their reliability for delegated work paradigms.