LLMs Corrupt Your Documents When You Delegate

arXiv cs.CL Papers

Summary

DELEGATE-52 is a new benchmark revealing that current LLMs, including frontier models like GPT-5.4 and Claude 4.6 Opus, corrupt an average of 25% of document content during long delegated workflows across 52 professional domains. The research demonstrates that LLMs introduce sparse but severe errors that compound over interactions, raising concerns about their reliability for delegated work paradigms.

arXiv:2604.15597v1 Announce Type: new Abstract: Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:28 AM

# LLMs Corrupt Your Documents When You Delegate
Source: https://arxiv.org/html/2604.15597
Philippe Laban Tobias Schnabel Jennifer Neville Microsoft Research \{plaban, tobias\.schnabel, jenneville\}@microsoft\.com

###### Abstract

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust – the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

Refer to captionFigure 1: Illustrative examples of how LLMs corrupt documents over long workflows in the DELEGATE-52 benchmark. As LLMs edit files that represent graph diagrams, textile patterns or 3D objects, they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.111DELEGATE-52 is a text-only benchmark, visual renderings are for illustrative purposes.

## 1 Introduction

Recent LLM progress is enabling new interaction paradigms such as delegated work (Shao et al., 2025 (https://arxiv.org/html/2604.15597#bib.bib15); Ulloa et al., 2025 (https://arxiv.org/html/2604.15597#bib.bib323)), where knowledge workers supervise LLMs as they complete tasks on their behalf (e.g., "vibe coding"). Crucially, users delegating work might lack the expertise or time to review changes implemented by the LLM, and must trust that the LLM does not introduce unchecked errors, such as hallucinations, or deletions.

The viability of delegated work hinges on LLMs' ability to carry out tasks and manipulate domain documents without introducing errors. We study, through simulation, the readiness of current LLMs for delegated work across a wide range of professions.

The first contribution of our work is DELEGATE-52, a benchmark with 310 work environments across 52 professional domains, including coding, crystallography, genealogy, and music sheet notation. Each environment consists of real documents totaling around 15k tokens in length, and 5–10 complex editing tasks that a user might ask an LLM to carry out. This substantially differs from past work that focuses on tasks within a single domain (e.g., code editing (Cassano et al., 2023 (https://arxiv.org/html/2604.15597#bib.bib83)) or text editing (Spangher et al., 2022 (https://arxiv.org/html/2604.15597#bib.bib90))).

Our second contribution is the round-trip relay simulation method, which enables us to simulate long-horizon delegated interaction and evaluate LLM performance without requiring annotation or reference solutions. Specifically, we assume every editing task is reversible, defined by a forward instruction and its inverse. Applying both in sequence forms a backtranslation round-trip that, under a perfect model, recovers the original documents exactly. This lets us evaluate performance by measuring document similarity before and after a round-trip. Round-trips can further be composed sequentially, forming a relay. Backtranslation originated as a data augmentation and evaluation technique in machine translation (Sennrich et al., 2015 (https://arxiv.org/html/2604.15597#bib.bib102); Somers, 2005 (https://arxiv.org/html/2604.15597#bib.bib175)), and has recently been adapted to evaluate LLM consistency through chained reversible transformations (Hong et al., 2025 (https://arxiv.org/html/2604.15597#bib.bib81); Allamaniset al., 2024 (https://arxiv.org/html/2604.15597#bib.bib98)). We repurpose the technique to study long-horizon delegated interaction.

Our third contribution is a large-scale simulation with 19 LLMs on DELEGATE-52. Our findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing on average 25% of document content over 20 delegated interactions, and an average degradation across all models of 50%. Degradation depends on the domain: LLMs perform better in programmatic domains (Python, Database) and worse in natural language and niche domains (e.g., earning statements, music notation). We define a model as "ready" for delegated work in a domain if it achieves a score of 98% or higher after 20 interactions. Python is the only domain (out of 52) where most models are ready, highlighting the significant gap that remains.

Finally, targeted experiments refine our understanding of current LLM capabilities. We confirm that known factors such as document size, interaction length, and distractor context contribute to degradation (Liu et al., 2023 (https://arxiv.org/html/2604.15597#bib.bib324); Shi et al., 2023 (https://arxiv.org/html/2604.15597#bib.bib313)), but these negative effects compound over time, meaning short simulations underestimate their severity. We also find that using a basic agentic harness does not improve the performance of LLMs we test on DELEGATE-52, and that performance after two interactions is not predictive of long-horizon performance (20 interactions), validating the importance of long-horizon evaluation. We release DELEGATE-52 publicly as a tool to monitor AI readiness for delegated work and drive research on long-horizon Human-AI interaction.

## 2 The DELEGATE-52 Benchmark

Refer to captionFigure 2: The backtranslation round-trip primitive.

In DELEGATE-52 we simulate long workflows that could be part of a knowledge worker's tasks. A workflow consists of seed documents along with other content that are transformed via a sequence of complex editing tasks, mirroring the iterative nature of delegated work. We now introduce the framework that allows us to (i) perform evaluation automatically and (ii) scale the length of workflows.

### 2.1 Evaluating Without References

Figure 2 (https://arxiv.org/html/2604.15597#S2.F2) illustrates the round-trip primitive made up from a pair of editing tasks, inspired by backtranslation (Somers, 2005 (https://arxiv.org/html/2604.15597#bib.bib175)). Given a seed document s, we can define a pair of forward and backward edit instructions (x→, x←) that describe in natural language a transformation of the seed document and its inverse (σ, σ−1). First, an LLM applies a forward instruction to the seed document, producing a transformed document t = σ(s) = LLM(s; x→). Second, the LLM applies the backward instruction to the transformed document, producing a reconstructed document ŝ = σ−1(t) = LLM(t; x←). Each step is conducted as an independent, single-turn session.

To measure reconstruction quality, we implement a domain-specific similarity function sim(si, sj). A perfect model yields sim(s, ŝ) = 1, reducing evaluation to semantic equivalence without reference annotations. For backtranslation to be aligned with model performance, models need to genuinely attempt the editing instructions rather than taking shortcuts; we validate this in Appendix A (https://arxiv.org/html/2604.15597#A1). Appendix B (https://arxiv.org/html/2604.15597#A2) discusses other properties, assumptions, and limitations of this framework.

##### Simulating long workflows.

Since each round-trip is designed to return to the seed document s, round-trips can be chained into longer workflows. We sample N pairs of forward and backward instructions (x1→, x1←), ..., (xN→, xN←) from the set of available options, each representing a transformation σi(s). We simulate an n-relay by applying n round-trip edits in sequence:

ŝk = (σ1 ∘ σ1−1 ∘ ⋯ ∘ σn ∘ σn−1)(s),     1 ≤ n ≤ N.

Our main metric is the reconstruction score after k interactions (i.e., k/2 round-trips):

RS@k(s) = sim(s, ŝk/2).

### 2.2 Benchmark Construction

We selected 52 professional domains to simulate workflows (listed in Figure 3 (https://arxiv.org/html/2604.15597#S2.F3)), representing diverse knowledge work professions across five categories: Science & Engineering, Code & Configuration, Creative & Media, Structured Record, and Everyday. A key criterion for inclusion is the existence of a standard document type that is textual and unencoded (e.g., .srt for subtitles, .cif for crystallography). Secondary considerations in domain selection are listed in Appendix K.1 (https://arxiv.org/html/2604.15597#A11.SS1).

Refer to captionFigure 3: DELEGATE-52 includes work environments from 52 professional domains in five categories: Science & Engineering, Code & Configuration, Creative & Media, Structured Records, and Everyday.

#### 2.2.1 Work Environments

For each domain, we construct six work environments consisting of a seed document, a set of 5–10 possible edit tasks, and a distractor context. An example environment for the accounting domain is presented in Figure 4 (https://arxiv.org/html/2604.15597#S2.F4), and environment creation is detailed in Appendix K (https://arxiv.org/html/2604.15597#A11).

Refer to captionFigure 4: Example work environment from the accounting domain in DELEGATE-52. The seed document is an accounting ledger of Hack Club, a non-profit organization. The highlighted edit (Category Split) consists in first splitting the seed document hack_club.ledger into separate files by expense category (forward edit task), and merging it back chronologically into one file (backward edit task).

##### Seed Documents.

The seed document is the starting point for all simulations. Seed documents are real documents found online (no synthetic data, exemplars, or templates), range from 2–5k tokens111Based on the GPT 4 tiktoken encoder, and have a permissive license for redistribution. Secondary requirements are listed in Appendix N (https://arxiv.org/html/2604.15597#A14). The simulations in Figure 1 (https://arxiv.org/html/2604.15597#footnotex2) use three seed documents: a Linux Kernel Architecture Diagram (graph), a 12-shaft Twill Diamond Pattern (textile), and the ActionBoy Palm Tree (3D objects).

##### Edit Tasks.

Edit tasks are pairs of forward and backward instructions defining invertible transformations. The instructions must: (1) represent realistic work tasks that a stakeholder might perform given the document, (2) require in-depth, non-trivial transformation of the context that goes beyond expansion. In other words, σ(s) cannot be decomposed into [s, σ′(s)] (concatenation), as this would make the backward edit trivial (cropping). Each edit task is tagged with the semantic operations required to perform the edit (e.g., numerical reasoning, classification, splitting). The accounting work environment in Figure 4 (https://arxiv.org/html/2604.15597#S2.F4) has 10 edit tasks, including tasks that require splitting the ledger into separate files by expense category or reimbursement recipient, converting the amounts to Euro, or formatting the ledger in Beancount format. Appendix K.4 (https://arxiv.org/html/2604.15597#A11.SS4) describes the edit creation and tagging process.

##### Distractor Context.

In realistic work settings, retrieved or available documents are not always relevant to the task at hand (i.e., retrieval precision is imperfect). To simulate this, each work environment includes a distractor context: topically related documents that do not interfere with any of the editing tasks. In the accounting example of Figure 4 (https://arxiv.org/html/2604.15597#S2.F4), the distractor context includes a chart of accounts, the organization expense reimbursement policy, and three other documents from the organization. Distractor contexts range from 8–12k tokens per environment, and are included by default in experiments to enhance simulation realism. Distractor construction and non-interference validation are detailed in Appendix K.7 (https://arxiv.org/html/2604.15597#A11.SS7).

#### 2.2.2 Domain-Specific Evaluation

Refer to captionFigure 5: Top: Domains in DELEGATE-52 implement a parsing function that converts text documents into a structured representation which is then used by a similarity function to score two parsed instances. Bottom: concrete example for the recipe domain.

Common textual similarity methods consider either low-level overlap (e.g., Levenshtein ratio (Levenshtein, 1965 (https://arxiv.org/html/2604.15597#bib.bib298))) or semantic distance in a generic embedding space (Neelakanta et al., 2022 (https://arxiv.org/html/2604.15597#bib.bib299)). These do not adequately capture fine-grained semantic changes, so we implement a custom similarity function for each domain, illustrated in Figure 5 (https://arxiv.org/html/2604.15597#S2.F5).

Semantic equivalence is measured in two steps: parsing and evaluation. A parsing function converts documents into a structured representation. In Figure 5 (https://arxiv.org/html/2604.15597#S2.F5), a recipe is parsed into ingredients (names, quantities, units), steps, and tips. A similarity function then compares two parsed representations and outputs a score in [0,1]. In the recipe domain, similarity is a weighted sum over ingredient lists (40%), steps (40%), and tips (20%). Per-domain component combination and relative weights are calibrated through ablation testing to ensure proportional sensitivity to content loss or corruption (Appendix K.2 (https://arxiv.org/html/2604.15597#A11.SS2)).

This flexibility allows for a domain-appropriate weighing of various components of the scoring function. For instance a small surface-level change in an ingredient (e.g., 200 → 800 g of butter) can severely impact the overall score (as desired). Conversely, the domain-specific parsing allows for robustness in the scoring function: surface-level changes that do not impact semantics (e.g., 200g vs. 0.2kg of butter, or shuffling the order of the ingredient list) do not affect the score.

Implementing robust semantic equivalence for 52 domains is central to our methodology. In Appendix C (https://arxiv.org/html/2604.15597#A3), we show that generic similarity measures (including LLM-as-a-judge with GPT 5.4) fail to capture nuanced semantic differences, only moderately correlating with our parsing-based metric and capturing at most 25% of the variance.

#### 2.2.3 Quality Assurance.

To ensure experimental validity, we performed quality assurance at each stage of the construction process (Appendix K (https://arxiv.org/html/2604.15597#A11)), evaluating (1) parsing robustness, (2) evaluation sensitivity, (3) edit testing, and (4) distractor inter

Similar Articles