Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

arXiv cs.CL 07/01/26, 04:00 AM Papers
data-quality benchmark document-classification dataset-revision label-errors test-train-overlap
Summary
This paper identifies and corrects label errors and test-train overlap in the RVL-CDIP document classification dataset, finding 12% label errors and 35% duplication. Correction improves classification accuracy and out-of-distribution generalization.
arXiv:2606.31446v1 Announce Type: new Abstract: RVL-CDIP is a popular dataset for benchmarking document classifiers. However, the dataset contains ample amounts of label errors as well as non-trivial amounts of test-train overlap, both of which may impact model performance metrics. In this paper, we address these two problems by (1) finding and fixing label errors, and (2) detecting and addressing test-train overlap. We produce several variations of RVL-CDIP with label error and test-train overlap fixes, and benchmark document classification performance on these new RVL-CDIP variations. Our rigorous analysis of RVL-CDIP finds that the corpus contains 12\% label error and approximately 35% test-train duplication. Remediation sees improvements in classification accuracy when errors are removed, but sees decreases in accuracy when duplicates are removed. We additionally evaluate models on RVL-CDIP-N, an out-of-distribution benchmark, finding that training on error-corrected data substantially improves OOD generalization, with supervised models gaining an average of 8.1 percentage points in accuracy and improvements as large as 14 percentage points.
Original Article
View Cached Full Text
Cached at: 07/01/26, 05:34 AM
# Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap
Source: [https://arxiv.org/html/2606.31446](https://arxiv.org/html/2606.31446)
,Attila NagyML Collective,Sam DesaiUniversity of Michigan,Cyrus DesaiUniversity of Michigan,Nicole C\. LimaUniversity of Michigan,Yixin YuanUniversity of Michigan,Siddharth BetalaIIT Madras, ML Collective,Kaushal K\. PrajapatiML Collective,Jamiu T\. SuleimanML Collective,Sharad DuwalML CollectiveandKevin LeachVanderbilt University

\(2026\)

###### Abstract\.

RVL\-CDIP is a popular dataset for benchmarking document classifiers\. However, the dataset contains ample amounts of label errors as well as non\-trivial amounts of test\-train overlap, both of which may impact model performance metrics\. In this paper, we address these two problems by \(1\) finding and fixing label errors, and \(2\) detecting and addressing test\-train overlap\. We produce several variations of RVL\-CDIP with label error and test\-train overlap fixes, and benchmark document classification performance on these new RVL\-CDIP variations\. Our rigorous analysis of RVL\-CDIP finds that the corpus contains 12% label error and approximately 35% test\-train duplication\. Remediation sees improvements in classification accuracy when errors are removed, but sees decreases in accuracy when duplicates are removed\. We additionally evaluate models on RVL\-CDIP\-N, an out\-of\-distribution benchmark, finding that training on error\-corrected data substantially improves OOD generalization, with supervised models gaining an average of 8\.1 percentage points in accuracy and improvements as large as 14 percentage points\.

document classification, data quality, benchmark evaluation

††copyright:acmlicensed††journalyear:2026††doi:10\.1145/3820755\.3821486††conference:the 26th ACM Symposium on Document Engineering; August 25–28, 2026; Fribourg, Switzerland††isbn:978\-1\-4503\-XXXX\-X/2026/XX††ccs:Computing methodologies Document analysis and recognition††ccs:Computing methodologies Neural networks## 1\.Introduction

The RVL\-CDIP document classification corpus\(Harleyet al\.,[2015](https://arxiv.org/html/2606.31446#bib.bib9)\)has been called the*de facto*standard benchmark for evaluating document classification models\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)\. Indeed, modern document understanding models like LayoutLMv3\(Huanget al\.,[2022](https://arxiv.org/html/2606.31446#bib.bib56)\), Donut\(Kimet al\.,[2022](https://arxiv.org/html/2606.31446#bib.bib7)\), and UDOP\(Tanget al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib41)\), among others, are often exclusively benchmarked on RVL\-CDIP to establish performance scores for document classification\. Despite this, recent work has estimated that there are large amounts of label errors in the RVL\-CDIP dataset, as well as large amounts of “data leakage” — duplicate and near\-duplicate samples — across RVL\-CDIP’s test and train splits\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)\(see Figure[1](https://arxiv.org/html/2606.31446#S1.F1)for examples of these phenomena\)\. These two undesirable features cast doubt on high reported performance scores, and motivate a rigorous investigation into the impacts of widespread label errors and test\-train overlap on models trained and evaluated on RVL\-CDIP\.

![Refer to caption](https://arxiv.org/html/2606.31446v1/x1.png)

Figure 1\.Example label errors \(top row\) and test\-train near\-duplicate pair \(bottom row, from theinvoicecategory\)\. Theresumedocument’s correct label isscientific\_publication, and theletter’s ismemo\.![Refer to caption](https://arxiv.org/html/2606.31446v1/figures/rvlcdip_examples.png)

Figure 2\.Example documents from a selection of RVL\-CDIP’s 16 categories\.In this paper, we seek to identify and fix the label errors found in RVL\-CDIP, and we aim to detect and address the data leakage observed across the dataset’s test and train splits\. We manually review RVL\-CDIP and uncover a large amount of label errors: our analysis finds that roughly 12% of RVL\-CDIP is erroneously labeled, with error rates in thelettercategory exceeding 30%\. We then apply a filter\-and\-refine method to detect test\-train overlap, finding that roughly 35% of RVL\-CDIP’s test set contains a \(near\-\) duplicate counterpart in the dataset’s train set\. We remediate these data quality issues and create several “cleaned” versions of RVL\-CDIP for benchmarking\. These new versions of RVL\-CDIP are corrected and cleaned, and minimize test\-train duplication\. With these new clean versions of RVL\-CDIP we are then able to study the impact of label errors and test\-train overlap on several document classification models, including transformer\-based and zero\-shot models\. We find that model performance increases upon removing label errors, but decreases when duplicates are removed\. We additionally evaluate our models on RVL\-CDIP\-N\(Larsonet al\.,[2022](https://arxiv.org/html/2606.31446#bib.bib40)\), an out\-of\-distribution \(OOD\) benchmark for RVL\-CDIP, to assess whether training on cleaner data improves generalization beyond RVL\-CDIP’s tobacco\-industry domain\. We make our cleaned RVL\-CDIP versions publicly available to aid in benchmarking document classification models on cleaner, more reliable data\.111[https://rvlcdip\-errors\.com/](https://rvlcdip-errors.com/)

Table 1\.Dataset statistics for RVL\-CDIP\. Train, validation, and test split sizes per category\.CategoryTrainVal\.Testadvertisement19,9632,5222,515budget20,0102,4852,505email19,9542,5302,505file\_folder20,0222,4512,527form19,9572,5372,506handwritten20,0342,4342,532invoice19,9472,5762,477letter20,1062,4302,464memo19,9752,5332,492news\_article20,0112,5262,463presentation20,0432,4682,489questionnaire20,0482,5172,435resume20,0372,4262,537scientific\_publication19,9022,5262,571scientific\_report19,9942,5082,498specification19,9972,5312,472

This paper is a*benchmark audit*: we propose no new model or training procedure\. Instead, our contribution is to provide the first exhaustive, empirically grounded analysis of data quality in the document classification community’s primary benchmark, and to quantify the concrete impact of these issues on reported model performance\. This type of paper is in the tradition of prior work that has revealed similar problems across NLP, computer vision, and document understanding benchmarks \(see Section[2](https://arxiv.org/html/2606.31446#S2)\)\.

To summarize, the main contributions of this paper are:

1. \(1\)We conduct the first exhaustive, 400,000\-document review of RVL\-CDIP, finding that approximately 12% of the dataset is erroneously labeled — a finding with implications for every performance claim made against this benchmark\.
2. \(2\)We develop a method to detect test\-train overlap, finding roughly 35% of RVL\-CDIP’s test set to have a \(near\-\) duplicate in the training set\.
3. \(3\)We create cleaned versions of RVL\-CDIP free of label error and test\-train duplication and re\-benchmark several document classification models\.
4. \(4\)We evaluate our models on RVL\-CDIP\-N, an out\-of\-distribution benchmark, to assess the effect of training data quality on OOD generalization\.

## 2\.Background and Related Work

RVL\-CDIP is widely used to benchmark document page classifiers\. Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)called it the*de facto*standard benchmark for document page classification, and when we surveyed 30 papers that used RVL\-CDIP to benchmark document classification systems or models, roughly 95%*exclusively*used RVL\-CDIP \(or a subset of RVL\-CDIP\) for performance evaluation\. A wide range of model types have used RVL\-CDIP to benchmark classification performance, including image\-based models, text\-based models, multi\-modal models, zero\-shot models, and Large Language Models\. RVL\-CDIP is therefore not merely the most popular document classification benchmark — it is effectively the*only*standardized benchmark of comparable scale for this task, making a rigorous audit of its quality a critical service to the document understanding community\.

The standard benchmarking setup with RVL\-CDIP is 16\-class classification\. RVL\-CDIP consists of 400,000 samples spread roughly evenly across its 16 categories\. These categories and their counts are listed in Table[1](https://arxiv.org/html/2606.31446#S1.T1); samples of RVL\-CDIP are displayed in Figure[2](https://arxiv.org/html/2606.31446#S1.F2)\. Harley et al\.\(Harleyet al\.,[2015](https://arxiv.org/html/2606.31446#bib.bib9)\), who introduced RVL\-CDIP, provided no definitions or distinguishing criteria for the 16 categories, but noted that the source collection had “missing or erroneous tags,” and acknowledged that “the final categories are not perfectly distinct\.” This absence of official category definitions means the research community has been benchmarking against a dataset with no authoritative criteria for what each category contains — a gap that our annotation work in Section[3](https://arxiv.org/html/2606.31446#S3)directly addresses\.

In the supervised learning setting, models are trained on RVL\-CDIP’s training set and evaluated on the test set to establish an accuracy score\. In zero\-shot settings \(e\.g\., with LLMs or zero\-shot image classifiers\), models are evaluated on RVL\-CDIP’s test set only\. To our knowledge, the state\-of\-the\-art model is LayoutLLM, with a reported accuracy of 98\.8%\(Fujitake,[2024](https://arxiv.org/html/2606.31446#bib.bib33)\), followed closely by EAML \(97\.70%;\(Bakkaliet al\.,[2021](https://arxiv.org/html/2606.31446#bib.bib51)\)\) and Bi\-VLDoc \(97\.17%;\(Luoet al\.,[2025](https://arxiv.org/html/2606.31446#bib.bib50)\)\)\.

Table 2\.Label error breakdown in RVL\-CDIP\. In total, we determine that RVL\-CDIP contains approximately 12% label error\.CategoryUnknownWrongMixedAny ErrorUnsureAny \+ Unsureadvertisement7\.51%7\.51\\%0\.60%0\.60\\%1\.62%1\.62\\%9\.74%9\.74\\%1\.50%1\.50\\%11\.24%11\.24\\%budget8\.87%8\.87\\%8\.48%8\.48\\%3\.94%3\.94\\%21\.29%21\.29\\%1\.66%1\.66\\%22\.94%22\.94\\%email8\.24%8\.24\\%1\.27%1\.27\\%0\.12%0\.12\\%9\.63%9\.63\\%0\.50%0\.50\\%10\.13%10\.13\\%file\_folder3\.85%3\.85\\%0\.16%0\.16\\%0\.52%0\.52\\%4\.53%4\.53\\%0\.08%0\.08\\%4\.61%4\.61\\%form10\.63%10\.63\\%3\.62%3\.62\\%6\.10%6\.10\\%20\.35%20\.35\\%1\.04%1\.04\\%21\.39%21\.39\\%handwritten10\.51%10\.51\\%2\.14%2\.14\\%1\.03%1\.03\\%13\.68%13\.68\\%0\.00%0\.00\\%13\.69%13\.69\\%invoice6\.42%6\.42\\%8\.52%8\.52\\%0\.59%0\.59\\%15\.53%15\.53\\%1\.16%1\.16\\%16\.70%16\.70\\%letter15\.75%15\.75\\%15\.51%15\.51\\%0\.42%0\.42\\%31\.68%31\.68\\%0\.04%0\.04\\%31\.72%31\.72\\%memo4\.23%4\.23\\%1\.72%1\.72\\%1\.12%1\.12\\%7\.08%7\.08\\%1\.91%1\.91\\%8\.99%8\.99\\%news\_article4\.05%4\.05\\%4\.50%4\.50\\%0\.54%0\.54\\%9\.09%9\.09\\%0\.96%0\.96\\%10\.05%10\.05\\%presentation4\.13%4\.13\\%1\.34%1\.34\\%0\.87%0\.87\\%6\.34%6\.34\\%2\.08%2\.08\\%8\.42%8\.42\\%questionnaire8\.47%8\.47\\%5\.98%5\.98\\%1\.09%1\.09\\%15\.54%15\.54\\%1\.27%1\.27\\%16\.81%16\.81\\%resume1\.48%1\.48\\%0\.01%0\.01\\%0\.00%0\.00\\%1\.49%1\.49\\%0\.01%0\.01\\%1\.50%1\.50\\%scientific\_publication2\.15%2\.15\\%2\.39%2\.39\\%0\.00%0\.00\\%4\.54%4\.54\\%2\.72%2\.72\\%7\.26%7\.26\\%scientific\_report8\.07%8\.07\\%4\.09%4\.09\\%5\.17%5\.17\\%17\.32%17\.32\\%0\.45%0\.45\\%17\.77%17\.77\\%specification2\.29%2\.29\\%1\.13%1\.13\\%0\.74%0\.74\\%4\.16%4\.16\\%0\.24%0\.24\\%4\.40%4\.40\\%RVL\-CDIP6\.67%6\.67\\%3\.84%3\.84\\%1\.49%1\.49\\%12\.00%12\.00\\%0\.98%0\.98\\%12\.98%12\.98\\%

Recent work by Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)raised several issues with RVL\-CDIP, including the presence of label errors and data duplication in the dataset\. Given the role that RVL\-CDIP plays in the document understanding research area as the*de facto*benchmark for classification, these data quality issues are alarming, as they call into question the validity of certain claims like state\-of\-the\-art performance\. In our present work, we aim to rigorously measure the extent of label errors and test\-train overlap in RVL\-CDIP, and to quantify their impact on model performance\. In this way, our paper fits within relevant literature on analyzing data quality in benchmark datasets\. This prior work includes investigating data label issues in text \(e\.g\.,\(Niu and Penn,[2019](https://arxiv.org/html/2606.31446#bib.bib48); Wang and Mueller,[2019](https://arxiv.org/html/2606.31446#bib.bib45); Croftet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib37); Rücker and Akbik,[2023](https://arxiv.org/html/2606.31446#bib.bib49)\)\), image \(e\.g\.,\(Radenovicet al\.,[2018](https://arxiv.org/html/2606.31446#bib.bib44); Müller and Markert,[2019](https://arxiv.org/html/2606.31446#bib.bib46); Northcuttet al\.,[2021](https://arxiv.org/html/2606.31446#bib.bib47); Liet al\.,[2022a](https://arxiv.org/html/2606.31446#bib.bib36); Agnewet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib35); Grögeret al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib61)\)\), and document\(Vu and Nguyen,[2020](https://arxiv.org/html/2606.31446#bib.bib38); Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6); Jungoet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib34); Limet al\.,[2024](https://arxiv.org/html/2606.31446#bib.bib42)\)datasets, along with investigating data duplication and test\-train overlap issues in text \(e\.g\.,\(Allamanis,[2019](https://arxiv.org/html/2606.31446#bib.bib52); Lewiset al\.,[2021](https://arxiv.org/html/2606.31446#bib.bib59); Muet al\.,[2024](https://arxiv.org/html/2606.31446#bib.bib60)\)\), image \(e\.g\.,\(Barz and Denzler,[2020](https://arxiv.org/html/2606.31446#bib.bib53); Larocaet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib54); Wuet al\.,[2024](https://arxiv.org/html/2606.31446#bib.bib57)\)\), and document \(e\.g\.,\(Laatiriet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib43); Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)\) benchmarks\. Collectively, these works demonstrate that label errors and test\-train overlap are pervasive problems across benchmark datasets, and that addressing them often changes the interpretation of reported model performance\.

In particular, Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)estimated that 9\.7% of RVL\-CDIP has label errors, and 32% of the benchmark’s test set has a duplicate or near\-duplicate in the train set\. Our present work goes beyond the approximation of Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)and more rigorously quantifies the amount of label error and test\-train overlap present in RVL\-CDIP\. We create cleaned versions of RVL\-CDIP with errors and duplicates removed and compute model performance on this cleaned data, which was not explored by Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)\. Separately, Larson et al\.\(Larsonet al\.,[2022](https://arxiv.org/html/2606.31446#bib.bib40)\)introduced RVL\-CDIP\-N, an out\-of\-distribution benchmark for RVL\-CDIP, finding that models trained on RVL\-CDIP do not generalize well beyond its tobacco\-industry domain; we include RVL\-CDIP\-N in our evaluation\.

## 3\.Label Error Detection

Unlike prior work that estimated label error rates through sampling\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\), we exhaustively review all 400,000 documents in RVL\-CDIP for label errors\.

![Refer to caption](https://arxiv.org/html/2606.31446v1/figures/unknown_errors.png)Figure 3\.Examples of RVL\-CDIP label errors with no valid true label\.![Refer to caption](https://arxiv.org/html/2606.31446v1/figures/mixed_examples.png)Figure 4\.Examples from RVL\-CDIP with multiple valid labels\. Top labels: original; bottom labels: alternate label\.### 3\.1\.Annotation

Our goal is to measure the amount of label error in RVL\-CDIP\. Possible tools to help accomplish this goal includeCleanLab222[https://github\.com/cleanlab/cleanlab](https://github.com/cleanlab/cleanlab), which uses confident learning\(Northcuttet al\.,[2021](https://arxiv.org/html/2606.31446#bib.bib47)\), but this approach is not feasible here because RVL\-CDIP’s pervasive and systematic label errors corrupt the model confidence scores that confident learning relies on — a conclusion also reached by Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)in their preliminary investigation\. Instead, we manually inspect all documents from RVL\-CDIP using the label descriptions and criteria outlined by Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\)\. We re\-print these criteria on our companion website\.333[https://rvlcdip\-errors\.com/](https://rvlcdip-errors.com/)There are three error types that we annotate in RVL\-CDIP: \(1\)*unknown*, where we determine that the document does not have a valid RVL\-CDIP label; \(2\)*wrong label*, where we determine the document has a wrong original label but a valid, more\-correct label; \(3\)*mixed*, where we determine that the document has multiple valid labels, including the original\.

A team of seven volunteer undergraduate researchers annotated RVL\-CDIP over the span of roughly two years\. The annotators were given document images in batches grouped by category along with descriptions of each category\. Each batch contained 1,000 samples from one particular category\. The annotators were instructed to provide one of the following annotations: “yes” if the document fit the provided category criteria, “no” if the document clearly did not fit the category criteria, or “unsure” in cases of ambiguity\. The project lead reviewed all “no” and “unsure” cases as well as a portion of “yes” cases in order to provide corrective instruction and to update the category criteria if necessary\. Documents marked as “no” were also given a corrected label, where appropriate\. Thus this annotation process was iterative, with category criteria refined continuously as annotators encountered ambiguous cases\. In some cases, documents might fit within multiple category criteria, in which the document was marked as “mixed” and the multiple categories to which that document belonged were also recorded\. Inter\-rater agreement was measured on a sample of 1,500 documents across 5 categories \(resume,letter,memo,handwritten, andemail\), which yielded a Fleiss’s kappa score of 0\.74, indicating substantial agreement\(Landis and Koch,[1977](https://arxiv.org/html/2606.31446#bib.bib63)\)\.

![Refer to caption](https://arxiv.org/html/2606.31446v1/figures/wrong_label_examples.png)Figure 5\.Examples of incorrectly labeled samples from RVL\-CDIP with a known correct label\.
### 3\.2\.Findings

Through our exhaustive annotation process, we find that RVL\-CDIP contains many label errors\. Table[2](https://arxiv.org/html/2606.31446#S2.T2)lists a per\-category breakdown of these errors\. Categories with the highest error rates includeletter, with roughly 31\.7% of that category being erroneously labeled, butbudget,form,handwritten,invoice,questionnaire, andscientific\_reportalso each have more than 10% label error\. Examples of various error types are shown in Figure[3](https://arxiv.org/html/2606.31446#S3.F3)\(*unknown*errors\), Figure[5](https://arxiv.org/html/2606.31446#S3.F5)\(*wrong*errors\), and Figure[4](https://arxiv.org/html/2606.31446#S3.F4)\(*mixed*errors\)\. We note that the 12% label error rate is higher than the rough estimate of Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\), who reported 9\.7% \(they reported 8\.1% for*wrong*and*unknown*, and 1\.7% for*mixed*\)\. Our approach is more exhaustive and thorough than that of Larson et al\.\(Larsonet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib6)\), who relied on sampling\. The exhaustive nature of our review meant that annotators developed a deeper understanding of category boundaries over time, and were better able to identify subtle errors that sampling\-based approaches may miss — particularly systematic error patterns that only become apparent at scale\. For example, press release documents appear in both thenews\_articleandpresentationcategories, but are more prevalent inpresentation; this pattern was only recognizable after reviewing documents at scale\.

Main finding:RVL\-CDIP contains approximately 12% label error\. The dataset’slettercategory contains the highest number of errors \(31\.68%\)\.

## 4\.Quantifying Test\-Train Overlap

Our goal is to quantify the amount of overlap between the official test and train splits of RVL\-CDIP, and if a non\-trivial amount of overlap is found, to re\-compute model performance scores on a modified version of the test set where duplicates and near\-duplicates are removed\. Prior work has found non\-trivial amounts of overlap between the test and train splits of other datasets \(see Section[2](https://arxiv.org/html/2606.31446#S2)\); in the words of Allamanis\(Allamanis,[2019](https://arxiv.org/html/2606.31446#bib.bib52)\), large amounts of data duplication across test and train splits can lead to “inflated” performance scores because a portion of the test set has been “seen” by the model during training\. Thus it is important to ensure datasets minimize the amount of test\-train overlap in order to facilitate meaningful model benchmarking\.

![Refer to caption](https://arxiv.org/html/2606.31446v1/figures/dup-near-examples.png)Figure 6\.Example duplicate \(top\) and near\-duplicate \(bottom\) test\-train pairs\. Near\-duplicates are not pixel\-identical: they share the same underlying document template but may differ in filled fields or scan conditions\. Both types constitute test\-train leakage\.### 4\.1\.Data Similarity Analysis Methods

We seek samples in RVL\-CDIP’s test set that have a duplicate or near\-duplicate in the train set\. We consider a pair of two documents to be \(near\-\) duplicates if they are \(1\) nearly exact copies of the same document, with one of the pair perhaps containing different amounts of noise or geometric distortion than the other; \(2\) a document template match, meaning both documents in the pair are slightly different manifestations of the same template document\. This template can be a form that is filled out slightly differently, or the same letter that is addressed to two different addressees but bears the same content\. Examples of cases \(1\) and \(2\) are shown in Figure[6](https://arxiv.org/html/2606.31446#S4.F6), where they are called duplicate and near\-duplicate pairs, respectively\. While we do not distinguish between these two types in the remainder of this paper, we point them out here because they help inform us of the space of data similarity analysis methods to consider\. Importantly, both types constitute meaningful test\-train leakage: a model trained on a near\-duplicate of a test document has effectively been exposed to that test instance during training, regardless of whether the two documents are pixel\-identical copies\.

Since we consider two documents to be \(near\-\) duplicates even if their pixels differ significantly \(e\.g\., due to scanner noise or geometric transformation\), image hashing algorithms like pHash444[https://www\.phash\.org/](https://www.phash.org/)—used in open source tools like CleanVision555[https://github\.com/cleanlab/cleanvision](https://github.com/cleanlab/cleanvision)and fastdup666[https://github\.com/visual\-layer/fastdup](https://github.com/visual-layer/fastdup)—are not applicable, as they will suffer from low recall\. Instead, we consider a filter\-and\-refine pipeline that uses approximate similarity to propose candidate match pairs\. Then, more exact matching methods are used to verify if a candidate match pair is indeed a \(near\-\) duplicate pair\.

The “filter” step of this process involves computing rough similarity scores between documents using page embeddings\. Similarity scores can be efficiently computed between the test set and the training set with matrix multiplication: if we stackdd\-dimensionalℓ2\\ell\_\{2\}\-scaled embedding vectors of data from train and test sets into matricesXtrainX\_\{train\}andXtestX\_\{test\}, then the resulting matrix of cosine similarity scores isS=XtrainXtest⊤S=X\_\{train\}X\_\{test\}^\{\\top\}\. In this work, we use CLIP\(Radfordet al\.,[2021](https://arxiv.org/html/2606.31446#bib.bib5)\)image embeddings \(d=512d=512\)\. For each test set sample, we find the 10 most similar training set documents using this approach\. We then pass those 10 candidate match documents on to the “refine” step where they are more thoroughly compared against the test sample\.

In the “refine” step, we aim to verify if candidate test\-train pairs selected based on the “filter” step are actually duplicates\. We consider two approaches: \(1\) image feature matching, where local features within two images are compared, and \(2\) text sequence similarity\. The image feature matching method we investigate uses SuperPoint\(DeToneet al\.,[2018](https://arxiv.org/html/2606.31446#bib.bib3)\)for feature extraction and LightGlue\(Lindenbergeret al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib62)\)for feature matching\. SuperPoint is a self\-supervised convolutional neural network that detects keypoints and computes descriptors from image patches\. LightGlue is a lightweight graph neural network that matches keypoints between two images based on these descriptors\. This pairing is well\-suited for detecting near\-duplicates in RVL\-CDIP, as SuperPoint was designed to handle geometric transformations such as the distortion that can differentiate \(near\-\) duplicate document scans\. The text sequence similarity method we investigate uses the Ratcliff/Obershelp pattern recognition algorithm\(Ratcliff and Metzener,[1988](https://arxiv.org/html/2606.31446#bib.bib64)\), implemented via Python’sdiffliblibrary777[https://docs\.python\.org/3/library/difflib\.html](https://docs.python.org/3/library/difflib.html), which recursively finds the longest common substring between two sequences to compute an overall similarity ratio\. This makes it well\-suited for detecting template\-match near\-duplicates, where two documents share a large body of common text but differ in small regions \(e\.g\., a filled\-in field or a different addressee\)\. In practice, we use a decision rule where we consider two documents to be a \(near\-\) duplicate match if the sequence similarity is≥0\.2\\geq 0\.2and the feature match score \(i\.e\., the number of features matched between the two documents\) is≥100\\geq 100\. These thresholds were selected based on tuning on a ground truth dataset of 1,600 document pairs \(100 test samples from each of the 16 categories\), where candidate pairs were generated using the CLIP\-based filter step and then manually reviewed by the authors\. This ground\-truth dataset consists of 403 match pairs\. All ground truth pairs are available on our companion website\.888[https://rvlcdip\-errors\.com/](https://rvlcdip-errors.com/)The decision rule settings that we use achieve a precision of 100%, recall of 82\.9%, and F1 score of 90\.6% on this calibration dataset\.

### 4\.2\.Findings

We run our duplicate detection method on RVL\-CDIP, and find that a large portion of RVL\-CDIP’s test set has a \(near\-\) duplicate training sample\. Examples of \(near\-\) duplicates are shown in Figure[7](https://arxiv.org/html/2606.31446#S4.F7)\. A breakdown of the number and percentage of \(near\-\) duplicates for each RVL\-CDIP category is shown in Table[3](https://arxiv.org/html/2606.31446#S4.T3)\. The more extreme cases arebudget,form,invoice,questionnaire,resume, andspecification, each of which has more than half of their test sets having a duplicate \(or near\-duplicate\) document in the training set\. Half \(8 out of 16\) RVL\-CDIP categories have over one\-third of their test sets having a corresponding duplicate in the train set \(see Table[3](https://arxiv.org/html/2606.31446#S4.T3)\)\. On average, approximately 35% of RVL\-CDIP’s test set has a duplicate or near\-duplicate counterpart in the training set\. We additionally apply our method to RVL\-CDIP’s validation set, finding a nearly identical rate: approximately 35% of the validation set also has a \(near\-\) duplicate in the training set \(Table[3](https://arxiv.org/html/2606.31446#S4.T3)\)\. The close agreement between test\-train and val\-train duplication rates suggests that duplication is a systematic property of how RVL\-CDIP was constructed, rather than an artifact of any particular split\.

Table 3\.Test\-train and val\-train duplication by RVL\-CDIP category\. The % Dup\. percentage indicates the portion of the split that contains a \(near\-\) duplicate in the training set\.CategoryTestVal\.\# Dup\.% Dup\.\# Dup\.% Dup\.advertisement85634\.04%84333\.43%budget1,47959\.04%1,45058\.35%email1756\.99%1907\.51%file\_folder501\.98%562\.28%form1,45858\.18%1,45557\.35%handwritten1024\.03%1124\.60%invoice1,70368\.75%1,80370\.00%letter29211\.85%25910\.66%memo45718\.34%44017\.37%news\_article50120\.34%51420\.35%presentation71028\.53%68827\.88%questionnaire1,49961\.56%1,51560\.19%resume1,46157\.59%1,36056\.06%scientific\_publication44517\.30%45017\.81%scientific\_report85634\.27%89935\.84%specification2,03282\.20%2,11783\.64%Total14,07635\.20%14,15135\.38%

Main finding:RVL\-CDIP’s test set has many samples that are highly similar to the training set\. More than one\-third \(∼35%\\sim 35\\%\) of the test set has a duplicate or near\-duplicate sample in the training set, including template\-match near\-duplicates where shared document structure constitutes effective data leakage even without pixel\-identical copies\.

![Refer to caption](https://arxiv.org/html/2606.31446v1/figures/test-train_pairs_2x3.png)

Figure 7\.Example \(near\-\) duplicate test\-train pairs\. Documents need not be pixel\-identical to constitute leakage; template\-match pairs share the same underlying document structure, effectively exposing the model to the test document during training\.

## 5\.Evaluation on Cleaned RVL\-CDIP Versions

Now that we have rigorously identified label issues and duplication issues, we seek to investigate the impact of these issues on document classification performance\. We do this by creating different versions of RVL\-CDIP with various data quality issues addressed\. The rest of this section is as follows: Section[5\.1](https://arxiv.org/html/2606.31446#S5.SS1)introduces the models that we benchmark, Section[5\.2](https://arxiv.org/html/2606.31446#S5.SS2)discusses our modifications to RVL\-CDIP, and Section[5\.3](https://arxiv.org/html/2606.31446#S5.SS3)discusses RVL\-CDIP\-N’s OOD data\.

### 5\.1\.Models

We investigate the impact of data label issues on several classes of document classification models\. The supervised learning models we investigate include text\-only, image\-only, and multi\-modal models\. For the text\-only modality, we investigate BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.31446#bib.bib26)\), RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2606.31446#bib.bib32)\), LongFormer\(Beltagyet al\.,[2020](https://arxiv.org/html/2606.31446#bib.bib66)\), and Big Bird\(Zaheeret al\.,[2020](https://arxiv.org/html/2606.31446#bib.bib65)\)\. For the image\-only modality, we investigate the CNNs VGG\-16\(Simonyan and Zisserman,[2015](https://arxiv.org/html/2606.31446#bib.bib18)\), ResNet\-50\(Heet al\.,[2016](https://arxiv.org/html/2606.31446#bib.bib20)\), ResNeXT\-50\(Xieet al\.,[2017](https://arxiv.org/html/2606.31446#bib.bib22)\), GoogLeNet\(Szegedyet al\.,[2015](https://arxiv.org/html/2606.31446#bib.bib19)\), AlexNet\(Krizhevskyet al\.,[2012](https://arxiv.org/html/2606.31446#bib.bib21)\), DocXclassifier\(Saifullahet al\.,[2024](https://arxiv.org/html/2606.31446#bib.bib2)\), as well as DiT\(Liet al\.,[2022b](https://arxiv.org/html/2606.31446#bib.bib8)\)and Donut\(Kimet al\.,[2022](https://arxiv.org/html/2606.31446#bib.bib7)\)\. The supervised multi\-modal models we use are LayoutLM\(Xuet al\.,[2020](https://arxiv.org/html/2606.31446#bib.bib55)\)and LayoutLMv3\(Huanget al\.,[2022](https://arxiv.org/html/2606.31446#bib.bib56)\)\.

Prior work has benchmarked pre\-trained Large Language Models \(LLMs\) in a zero\-shot setting on RVL\-CDIP as well\(Scius\-Bertrandet al\.,[2024](https://arxiv.org/html/2606.31446#bib.bib31)\), so we also include these in our investigation\. We experiment with gpt\-oss\-20b\(OpenAI,[2025](https://arxiv.org/html/2606.31446#bib.bib28)\), Llama 4 Maverick 17B\(Meta AI,[2025](https://arxiv.org/html/2606.31446#bib.bib29)\), Qwen 3 32b\(Yanget al\.,[2025](https://arxiv.org/html/2606.31446#bib.bib58)\), and Mistral 7B Instruct\(Jianget al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib30)\)\. We additionally evaluate vision\-language models in a zero\-shot setting, including CLIP\(Radfordet al\.,[2021](https://arxiv.org/html/2606.31446#bib.bib5)\), ALIGN\(Jiaet al\.,[2021](https://arxiv.org/html/2606.31446#bib.bib25)\), SigLIP\(Zhaiet al\.,[2023](https://arxiv.org/html/2606.31446#bib.bib23)\), and SigLIP 2\(Tschannenet al\.,[2025](https://arxiv.org/html/2606.31446#bib.bib24)\)\.999We use the Hugging Face implementationsopenai/clip\-vit\-large\-patch14\-336for CLIP,kakaobrain/align\-basefor ALIGN,google/siglip\-so400m\-patch14\-384for SigLIP, andgoogle/siglip2\-so400m\-patch16\-512for SigLIP 2\.We measure document classification accuracy for all models\. Experimental settings such as libraries and hyperparameters used are available on our companion website\. We do not re\-train Donut or DiT due to resource constraints, and use a pre\-trained version of these supervised models only\.

### 5\.2\.RVL\-CDIP Variants

We use the following versions of RVL\-CDIP in our benchmarking \(label errors and duplicates are detected using the approaches discussed in Sections[3](https://arxiv.org/html/2606.31446#S3)and[4](https://arxiv.org/html/2606.31446#S4)\):

1. \(1\)Original RVL\-CDIP: the original RVL\-CDIP from Harley et al\.\(Harleyet al\.,[2015](https://arxiv.org/html/2606.31446#bib.bib9)\)\. We train models on this version’s training set\.
2. \(2\)Errors Dropped: where all erroneously labeled test samples are removed\.
3. \(3\)Errors Fixed: where all erroneously labeled test samples are fixed\. This means that*unknown*and*mixed*documents are removed, and*wrong*label documents have their labels fixed \(*unsure*documents are also removed\)\.
4. \(4\)Fixed \+ Retrain: where all erroneously labeled samples in RVL\-CDIP are fixed; we retrain models on this version’s training set\.
5. \(5\)Original De\-Duplicated: where test samples flagged as having a \(near\-\) duplicate counterpart in the train set are removed\.
6. \(6\)Errors Dropped and De\-Duplicated: where duplicate test samples are dropped \(if they have a duplicate train counterpart\) and all errors are dropped from RVL\-CDIP’s test set\.
7. \(7\)Errors Fixed and De\-Duplicated: where duplicate test samples are dropped \(if they have a duplicate train counterpart\) and all erroneously labeled test samples are fixed\.
8. \(8\)Fixed \+ Retrain and De\-Duplicated: where we fix all errors in all of RVL\-CDIP, and remove test duplicates if they have a \(near\-\) duplicate train pair\.

Table 4\.Model performance on RVL\-CDIP with different data configurations\. “De\-Dup\.” means test samples that were determined to have duplicates in the train set were dropped \(see Section[4](https://arxiv.org/html/2606.31446#S4)\)\. Overall, model accuracy scores increase when data label issues are addressed by removing those test samples or re\-training on error\-free data\. Accuracy scores decrease when duplicates are removed from the test set\.OriginalOriginalErrorsErrorsFixed \+OriginalErr\. Dr\.Err\. Fix\.Fix\. Ret\.ModelModalityReportedOursDroppedFixedRetrainDe\-Dup\.De\-Dup\.De\-Dup\.De\-Dup\.BERTText89\.8192\.4495\.5393\.2795\.9489\.4793\.5991\.1694\.22RoBERTaText90\.0692\.7195\.7393\.4496\.1489\.8793\.9191\.4894\.56LongFormerText93\.8593\.1196\.0393\.7596\.1190\.6294\.4392\.9594\.65Big BirdText93\.4892\.1095\.3093\.0895\.7189\.2093\.3891\.9994\.05VGG\-16Image90\.9790\.9994\.0191\.1894\.2687\.7391\.7089\.3192\.11ResNet\-50Image90\.4089\.8893\.1990\.4293\.3886\.7090\.9588\.5891\.36ResNeXT\-50Image—90\.4893\.7190\.9493\.9087\.2891\.5489\.1791\.90GoogLeNetImage89\.0287\.7891\.2688\.5691\.4384\.4889\.0086\.6489\.37AlexNetImage88\.6088\.0791\.4188\.4891\.9384\.5188\.8186\.5089\.42DocXclassifierImage94\.0094\.0496\.7194\.14—91\.7195\.2993\.45—DiTImage92\.1193\.2896\.2693\.76—90\.8494\.7592\.94—DonutImage95\.3095\.2497\.5794\.88—93\.4496\.5894\.06—LayoutLMMulti94\.4293\.4896\.4794\.1894\.3591\.1195\.0393\.5493\.79LayoutLMv3Multi95\.4494\.3096\.8794\.2497\.3292\.3495\.7393\.8196\.36CLIPImage—42\.2645\.5445\.32—42\.6947\.0146\.87—ALIGNImage—46\.9550\.9851\.14—47\.0752\.0252\.15—SigLIPImage—56\.6961\.9061\.66—56\.6961\.7162\.84—SigLIP 2Image—57\.9362\.8962\.57—56\.2461\.5561\.45—gpt\-oss\-20bText—65\.4471\.0870\.44—64\.4770\.8969\.05—Mistral 7B InstructText—56\.9661\.5660\.86—54\.3559\.7559\.43—Llama 4 Maverick 17BText—67\.6173\.4772\.78—65\.2872\.0371\.55—Qwen 3 32bText—62\.2867\.2766\.38—62\.6168\.9268\.47—

### 5\.3\.Out\-of\-Distribution Data

To evaluate generalization beyond RVL\-CDIP’s tobacco\-industry domain, Larson et al\.\(Larsonet al\.,[2022](https://arxiv.org/html/2606.31446#bib.bib40)\)created RVL\-CDIP\-N, an out\-of\-distribution benchmark of 1,002 documents that span the same 16 categories as RVL\-CDIP but are sourced from DocumentCloud101010[https://www\.documentcloud\.org/home/](https://www.documentcloud.org/home/)and web search rather than the Legacy Tobacco Document Library\. Examples from RVL\-CDIP\-N are available on our companion website\. Larson et al\. found that certain models trained on RVL\-CDIP exhibit accuracy drops of roughly 15–30% on RVL\-CDIP\-N, suggesting that high in\-distribution accuracy does not reflect general classification ability\. We evaluate our selected models on RVL\-CDIP\-N to check for performance differences across versions of RVL\-CDIP training data—in particular, whether training on error\-corrected data \(Fixed \+ Retrain\) improves OOD generalization\.

## 6\.Results

### 6\.1\.In\-Distribution Results

Document classification accuracy scores for each model are listed in Table[4](https://arxiv.org/html/2606.31446#S5.T4)\. Before comparing across conditions, we note that our replicated Original scores are generally consistent with published results\. Text\-based models BERT and RoBERTa are exceptions, scoring \+2\.63 percentage points \(pp\) and \+2\.65 pp above their reported scores respectively; we attribute this to our use of AWS Textract as the OCR engine rather than Tesseract, as better OCR provides a richer text signal for text\-only models\. Differences for image\-based and multimodal models are small \(≤\\leq1\.4 pp\) and attributable to variation in hyperparameters, random seeds, or library versions\.

Among zero\-shot models on the original RVL\-CDIP, vision\-language models range from 42\.26% \(CLIP\) to 57\.93% \(SigLIP 2\), while zero\-shot LLMs perform considerably better, ranging from 56\.96% \(Mistral 7B Instruct\) to 67\.61% \(Llama 4 Maverick 17B\)\. Both are well below the top supervised models but establish a useful baseline for interpreting cross\-condition gains\.

Turning to cross\-condition comparisons, we observe several consistent patterns\. First, removing errors \(Errors Dropped\) results in a higher accuracy score for all models, highlighting the impact of data quality on performance evaluation\. Supervised models gain approximately 3 percentage points on average, while zero\-shot LLMs see the biggest increases, each gaining roughly 5–6 points \(e\.g\., gpt\-oss\-20b increases from 65\.44% to 71\.08%\)\.

Second, when label errors are fixed \(Errors Fixed\), all models see a decrease in accuracy from the Errors Dropped scores\. This indicates that fixing the*wrong*label error type exposes some overfitting by the models: since the label errors are pervasive across all folds of the dataset, a pervasively*wrong*label error in the train set will lead the model to make “correct” predictions on erroneously labeled data\. When those cases are fixed in the test set, the model will still predict the old label, which has now been corrected, resulting in misclassification\. When we address this issue by*retraining*the model with corrected data \(Fixed \+ Retrain\), the accuracy scores increase again to levels above the Errors Fixed scores\. We see these Fixed \+ Retrain scores as representing what models achieve when label errors are addressed throughout RVL\-CDIP\. The top\-performing model here is LayoutLMv3, with a cleaned test accuracy of 97\.32%\.

Finally, we observe that removing test samples with a \(near\-\) duplicate in the train set leads to decreases in classification accuracy scores\. These drops range from approximately 1\.8 to 3\.6 percentage points across supervised models, with CNN models tending toward the higher end\. The zero\-shot models are less impacted by data deduplication in the test set, with results mixed — some models see small drops \(SigLIP 2:−1\.69\-1\.69pp, Llama 4 Maverick 17B:−2\.33\-2\.33pp\) while others are unaffected or improve slightly \(CLIP:\+0\.43\+0\.43pp, ALIGN:\+0\.12\+0\.12pp\); we speculate this may be because the removed duplicate test samples tend to be “easier”\.

Across all conditions, we also note little manifestation of the tendency of model leaderboards to “destabilize”\(Northcuttet al\.,[2021](https://arxiv.org/html/2606.31446#bib.bib47)\)upon the removal or remediation of label errors\. In our experiments, top\-performing models on the original RVL\-CDIP still tend to be top performers on cleaned versions of the dataset\.

Table 5\.Model performance on RVL\-CDIP\-N\. The Original column charts performance on models trained on the original RVL\-CDIP training set, while the Fixed column represents models trained on the cleaned \(errors fixed\) data\. Zero\-shot models \(CLIP, ALIGN, SigLIP, SigLIP 2, and LLMs\) require no training and are therefore evaluated under the Original condition only\.ModelOriginalFixedΔ\\DeltaBERT88\.1291\.32\+3\.20RoBERTa89\.8294\.91\+5\.09LongFormer90\.0293\.91\+3\.89Big Bird89\.2293\.51\+4\.29VGG\-1666\.9780\.34\+13\.37ResNet\-5062\.0875\.55\+13\.47ResNeXT\-5065\.1777\.94\+12\.77GoogLeNet60\.3867\.47\+7\.09AlexNet62\.0872\.85\+10\.77DocXclassifier74\.95——DiT78\.84——Donut78\.84——LayoutLM86\.7388\.52\+1\.79LayoutLMv381\.3495\.11\+13\.77CLIP81\.64——ALIGN89\.22——SigLIP92\.51——SigLIP 292\.61——gpt\-oss\-20b84\.32——Mistral 7B Instruct78\.22——Llama 4 Maverick 17B87\.81——Qwen 3 32b85\.41——

### 6\.2\.Out\-of\-Distribution Results

Table[5](https://arxiv.org/html/2606.31446#S6.T5)reports model performance on RVL\-CDIP\-N’s out\-of\-distribution data\. Every model trained on Fixed data outperforms its Original\-trained counterpart, with the top Fixed\-trained model \(LayoutLMv3, 95\.11%\) improving by \+13\.77 pp\. We find that the magnitude of improvement splits along input modality: models that ingest pixels \(the CNNs and LayoutLMv3\) gain 7 to 14 pp, whereas OCR\-based models \(BERT, RoBERTa, and LayoutLM\) gain less than 5\.1 pp each\. We interpret this as evidence that pixel\-ingesting models were learning RVL\-CDIP\-specific visual shortcuts during Original training; training on corrected labels disrupts these shortcuts and encourages class\-intrinsic features that transfer better out\-of\-distribution\. OCR\-token inputs are largely corpus\-invariant, so text\-based models have less to re\-learn\.

Fixed training also substantially narrows the in\-distribution\-versus\-OOD accuracy gap\. Original\-trained LayoutLMv3 drops from 94\.30% on RVL\-CDIP to 81\.34% on RVL\-CDIP\-N, a gap of 13 pp; Fixed\-trained LayoutLMv3 drops only from 97\.32% to 95\.11%, a gap of 2\.2 pp\. A similar narrowing holds for the CNN models\. Label cleanup therefore does more than raise average accuracy—it restores a substantial portion of the OOD generalization that Original training otherwise sacrifices\.

Notably, zero\-shot models perform competitively on RVL\-CDIP\-N despite receiving no training on RVL\-CDIP\. SigLIP and SigLIP 2 achieve 92\.51% and 92\.61% respectively—outperforming every Original\-trained supervised model, including LayoutLMv3 \(81\.34%\)\. Zero\-shot LLMs \(78–88%\) similarly match or exceed most pixel\-ingesting supervised models trained on original data\. This pattern is consistent with the RVL\-CDIP shortcut interpretation: zero\-shot models have never seen RVL\-CDIP’s training distribution and are therefore naturally unaffected by its domain\-specific biases\.

## 7\.Conclusion

In this paper we conduct a thorough review of RVL\-CDIP\. After a manual review, we find that approximately 12% of the dataset is erroneously labeled\. We implement a filter\-and\-refine approach to detect test\-train overlap in RVL\-CDIP\. In particular, we find that approximately 35% of RVL\-CDIP’s test set has a near\-duplicate or duplicate counterpart in the training set\. When we address these issues in new versions of RVL\-CDIP, we find that model performance increases when label errors are addressed, but performance decreases if duplicates are removed from RVL\-CDIP’s test set\. We additionally find that training on error\-corrected data substantially improves out\-of\-distribution generalization on RVL\-CDIP\-N, with models seeing improvements as large as 14 percentage points\. Our work brings to light several important data quality issues with the widely\-used RVL\-CDIP benchmark, calling attention to the need for high\-quality and clean data\. We make our data and analyses available at:[https://rvlcdip\-errors\.com/](https://rvlcdip-errors.com/)\.

## References

- C\. Agnew, C\. Eising, P\. Denny, A\. Scanlan, P\. Van De Ven, and E\. M\. Grua \(2023\)Quantifying the effects of ground truth annotation quality on object detection and instance segmentation performance\.IEEE Access11,pp\. 25174–25188\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- M\. Allamanis \(2019\)The adverse effects of code duplication in machine learning models of code\.InProceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software,Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1),[§4](https://arxiv.org/html/2606.31446#S4.p1.1)\.
- S\. Bakkali, Z\. Ming, M\. Coustaty, and M\. Rusiñol \(2021\)EAML: ensemble self\-attention\-based mutual learning network for document image classification\.Int\. J\. Doc\. Anal\. Recognit\.24\(3\),pp\. 251–268\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p3.1)\.
- B\. Barz and J\. Denzler \(2020\)Do we train on test data? Purging CIFAR of near\-duplicates\.Journal of Imaging6\(6\)\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- I\. Beltagy, M\. E\. Peters, and A\. Cohan \(2020\)Longformer: the long\-document transformer\.arXiv:2004\.05150\.Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- R\. Croft, M\. A\. Babar, and M\. M\. Kholoosi \(2023\)Data quality for software vulnerability datasets\.InProceedings of the 45th International Conference on Software Engineering \(ICSE\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- D\. DeTone, T\. Malisiewicz, and A\. Rabinovich \(2018\)SuperPoint: self\-supervised interest point detection and description\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops \(CVPRW\),Cited by:[§4\.1](https://arxiv.org/html/2606.31446#S4.SS1.p4.2)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\),Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- M\. Fujitake \(2024\)LayoutLLM: large language model instruction tuning for visually rich document understanding\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p3.1)\.
- F\. Gröger, S\. Lionetti, P\. Gottfrois, A\. Gonzalez\-Jimenez, M\. Groh, R\. Daneshjou, L\. Consortium, A\. A\. Navarini, and M\. Pouly \(2023\)Towards reliable dermatology evaluation benchmarks\.InProceedings of the 3rd Machine Learning for Health Symposium,Proceedings of Machine Learning Research, Vol\.225,pp\. 101–128\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- A\. W\. Harley, A\. Ufkes, and K\. G\. Derpanis \(2015\)Evaluation of deep convolutional nets for document image classification and retrieval\.InProceedings of the International Conference on Document Analysis and Recognition \(ICDAR\),Cited by:[§1](https://arxiv.org/html/2606.31446#S1.p1.1),[§2](https://arxiv.org/html/2606.31446#S2.p2.1),[item 1](https://arxiv.org/html/2606.31446#S5.I1.i1.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- Y\. Huang, T\. Lv, L\. Cui, Y\. Lu, and F\. Wei \(2022\)LayoutLMv3: pre\-training for document AI with unified text and image masking\.InProceedings of the 30th ACM International Conference on Multimedia,Cited by:[§1](https://arxiv.org/html/2606.31446#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- C\. Jia, Y\. Yang, Y\. Xia, Y\. Chen, Z\. Parekh, H\. Pham, Q\. V\. Le, Y\. Sung, Z\. Li, and T\. Duerig \(2021\)Scaling up visual and vision\-language representation learning with noisy text supervision\.InProceedings of the 38th International Conference on Machine Learning \(ICML\),Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
- M\. Jungo, L\. Vögtlin, A\. Fakhari, N\. Wegmann, R\. Ingold, A\. Fischer, and A\. Scius\-Bertrand \(2023\)Impact of the ground truth quality for handwriting recognition\.InProceedings of the 12th International Symposium on Information and Communication Technology,Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- G\. Kim, T\. Hong, M\. Yim, J\. Nam, J\. Park, J\. Yim, W\. Hwang, S\. Yun, D\. Han, and S\. Park \(2022\)OCR\-free document understanding transformer\.InProceedings of the European Conference on Computer Vision \(ECCV\),Cited by:[§1](https://arxiv.org/html/2606.31446#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- A\. Krizhevsky, I\. Sutskever, and G\. E\. Hinton \(2012\)ImageNet classification with deep convolutional neural networks\.InProceedings of the 25th International Conference on Neural Information Processing Systems,Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- S\. Laatiri, P\. Ratnamogan, J\. Tang, L\. Lam, W\. Vanhuffel, and F\. Caspani \(2023\)Information redundancy and biases in public document information extraction benchmarks\.InProceedings of the International Conference on Document Analysis and Recognition \(ICDAR\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- J\. R\. Landis and G\. G\. Koch \(1977\)The measurement of observer agreement for categorical data\.Biometrics33\(1\),pp\. 159–174\.Cited by:[§3\.1](https://arxiv.org/html/2606.31446#S3.SS1.p2.1)\.
- R\. Laroca, V\. Estevam, A\. S\. Britto, R\. Minetto, and D\. Menotti \(2023\)Do we train on test data? the impact of near\-duplicates on license plate recognition\.InProceedings of the International Joint Conference on Neural Networks \(IJCNN\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- S\. Larson, G\. Lim, Y\. Ai, D\. Kuang, and K\. Leach \(2022\)Evaluating out\-of\-distribution performance on document image classifiers\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.31446#S1.p2.1),[§2](https://arxiv.org/html/2606.31446#S2.p5.1),[§5\.3](https://arxiv.org/html/2606.31446#S5.SS3.p1.1)\.
- S\. Larson, G\. Lim, and K\. Leach \(2023\)On evaluation of document classification with RVL\-CDIP\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\),Cited by:[§1](https://arxiv.org/html/2606.31446#S1.p1.1),[§2](https://arxiv.org/html/2606.31446#S2.p1.1),[§2](https://arxiv.org/html/2606.31446#S2.p4.1),[§2](https://arxiv.org/html/2606.31446#S2.p5.1),[§3\.1](https://arxiv.org/html/2606.31446#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.31446#S3.SS2.p1.1),[§3](https://arxiv.org/html/2606.31446#S3.p1.1)\.
- P\. Lewis, P\. Stenetorp, and S\. Riedel \(2021\)Question and answer test\-train overlap in open\-domain question answering datasets\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- B\. Li, G\. Baris, P\. H\. Chan, A\. Rahman, and V\. Donzella \(2022a\)Testing ground\-truth errors in an automotive dataset for a dnn\-based object detector\.InProceedings of the 2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering \(ICECCME\),pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- J\. Li, Y\. Xu, T\. Lv, L\. Cui, C\. Zhang, and F\. Wei \(2022b\)DiT: self\-supervised pre\-training for document image transformer\.InProceedings of the 30th ACM International Conference on Multimedia,Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- G\. Lim, S\. Larson, and K\. Leach \(2024\)Label errors in the Tobacco3482 dataset\.arXiv preprint arXiv:2412\.13140\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- P\. Lindenberger, P\. Sarlin, and M\. Pollefeys \(2023\)LightGlue: Local Feature Matching at Light Speed\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),Cited by:[§4\.1](https://arxiv.org/html/2606.31446#S4.SS1.p4.2)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- C\. Luo, G\. Tang, Q\. Zheng, C\. Yao, L\. Jin, C\. Li, Y\. Xue, and L\. Si \(2025\)Bi\-VLDoc: bidirectional vision\-language modeling for visually\-rich document understanding\.International Journal on Document Analysis and Recognition \(IJDAR\)\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p3.1)\.
- Meta AI \(2025\)The Llama 4 herd: the beginning of a new era of natively multimodal AI innovation\.Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
- Y\. Mu, M\. Jin, X\. Song, and N\. Aletras \(2024\)Enhancing data quality through simple de\-duplication: navigating responsible computational social science research\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- N\. M\. Müller and K\. Markert \(2019\)Identifying mislabeled instances in classification datasets\.InProceedings of the International Joint Conference on Neural Networks \(IJCNN\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- J\. Niu and G\. Penn \(2019\)Rationally reappraising ATIS\-based dialogue systems\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- C\. G\. Northcutt, A\. Athalye, and J\. Mueller \(2021\)Pervasive label errors in test sets destabilize machine learning benchmarks\.InProceedings of the 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks,Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1),[§3\.1](https://arxiv.org/html/2606.31446#S3.SS1.p1.1),[§6\.1](https://arxiv.org/html/2606.31446#S6.SS1.p6.1)\.
- OpenAI \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
- F\. Radenovic, A\. Iscen, G\. Tolias, Y\. Avrithis, and O\. Chum \(2018\)Revisiting oxford and paris: large\-scale image retrieval benchmarking\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 5706–5715\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning \(ICML\),Cited by:[§4\.1](https://arxiv.org/html/2606.31446#S4.SS1.p3.6),[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
- J\. W\. Ratcliff and D\. E\. Metzener \(1988\)Pattern matching: the gestalt approach\.Dr\. Dobb’s Journal\.Cited by:[§4\.1](https://arxiv.org/html/2606.31446#S4.SS1.p4.2)\.
- S\. Rücker and A\. Akbik \(2023\)CleanCoNLL: a nearly noise\-free named entity recognition dataset\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- S\. Saifullah, S\. Agne, A\. Dengel, and S\. Ahmed \(2024\)DocXclassifier: towards a robust and interpretable deep neural network for document image classification\.International Journal on Document Analysis and Recognition \(IJDAR\)\.Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- A\. Scius\-Bertrand, M\. Jungo, L\. Vögtlin, J\. Spat, and A\. Fischer \(2024\)Zero\-shot prompting and few\-shot fine\-tuning: revisiting document image classification using large language models\.InProceedings of the 27th International Conference on Pattern Recognition \(ICPR\),Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
- K\. Simonyan and A\. Zisserman \(2015\)Very deep convolutional networks for large\-scale image recognition\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- C\. Szegedy, W\. Liu, Y\. Jia, P\. Sermanet, S\. Reed, D\. Anguelov, D\. Erhan, V\. Vanhoucke, and A\. Rabinovich \(2015\)Going deeper with convolutions\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- Z\. Tang, Z\. Yang, G\. Wang, Y\. Fang, Y\. Liu, C\. Zhu, M\. Zeng, C\. Zhang, and M\. Bansal \(2023\)Unifying vision, text, and layout for universal document processing\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§1](https://arxiv.org/html/2606.31446#S1.p1.1)\.
- M\. Tschannen, A\. Gritsenko, X\. Wang, M\. F\. Naeem, I\. Alabdulmohsin, N\. Parthasarathy, T\. Evans, L\. Beyer, Y\. Xia, B\. Mustafa, O\. Hénaff, J\. Harmsen, A\. Steiner, and X\. Zhai \(2025\)SigLIP 2: multilingual vision\-language encoders with improved semantic understanding, localization, and dense features\.arXiv preprint arXiv:2502\.14786\.Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
- H\. M\. Vu and D\. T\. Nguyen \(2020\)Revising funsd dataset for key\-value detection in document images\.arXiv preprint arXiv:2010\.05322\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- W\. Wang and J\. Mueller \(2019\)Detecting label errors in token classification data\.arXiv preprint arXiv:2210\.03920\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- H\. Wu, S\. Tian, J\. Gutierrez, A\. Bhatta, K\. Öztürk, and K\. W\. Bowyer \(2024\)Identity overlap between face recognition train/test data: causing optimistic bias in accuracy measurement\.arXiv preprint arXiv:2405\.09403\.Cited by:[§2](https://arxiv.org/html/2606.31446#S2.p4.1)\.
- S\. Xie, R\. Girshick, P\. Dollár, Z\. Tu, and K\. He \(2017\)Aggregated residual transformations for deep neural networks\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- Y\. Xu, M\. Li, L\. Cui, S\. Huang, F\. Wei, and M\. Zhou \(2020\)LayoutLM: pre\-training of text and layout for document image understanding\.InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
- M\. Zaheer, G\. Guruganesh, K\. A\. Dubey, J\. Ainslie, C\. Alberti, S\. Ontanon, P\. Pham, A\. Ravula, Q\. Wang, L\. Yang,et al\.\(2020\)Big bird: transformers for longer sequences\.Advances in Neural Information Processing Systems33\.Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p1.1)\.
- X\. Zhai, B\. Mustafa, A\. Kolesnikov, and L\. Beyer \(2023\)Sigmoid loss for language image pre\-training\.InProceedings of the 2023 IEEE/CVF International Conference on Computer Vision \(ICCV\),Cited by:[§5\.1](https://arxiv.org/html/2606.31446#S5.SS1.p2.1)\.
Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

Similar Articles

Learning High Coverage Discriminative Parsimonious Rulesets

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Diagnosing and Repairing Factual Errors in RAG under Budget Constraints

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Submit Feedback

Similar Articles

Learning High Coverage Discriminative Parsimonious Rulesets
REVES: REvision and VErification--Augmented Training for Test-Time Scaling
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Diagnosing and Repairing Factual Errors in RAG under Budget Constraints
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards