Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

arXiv cs.CL 06/30/26, 04:00 AM Papers

ocr devanagari vision-language-models benchmark hindi post-correction

Summary

This paper benchmarks ten OCR systems on Devanagari script under synthetic degradation and real scans, finding that synthetic renders overstate quality, specialized OCR-VLMs are fragile, and strong English OCR does not predict Indic OCR performance. It releases a benchmark, code, and models.

arXiv:2606.29213v1 Announce Type: new Abstract: OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts is largely uncharacterised. We benchmark ten systems on Devanagari (Hindi): classical EasyOCR; open VLMs (Qwen2.5-VL-3B, Qwen3-VL-8B, olmOCR-7B); specialised OCR-VLMs (DeepSeek-OCR, Unlimited-OCR); and frontier closed models (Gemini 2.5 Flash, Claude Opus 4.7, GPT-5.5, Mistral OCR), across four synthetic degradation conditions and 300 real printed scans. We report four findings. First, on clean rendered text all ten cluster within chrF++ 91 to 98, so synthetic text does not separate them. Second, under degradation the specialised OCR-VLMs are the most fragile: DeepSeek-OCR suffers rare but catastrophic repetition failures (outputs up to 71 the reference length) that wreck its corpus mean even though its median is the best of any system, which is why we report median and catastrophic-rate instead of the mean. Third, on real scans nine of the ten systems collapse (EasyOCR falls from chrF++ 93.6 to 58.3) and the field spreads across a 76-point range, so synthetic renders badly overstate Devanagari quality. Fourth, strong English OCR does not predict Indic OCR: GPT-5.5 drops to chrF++ 58.5 (tying classical EasyOCR) and olmOCR-7B, the model behind olmOCR-Bench, falls to 40.5, while the open Qwen3-VL-8B (75.2, runnable on a single 24 GB GPU) beats GPT-5.5 and approaches Mistral; Gemini and Claude lead at 86.3 and 82.2. An error taxonomy separates surface errors (numerals, punctuation) from structural ones (conjuncts, matras, nukta), and a byte-level (ByT5) post-corrector improves a cheap engine on its own error distribution (chrF++ +1.2 to +1.5) but does not transfer across engines. We release the benchmark, code, and models.

Original Article

View Cached Full Text

Cached at: 06/30/26, 05:30 AM

# Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study
Source: [https://arxiv.org/html/2606.29213](https://arxiv.org/html/2606.29213)
\(2026\)

###### Abstract

OCR systems, ranging from classical engines to specialised OCR vision\-language models \(OCR\-VLMs\) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts is largely uncharacterised\. We benchmark ten systems on Devanagari \(Hindi\): classical EasyOCR; open VLMs \(Qwen2\.5\-VL\-3B, Qwen3\-VL\-8B, olmOCR\-7B\); specialised OCR\-VLMs \(DeepSeek\-OCR, Unlimited\-OCR\); and frontier closed models \(Gemini 2\.5 Flash, Claude Opus 4\.7, GPT\-5\.5, Mistral OCR\), across four synthetic degradation conditions and300300real printed scans\. We report four findings\. First, on clean rendered text all ten cluster within chrF\+\+9191to9898, so synthetic text does not separate them\. Second, under degradation the specialised OCR\-VLMs are the most fragile: DeepSeek\-OCR suffers rare but catastrophic repetition failures \(outputs up to71×71\\timesthe reference length\) that wreck its corpus mean even though its median is the best of any system, which is why we report median and catastrophic\-rate instead of the mean\. Third, on real scans nine of the ten systems collapse \(EasyOCR falls from chrF\+\+93\.693\.6to58\.358\.3\) and the field spreads across a7676\-point range, so synthetic renders badly overstate Devanagari quality\. Fourth, strong English OCR does not predict Indic OCR: GPT\-5\.5 drops to chrF\+\+58\.558\.5\(tying classical EasyOCR\) and olmOCR\-7B, the model behind olmOCR\-Bench, falls to40\.540\.5, while the open Qwen3\-VL\-8B \(75\.275\.2, runnable on a single 24 GB GPU\) beats GPT\-5\.5 and approaches Mistral; Gemini and Claude lead at86\.386\.3and82\.282\.2\. An error taxonomy separates surface errors \(numerals, punctuation\) from structural ones \(conjuncts, matras, nukta\), and a byte\-level \(ByT5\) post\-corrector improves a cheap engine on its own error distribution \(chrF\+\+\+1\.2\+1\.2to\+1\.5\+1\.5\) but does not transfer across engines\. We release the benchmark, code, and models\.111[https://github\.com/Aditya\-PS\-05/devanagari\-ocr\-benchmark](https://github.com/Aditya-PS-05/devanagari-ocr-benchmark)

## 1Introduction

The latest wave of end\-to\-end OCR vision\-language models \(OCR\-VLMs\), including DeepSeek\-OCR\[[1](https://arxiv.org/html/2606.29213#bib.bib1)\], its successor DeepSeek\-OCR 2, and the recently released Unlimited\-OCR\[[2](https://arxiv.org/html/2606.29213#bib.bib2)\], treats document parsing as image\-to\-text generation with a large language decoder\. These models report state\-of\-the\-art results on OmniDocBench, whose documents are overwhelmingly English and Chinese\. Whether those gains hold for Indic scripts is unknown\.

Devanagari, the script of Hindi and several other languages, poses challenges that do not appear in Latin and CJK text: stacked conjunct consonants \(saṃyuktākṣar\), dependent vowel signs \(matras\) placed above, below, and beside a base glyph, the connecting headline \(shirorekha\), the nukta diacritic, frequent Hindi/English code\-mixing, and two numeral systems\. A model that excels on Latin text may still mishandle these\.

We ask three questions\. \(Q1\) How accurate and how robust are modern OCR\-VLMs on Devanagari under realistic image degradation? \(Q2\) What do they get wrong, categorically? \(Q3\) Can a lightweight post\-corrector recover the errors of a cheap engine? Our contributions are:

- •A controlled, multi\-font, multi\-condition Devanagari OCR benchmark with a script\-aware evaluation protocol \(Unicode NFC normalisation; CER/WER/chrF\+\+\)\.
- •A robustness analysis showing that corpus\-mean error is dominated by rare catastrophic repetition failures, so that median together with catastrophic\-rate is the faithful summary\.
- •A Devanagari error taxonomy that contrasts classical\-OCR and VLM failure modes\.
- •A distribution\-matched byte\-level post\-corrector, with a positive result for matched noise and a negative cross\-engine transfer result\.

## 2Related Work

End\-to\-end OCR\-VLMs\.GOT\-OCR2\[[4](https://arxiv.org/html/2606.29213#bib.bib4)\], Nougat, and the DeepSeek\-OCR line cast OCR as long\-form generation, using a high\-compression visual encoder and an LLM decoder\. Unlimited\-OCR\[[2](https://arxiv.org/html/2606.29213#bib.bib2)\]replaces the decoder’s attention with Reference Sliding Window Attention to bound the KV cache for long\-document parsing, reporting a\+6\+6overall gain over DeepSeek\-OCR on OmniDocBench\.Generic VLMssuch as Qwen2\.5\-VL\[[3](https://arxiv.org/html/2606.29213#bib.bib3)\]also perform competitive document OCR\.Document\-OCR benchmarkssuch as OmniDocBench\[[5](https://arxiv.org/html/2606.29213#bib.bib5)\]and olmOCR\-Bench\[olmocr\]drive progress with unit\-test\-style checks over real PDFs, but their documents are overwhelmingly English/Latin and Chinese, and Indic scripts are essentially absent\.Indic OCRhas historically relied on pipeline systems; to our knowledge a large\-scale evaluation of the new OCR\-VLMs and frontier LLMs on Devanagari is absent, and that is the gap this paper fills\.OCR post\-correctionas sequence\-to\-sequence denoising is established for Latin and historical text; we study it for Devanagari with a byte\-level model\.

## 3Benchmark Construction

Source text\.We use the Hindi side of the FLORES test set \(997 sentences\), sampling the firstN=100N\{=\}100Devanagari sentences for the main evaluation\. FLORES is held out from all training\.

Rendering\.Each sentence is rendered to a white\-background image with one of five Devanagari fonts \(Droid Sans Devanagari; Lohit Devanagari; Noto Sans Devanagari Regular/Medium/Condensed\), cycled across sentences, with line wrapping at 1400 px width and 40 px type\.

Degradation conditions\.From each clean image we derive three degraded variants:*blur*\(Gaussian,σ∈\[1\.0,1\.8\]\\sigma\\\!\\in\\\!\[1\.0,1\.8\]\);*noise*\(additive pixel noise on6%6\\%of pixels\); and*low\-DPI*\(0\.45×0\.45\\timesdownscale then upscale, bilinear\)\. This yields44conditions×\\times100100images\.

Metrics\.All references and hypotheses are Unicode NFC\-normalised before scoring\. We report Character Error Rate \(CER\), Word Error Rate \(WER\), and chrF\+\+ \(characternn\-gram F\-score with word order 2\)\. Because a single visual character \(akṣara\) spans multiple code points, code\-point CER understates structural errors; we treat this as a known limitation \(§[6](https://arxiv.org/html/2606.29213#S6)\)\.

Real printed set\.To measure the gap between synthetic and real images we additionally evaluate on300300real printed\-Devanagari images with transcriptions, sampled from the Sanskrit\-OCR\-Typed corpus \(historical typeset scans\)\. These are word and short\-phrase level, so we use them as a real\-image robustness probe rather than a document\-parsing benchmark\.

Models\.We evaluate ten systems across four families\.*Classical:*EasyOCR \(Hindi and English\)\.*Open VLMs:*Qwen2\.5\-VL\-3B and the newer Qwen3\-VL\-8B \(generic, prompted to transcribe verbatim\) and olmOCR\-7B \(the model behind olmOCR\-Bench\)\.*Specialised OCR\-VLMs:*DeepSeek\-OCR \(3B, 0\.5B active; “Free OCR”\) and Unlimited\-OCR \(3B, 0\.5B active; “document parsing”, Gundam mode\)\.*Frontier closed \(API\):*Google Gemini 2\.5 Flash, Anthropic Claude Opus 4\.7, OpenAI GPT\-5\.5, and Mistral OCR, evaluated on the clean and real sets \(cost\-bounded\), while the local open and specialised models additionally run all four degradation conditions\. VLM outputs are stripped of layout and grounding special tokens and of bounding\-box coordinates before scoring\. Local inference runs on a single NVIDIA A10G \(23 GB\) in bfloat16, one model resident at a time\. We also attempted PaddleOCR, GOT\-OCR2, and LlamaParse: the first two would not run reliably in our environment \(a PaddlePaddle segfault on Amazon Linux 2023 and a processor\-instantiation error\), and LlamaParse returned non\-Devanagari \(Latin\) output on every image, so we omit all three from the quantitative tables\.

## 4Results

### 4\.1Clean accuracy: everyone looks good

On clean rendered text all ten systems score chrF\+\+ in a narrow9191to9898band \(Table[1](https://arxiv.org/html/2606.29213#S4.T1)\)\. The frontier closed models lead slightly \(Claude98\.098\.0, Mistral97\.697\.6\), but classical EasyOCR, the open VLMs, and the specialised OCR\-VLMs are all within a few points\. Clean synthetic text does not separate the systems, which is exactly why degradation and real data matter\.

Table 1:Clean rendered Devanagari,N=100N\{=\}100\. CER↓\\downarrow/ chrF\+\+↑\\uparrow\.
### 4\.2Robustness under degradation

Table[2](https://arxiv.org/html/2606.29213#S4.T2)reports corpus CER per condition for the six systems we run locally across all four conditions\. EasyOCR and the Qwen models are nearly flat, olmOCR is stable, Unlimited\-OCR degrades moderately, and DeepSeek\-OCR’s corpus CER explodes to111\.8111\.8under blur and51\.951\.9under low\-DPI\.

Table 2:Corpus CER \(%,↓\\downarrow\) by condition,N=100N\{=\}100\. Best per column inbold\.![Refer to caption](https://arxiv.org/html/2606.29213v1/figs/fig_cer.png)Figure 1:Corpus CER by image condition \(log scale\)\. EasyOCR and Qwen are nearly flat, while DeepSeek\-OCR collapses under blur and low\-DPI\.The mean hides the truth\.Table[3](https://arxiv.org/html/2606.29213#S4.T3)decomposes per\-sample CER\. DeepSeek\-OCR has the best median CER of all systems \(1\.21\.2to1\.51\.5\), yet its mean is wrecked by the22to3%3\\%of samples that enter a degenerate repetition loop and produce outputs up to71\.6×71\.6\\timesthe reference length\. Unlimited\-OCR, whose decoder uses an explicit no\-repeat\-nn\-gram guard, bounds its worst case to3\.8×3\.8\\times\. We therefore recommend reporting median CER and catastrophic rate \(the fraction of samples with CER above50%50\\%\) alongside the mean\.

Table 3:Per\-sample CER distribution under blur and low\-DPI\. “cat” is the fraction with CER above50%50\\%; “max×\\times” is the largest output\-to\-reference length ratio\.![Refer to caption](https://arxiv.org/html/2606.29213v1/figs/fig_meanmed.png)Figure 2:Under blur, DeepSeek\-OCR has the best median CER \(1\.51\.5\) but a catastrophic mean \(73\.773\.7\), because2%2\\%of samples enter a repetition loop\. Median together with catastrophic rate is the faithful summary\.The reported ordering does not transfer\.Unlimited\-OCR is reported to beat DeepSeek\-OCR by\+6\+6overall on Latin and CJK OmniDocBench\. On clean Devanagari the ordering reverses: DeepSeek\-OCR attains higher chrF\+\+ \(93\.8493\.84versus91\.0491\.04\) and lower WER\. The two specialised OCR\-VLMs are also both outperformed in robustness by a generic VLM \(Qwen\) and by classical EasyOCR\.

### 4\.3Error taxonomy

We align each hypothesis to its reference at the character level and classify every edit into Devanagari\-specific categories \(Table[4](https://arxiv.org/html/2606.29213#S4.T4)\)\. Catastrophic repetition samples are excluded so the taxonomy reflects genuine recognition errors\.

Table 4:Error counts by category, clean condition,N=100N\{=\}100\.![Refer to caption](https://arxiv.org/html/2606.29213v1/figs/fig_taxonomy.png)Figure 3:Error composition by category \(clean\)\. EasyOCR errors are dominated by numerals and punctuation, while the VLMs fail on structural elements \(conjunct, matra, nukta\)\.Two profiles emerge\. The classical engine fails on surface elements: Devanagari numerals \(it transcribes them as Latin digits or misreads them\) and punctuation \(for example, danda—normalisation\)\. The VLMs fail on structural elements: conjuncts, matras, and nukta\. Unlimited\-OCR makes the most structural errors, consistent with its lower chrF\+\+\. Recurring look\-alike confusions are visually and phonetically motivated and are consistent across systems:ba↔\\leftrightarrowva,gha↔\\leftrightarrowdha,ma↔\\leftrightarrowbha,da↔\\leftrightarrowdha, andta↔\\leftrightarrowTa\. We also note that a substantial share of the look\-alike edits are really punctuation normalisation \(danda versus full\-stop, smart quotes\); such differences inflate raw error counts and should be normalised, a methodological caveat for Indic OCR evaluation\.

### 4\.4Synthetic versus real printed Devanagari

The clean ties vanish on300300real printed\-Devanagari scans \(Table[5](https://arxiv.org/html/2606.29213#S4.T5), Fig\.[4](https://arxiv.org/html/2606.29213#S4.F4)\), which spread the ten systems across a7676\-point chrF\+\+ range\. Four findings stand out\.

\(1\) Synthetic renders badly overstate quality\.Nine of the ten systems drop sharply from synthetic to real; EasyOCR falls from chrF\+\+93\.693\.6to58\.358\.3and its median CER rises from about2%2\\%to17%17\\%\. Benchmarks built only on rendered text are misleading for Devanagari\.

\(2\) Specialised OCR\-VLMs collapse\.DeepSeek\-OCR’s median CER is100%100\\%\(with89%89\\%of samples catastrophic\) and Unlimited\-OCR emits on average4×4\\timesthe reference length through hallucination; both sit at the bottom \(chrF\+\+1010to2525\)\.

\(3\) Frontier closed models mostly hold\.Gemini, Claude, and Mistral all reach a median CER of0\.00\.0with chrF\+\+ between7777and8686\. This is not a simple “closed beats open” story, as finding \(4\) shows\.

\(4\) The English ranking does not transfer\.Two results break the olmOCR\-Bench and English ordering\. First, GPT\-5\.5, a top model on English document OCR, drops to chrF\+\+58\.558\.5on real Devanagari, tying classical EasyOCR and falling far below Gemini and Claude\. Second, the open Qwen3\-VL\-8B reaches chrF\+\+75\.275\.2\(median CER0\.00\.0\), beating GPT\-5\.5 and approaching Mistral even though it runs freely on a single 24 GB GPU\. Most pointedly, olmOCR\-7B, the model behind the olmOCR\-Bench leaderboard, collapses to chrF\+\+40\.540\.5on real Devanagari\. Strong English OCR performance is thus a poor predictor of Indic performance\.

We acknowledge a confound: these images are word and short\-phrase level, which disadvantages page\-oriented models\. Even granting this, the gap between Gemini and GPT\-5\.5, or between Qwen3\-VL and olmOCR, occurs within the same regime and cannot be explained by granularity alone\.

Table 5:Real printed\-Devanagari scans \(N=300N\{=\}300, word\-level\), sorted by chrF\+\+\. “cat” is the CER\-above\-50%50\\%rate; “len×\\times” is the mean output\-to\-reference length\. F is frontier closed, O is open, C is classical, and S is specialised OCR\-VLM\.![Refer to caption](https://arxiv.org/html/2606.29213v1/figs/fig_real.png)Figure 4:chrF\+\+ on synthetic clean renders versus real printed scans, ten systems sorted by real\-data chrF\+\+ \(real bars coloured by family\)\. Nearly all collapse on real images\. Gemini, Claude, Mistral, and the open Qwen3\-VL stay strong, while GPT\-5\.5 and olmOCR\-7B drop sharply despite strong English performance\.

## 5Distribution\-Matched Post\-Correction

We test whether a cheap engine can be rescued by a post\-corrector that maps noisy OCR text to clean text\. We fine\-tuneByT5\-small\(byte\-level, which suits character\-level OCR noise\) on6,0006\{,\}000real \(OCR\-output, clean\) pairs\. Held\-out Hindi sentences \(IITB and a general corpus, disjoint from FLORES\) are rendered under the same four conditions and transcribed with EasyOCR\. At inference we chunk inputs to at most9090characters, because the corrector is trained on short spans and long inputs otherwise induce repetition, then rejoin the pieces\.

Table 6:Post\-correction \(ByT5\-small trained on EasyOCR noise\)\. CER↓\\downarrow/ chrF\+\+↑\\uparrow, before→\\rightarrowafter\.![Refer to caption](https://arxiv.org/html/2606.29213v1/figs/fig_pc.png)Figure 5:A ByT5 post\-corrector trained on EasyOCR’s own error distribution improves EasyOCR chrF\+\+ in every condition\.The corrector consistently improves the engine it was trained on \(EasyOCR chrF\+\+ rises by1\.21\.2to1\.51\.5in all conditions, and CER improves on clean and noise\)\. It does not transfer: applied to Qwen, Unlimited, or DeepSeek outputs, whose error distributions differ, it is neutral or harmful\. The practical conclusion is that OCR post\-correction is effective but must be matched to the target engine’s error distribution\.

## 6Limitations

Our main controlled evaluation uses rendered images, whose low baseline CER limits post\-correction headroom\. We partly address this with the real printed set \(§[4\.4](https://arxiv.org/html/2606.29213#S4.SS4)\), but that set is word and short\-phrase level and Sanskrit typeset rather than sentence\-level Hindi\. An openly available, sentence\-level, real\-scanned Hindi corpus with reliable transcriptions remains scarce, and obtaining one is the most valuable next step\. The controlled evaluation usesN=100N\{=\}100sentences and Hindi only, CER is computed at the code\-point rather than grapheme\-cluster level, and we evaluate single\-page images rather than the multi\-page long\-horizon setting that Unlimited\-OCR targets\. Line\- and document\-level real Hindi data, multi\-script coverage, and grapheme\-aware scoring are clear next steps\.

## 7Conclusion

Benchmarking ten OCR systems on Devanagari yields a consistent message: clean rendered text hides every difference, and only real degraded scans reveal them\. The specialised OCR\-VLMs are the least safe choice, since DeepSeek\-OCR’s strong median is masked by catastrophic repetition and both it and Unlimited\-OCR collapse on real scans\. Crucially, strong English OCR does not predict Indic OCR: GPT\-5\.5 and olmOCR\-7B, both strong on English document benchmarks, fall to the middle and bottom of the real\-Devanagari ranking, while the open Qwen3\-VL\-8B, runnable on a single 24 GB GPU, beats GPT\-5\.5 and trails only the strongest closed models \(Gemini and Claude\)\. Synthetic\-only Devanagari benchmarks are therefore misleading, real evaluation is indispensable, and the best general OCR for English may be far from the best for Indic\. We also contribute a robustness methodology \(median and catastrophic\-rate over the mean\), a Devanagari error taxonomy, and a distribution\-matched post\-corrector\. We release the benchmark and code to support evaluation of OCR systems on Indic scripts\.

## References

- \[1\]H\. Wei, Y\. Sun, Y\. Li\. DeepSeek\-OCR: Contexts Optical Compression\. arXiv:2510\.18234, 2025\.
- \[2\]Y\. Yin et al\. Unlimited OCR Works\. arXiv:2606\.23050, 2026\.
- \[3\]S\. Bai et al\. Qwen2\.5\-VL Technical Report\. arXiv:2502\.13923, 2025\.
- \[4\]H\. Wei et al\. General OCR Theory: Towards OCR\-2\.0 via a Unified End\-to\-End Model\. arXiv:2409\.01704, 2024\.
- \[5\]L\. Ouyang et al\. OmniDocBench\. CVPR, 2025\.

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

Similar Articles

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR

Submit Feedback

Similar Articles

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR