Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study
Summary
This paper benchmarks ten OCR systems on Devanagari script under synthetic degradation and real scans, finding that synthetic renders overstate quality, specialized OCR-VLMs are fragile, and strong English OCR does not predict Indic OCR performance. It releases a benchmark, code, and models.
View Cached Full Text
Cached at: 06/30/26, 05:30 AM
# Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study
Source: [https://arxiv.org/html/2606.29213](https://arxiv.org/html/2606.29213)
\(2026\)
###### Abstract
OCR systems, ranging from classical engines to specialised OCR vision\-language models \(OCR\-VLMs\) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts is largely uncharacterised\. We benchmark ten systems on Devanagari \(Hindi\): classical EasyOCR; open VLMs \(Qwen2\.5\-VL\-3B, Qwen3\-VL\-8B, olmOCR\-7B\); specialised OCR\-VLMs \(DeepSeek\-OCR, Unlimited\-OCR\); and frontier closed models \(Gemini 2\.5 Flash, Claude Opus 4\.7, GPT\-5\.5, Mistral OCR\), across four synthetic degradation conditions and300300real printed scans\. We report four findings\. First, on clean rendered text all ten cluster within chrF\+\+9191to9898, so synthetic text does not separate them\. Second, under degradation the specialised OCR\-VLMs are the most fragile: DeepSeek\-OCR suffers rare but catastrophic repetition failures \(outputs up to71×71\\timesthe reference length\) that wreck its corpus mean even though its median is the best of any system, which is why we report median and catastrophic\-rate instead of the mean\. Third, on real scans nine of the ten systems collapse \(EasyOCR falls from chrF\+\+93\.693\.6to58\.358\.3\) and the field spreads across a7676\-point range, so synthetic renders badly overstate Devanagari quality\. Fourth, strong English OCR does not predict Indic OCR: GPT\-5\.5 drops to chrF\+\+58\.558\.5\(tying classical EasyOCR\) and olmOCR\-7B, the model behind olmOCR\-Bench, falls to40\.540\.5, while the open Qwen3\-VL\-8B \(75\.275\.2, runnable on a single 24 GB GPU\) beats GPT\-5\.5 and approaches Mistral; Gemini and Claude lead at86\.386\.3and82\.282\.2\. An error taxonomy separates surface errors \(numerals, punctuation\) from structural ones \(conjuncts, matras, nukta\), and a byte\-level \(ByT5\) post\-corrector improves a cheap engine on its own error distribution \(chrF\+\+\+1\.2\+1\.2to\+1\.5\+1\.5\) but does not transfer across engines\. We release the benchmark, code, and models\.111[https://github\.com/Aditya\-PS\-05/devanagari\-ocr\-benchmark](https://github.com/Aditya-PS-05/devanagari-ocr-benchmark)
## 1Introduction
The latest wave of end\-to\-end OCR vision\-language models \(OCR\-VLMs\), including DeepSeek\-OCR\[[1](https://arxiv.org/html/2606.29213#bib.bib1)\], its successor DeepSeek\-OCR 2, and the recently released Unlimited\-OCR\[[2](https://arxiv.org/html/2606.29213#bib.bib2)\], treats document parsing as image\-to\-text generation with a large language decoder\. These models report state\-of\-the\-art results on OmniDocBench, whose documents are overwhelmingly English and Chinese\. Whether those gains hold for Indic scripts is unknown\.
Devanagari, the script of Hindi and several other languages, poses challenges that do not appear in Latin and CJK text: stacked conjunct consonants \(saṃyuktākṣar\), dependent vowel signs \(matras\) placed above, below, and beside a base glyph, the connecting headline \(shirorekha\), the nukta diacritic, frequent Hindi/English code\-mixing, and two numeral systems\. A model that excels on Latin text may still mishandle these\.
We ask three questions\. \(Q1\) How accurate and how robust are modern OCR\-VLMs on Devanagari under realistic image degradation? \(Q2\) What do they get wrong, categorically? \(Q3\) Can a lightweight post\-corrector recover the errors of a cheap engine? Our contributions are:
- •A controlled, multi\-font, multi\-condition Devanagari OCR benchmark with a script\-aware evaluation protocol \(Unicode NFC normalisation; CER/WER/chrF\+\+\)\.
- •A robustness analysis showing that corpus\-mean error is dominated by rare catastrophic repetition failures, so that median together with catastrophic\-rate is the faithful summary\.
- •A Devanagari error taxonomy that contrasts classical\-OCR and VLM failure modes\.
- •A distribution\-matched byte\-level post\-corrector, with a positive result for matched noise and a negative cross\-engine transfer result\.
## 2Related Work
End\-to\-end OCR\-VLMs\.GOT\-OCR2\[[4](https://arxiv.org/html/2606.29213#bib.bib4)\], Nougat, and the DeepSeek\-OCR line cast OCR as long\-form generation, using a high\-compression visual encoder and an LLM decoder\. Unlimited\-OCR\[[2](https://arxiv.org/html/2606.29213#bib.bib2)\]replaces the decoder’s attention with Reference Sliding Window Attention to bound the KV cache for long\-document parsing, reporting a\+6\+6overall gain over DeepSeek\-OCR on OmniDocBench\.Generic VLMssuch as Qwen2\.5\-VL\[[3](https://arxiv.org/html/2606.29213#bib.bib3)\]also perform competitive document OCR\.Document\-OCR benchmarkssuch as OmniDocBench\[[5](https://arxiv.org/html/2606.29213#bib.bib5)\]and olmOCR\-Bench\[olmocr\]drive progress with unit\-test\-style checks over real PDFs, but their documents are overwhelmingly English/Latin and Chinese, and Indic scripts are essentially absent\.Indic OCRhas historically relied on pipeline systems; to our knowledge a large\-scale evaluation of the new OCR\-VLMs and frontier LLMs on Devanagari is absent, and that is the gap this paper fills\.OCR post\-correctionas sequence\-to\-sequence denoising is established for Latin and historical text; we study it for Devanagari with a byte\-level model\.
## 3Benchmark Construction
Source text\.We use the Hindi side of the FLORES test set \(997 sentences\), sampling the firstN=100N\{=\}100Devanagari sentences for the main evaluation\. FLORES is held out from all training\.
Rendering\.Each sentence is rendered to a white\-background image with one of five Devanagari fonts \(Droid Sans Devanagari; Lohit Devanagari; Noto Sans Devanagari Regular/Medium/Condensed\), cycled across sentences, with line wrapping at 1400 px width and 40 px type\.
Degradation conditions\.From each clean image we derive three degraded variants:*blur*\(Gaussian,σ∈\[1\.0,1\.8\]\\sigma\\\!\\in\\\!\[1\.0,1\.8\]\);*noise*\(additive pixel noise on6%6\\%of pixels\); and*low\-DPI*\(0\.45×0\.45\\timesdownscale then upscale, bilinear\)\. This yields44conditions×\\times100100images\.
Metrics\.All references and hypotheses are Unicode NFC\-normalised before scoring\. We report Character Error Rate \(CER\), Word Error Rate \(WER\), and chrF\+\+ \(characternn\-gram F\-score with word order 2\)\. Because a single visual character \(akṣara\) spans multiple code points, code\-point CER understates structural errors; we treat this as a known limitation \(§[6](https://arxiv.org/html/2606.29213#S6)\)\.
Real printed set\.To measure the gap between synthetic and real images we additionally evaluate on300300real printed\-Devanagari images with transcriptions, sampled from the Sanskrit\-OCR\-Typed corpus \(historical typeset scans\)\. These are word and short\-phrase level, so we use them as a real\-image robustness probe rather than a document\-parsing benchmark\.
Models\.We evaluate ten systems across four families\.*Classical:*EasyOCR \(Hindi and English\)\.*Open VLMs:*Qwen2\.5\-VL\-3B and the newer Qwen3\-VL\-8B \(generic, prompted to transcribe verbatim\) and olmOCR\-7B \(the model behind olmOCR\-Bench\)\.*Specialised OCR\-VLMs:*DeepSeek\-OCR \(3B, 0\.5B active; “Free OCR”\) and Unlimited\-OCR \(3B, 0\.5B active; “document parsing”, Gundam mode\)\.*Frontier closed \(API\):*Google Gemini 2\.5 Flash, Anthropic Claude Opus 4\.7, OpenAI GPT\-5\.5, and Mistral OCR, evaluated on the clean and real sets \(cost\-bounded\), while the local open and specialised models additionally run all four degradation conditions\. VLM outputs are stripped of layout and grounding special tokens and of bounding\-box coordinates before scoring\. Local inference runs on a single NVIDIA A10G \(23 GB\) in bfloat16, one model resident at a time\. We also attempted PaddleOCR, GOT\-OCR2, and LlamaParse: the first two would not run reliably in our environment \(a PaddlePaddle segfault on Amazon Linux 2023 and a processor\-instantiation error\), and LlamaParse returned non\-Devanagari \(Latin\) output on every image, so we omit all three from the quantitative tables\.
## 4Results
### 4\.1Clean accuracy: everyone looks good
On clean rendered text all ten systems score chrF\+\+ in a narrow9191to9898band \(Table[1](https://arxiv.org/html/2606.29213#S4.T1)\)\. The frontier closed models lead slightly \(Claude98\.098\.0, Mistral97\.697\.6\), but classical EasyOCR, the open VLMs, and the specialised OCR\-VLMs are all within a few points\. Clean synthetic text does not separate the systems, which is exactly why degradation and real data matter\.
Table 1:Clean rendered Devanagari,N=100N\{=\}100\. CER↓\\downarrow/ chrF\+\+↑\\uparrow\.
### 4\.2Robustness under degradation
Table[2](https://arxiv.org/html/2606.29213#S4.T2)reports corpus CER per condition for the six systems we run locally across all four conditions\. EasyOCR and the Qwen models are nearly flat, olmOCR is stable, Unlimited\-OCR degrades moderately, and DeepSeek\-OCR’s corpus CER explodes to111\.8111\.8under blur and51\.951\.9under low\-DPI\.
Table 2:Corpus CER \(%,↓\\downarrow\) by condition,N=100N\{=\}100\. Best per column inbold\.Figure 1:Corpus CER by image condition \(log scale\)\. EasyOCR and Qwen are nearly flat, while DeepSeek\-OCR collapses under blur and low\-DPI\.The mean hides the truth\.Table[3](https://arxiv.org/html/2606.29213#S4.T3)decomposes per\-sample CER\. DeepSeek\-OCR has the best median CER of all systems \(1\.21\.2to1\.51\.5\), yet its mean is wrecked by the22to3%3\\%of samples that enter a degenerate repetition loop and produce outputs up to71\.6×71\.6\\timesthe reference length\. Unlimited\-OCR, whose decoder uses an explicit no\-repeat\-nn\-gram guard, bounds its worst case to3\.8×3\.8\\times\. We therefore recommend reporting median CER and catastrophic rate \(the fraction of samples with CER above50%50\\%\) alongside the mean\.
Table 3:Per\-sample CER distribution under blur and low\-DPI\. “cat” is the fraction with CER above50%50\\%; “max×\\times” is the largest output\-to\-reference length ratio\.Figure 2:Under blur, DeepSeek\-OCR has the best median CER \(1\.51\.5\) but a catastrophic mean \(73\.773\.7\), because2%2\\%of samples enter a repetition loop\. Median together with catastrophic rate is the faithful summary\.The reported ordering does not transfer\.Unlimited\-OCR is reported to beat DeepSeek\-OCR by\+6\+6overall on Latin and CJK OmniDocBench\. On clean Devanagari the ordering reverses: DeepSeek\-OCR attains higher chrF\+\+ \(93\.8493\.84versus91\.0491\.04\) and lower WER\. The two specialised OCR\-VLMs are also both outperformed in robustness by a generic VLM \(Qwen\) and by classical EasyOCR\.
### 4\.3Error taxonomy
We align each hypothesis to its reference at the character level and classify every edit into Devanagari\-specific categories \(Table[4](https://arxiv.org/html/2606.29213#S4.T4)\)\. Catastrophic repetition samples are excluded so the taxonomy reflects genuine recognition errors\.
Table 4:Error counts by category, clean condition,N=100N\{=\}100\.Figure 3:Error composition by category \(clean\)\. EasyOCR errors are dominated by numerals and punctuation, while the VLMs fail on structural elements \(conjunct, matra, nukta\)\.Two profiles emerge\. The classical engine fails on surface elements: Devanagari numerals \(it transcribes them as Latin digits or misreads them\) and punctuation \(for example, danda—normalisation\)\. The VLMs fail on structural elements: conjuncts, matras, and nukta\. Unlimited\-OCR makes the most structural errors, consistent with its lower chrF\+\+\. Recurring look\-alike confusions are visually and phonetically motivated and are consistent across systems:ba↔\\leftrightarrowva,gha↔\\leftrightarrowdha,ma↔\\leftrightarrowbha,da↔\\leftrightarrowdha, andta↔\\leftrightarrowTa\. We also note that a substantial share of the look\-alike edits are really punctuation normalisation \(danda versus full\-stop, smart quotes\); such differences inflate raw error counts and should be normalised, a methodological caveat for Indic OCR evaluation\.
### 4\.4Synthetic versus real printed Devanagari
The clean ties vanish on300300real printed\-Devanagari scans \(Table[5](https://arxiv.org/html/2606.29213#S4.T5), Fig\.[4](https://arxiv.org/html/2606.29213#S4.F4)\), which spread the ten systems across a7676\-point chrF\+\+ range\. Four findings stand out\.
\(1\) Synthetic renders badly overstate quality\.Nine of the ten systems drop sharply from synthetic to real; EasyOCR falls from chrF\+\+93\.693\.6to58\.358\.3and its median CER rises from about2%2\\%to17%17\\%\. Benchmarks built only on rendered text are misleading for Devanagari\.
\(2\) Specialised OCR\-VLMs collapse\.DeepSeek\-OCR’s median CER is100%100\\%\(with89%89\\%of samples catastrophic\) and Unlimited\-OCR emits on average4×4\\timesthe reference length through hallucination; both sit at the bottom \(chrF\+\+1010to2525\)\.
\(3\) Frontier closed models mostly hold\.Gemini, Claude, and Mistral all reach a median CER of0\.00\.0with chrF\+\+ between7777and8686\. This is not a simple “closed beats open” story, as finding \(4\) shows\.
\(4\) The English ranking does not transfer\.Two results break the olmOCR\-Bench and English ordering\. First, GPT\-5\.5, a top model on English document OCR, drops to chrF\+\+58\.558\.5on real Devanagari, tying classical EasyOCR and falling far below Gemini and Claude\. Second, the open Qwen3\-VL\-8B reaches chrF\+\+75\.275\.2\(median CER0\.00\.0\), beating GPT\-5\.5 and approaching Mistral even though it runs freely on a single 24 GB GPU\. Most pointedly, olmOCR\-7B, the model behind the olmOCR\-Bench leaderboard, collapses to chrF\+\+40\.540\.5on real Devanagari\. Strong English OCR performance is thus a poor predictor of Indic performance\.
We acknowledge a confound: these images are word and short\-phrase level, which disadvantages page\-oriented models\. Even granting this, the gap between Gemini and GPT\-5\.5, or between Qwen3\-VL and olmOCR, occurs within the same regime and cannot be explained by granularity alone\.
Table 5:Real printed\-Devanagari scans \(N=300N\{=\}300, word\-level\), sorted by chrF\+\+\. “cat” is the CER\-above\-50%50\\%rate; “len×\\times” is the mean output\-to\-reference length\. F is frontier closed, O is open, C is classical, and S is specialised OCR\-VLM\.Figure 4:chrF\+\+ on synthetic clean renders versus real printed scans, ten systems sorted by real\-data chrF\+\+ \(real bars coloured by family\)\. Nearly all collapse on real images\. Gemini, Claude, Mistral, and the open Qwen3\-VL stay strong, while GPT\-5\.5 and olmOCR\-7B drop sharply despite strong English performance\.
## 5Distribution\-Matched Post\-Correction
We test whether a cheap engine can be rescued by a post\-corrector that maps noisy OCR text to clean text\. We fine\-tuneByT5\-small\(byte\-level, which suits character\-level OCR noise\) on6,0006\{,\}000real \(OCR\-output, clean\) pairs\. Held\-out Hindi sentences \(IITB and a general corpus, disjoint from FLORES\) are rendered under the same four conditions and transcribed with EasyOCR\. At inference we chunk inputs to at most9090characters, because the corrector is trained on short spans and long inputs otherwise induce repetition, then rejoin the pieces\.
Table 6:Post\-correction \(ByT5\-small trained on EasyOCR noise\)\. CER↓\\downarrow/ chrF\+\+↑\\uparrow, before→\\rightarrowafter\.Figure 5:A ByT5 post\-corrector trained on EasyOCR’s own error distribution improves EasyOCR chrF\+\+ in every condition\.The corrector consistently improves the engine it was trained on \(EasyOCR chrF\+\+ rises by1\.21\.2to1\.51\.5in all conditions, and CER improves on clean and noise\)\. It does not transfer: applied to Qwen, Unlimited, or DeepSeek outputs, whose error distributions differ, it is neutral or harmful\. The practical conclusion is that OCR post\-correction is effective but must be matched to the target engine’s error distribution\.
## 6Limitations
Our main controlled evaluation uses rendered images, whose low baseline CER limits post\-correction headroom\. We partly address this with the real printed set \(§[4\.4](https://arxiv.org/html/2606.29213#S4.SS4)\), but that set is word and short\-phrase level and Sanskrit typeset rather than sentence\-level Hindi\. An openly available, sentence\-level, real\-scanned Hindi corpus with reliable transcriptions remains scarce, and obtaining one is the most valuable next step\. The controlled evaluation usesN=100N\{=\}100sentences and Hindi only, CER is computed at the code\-point rather than grapheme\-cluster level, and we evaluate single\-page images rather than the multi\-page long\-horizon setting that Unlimited\-OCR targets\. Line\- and document\-level real Hindi data, multi\-script coverage, and grapheme\-aware scoring are clear next steps\.
## 7Conclusion
Benchmarking ten OCR systems on Devanagari yields a consistent message: clean rendered text hides every difference, and only real degraded scans reveal them\. The specialised OCR\-VLMs are the least safe choice, since DeepSeek\-OCR’s strong median is masked by catastrophic repetition and both it and Unlimited\-OCR collapse on real scans\. Crucially, strong English OCR does not predict Indic OCR: GPT\-5\.5 and olmOCR\-7B, both strong on English document benchmarks, fall to the middle and bottom of the real\-Devanagari ranking, while the open Qwen3\-VL\-8B, runnable on a single 24 GB GPU, beats GPT\-5\.5 and trails only the strongest closed models \(Gemini and Claude\)\. Synthetic\-only Devanagari benchmarks are therefore misleading, real evaluation is indispensable, and the best general OCR for English may be far from the best for Indic\. We also contribute a robustness methodology \(median and catastrophic\-rate over the mean\), a Devanagari error taxonomy, and a distribution\-matched post\-corrector\. We release the benchmark and code to support evaluation of OCR systems on Indic scripts\.
## References
- \[1\]H\. Wei, Y\. Sun, Y\. Li\. DeepSeek\-OCR: Contexts Optical Compression\. arXiv:2510\.18234, 2025\.
- \[2\]Y\. Yin et al\. Unlimited OCR Works\. arXiv:2606\.23050, 2026\.
- \[3\]S\. Bai et al\. Qwen2\.5\-VL Technical Report\. arXiv:2502\.13923, 2025\.
- \[4\]H\. Wei et al\. General OCR Theory: Towards OCR\-2\.0 via a Unified End\-to\-End Model\. arXiv:2409\.01704, 2024\.
- \[5\]L\. Ouyang et al\. OmniDocBench\. CVPR, 2025\.Similar Articles
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA
A benchmark comparing vision-capable LLMs (native PDF reading) against OCR-based pipelines on 30 long, image-heavy PDFs finds that OCR with layout extraction still outperforms vision models on chart/table-heavy pages and has a 0% failure rate vs. 7% for native PDF, though the sample size is small and many gaps are within noise.
Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis
This paper introduces sinhala-ocr-lk-acts-1010, the first publicly available real-world page-level dataset for Sinhala OCR, and fine-tunes three vision language models (DeepSeek-OCR V1, DeepSeek-OCR V2, LightOnOCR-2-1B) using QLoRA. LightOnOCR-2-1B achieves a CER of 1.05%, outperforming both open-source and commercial OCR models, and maintains consistent performance across degraded documents from different time periods.
We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]
A comprehensive benchmark of 18 LLMs on OCR tasks (7k+ calls) reveals that cheaper and older models often match premium accuracy at a fraction of the cost, with full dataset and framework open-sourced.
LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR
This paper presents LV-ROVER, a multi-stream Tesseract ensemble for Maltese OCR, achieving a 70% reduction in character error rate through synthetic data training and post-processing, addressing the challenges of low-resource OCR for Maltese.