Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

Reddit r/artificial 05/24/26, 02:52 AM News

benchmark vision-llm ocr rag document-qa long-document pdf-processing

Summary

A benchmark comparing vision-capable LLMs (native PDF reading) against OCR-based pipelines on 30 long, image-heavy PDFs finds that OCR with layout extraction still outperforms vision models on chart/table-heavy pages and has a 0% failure rate vs. 7% for native PDF, though the sample size is small and many gaps are within noise.

I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc ([https://github.com/mayubo2333/MMLongBench-Doc](https://github.com/mayubo2333/MMLongBench-Doc)). There were 171 questions in total, using Claude Sonnet 4.5 as the LLM. Post-retry results: |Approach|Accuracy|$/query| |:-|:-|:-| |LlamaCloud premium + full-context|59.6%|$0.1885| |Azure premium + full-context|58.5%|$0.2051| |Azure basic + full-context|54.4%|$0.1062| |Agentic RAG|53.2%|$0.0827| |**Native PDF (vision LLM)**|**52.0%**|**$0.2552**| |LlamaCloud basic + full-context|50.9%|$0.1049| Native PDF came 5th of 6 on accuracy and was the most expensive arm at $0.2552 per query. Two findings: Vision underperformed on chart-heavy and table-heavy pages, the territory that the "vision LLMs make OCR obsolete" claim most often points to. Premium OCR with layout extraction held up better there. The native-PDF arm had a 7% intrinsic failure rate (related to PDF file size) that survived retries. There were 27 first-pass failures, with 5 attempts of exponential backoff per failed query. Fifteen recovered, and 12 stayed permanently broken. These were concentrated in two specific PDFs that fail for predictable transport-layer reasons (the blog identifies them). OCR-based arms had a 0% intrinsic failure rate after retries. Caveats: 30 docs is a small sample. I ran McNemar's pairwise test to determine which gaps are real and which are within noise. Only 3 of 15 head-to-head gaps are statistically distinguishable at α = 0.05, so the order in the table is partly noise. The vision-versus-OCR finding survives the test. Full writeup: [https://www.surfsense.com/blog/agentic-rag-vs-long-context-llms-benchmark](https://www.surfsense.com/blog/agentic-rag-vs-long-context-llms-benchmark)

Original Article

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

Similar Articles

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

@jerryjliu0: A downside with using VLMs to parse PDFs is guaranteeing that the output text is correct and output in the correct re…

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

PaddlePaddle/PaddleOCR

Submit Feedback

Similar Articles

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

@jerryjliu0: A downside with using VLMs to parse PDFs is guaranteeing that the output text is *correct* and output in the correct re…

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model