Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA
Summary
A benchmark comparing vision-capable LLMs (native PDF reading) against OCR-based pipelines on 30 long, image-heavy PDFs finds that OCR with layout extraction still outperforms vision models on chart/table-heavy pages and has a 0% failure rate vs. 7% for native PDF, though the sample size is small and many gaps are within noise.
Similar Articles
We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]
A comprehensive benchmark of 18 LLMs on OCR tasks (7k+ calls) reveals that cheaper and older models often match premium accuracy at a fraction of the cost, with full dataset and framework open-sourced.
@jerryjliu0: A downside with using VLMs to parse PDFs is guaranteeing that the output text is *correct* and output in the correct re…
Jerry Liu discusses challenges with using Vision Language Models for PDF parsing, particularly around ensuring text correctness and maintaining proper reading order while avoiding hallucinations.
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
This paper presents dots.ocr, a unified Vision-Language Model that jointly learns layout detection, text recognition, and relational understanding for multilingual document layout parsing. It achieves state-of-the-art results on OmniDocBench and introduces the XDocParse benchmark spanning 126 languages.
PaddlePaddle/PaddleOCR
PaddleOCR is a powerful, lightweight OCR toolkit that converts PDFs and images into structured data for AI applications, supporting 100+ languages and designed to bridge documents with LLMs.