@jerryjliu0: Last week we revamped Liteparse to be the fastest PDF parser out there An underrated part of liteparse is it doesn't ju…
Summary
Jerry Liu announces a revamped LiteParse, a fast PDF parser that provides bounding boxes for audit trails, with sample demos available.
View Cached Full Text
Cached at: 06/01/26, 09:36 PM
Last week we revamped Liteparse to be the fastest PDF parser out there An underrated part of liteparse is it doesn’t just give you text. It gives you bounding boxes that a coding agent can use to paint exact audit trails back to the source document. For instance, check out the deep research skill we compiled in liteparse_samples: https://github.com/jerryjliu/liteparse_samples… Come check out liteparse: https://github.com/run-llama/liteparse… We are hard at work making liteparse even better (e.g. Markdown support). Please feel free to open up issues, PRs, and let us know your feature requests
jerryjliu/liteparse_samples
Source: https://github.com/jerryjliu/liteparse_samples
LiteParse Samples
Interactive demos showcasing LiteParse — a fast, local, model-free document parser by LlamaIndex.
Samples
Parser Comparison
Side-by-side comparison of LiteParse vs PyPDF vs PyMuPDF on real government and financial documents. See the original PDF on the left, then tab through each parser’s extracted text on the right.

Quick start: Open comparison/output/comparison.html in your browser.
Features:
- 8 document sections from 5 real-world PDFs (FDIC, Federal Reserve, CMS, IRS, WHO)
- Embedded PDF viewer alongside parsed text
- Per-document timing for each parser
Visual Citations
Exact keyword search over parsed documents — see precisely where each match appears on the source PDF page, with bounding boxes highlighted directly on the page image. This is a simple substring match demo (not fuzzy or RAG-based search). Learn more in the Visual Citations guide.

Quick start: Open visual_citations/output/visual-citations.html in your browser.
Features:
- Interactive keyword search across all documents
- Bounding box overlays on rendered page images
- Side-by-side view of source page and parsed text with highlighted matches
Research Docs (Claude Code Skill)
Ask questions about your documents — get answers with visual source citations. Install as a Claude Code skill and invoke with /research-docs. The skill parses your documents, has Claude answer your question, and generates an HTML report with the answer and cited source pages highlighted with bounding boxes.

Install:
npx skills add run-llama/liteparse_samples --skill research_docs
Usage: /research-docs ./my-pdfs What is the total revenue?
Features:
- Parse any document LiteParse supports (PDF, DOCX, PPTX, XLSX, images) plus plaintext
- AI-powered answers with exact-quote source citations
- Bounding box highlights on source page images
- PDF viewer toggle for each citation
- Self-contained HTML report
Regenerating with Your Own Data
- Add your PDFs to the
data/folder - Edit
docs.jsonin the relevant sample folder to configure your documents and pages - Install dependencies and run:
pip install -r requirements.txt
# Regenerate comparison
cd comparison && python generate.py
# Regenerate visual citations
cd visual_citations && python generate.py
# Install research_docs skill
cp -r research_docs ~/.claude/skills/research-docs
# Then use: /research-docs ./data "Your question here"
docs.json format
Each sample has a docs.json that controls which documents and pages are processed:
[
{
"name": "My Document Title",
"file": "my_document.pdf",
"pages": [0, 1, 2],
"source": "example.com",
"desc": "Optional description (comparison only)"
}
]
- file: PDF filename (must exist in
data/) - pages: 0-indexed page numbers to parse
- source: Attribution label
- desc: Description shown in comparison cards (comparison sample only)
Data
The included PDFs are publicly available government documents:
| File | Source | Description |
|---|---|---|
cms_pfs.pdf | cms.gov | CMS Medicare Physician Fee Schedule (CY 2026) |
fdic_qbp.pdf | fdic.gov | FDIC Quarterly Banking Profile |
fed_h41.pdf | federalreserve.gov | Federal Reserve H.4.1 Statistical Release |
irs_1040.pdf | irs.gov | IRS Form 1040 — U.S. Individual Income Tax Return |
who_indicators.pdf | who.int | WHO Core Health Indicators |
Requirements
- Python 3.9+
- Dependencies:
liteparse,pypdf,pymupdf(see requirements.txt)
pip install -r requirements.txt
Links
Jerry Liu (@jerryjliu0): We’ve created the world’s fastest PDF parser ⚡️
And it’s more accurate than any other open-source, model-free PDF parser out there (pymupdf, pypdf, markitdown, pdftotext, opendataloader, pymupdf4llm)
Introducing LiteParse v2 - we rewrote the entire library into Rust and
Similar Articles
@jerryjliu0: Parse PDFs at lightspeed (this video is at 1x) Absolute cinema
Jerry Liu announces LiteParse v2, a Rust-based PDF parser that is claimed to be the fastest and most accurate open-source, model-free PDF parser available.
@jerryjliu0: LiteParse, our OSS document parser, is really good at parsing complex PDF layouts, text, and tables into a clean spatia…
LiteParse is an open-source, heuristic-based PDF parser that quickly converts complex layouts, text, and tables into a clean spatial grid without relying on ML models.
@jerryjliu0: LiteParse is the best open-source, model-free document parser for AI agents. Run it over over 50+ document types, and i…
LlamaIndex releases liteparse-server, a self-hosted, model-free HTTP API for parsing diverse document types with high spatial fidelity and privacy preservation.
run-llama/liteparse
LiteParse is a standalone open-source PDF parsing tool from run-llama that provides fast, local spatial text extraction with bounding boxes, supporting multiple programming languages and platforms.
@llama_index: When we say “LiteParse runs everywhere,” we mean it. Our WASM package is lightweight, minimal, and built for browser an…
LiteParse is a lightweight WASM-based PDF parser designed for browser and edge runtimes like Cloudflare Workers, enabling efficient document parsing in edge environments.