@jerryjliu0: Last week we revamped Liteparse to be the fastest PDF parser out there An underrated part of liteparse is it doesn't ju…

X AI KOLs Following Tools

Summary

Jerry Liu announces a revamped LiteParse, a fast PDF parser that provides bounding boxes for audit trails, with sample demos available.

Last week we revamped Liteparse to be the fastest PDF parser out there An underrated part of liteparse is it doesn't just give you text. It gives you bounding boxes that a coding agent can use to paint exact audit trails back to the source document. For instance, check out the deep research skill we compiled in liteparse_samples: https://github.com/jerryjliu/liteparse_samples… Come check out liteparse: https://github.com/run-llama/liteparse… We are hard at work making liteparse even better (e.g. Markdown support). Please feel free to open up issues, PRs, and let us know your feature requests
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:36 PM

Last week we revamped Liteparse to be the fastest PDF parser out there An underrated part of liteparse is it doesn’t just give you text. It gives you bounding boxes that a coding agent can use to paint exact audit trails back to the source document. For instance, check out the deep research skill we compiled in liteparse_samples: https://github.com/jerryjliu/liteparse_samples… Come check out liteparse: https://github.com/run-llama/liteparse… We are hard at work making liteparse even better (e.g. Markdown support). Please feel free to open up issues, PRs, and let us know your feature requests


jerryjliu/liteparse_samples

Source: https://github.com/jerryjliu/liteparse_samples

LiteParse Samples

Interactive demos showcasing LiteParse — a fast, local, model-free document parser by LlamaIndex.

Samples

Parser Comparison

Side-by-side comparison of LiteParse vs PyPDF vs PyMuPDF on real government and financial documents. See the original PDF on the left, then tab through each parser’s extracted text on the right.

Parser Comparison

Quick start: Open comparison/output/comparison.html in your browser.

Features:

  • 8 document sections from 5 real-world PDFs (FDIC, Federal Reserve, CMS, IRS, WHO)
  • Embedded PDF viewer alongside parsed text
  • Per-document timing for each parser

Visual Citations

Exact keyword search over parsed documents — see precisely where each match appears on the source PDF page, with bounding boxes highlighted directly on the page image. This is a simple substring match demo (not fuzzy or RAG-based search). Learn more in the Visual Citations guide.

Visual Citations

Quick start: Open visual_citations/output/visual-citations.html in your browser.

Features:

  • Interactive keyword search across all documents
  • Bounding box overlays on rendered page images
  • Side-by-side view of source page and parsed text with highlighted matches

Research Docs (Claude Code Skill)

Ask questions about your documents — get answers with visual source citations. Install as a Claude Code skill and invoke with /research-docs. The skill parses your documents, has Claude answer your question, and generates an HTML report with the answer and cited source pages highlighted with bounding boxes.

Research Docs

Install:

npx skills add run-llama/liteparse_samples --skill research_docs

Usage: /research-docs ./my-pdfs What is the total revenue?

Features:

  • Parse any document LiteParse supports (PDF, DOCX, PPTX, XLSX, images) plus plaintext
  • AI-powered answers with exact-quote source citations
  • Bounding box highlights on source page images
  • PDF viewer toggle for each citation
  • Self-contained HTML report

Regenerating with Your Own Data

  1. Add your PDFs to the data/ folder
  2. Edit docs.json in the relevant sample folder to configure your documents and pages
  3. Install dependencies and run:
pip install -r requirements.txt

# Regenerate comparison
cd comparison && python generate.py

# Regenerate visual citations
cd visual_citations && python generate.py

# Install research_docs skill
cp -r research_docs ~/.claude/skills/research-docs
# Then use: /research-docs ./data "Your question here"

docs.json format

Each sample has a docs.json that controls which documents and pages are processed:

[
  {
    "name": "My Document Title",
    "file": "my_document.pdf",
    "pages": [0, 1, 2],
    "source": "example.com",
    "desc": "Optional description (comparison only)"
  }
]
  • file: PDF filename (must exist in data/)
  • pages: 0-indexed page numbers to parse
  • source: Attribution label
  • desc: Description shown in comparison cards (comparison sample only)

Data

The included PDFs are publicly available government documents:

FileSourceDescription
cms_pfs.pdfcms.govCMS Medicare Physician Fee Schedule (CY 2026)
fdic_qbp.pdffdic.govFDIC Quarterly Banking Profile
fed_h41.pdffederalreserve.govFederal Reserve H.4.1 Statistical Release
irs_1040.pdfirs.govIRS Form 1040 — U.S. Individual Income Tax Return
who_indicators.pdfwho.intWHO Core Health Indicators

Requirements

pip install -r requirements.txt

Links

Jerry Liu (@jerryjliu0): We’ve created the world’s fastest PDF parser ⚡️

And it’s more accurate than any other open-source, model-free PDF parser out there (pymupdf, pypdf, markitdown, pdftotext, opendataloader, pymupdf4llm)

Introducing LiteParse v2 - we rewrote the entire library into Rust and

Similar Articles

run-llama/liteparse

GitHub Trending (daily)

LiteParse is a standalone open-source PDF parsing tool from run-llama that provides fast, local spatial text extraction with bounding boxes, supporting multiple programming languages and platforms.