Tag
A comparison of on-prem document processing tools—Docling, Liteparse, Mineru, and Unstructured—for university use, evaluating their suitability for local deployment.
Baidu has released Unlimited-OCR, which processes entire documents in a single pass without chunking, overcoming a major limitation of current OCR technology.
Vik Paruchuri is open-sourcing a 9B model that extracts structured data from documents with near-frontier performance (90.2% on their benchmark, vs Gemini 3.5 Flash at 91.3%).
Hyper-Extract is an open-source framework that converts messy documents into typed knowledge structures, supporting multiple graph architectures like GraphRAG, LightRAG, and KG-Gen, with 10+ extraction engines and 80+ YAML templates for various domains.
Typst 0.15, a major release of the open-source typesetting system, introduces support for variable fonts, MathML export, multi-file output, multiple bibliographies, and multiple PDF standards, along with improved documentation and diagnostics.
PP-OCRv6 is a new open-source OCR model series from Baidu's PaddleOCR, available in Tiny/Small/Medium sizes with excellent accuracy and speed, beating several commercial models.
DeepSeek-OCR is a 3B vision model using context optical compression for efficient document processing. Fine-tuning it on Persian text using Unsloth achieved an 88.26% improvement in character error rate, all open-source and runnable on a single GPU.
A developer shares lessons from building a local document-to-JSON extractor using llama3.2 3B on Ollama, highlighting that deterministic post-processing and schema-constrained outputs matter more than model size, while seeking feedback on hallucination and context truncation issues with long documents.
The user shares their 6-month experience with NotebookLM and provides 10 prompts, claiming to convert 200 pages of documents into clear answers in 1 hour.
Microsoft has open-sourced MarkItDown, a tool that can convert PDF, Word, Excel, PPT and other files into well-structured Markdown format with a single click, making it easy to feed directly into LLMs. It has garnered over 138k stars on GitHub.
Attended the Applied AI Conference in Berlin and gave a talk on building document agents, including a detailed walkthrough of LobsterX, a document-processing agent built with LlamaIndex that uses structured outputs and event-driven workflows.
This blog post describes the architecture for a scalable ingestion pipeline using Temporal to handle crawling, extracting, chunking, and embedding customer documentation from various sources, emphasizing durability, statefulness, and concurrency control.
The author shares their three-year experience of feeding PDFs to AI, pointing out that Markdown is a better input format for AI than PDF, because PDF is essentially a mix of coordinates and characters. AI needs to parse the structure first, which is error-prone and consumes more tokens. The article provides specific cases and recommended tools (markitdown, pandoc, LlamaParse), and teases a new series called 'The Art of Feeding AI'.
LightRAG v1.5 is released with six major improvements including multimodal document processing, enhanced parsing, and role-specific LLM configuration, making RAG simpler, faster, and more powerful.
LlamaParse now offers latency metrics for Parse, Extract, and Classify jobs, providing queue time, processing time, and total latency breakdowns. This helps users monitor and scale their document processing.
Parsewise is an API for agentic multi-document processing, enabling efficient handling of multiple documents.
pdf-inspector is an open-source Rust library for intelligently classifying PDF types (text or scanned), extracting text, and converting to Markdown, avoiding unnecessary OCR to improve speed and save costs.
MADP is a multi-agent architecture for enterprise document processing that combines deep learning and LLMs with human-in-the-loop validation, achieving 97% automation and significant reductions in resource usage.
This reference implementation demonstrates how to run an LLM agent securely within a local sandbox to process and analyze various document types using Rust, LiteParse, and microsandbox. The open-source CLI leverages OpenAI's GPT models and native bash commands to perform file retrieval and analysis in an isolated environment.
Paper2Any is an open-source AI tool that converts research papers into editable diagrams, technical roadmaps, and slide decks with support for universal file formats and custom styling.