Tag
Interfaze introduces a new hybrid AI model architecture that combines DNN/CNN encoders with transformers to achieve superior accuracy and cost-efficiency for deterministic tasks such as OCR, vision, and STT, compared to generalist models.
Wink Engineering evaluates the efficacy of neural super-resolution as a pre-filter for license plate OCR, concluding that it fails to improve accuracy and often leads to hallucinated characters compared to training directly on low-resolution data.
LlamaIndex releases liteparse-server, a self-hosted, model-free HTTP API for parsing diverse document types with high spatial fidelity and privacy preservation.
LlamaIndex introduces liteparse-server, an open-source, self-hosted HTTP backend for parsing PDFs, images, and Office documents with spatial layout extraction, OCR, and screenshot generation, designed for AI and data workflows.
The article introduces dots-ocr, a 1.7B parameter model capable of parsing text, tables, formulas, and images from documents in over 100 languages without needing separate OCR pipelines.
A new AI model from interfaze_ai claims to outperform leading models (sonnet 4.6, gemini 3 flash, gpt 5.4 mini) on OCR, vision, and speech-to-text tasks.
A comprehensive benchmark of 18 LLMs on OCR tasks (7k+ calls) reveals that cheaper and older models often match premium accuracy at a fraction of the cost, with full dataset and framework open-sourced.
Koharu is an open-source Rust-based manga/image translator that combines object detection, visual LLM OCR, layout analysis, and inpainting, with llama.cpp integration supporting Gemma 4 and Qwen3.5 models.
Gemma 4’s vision performance is bottlenecked by low default token budgets; raising --image-max-tokens to 2240 in llama.cpp unlocks state-of-the-art OCR and detail recognition at the cost of ~14 GB extra VRAM.
Interfaze AI introduces a specialized model that surpasses general LLMs on deterministic developer tasks including OCR, object detection, web scraping, speech-to-text, and classification.
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.
NVIDIA introduces Nemotron OCR v2, a fast multilingual OCR model built using synthetic data generation. The model achieves 34.7 pages/second on a single A100 GPU by using a unified FOTS-based architecture with feature reuse across detection, recognition, and relational components.
MinerU2.5 is a 1.2B-parameter vision-language model that achieves state-of-the-art document parsing accuracy with high computational efficiency using a coarse-to-fine parsing strategy.
SmolDocling is a compact 256M parameter vision-language model designed for end-to-end multi-modal document conversion. It introduces a new universal markup format called DocTags to capture page elements with location, competing with models 27 times larger.
Paperless-ngx is an open-source document management system that digitizes and archives physical documents with full-text search capabilities. It is the official successor to the original Paperless and Paperless-ng projects, designed as a community-driven initiative.