Tag
Baidu has open-sourced the visual language model Unlimited-OCR, upgraded from DeepSeek-OCR, supporting one-shot parsing of extremely long documents, offering two inference modes: gundam (dense text in a single image) and base (multi-page/PDF).
Introduces MMIOC-1M, a large-scale multi-modal benchmark for industrial defect detection, and proposes RTVPNet, a refined text-visual prompt network achieving state-of-the-art performance.
A tweet claims that a small visual language model fine-tuned on custom data can match GPT-5 accuracy while costing 50× less, citing Liquid AI’s 1.6B model running locally with llama.cpp.