SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Reddit r/MachineLearning 04/20/26, 03:24 AM Tools

ocr vision-language-model dataset open-source spatial-grounding vqa fine-tuning

Summary

SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.

Hello everyone! I've been independently researching & developing small-but-powerful vision-language models (VLMs) and noticed a gap in visual datasets - none were teaching my model to simply ground text in imagery, but trying to get it to reason about the text or about the scene itself. This lead me down a 2 week side-side-project to create SGOCR, an open source dataset pipeline for generating spatially-grounded, OCR-focused VQA tuples with tons of rich metadata to support diverse VLM training strategies. [Code](https://github.com/cothogonal/sgocr-dataset-pipeline) [v1 dataset](https://huggingface.co/datasets/dreeseaw/SGOCR) My development began with simply prompting Qwen2.5-VL locally and grew into a multi-stage beast. At one point, my OCR-stage looked for concensus between 3 text recognition models (Parseq), my anchor stage did the same between GroundingDino, Florence 2, and SAM 3.1, and verification required passes from both Gemini 3.1 Pro & ChatGPT 5.3 Codex to pass. I discovered that less is more in this case, and landed on using Nvidia's nemotron-ocr-v2 for text extraction, a combination of Gemma4 with a Qwen3-VL fallback for anchor discovery & labeling, and then gemini-2.5-flash as a teacher model with simple grounding checks for verification. I got away with using the smaller 2.5 Flash teacher model due to the highly grounded annotations provided in context allowing flash to focus on semantics. I utilized an agentic loop for development after first creating a dataset review frontend that would store my personal accept/reject/maybe marks to be referenced as human-grounded context later. I bootstrapped this process into a quality score that reflected the aspects of questions I accepted, and from there the rest was much easier to automate. I run a custom optimization loop agent, based on Karpathy's autoresearch (which I found a bit too hyperparameter-searchy), that uses a sweep-based process that allows better holisitc observation, an oppurtunity to make code changes, and less risks of good ideas dying earlier due to their evals being slightly less than another variant's. I'm looking for general feedback and interested if other people were looking for something like this, or building similar VLMs. Thanks for reading!

Original Article

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Similar Articles

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Towards One-to-Many Temporal Grounding

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

Submit Feedback

Similar Articles

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Towards One-to-Many Temporal Grounding

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…