SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Reddit r/MachineLearning Tools

Summary

SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.

Hello everyone! I've been independently researching & developing small-but-powerful vision-language models (VLMs) and noticed a gap in visual datasets - none were teaching my model to simply ground text in imagery, but trying to get it to reason about the text or about the scene itself. This lead me down a 2 week side-side-project to create SGOCR, an open source dataset pipeline for generating spatially-grounded, OCR-focused VQA tuples with tons of rich metadata to support diverse VLM training strategies. [Code](https://github.com/cothogonal/sgocr-dataset-pipeline) [v1 dataset](https://huggingface.co/datasets/dreeseaw/SGOCR) My development began with simply prompting Qwen2.5-VL locally and grew into a multi-stage beast. At one point, my OCR-stage looked for concensus between 3 text recognition models (Parseq), my anchor stage did the same between GroundingDino, Florence 2, and SAM 3.1, and verification required passes from both Gemini 3.1 Pro & ChatGPT 5.3 Codex to pass. I discovered that less is more in this case, and landed on using Nvidia's nemotron-ocr-v2 for text extraction, a combination of Gemma4 with a Qwen3-VL fallback for anchor discovery & labeling, and then gemini-2.5-flash as a teacher model with simple grounding checks for verification. I got away with using the smaller 2.5 Flash teacher model due to the highly grounded annotations provided in context allowing flash to focus on semantics. I utilized an agentic loop for development after first creating a dataset review frontend that would store my personal accept/reject/maybe marks to be referenced as human-grounded context later. I bootstrapped this process into a quality score that reflected the aspects of questions I accepted, and from there the rest was much easier to automate. I run a custom optimization loop agent, based on Karpathy's autoresearch (which I found a bit too hyperparameter-searchy), that uses a sweep-based process that allows better holisitc observation, an oppurtunity to make code changes, and less risks of good ideas dying earlier due to their evals being slightly less than another variant's. I'm looking for general feedback and interested if other people were looking for something like this, or building similar VLMs. Thanks for reading!
Original Article

Similar Articles

Towards One-to-Many Temporal Grounding

Hugging Face Daily Papers

This paper introduces One-to-Many Temporal Grounding (OMTG), a new task for localizing multiple disjoint video segments from a single text query, along with a benchmark, evaluation metrics, a 56k-sample dataset, and novel reward functions that achieve state-of-the-art results, outperforming Gemini 2.5 Pro and Seed-1.8.