spatial-grounding

#spatial-grounding

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

Hugging Face Daily Papers ↗ · 2026-06-22 Cached

ABACUS is a unified vision-language model that handles multiple counting tasks and count-faithful image generation without benchmark-specific training, achieving state-of-the-art results across seven benchmarks.

0 favorites 0 likes

#spatial-grounding

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

AnchorWorld is a framework for egocentric world simulation that enhances interaction integrity and enables flexible world customization through 3D human motion and anchor view definitions, outperforming state-of-the-art baselines.

0 favorites 0 likes

#spatial-grounding

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

arXiv cs.CL ↗ · 2026-05-11 Cached

This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.

0 favorites 0 likes

#spatial-grounding

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Hugging Face Daily Papers ↗ · 2026-04-21 Cached

CityRAG introduces a video generative model that produces long, physically grounded, 3D-consistent videos of real-world cities using geo-registered data, enabling realistic navigation and simulation for robotics and autonomous driving.

0 favorites 0 likes

#spatial-grounding

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

arXiv cs.AI ↗ · 2026-04-20 Cached

GIST is a multimodal knowledge extraction pipeline that transforms mobile point cloud data into semantically annotated navigation topologies for dense environments, enabling semantic search, localization, and natural language routing with 80% navigation success rates in real-world evaluation.

0 favorites 0 likes

#spatial-grounding

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Reddit r/MachineLearning ↗ · 2026-04-20

SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.

0 favorites 0 likes

spatial-grounding

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Submit Feedback