Tag
ABACUS is a unified vision-language model that handles multiple counting tasks and count-faithful image generation without benchmark-specific training, achieving state-of-the-art results across seven benchmarks.
AnchorWorld is a framework for egocentric world simulation that enhances interaction integrity and enables flexible world customization through 3D human motion and anchor view definitions, outperforming state-of-the-art baselines.
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
CityRAG introduces a video generative model that produces long, physically grounded, 3D-consistent videos of real-world cities using geo-registered data, enabling realistic navigation and simulation for robotics and autonomous driving.
GIST is a multimodal knowledge extraction pipeline that transforms mobile point cloud data into semantically annotated navigation topologies for dense environments, enabling semantic search, localization, and natural language routing with 80% navigation success rates in real-world evaluation.
SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.