SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Hugging Face Daily Papers Papers

Summary

SpaceDG is a large-scale dataset and benchmark that evaluates multimodal language models' spatial reasoning robustness under visual degradations like motion blur and low light, revealing significant performance gaps and showing that fine-tuning on SpaceDG improves robustness without degrading clean image performance.

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.
Original Article
View Cached Full Text

Cached at: 05/22/26, 06:27 AM

Paper page - SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Source: https://huggingface.co/papers/2605.22536 Authors:

,

,

,

,

,

,

,

,

,

Abstract

SpaceDG dataset and benchmark evaluate multimodal language models’ spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.

Multimodal Large Language Models(MLLMs) have made rapid progress inspatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is thespatial intelligenceof current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset fordegradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into3D Gaussian Splatting(3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, anhuman-verified benchmarkwith 1,102 questions spanning 11 reasoning categories and 9visual degradationtypes, yielding over 10KVQA instances. Evaluating 25 open- and closed-source MLLMs reveals thatvisual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show thatfinetuningon SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robustspatial intelligence.

View arXiv pageView PDFProject pageGitHub15Add to collection

Get this paper in your agent:

hf papers read 2605\.22536

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22536 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22536 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22536 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

arXiv cs.AI

This paper introduces Pre-Flight, an open-source benchmark of 300 multiple choice questions designed to evaluate large language models on aviation operational knowledge, covering international regulations and ground operations. Results show even the best models in 2026 score 82.7%, significantly below the expert reference of ~95%, highlighting a persistent reliability gap.

Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

arXiv cs.AI

This paper presents Vera, an end-to-end automated safety testing framework for LLM agents that combines literature-driven risk discovery, combinatorial composition of safety cases, and evidence-grounded verification. Evaluations on four agent frameworks reveal substantial safety weaknesses, with average attack success rates reaching 93.9% under multi-channel attacks, and the release of Vera-Bench with 1600 executable safety cases.