SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Hugging Face Daily Papers 05/21/26, 12:00 AM Papers

Summary

SpaceDG is a large-scale dataset and benchmark that evaluates multimodal language models' spatial reasoning robustness under visual degradations like motion blur and low light, revealing significant performance gaps and showing that fine-tuning on SpaceDG improves robustness without degrading clean image performance.

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

Original Article

View Cached Full Text

Cached at: 05/22/26, 06:27 AM

Paper page - SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Source: https://huggingface.co/papers/2605.22536 Authors:

Abstract

SpaceDG dataset and benchmark evaluate multimodal language models’ spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.

Multimodal Large Language Models(MLLMs) have made rapid progress inspatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is thespatial intelligenceof current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset fordegradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into3D Gaussian Splatting(3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, anhuman-verified benchmarkwith 1,102 questions spanning 11 reasoning categories and 9visual degradationtypes, yielding over 10KVQA instances. Evaluating 25 open- and closed-source MLLMs reveals thatvisual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show thatfinetuningon SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robustspatial intelligence.

View arXiv page View PDF Project page GitHub15 Add to collection

Get this paper in your agent:

hf papers read 2605\.22536

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22536 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22536 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22536 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Paper page - SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Surface Evolver Bench: my benchmark asking LLMs to write complex physical simulations in a custom data format

@akshay_pachaar: Don't train the model, evolve the harness. I read a brilliant blog post from Hugging Face where they took a frozen open…

Follow-up: DeepSeek V4 Flash on 2x RTX PRO 6000 finishes real coding tasks faster than Sonnet and Opus, at about Sonnet quality

@Xudong07452910: A hot comment section on Hacker News: Qwen 3.6 27B is the ideal choice for local development. Key findings: dense parameter model, native support for 256k context, running Q8_0 quantized version at 30 tokens/…

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

Submit Feedback

Similar Articles

Surface Evolver Bench: my benchmark asking LLMs to write complex physical simulations in a custom data format

@akshay_pachaar: Don't train the model, evolve the harness. I read a brilliant blog post from Hugging Face where they took a frozen open…

Follow-up: DeepSeek V4 Flash on 2x RTX PRO 6000 finishes real coding tasks faster than Sonnet and Opus, at about Sonnet quality

@Xudong07452910: A hot comment section on Hacker News: Qwen 3.6 27B is the ideal choice for local development. Key findings: dense parameter model, native support for 256k context, running Q8_0 quantized version at 30 tokens/…

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge