SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
Summary
SpaceDG is a large-scale dataset and benchmark that evaluates multimodal language models' spatial reasoning robustness under visual degradations like motion blur and low light, revealing significant performance gaps and showing that fine-tuning on SpaceDG improves robustness without degrading clean image performance.
View Cached Full Text
Cached at: 05/22/26, 06:27 AM
Paper page - SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
Source: https://huggingface.co/papers/2605.22536 Authors:
,
,
,
,
,
,
,
,
,
Abstract
SpaceDG dataset and benchmark evaluate multimodal language models’ spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.
Multimodal Large Language Models(MLLMs) have made rapid progress inspatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is thespatial intelligenceof current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset fordegradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into3D Gaussian Splatting(3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, anhuman-verified benchmarkwith 1,102 questions spanning 11 reasoning categories and 9visual degradationtypes, yielding over 10KVQA instances. Evaluating 25 open- and closed-source MLLMs reveals thatvisual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show thatfinetuningon SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robustspatial intelligence.
View arXiv pageView PDFProject pageGitHub15Add to collection
Get this paper in your agent:
hf papers read 2605\.22536
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22536 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22536 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22536 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Surface Evolver Bench: my benchmark asking LLMs to write complex physical simulations in a custom data format
Introduces Surface Evolver Bench, a benchmark that evaluates LLMs on writing complex physical simulations in a custom data format.
@akshay_pachaar: Don't train the model, evolve the harness. I read a brilliant blog post from Hugging Face where they took a frozen open…
The article discusses a Hugging Face experiment where an automated loop rewrites only the code (harness) around a frozen model, raising its benchmark score from 0% to near Sonnet 4.6 at lower cost, demonstrating that many benchmark failures stem from the harness, not the model itself.
Follow-up: DeepSeek V4 Flash on 2x RTX PRO 6000 finishes real coding tasks faster than Sonnet and Opus, at about Sonnet quality
DeepSeek V4 Flash on dual RTX PRO 6000 GPUs completes real coding tasks faster than Anthropic's Sonnet and Opus models while achieving similar quality to Sonnet.
@Xudong07452910: A hot comment section on Hacker News: Qwen 3.6 27B is the ideal choice for local development. Key findings: dense parameter model, native support for 256k context, running Q8_0 quantized version at 30 tokens/…
Qwen 3.6 27B is a dense 27B model that achieves impressive performance on local hardware with 256k context, running at 30 tokens/s on MacBook Max M5 and 50 tokens/s on RTX 5090, and is considered by some as the first local model with true general intelligence.
Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge
This paper introduces Pre-Flight, an open-source benchmark of 300 multiple choice questions designed to evaluate large language models on aviation operational knowledge, covering international regulations and ground operations. Results show even the best models in 2026 score 82.7%, significantly below the expert reference of ~95%, highlighting a persistent reliability gap.