SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
Summary
SpaceDG is a large-scale dataset and benchmark that evaluates multimodal language models' spatial reasoning robustness under visual degradations like motion blur and low light, revealing significant performance gaps and showing that fine-tuning on SpaceDG improves robustness without degrading clean image performance.
View Cached Full Text
Cached at: 05/22/26, 06:27 AM
Paper page - SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
Source: https://huggingface.co/papers/2605.22536 Authors:
,
,
,
,
,
,
,
,
,
Abstract
SpaceDG dataset and benchmark evaluate multimodal language models’ spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.
Multimodal Large Language Models(MLLMs) have made rapid progress inspatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is thespatial intelligenceof current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset fordegradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into3D Gaussian Splatting(3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, anhuman-verified benchmarkwith 1,102 questions spanning 11 reasoning categories and 9visual degradationtypes, yielding over 10KVQA instances. Evaluating 25 open- and closed-source MLLMs reveals thatvisual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show thatfinetuningon SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robustspatial intelligence.
View arXiv pageView PDFProject pageGitHub15Add to collection
Get this paper in your agent:
hf papers read 2605\.22536
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22536 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22536 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22536 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge
This paper introduces Pre-Flight, an open-source benchmark of 300 multiple choice questions designed to evaluate large language models on aviation operational knowledge, covering international regulations and ground operations. Results show even the best models in 2026 score 82.7%, significantly below the expert reference of ~95%, highlighting a persistent reliability gap.
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
This paper presents Vera, an end-to-end automated safety testing framework for LLM agents that combines literature-driven risk discovery, combinatorial composition of safety cases, and evidence-grounded verification. Evaluations on four agent frameworks reveal substantial safety weaknesses, with average attack success rates reaching 93.9% under multi-channel attacks, and the release of Vera-Bench with 1600 executable safety cases.
Distributionally Robust Listwise Preference Optimization
This paper proposes a distributionally robust listwise preference optimization method for LLM alignment that handles ranking-label uncertainty, with a tractable objective and strong convergence guarantees.
SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses
Introduces SPLIT, a 500-prompt benchmark evaluating LLM cross-lingual empathy and cultural grounding in English and Ukrainian. Findings show Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade in Ukrainian while DeepSeek-V3 remains stable, with weak agreement between human and AI evaluators on cultural dimensions.
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
OpenSafeIntent introduces a benchmark of controlled prompt sets that vary intent while holding tasks fixed, enabling evaluation of whether models calibrate assistance across benign, dual-use, and malicious variants rather than appearing safe on average.