RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
Summary
RoboStressBench proposes a benchmark for evaluating vision-language model robustness to physical visual stresses (material, viewpoint, lighting, geometry) in embodied scenes, identifying stress-specific failure modes.
View Cached Full Text
Cached at: 06/02/26, 03:23 AM
Paper page - RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
Source: https://huggingface.co/papers/2606.00828 Published on May 30
·
Submitted byhttps://huggingface.co/YUEVII
LeyiWuon Jun 2
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
RoboStressBench presents a principled benchmark for evaluating vision-language model robustness to physical visual stress in embodied AI, decomposing visual stress into material, viewpoint, lighting, and geometry dimensions.
Vision-Language Models(VLMs) have shown strong visual understanding and are increasingly deployed inembodied AIsystems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everydayvisual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we definevisual stressin a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulatevisual perceptionfrom aninverse graphicsperspective and introduceRoboStressBench, a benchmark for evaluating VLM robustness to physicalvisual stressin embodied scenes. Inspired by thephysical rendering equation,RoboStressBenchdecomposesvisual stressinto four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enablesRoboStressBenchto cover a broad range ofvisual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such asvisual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce astress-aware agentic solverthat detectsvisual stressors and invokesvisual-editing skillsbefore reasoning, improving robustness in high-stress scenarios. Overall,RoboStressBenchprovides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliableembodied AIsystems.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.00828
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.00828 in a model README.md to link it from this page.
Datasets citing this paper1
#### RoboStressBench/RoboStressBench-Dataset Updated38 minutes ago • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.00828 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
This paper introduces WorldReasonBench and WorldRewardBench, new benchmarks designed to evaluate video generation models' ability to reason about world-state evolution and physical consistency. The research highlights a gap between visual plausibility and true logical reasoning in current commercial video generators.
SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation
SleepWalk is a three-tier benchmark for evaluating vision-language models' ability to predict spatially coherent trajectories in 3D environments from textual instructions and visual observations, revealing systematic failures in grounded spatial reasoning under occlusions and multi-step instructions.
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
MiraBench is a hierarchical benchmark for evaluating action-conditioned reliability in robotic world models, assessing physics adherence, action-following fidelity, and optimism bias across 12 model configurations.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models
RoboSemanticBench is a benchmark that diagnoses semantic grounding in action prediction for vision-language-action models, revealing that while robots can grasp objects, they fail to select semantically correct targets based on instruction semantics.