RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Hugging Face Daily Papers 05/30/26, 12:00 AM Papers

vision-language-models robustness benchmark embodied-ai visual-stress evaluation

Summary

RoboStressBench proposes a benchmark for evaluating vision-language model robustness to physical visual stresses (material, viewpoint, lighting, geometry) in embodied scenes, identifying stress-specific failure modes.

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:23 AM

Paper page - RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Source: https://huggingface.co/papers/2606.00828 Published on May 30

Submitted byhttps://huggingface.co/YUEVII

LeyiWuon Jun 2

Authors:

Abstract

RoboStressBench presents a principled benchmark for evaluating vision-language model robustness to physical visual stress in embodied AI, decomposing visual stress into material, viewpoint, lighting, and geometry dimensions.

Vision-Language Models(VLMs) have shown strong visual understanding and are increasingly deployed inembodied AIsystems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everydayvisual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we definevisual stressin a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulatevisual perceptionfrom aninverse graphicsperspective and introduceRoboStressBench, a benchmark for evaluating VLM robustness to physicalvisual stressin embodied scenes. Inspired by thephysical rendering equation,RoboStressBenchdecomposesvisual stressinto four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enablesRoboStressBenchto cover a broad range ofvisual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such asvisual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce astress-aware agentic solverthat detectsvisual stressors and invokesvisual-editing skillsbefore reasoning, improving robustness in high-stress scenarios. Overall,RoboStressBenchprovides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliableembodied AIsystems.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.00828

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.00828 in a model README.md to link it from this page.

Datasets citing this paper1

#### RoboStressBench/RoboStressBench-Dataset Updated38 minutes ago • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.00828 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Paper page - RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Submit Feedback

Similar Articles

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models