Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Hugging Face Daily Papers Papers

Summary

This paper studies how self-driving car systems and humans perform on visual question answering tasks across different geographic locations (Lima and New York City), finding that both humans and VLMs show similar performance regardless of location but diverge based on question type.

As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2
Original Article
View Cached Full Text

Cached at: 06/23/26, 05:43 PM

Paper page - Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Source: https://huggingface.co/papers/2606.20980

Abstract

Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.

AsSelf-Driving Carscontinue to expand internationally and usemulti-modal systemssuch asVLMsas a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particularout-of-distribution(OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, andVLMsand showing themdashcam footagecollected from Lima and New York City -- prompting them with a variety of questions under aVisual Question Answering(VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories:Factual,Ratings,CounterfactualandReasoning. We find that Humans andVLMsdiverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans orVLMs) that was modulated by geography, likely due to their highout-of-distributionnature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2

View arXiv pageView PDFAdd to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.20980 in a model README.md to link it from this page.

Datasets citing this paper1

#### Artificio/robusto-2 Updatedabout 3 hours ago • 2

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.20980 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Papers with Code Trending

OVO-S-Bench introduces a comprehensive human-annotated benchmark of 1,680 questions across 348 videos to evaluate streaming spatial intelligence in multimodal LLMs, revealing that even the best model (Gemini-3.1-Pro) trails human experts by 27 points. The benchmark exposes key limitations including allocentric mapping as a major bottleneck and chain-of-thought reasoning amplifying spatial errors.

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Hugging Face Daily Papers

SuperMemory-VQA is a new egocentric VQA benchmark featuring 52.9 hours of AI-glasses footage and 4,853 QA pairs designed to evaluate AI assistants on long-horizon memory tasks spanning object recall, intent, timelines, and conversations. Benchmarking reveals existing agentic frameworks and LLMs remain far from reliable on these real-world memory challenges.

Revealing Interpretable Failure Modes of VLMs

arXiv cs.AI

This paper introduces Revelio, a framework that systematically discovers interpretable failure modes in Vision-Language Models (VLMs) by searching over discrete concept combinations. Applied to autonomous driving and indoor robotics, it reveals previously unreported vulnerabilities that lead to crashes or safety hazards.