Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Hugging Face Daily Papers 06/18/26, 12:00 AM Papers

autonomous-driving benchmark vlm vqa self-driving-cars generalization out-of-distribution

Summary

This paper studies how self-driving car systems and humans perform on visual question answering tasks across different geographic locations (Lima and New York City), finding that both humans and VLMs show similar performance regardless of location but diverge based on question type.

As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2

Original Article

View Cached Full Text

Cached at: 06/23/26, 05:43 PM

Paper page - Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Source: https://huggingface.co/papers/2606.20980

Abstract

Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.

AsSelf-Driving Carscontinue to expand internationally and usemulti-modal systemssuch asVLMsas a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particularout-of-distribution(OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, andVLMsand showing themdashcam footagecollected from Lima and New York City -- prompting them with a variety of questions under aVisual Question Answering(VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories:Factual,Ratings,CounterfactualandReasoning. We find that Humans andVLMsdiverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans orVLMs) that was modulated by geography, likely due to their highout-of-distributionnature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2

View arXiv page View PDF Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.20980 in a model README.md to link it from this page.

Datasets citing this paper1

#### Artificio/robusto-2 Updatedabout 3 hours ago • 2

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.20980 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Paper page - Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

Revealing Interpretable Failure Modes of VLMs

Submit Feedback

Similar Articles

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

Revealing Interpretable Failure Modes of VLMs