Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City
Summary
This paper studies how self-driving car systems and humans perform on visual question answering tasks across different geographic locations (Lima and New York City), finding that both humans and VLMs show similar performance regardless of location but diverge based on question type.
View Cached Full Text
Cached at: 06/23/26, 05:43 PM
Paper page - Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City
Source: https://huggingface.co/papers/2606.20980
Abstract
Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.
AsSelf-Driving Carscontinue to expand internationally and usemulti-modal systemssuch asVLMsas a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particularout-of-distribution(OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, andVLMsand showing themdashcam footagecollected from Lima and New York City -- prompting them with a variety of questions under aVisual Question Answering(VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories:Factual,Ratings,CounterfactualandReasoning. We find that Humans andVLMsdiverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans orVLMs) that was modulated by geography, likely due to their highout-of-distributionnature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2
View arXiv pageView PDFAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.20980 in a model README.md to link it from this page.
Datasets citing this paper1
#### Artificio/robusto-2 Updatedabout 3 hours ago • 2
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.20980 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
OVO-S-Bench introduces a comprehensive human-annotated benchmark of 1,680 questions across 348 videos to evaluate streaming spatial intelligence in multimodal LLMs, revealing that even the best model (Gemini-3.1-Pro) trails human experts by 27 points. The benchmark exposes key limitations including allocentric mapping as a major bottleneck and chain-of-thought reasoning amplifying spatial errors.
PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation
This paper introduces PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, enabling style-diverse non-ego agents for closed-loop simulation and improving driving scores on Bench2Drive.
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory
SuperMemory-VQA is a new egocentric VQA benchmark featuring 52.9 hours of AI-glasses footage and 4,853 QA pairs designed to evaluate AI assistants on long-horizon memory tasks spanning object recall, intent, timelines, and conversations. Benchmarking reveals existing agentic frameworks and LLMs remain far from reliable on these real-world memory challenges.
LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
This paper introduces LEVANTE-bench, a benchmark that systematically evaluates vision-language models on six cognitive tasks and compares their performance to children aged 5-12, finding that current VLMs align only partially with children's cognitive abilities.
Revealing Interpretable Failure Modes of VLMs
This paper introduces Revelio, a framework that systematically discovers interpretable failure modes in Vision-Language Models (VLMs) by searching over discrete concept combinations. Applied to autonomous driving and indoor robotics, it reveals previously unreported vulnerabilities that lead to crashes or safety hazards.