The Amazing Agent Race: Strong Tool Users, Weak Navigators

arXiv cs.CL 04/20/26, 04:00 AM Papers

benchmark agent-evaluation tool-use navigation llm-agents compositional-reasoning

Summary

The Amazing Agent Race (AAR) introduces a new benchmark with 1,400 directed acyclic graph (DAG) puzzle instances to evaluate LLM agents on fork-merge tool chains and Wikipedia navigation. Evaluations reveal agents excel at tool-use (errors <17%) but struggle with navigation (27-52% of failures), exposing a critical gap invisible to existing linear benchmarks.

arXiv:2604.10261v2 Announce Type: replace-cross Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:33 AM

# The Amazing Agent Race: Strong Tool Users, Weak Navigators
Source: https://arxiv.org/html/2604.10261
Zae Myung Kim1, Dongseok Lee2, Jaehyung Kim2, Vipul Raheja3, Dongyeop Kang1 University of Minnesota Twin Cities1, Yonsei University2, Grammarly3 \{kim01756,dongyeop\}@umn\.edu

###### Abstract

Existing tool-use benchmarks for LLM agents are overwhelmingly *linear*: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring *directed acyclic graph* (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6× fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

## 1 Introduction

Consider an innocuous question: "What is the elevation difference between the birthplaces of Apple's founders?" Using Wikipedia as one possible information source, an agent might (1) navigate to Apple's page, (2) extract the founders' names, (3) follow links to their biographical pages, (4) identify their birthplaces (San Francisco and Green Bay), (5) geocode each city, (6) query an elevation API, and (7) compute the difference:

```
coords_1 = geocode("San Francisco") → (37.77, -122.42)
coords_2 = geocode("Green Bay")     → (44.51, -88.01)
elev_1   = elevation(coords_1)      → 16 m
elev_2   = elevation(coords_2)      → 177 m
answer   = abs(elev_1 - elev_2)     → 161 m
```

A wrong page visit or swapped coordinate cascades through the chain and invalidates the answer. If the question also asks for the driving distance, the agent must *fork* coordinates into parallel API calls and *merge* results, a non-linear dependency that existing benchmarks leave untested.

Existing benchmarks isolate these capabilities: tool-use benchmarks (Qi et al., 2024; Patil et al., 2025) omit navigation, compositional benchmarks (Basu et al., 2024; Ye and others, 2025) provide all inputs upfront, and web-navigation benchmarks (Zhou et al., 2024; Mialon et al., 2024) omit compositional tool chains. Our analysis of their dependency structures reveals that 55 to 100% of instances are strictly linear chains averaging only 2 to 5 steps (§2), a *compositionality deficit* that leaves fork–merge reasoning untested.

This work. We introduce The Amazing Agent Race (AAR), a benchmark designed around one diagnostic question: *where exactly does an agent break down when it must discover information through navigation, fork that information into parallel tool branches, and merge the results?* Inspired by the television series *The Amazing Race* (CBS, 2001), AAR frames evaluation as a race across Wikipedia. Each instance is a *leg*: a sequence of steps where the agent navigates Wikipedia pages, executes tool chains (e.g., geocode → elevation, geocode → weather), applies analytical reasoning, and aggregates results into a single-digit answer. Legs are not linear chains but directed acyclic graphs (DAGs): fork–merge *diamond* patterns spawn parallel tool branches from a single extracted entity whose outputs merge downstream. Every AAR instance is a true DAG (0% linear) with an average of 22 pit stops and up to 5 diamonds, compared to 94–100% linearity and 1.7–4.8 steps in prior benchmarks.

An automated pipeline generates legs from random Wikipedia seeds with pre-validated tool chains, diamond augmentation, and verbalized clue envelopes that never reveal titles or tool names directly. AAR provides 19 tools across four difficulty levels (8 to 33 pit stops); live APIs ensure answers must be *derived*, not recalled.

Figure 1: (a) Existing benchmarks are 55 to 100% linear; AAR is 0% linear (all DAGs). Numbers in parentheses show mean steps per instance (abbreviated "s"). (b) Best agent accuracy is 36.6% (aggregated across 1,400 legs). (c) Navigation errors dominate (5% to 52%) while tool-use errors stay below 15%.

Three metrics separately diagnose failures at each pipeline stage (Figure 1): finish-line accuracy (FA), pit-stop visit rate (PVR, navigation), and roadblock completion rate (RCR, tool use).

Key findings. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% FA. Navigation errors dominate (27 to 52% of trials) while tool-use errors stay below 17%. Moving from AAR-Linear to AAR-DAG drops navigation scores by 13 to 18 percentage points while tool-use scores remain stable, confirming that compositional structure challenges navigation, not tool use (§6.1).

Contributions.

1. A *compositionality analysis* of six benchmarks showing 55–100% linearity (§2).
2. An *automated generation pipeline* producing DAG-structured legs from random Wikipedia seeds with fork–merge diamond patterns, four structurally controlled difficulty levels, and contamination resistance via live APIs and clue paraphrasing (§4–§3.3). Code and data are available at https://github.com/minnesotanlp/the-amazing-agent-race.
3. *Three decomposed metrics* (FA, PVR, RCR) that isolate failures at the navigation, tool-use, and computation stages (§6). *Evaluation on 1,400 legs* across three agent frameworks and two model families, with a detailed failure taxonomy (§6.1, §6.5).

## 2 Related Work

| Benchmark | Venue | Tools | Nav | Met | Stp | Lvl | Vie | Diff | Gld | Gen | Steps | %Lin | %DAG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ToolBench | ICLR'24 | 16k+ | ✗ | 2 | ✗ | ✓ | † | 3 lvl | ✓ | Auto | 1.9 | 100 | 0 |
| TaskBench | NeurIPS'24 | graph | ✗ | 3 | ✓ | ✗ | size | ✓ | Auto | 1.7 | 94 | 2.5 |
| NESTFUL | arXiv'24 | nest | ✗ | 2 | ✓ | ✗ | depth | ✓ | Scr | 3.4 | 55 |  |
| GAIA | ICLR'24 | var | ✓ | 1 | ✗ | ✗ | 3 lvl | ✗ | Man | ~5‡ | 100 | 0 |
| WebArena | ICLR'24 | brow | ✓ | 1 | ✗ | ✓ | impl | ✗ | Scr | – | – | – |
| AgentBench | ICLR'24 | 8env | part | 1 | ✓ | mix | env | ✗ | Man | – | – | – |
| AAR | – | 19 | ✓ | 3 | ✓ | ✓ | 4 lvl | ✓ | Auto | 22.1 | 0 | 100 |

Table 1: Comparison with representative benchmarks (3 per category; full table with 12 benchmarks in Appendix L). †ToolBench suffers API instability. ‡GAIA step count from annotator metadata only.

Deploying an LLM agent in the wild requires interpreting instructions, navigating information sources, invoking APIs, and chaining results, all within a single episode. Existing benchmarks isolate one or two of these capabilities; AAR combines open web navigation with multi-step tool composition in a structurally controlled, automatically generated benchmark (Table 1).

**Tool-use benchmarks.** ToolBench (Qi et al., 2024) curates 16,464 REST APIs for multi-step planning; real-API instability motivated StableToolBench (Guo et al., 2024) to replace live endpoints with a virtual server. BFCL (Patil et al., 2025) standardizes function-calling evaluation with AST-based scoring and multi-turn stateful workflows. API-Bank (Li et al., 2023) introduces a three-level framework over 73 APIs. All three scale the *number* of available tools but present them in isolation: the agent receives a query and calls APIs without needing to *find* the inputs first.

**Multi-step tool composition.** TaskBench (Shen et al., 2024) models inter-tool dependencies as a Tool Graph. NESTFUL (Basu et al., 2024) tests nested API sequences (GPT-4o: 28% full-sequence accuracy). ToolHop (Ye and others, 2025) constructs multi-hop queries requiring 3+ chained calls (best model: 49%). T-Eval (Chen et al., 2024) decomposes tool use into six sub-capabilities. ToolSandbox (Lu and others, 2025) adds statefulness and implicit dependencies. These benchmarks show compositional tool use is hard even when all inputs are given upfront. AAR adds a further challenge: agents must first *discover* inputs through navigation, coupling navigation errors with downstream tool failures.

**Compositionality gap.** We extract dependency graphs from the golden execution traces of six benchmarks (Table 1). ToolBench, ToolHop, and GAIA are entirely linear (100%). TaskBench, the only benchmark with explicit DAG annotations, is 94% linear with just 1.7 steps on average. NESTFUL and T-Eval show moderate non-linearity (45% and 38%) but remain shallow (3.4 and 4.8 steps). Every AAR instance is a DAG averaging 22 pit stops with fan-out and fan-in through diamond patterns, a structural gap that motivates our benchmark.¹

¹ GAIA lacks structured golden chains; we use annotator-reported step counts as a linear-chain proxy (165 validation samples only).

**Web navigation benchmarks.** WebArena (Zhou et al., 2024) evaluates long-horizon tasks across self-hosted web applications. Mind2Web (Deng et al., 2024) tests generalization across 137 real websites. OSWorld (Xie et al., 2024) extends evaluation to desktop GUI environments. GAIA (Mialon et al., 2024) comes closest to AAR's scope (some questions require both web lookup and tool use), but its 466 manually curated, static instances risk contamination, difficulty is human-annotated rather than structurally controlled, and evaluation is limited to final-answer exact match. AAR addresses all three limitations.

**Broader context.** Holistic multi-environment benchmarks (Liu et al., 2024; Ma et al., 2024; Trivedi et al., 2024; Yao et al., 2024; Xu et al., 2024) trade depth for breadth; AAR makes the complementary trade-off. Contamination resistance via live APIs and procedural generation is discussed alongside related fixed-benchmark limitations in Appendix A.

## 3 Benchmark Design Principles

While our framework is source-agnostic, we use Wikipedia because it offers dense hyperlink graphs (~40 outgoing links per page), semi-structured infoboxes for deterministic fact extraction, broad topical diversity, free licensing (CC BY-SA), and a contamination testbed: since LLMs have trained extensively on Wikipedia, our benchmark specifically tests whether agents can go *beyond* memorized facts via paraphrased clues and live API calls (§4.2).

### 3.1 Task Formulation

An AAR instance (a *leg*) consists of four inputs and produces one output:

- A *seed URL* u₀ pointing to a Wikipedia article (the starting line).
- A *clue envelope* 𝒞: a natural-language riddle whose K clues describe a sequence of steps without naming Wikipedia titles or tool names.
- A *tool set* 𝒯 of 19 tools with schema descriptions.
- A *step budget* B = max(10, ⌊1.5K⌋).

The agent must produce a single-digit *finish-line code* ŷ ∈ {0,...,9}. The ground-truth code y* is computed by the golden executor from a verified execution trace.

### 3.2 Leg Structure

A leg is a directed acyclic graph (DAG) of *pit stops* s₁,...,sₖ, each producing a typed value vᵢ and optionally depending on prior stops via explicit depends_on edges. Borrowing terminology from *The Amazing Race* (CBS, 2001), we define four pit-stop types:

1. **Route info** (route_info): Navigate to a Wikipedia page and extract a fact (e.g., a numeric infobox field, a date from prose).
2. **Roadblock** (roadblock): Execute a multi-step tool chain, e.g., geocode a location then query the elevation API.
3. **Detour** (detour): Apply an analytical transform to a prior value, e.g., next_prime(vᵢ), digit_sum(vᵢ).
4. **Finish line** (finish_line): Aggregate values from earlier stops via arithmetic to produce y* ∈ {0,...,9}.

Transitions are typed (link_follow, search_query, tool_call, compute), and values are typed (number, text, coords, date), enabling type-aware argument passing between stops.

### 3.3 Diamond Patterns

Figure 3: Diamond pattern structure.

AAR introduces *diamond patterns* (Figure 3) to create non-linear DAG structure. A diamond has a *source stop* (extract a geocodable entity), two *branch stops* (independent tool chains on the same entity, e.g., elevation and POI count), and a *merge stop* (combines branch outputs). Each branch records a depends_on edge to the source; the merge depends on both branches. Diamond count scales with difficulty (1 for easy up to 3–5 for extreme) across four types (elevation × POI, elevation × rating, population × area, temperature × precipitation), guaranteeing every instance is a true DAG.

### 3.4 Tool Set

AAR provides 19 tools across eight categories (Appendix D), designed for composability (e.g., geocode → elevation) and temporal dynamism (stock/crypto tools return live data). Roadblock pit stops instantiate 17 templates composing 1–3 tools. Each tool returns values in a canonical unit (elevation in meters, distance in km, temperature in °C); explicit python_execute_code conversion stops handle unit changes when needed. The finish-line stop reduces gathered values to a single digit via modular arithmetic (digital_root, mod10, etc.), absorbing small API perturbations.

### 3.5 Difficulty Levels

Difficulty is controlled through four levels that independently vary five parameters: pre-augmentation leg length (3–6 for easy up to 17–21 for extreme), roadblock count, detour count, extraction complexity (infobox-only vs. cross-section), and

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Similar Articles

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

AA introduces Coding Agent Index - Performance Comparisons between Model & Harness Combinations

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

Submit Feedback

Similar Articles

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

AA introduces Coding Agent Index - Performance Comparisons between Model & Harness Combinations

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space