workflow-completion

#workflow-completion

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

GTA-2 introduces a hierarchical benchmark for evaluating general tool agents across atomic tool-use and open-ended workflows, revealing a significant capability cliff where frontier models achieve only 14.39% success on complex tasks despite reasonable atomic performance.

0 favorites 0 likes

workflow-completion

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Submit Feedback