Tag
GTA-2 introduces a hierarchical benchmark for evaluating general tool agents across atomic tool-use and open-ended workflows, revealing a significant capability cliff where frontier models achieve only 14.39% success on complex tasks despite reasonable atomic performance.