Tag
TerraBench is a new benchmark for evaluating AI agents' ability to reason over heterogeneous Earth-system data, including gridded data, satellite imagery, and simulator outputs. It reveals significant limitations in current frontier models, with top performers achieving only 59.2% tool-use score on average.