Tag
The paper introduces GeoNatureAgent Benchmark, the first benchmark for evaluating LLM agents on environmental geospatial analysis tasks via structured tool calls. It evaluates seven models on 93 tasks across 18 categories and finds Claude Sonnet 4 achieves highest accuracy at 60.8%, while open-weight models like DeepSeek V3.2 offer strong cost-performance tradeoffs.