How Do Tool-Augmented LLM Agents Perform on Real-World Energy Analytics Tasks?
Summary
This paper presents an empirical study and benchmark for evaluating tool-augmented LLM agents on real-world energy analytics tasks, comprising 243 expert-curated problems across market data retrieval, knowledge interpretation, and quantitative modeling.
View Cached Full Text
Cached at: 06/26/26, 05:12 AM
# How Do Tool-Augmented LLM Agents Perform on Real-World Energy Analytics Tasks?
Source: [https://arxiv.org/html/2606.26346](https://arxiv.org/html/2606.26346)
David Akinpelu Independent Researcher &Akintonde Abbas Tume AI &Rereloluwa Alimi Tume AI &Ayodeji Lana Independent Researcher
###### Abstract
While agentic benchmarks have emerged across both general\-purpose and domain\-specific settings, including finance, coding, law, and drug discovery, energy\-domain evaluations remain limited to static knowledge recall\. This is a critical gap for a sector that demands live data retrieval, specialized regulatory and market knowledge, and multi\-step quantitative reasoning under real\-world constraints\. Despite its complexity and societal importance, the energy sector remains substantially underserved relative to domains where dynamic, tool\-augmented evaluation has matured considerably\.
We present an empirical study of tool\-augmented LLM agents on real\-world energy market analytics tasks\. Our evaluation environment consists of 243 expert\-curated problems spanning three broad categories: \(1\) Market Data Retrieval and Analysis, \(2\) Knowledge Retrieval and Interpretation, and \(3\) Advanced Quantitative Modeling and Decision Analytics, encompassing tasks such as price and demand analysis, tariff impact modeling, asset revenue and returns estimation, hedging strategy analysis, and optimization modeling, each graded across multiple difficulty levels\.
Agents are provided with a configurable suite of domain tools, including live electricity market APIs for major U\.S\. ISOs, regulatory docket search, utility tariff databases, asset optimization models, and retrieval\-augmented generation over energy market documents\. To assess the performance of the agents along multiple dimensions, we employ a multi\-dimensional evaluation protocol that scores responses on approach correctness, answer accuracy, attribute alignment, and source validity, with category\-aware routing to match scoring criteria to question type\. We evaluate both closed\-source and open\-source LLMs, offering a comparative analysis of how model capability and domain tooling interact in a high\-stakes professional domain, with key artifacts publicly released\.
## 1Introduction
The energy sector is one of the most analytically demanding domains for AI\-assisted decision support\. Energy market professionals routinely synthesize heterogeneous, time\-sensitive information spanning real\-time market prices, complex regulatory frameworks, asset\-level financial models, and datasets from ISO/RTO market systems, utility tariffs, interconnection queues, and weather services\. Analysts at utilities, independent power producers, regulators, and consulting firms execute these workflows under conditions where errors carry material financial and operational consequences\. Large language models \(LLMs\) have demonstrated strong capabilities in natural language understanding, knowledge retrieval, and structured reasoning\[[28](https://arxiv.org/html/2606.26346#bib.bib1),[2](https://arxiv.org/html/2606.26346#bib.bib2),[16](https://arxiv.org/html/2606.26346#bib.bib3)\]\. The emergence of tool\-augmented agentic frameworks—where models iteratively invoke external tools to retrieve data and execute computations—has further expanded the potential of LLMs in professional analytical workflows\[[43](https://arxiv.org/html/2606.26346#bib.bib4),[33](https://arxiv.org/html/2606.26346#bib.bib5)\]\. Domain\-specific benchmarks have begun evaluating such systems in finance, law, software engineering, and drug discovery\[[41](https://arxiv.org/html/2606.26346#bib.bib15),[8](https://arxiv.org/html/2606.26346#bib.bib26),[18](https://arxiv.org/html/2606.26346#bib.bib13),[21](https://arxiv.org/html/2606.26346#bib.bib9),[10](https://arxiv.org/html/2606.26346#bib.bib14)\]\.
Despite this progress, the energy sector remains largely absent from rigorous agentic evaluation\. Most prior AI work in energy focuses on predictive tasks such as load forecasting, renewable generation estimation, and electricity price prediction using supervised learning on historical data\[[19](https://arxiv.org/html/2606.26346#bib.bib19),[38](https://arxiv.org/html/2606.26346#bib.bib20)\]\. WattWorks, a benchmark from the Electric Power Research Institute \(EPRI\), evaluates LLMs on power system questions but does not assess tool\-augmented agents performing multi\-step analytical workflows reflective of real analyst practice\[[15](https://arxiv.org/html/2606.26346#bib.bib53)\]\. Consequently, no benchmark currently evaluates whether LLM agents can execute end\-to\-end energy analytics workflows under realistic operational constraints\. To address this gap, we introduceEnergyEvals, an evaluation framework for tool\-augmented LLM agents on real\-world energy analytics tasks\. The first iteration focuses on U\.S\. electricity market analytics, with future versions expanding to additional regions and energy sub\-domains\. This paper makes the following contributions:
- •A domain benchmark of 243 expert\-curated tasksspanning three core capability areas – market data retrieval and analysis, knowledge retrieval and interpretation, and advanced quantitative modeling and decision analytics\. Each of these categories contains tasks across three difficulty levels \(Easy, Medium, and Hard\), which were generated by practitioners with doctoral\-level training and combined industry experience exceeding 25 years at leading energy consulting and engineering organizations\.
- •A configurable agentic evaluation environmentproviding agents with nine domain\-specific tools, including live ISO/RTO market APIs covering all major U\.S\. wholesale markets, utility tariff databases, regulatory docket search, renewable energy generation simulation, battery revenue optimization, and retrieval\-augmented generation over electricity market reports and market protocol documents\.
- •A multi\-dimensional evaluation protocolwith category\-aware rubric routing that assesses approach correctness, answer accuracy, attribute alignment, and source validity through a multiple LLM\-as\-a\-judge framework calibrated to specific quality requirements defined by energy analytics domain experts\.
- •An empirical study of seven frontier LLMsspanning closed\-source and open\-source models, revealing model\-specific performance profiles and failure modes that emerge exclusively under realistic agentic task execution in a high\-stakes professional domain\.
- •Public releaseof the benchmark dataset, evaluation framework, scoring code, and a subset of agent execution traces, to support reproducibility and community extension\.
The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2606.26346#S2)surveys related work\. Section[3](https://arxiv.org/html/2606.26346#S3)describes the benchmark dataset\. Section[4](https://arxiv.org/html/2606.26346#S4)presents the agent architecture and tool suite\. Section[5](https://arxiv.org/html/2606.26346#S5)defines the evaluation protocol\. Section[6](https://arxiv.org/html/2606.26346#S6)reports experimental results\. Conclusions and next steps are covered in Section[7](https://arxiv.org/html/2606.26346#S7)\.
## 2Related Work
Research on benchmarks for agentic large language model \(LLM\) systems has expanded rapidly as tool\-use frameworks mature\. Early general\-purpose benchmarks demonstrate that real\-world multi\-step reasoning remains difficult even for frontier models\. GAIA evaluates web\-augmented reasoning tasks where human non\-experts solve 92% of problems while state\-of\-the\-art models score below 30%\[[27](https://arxiv.org/html/2606.26346#bib.bib8)\], and SWE\-bench measures software engineering agents solving real GitHub issues\[[21](https://arxiv.org/html/2606.26346#bib.bib9)\]\. Benchmarks such as AgentBench, ToolBench,τ\\tau\-bench, and TheAgentCompany consistently reveal large gaps between model reasoning ability and successful task completion across interactive environments, API ecosystems, and simulated enterprise settings\[[26](https://arxiv.org/html/2606.26346#bib.bib10),[32](https://arxiv.org/html/2606.26346#bib.bib11),[42](https://arxiv.org/html/2606.26346#bib.bib12),[40](https://arxiv.org/html/2606.26346#bib.bib27)\]\. AstaBench further demonstrates that autonomous scientific discovery remains unsolved\[[9](https://arxiv.org/html/2606.26346#bib.bib28)\]\.
Domain\-specific agentic benchmarks have emerged across professional fields, revealing that general capability improvements do not reliably transfer to specialized domains\. In finance and enterprise analytics, Finance Agent Benchmark and InvestorBench show that even top models achieve only moderate accuracy\[[8](https://arxiv.org/html/2606.26346#bib.bib26),[24](https://arxiv.org/html/2606.26346#bib.bib30)\], while CLASSIC and EnterpriseBench highlight agent struggles with workflow orchestration and tool usage\[[39](https://arxiv.org/html/2606.26346#bib.bib44),[36](https://arxiv.org/html/2606.26346#bib.bib47)\]\. PaperBench and ScienceAgentBench demonstrate that replicating academic research remains extremely challenging\[[35](https://arxiv.org/html/2606.26346#bib.bib43),[11](https://arxiv.org/html/2606.26346#bib.bib45)\], and across productivity and specialized domains—including OdysseyBench, ContextBench, MedAgentBench, and LegalAgentBench—performance degrades significantly as tasks require deeper contextual understanding or domain expertise\[[37](https://arxiv.org/html/2606.26346#bib.bib46),[23](https://arxiv.org/html/2606.26346#bib.bib50),[20](https://arxiv.org/html/2606.26346#bib.bib31),[22](https://arxiv.org/html/2606.26346#bib.bib29)\]\. SkillsBench covers energy through only three narrow power\-system tasks, yet even with curated Skills agents fail more than half of them \(47\.5% pass rate\) and domain knowledge gaps emerge as a primary failure mode, further motivating purpose\-built evaluation frameworks that pair domain\-specific tools with tasks representative of real energy analyst workflows\[[25](https://arxiv.org/html/2606.26346#bib.bib51)\]\.
Within the energy domain, most AI research has focused on predictive tasks such as load forecasting, electricity price forecasting, and renewable power generation modeling\[[19](https://arxiv.org/html/2606.26346#bib.bib19),[38](https://arxiv.org/html/2606.26346#bib.bib20),[30](https://arxiv.org/html/2606.26346#bib.bib21)\]\. Recent surveys highlight the absence of robust agentic evaluation frameworks for real\-world analytical workflows\[[1](https://arxiv.org/html/2606.26346#bib.bib52)\]\. Some recently published works indicate initial steps to fill this gap\. WattWorks from EPRI shows that frontier LLMs perform well on multiple\-choice power\-sector questions \(around 83–86% accuracy\) but decline by roughly 27 percentage points on open\-ended technical tasks\[[15](https://arxiv.org/html/2606.26346#bib.bib53)\]\. ElecBench and smart\-grid agent frameworks confirm that retrieval\-augmented tools improve but do not fully resolve operational challenges\[[47](https://arxiv.org/html/2606.26346#bib.bib36),[31](https://arxiv.org/html/2606.26346#bib.bib38)\]\. PFBench presents a benchmark dataset that focuses on power\-flow analysis and leverages standard IEEE transmission test cases\[[34](https://arxiv.org/html/2606.26346#bib.bib58)\]\. GridMind and GridAgent discuss agents for transmission power flows and contingency analysis and evaluate them on simulation\-based test cases\[[4](https://arxiv.org/html/2606.26346#bib.bib61),[45](https://arxiv.org/html/2606.26346#bib.bib62)\]\. PowerDAG and PowerChain present agentic systems for distribution networks analysis\[[5](https://arxiv.org/html/2606.26346#bib.bib60),[6](https://arxiv.org/html/2606.26346#bib.bib59)\]\. None of the existing works, however, focus on practical workflows that are part of daily activities for typical energy analysts and decision makers\.
Tool\-augmented language agents offer a promising paradigm for such workflows\. ReAct demonstrated that interleaving reasoning with tool invocation enables iterative solution refinement\[[43](https://arxiv.org/html/2606.26346#bib.bib4)\], and Toolformer showed that models can learn autonomous tool invocation\[[33](https://arxiv.org/html/2606.26346#bib.bib5)\]\. Sandboxed code execution and structured tool access further expand agent capabilities\[[12](https://arxiv.org/html/2606.26346#bib.bib42)\], though poor tool integration can introduce new reasoning errors\[[44](https://arxiv.org/html/2606.26346#bib.bib41)\]\. Recent frameworks therefore emphasize multi\-dimensional scoring and LLM\-as\-a\-judge methodologies to better diagnose performance\[[46](https://arxiv.org/html/2606.26346#bib.bib16),[14](https://arxiv.org/html/2606.26346#bib.bib17),[7](https://arxiv.org/html/2606.26346#bib.bib18)\]\. Building on these contributions,EnergyEvalsevaluates tool\-augmented agents on real\-world energy analytics tasks involving live market data retrieval, regulatory analysis, and optimization modeling absent from existing benchmarks\.
## 3Benchmark Dataset
### 3\.1Design Philosophy
The dataset is designed to evaluate whether tool\-augmented LLM agents can execute realistic, end\-to\-end energy analytics workflows in a professional domain where accuracy, traceability, and quantitative rigor are essential\. Rather than testing benchmark\-style factual recall, tasks mirror the workflows of practicing energy market analysts, including retrieving live pricing data, interpreting formal regulatory and interconnection documents, and executing multi\-step financial models under operationally realistic constraints\. Task development was led by domain experts with doctoral\-level training and prior professional experience at organizations including McKinsey, ICF, LCG Consulting, and General Electric, representing more than 25 years of combined industry experience\. This practitioner grounding is reflected in the prompt design through the use of market\-specific terminology \(e\.g\., nodal pricing, ancillary service qualification thresholds, interconnection milestones\) and operational constraints such as efficiency parameters, degradation costs, state\-of\-charge limits, and IRR targets that are typical of client engagements rather than academic exercises\. The current release is intentionally scoped to U\.S\. electricity markets—covering both deregulated and vertically integrated regions—to ensure cross\-task comparability while preserving real\-world complexity arising from differences in market design, tariff structures, and regulatory and interconnection documentation\. Future releases will expand coverage to additional geographies, commodities, and adjacent energy sub\-domains\.
### 3\.2Capability Areas
The 243\-task corpus is organized around three broad capability areas across three difficulty levels \(Easy, Medium, Hard\), with 107 Data, 86 Knowledge, and 50 Quant\. tasks respectively \(see Appendix[A\.1](https://arxiv.org/html/2606.26346#A1.SS1)for the full breakdown\)\.
1. 1\.Market Data Retrieval and Analysis \("Data"\)\.Tasks requiring extraction, aggregation, filtering, and formatting of structured market data from ISO/RTO databases and APIs\. Representative analyst functions include day\-ahead and real\-time price analysis, ancillary service performance evaluation, load and generation dispatch reporting, and cross\-market comparisons\. Example:“Show me the monthly average of day\-ahead prices for ERCOT Houston hub in 2023 based on your ERCOT database\.”
2. 2\.Knowledge Retrieval and Interpretation \("Knowledge"\)\.Tasks requiring navigation of formal regulatory documents, utility tariff filings, market operation manuals, and interconnection procedures to answer precise procedural and structural questions\. These tasks test the agent’s ability to identify authoritative sources, locate relevant provisions, and interpret regulatory language accurately without fabricating content\. Example:“What are the fees associated with each milestone in the ERCOT generation interconnection process based on the ERCOT fee schedule and Resource Interconnection Handbook?”
3. 3\.Advanced Quantitative Modeling and Decision Analytics \("Quant"\)\.Tasks requiring multi\-step analytical reasoning, modeling, and optimization under explicit operational assumptions\. Representative functions include battery energy storage revenue estimation, demand\-charge impact assessment, internal rate of return \(IRR\) computation, and optimization\-based decision support with explicit constraint specifications\. Example:“If a 4\-hour battery earns revenues from arbitrage only in ERCOT West hub over 15 years, what should the $/MW capex be to earn a 13% IRR? Assume 81% roundtrip efficiency, $25/MWh degradation cost, state\-of\-charge limits of 10–90%, and use prices from 2010–2024 as the representative 15\-year window\.”
### 3\.3Difficulty Stratification
Tasks are stratified across three difficulty levels to probe increasingly demanding agent behaviors:
- •Easy\- Direct retrieval tasks with explicit source context and limited data transformation\. The agent must select and invoke the correct tool but requires minimal multi\-step reasoning\. Example:“What detailed fees are associated with each decision point in the NYISO generation interconnection process based on NYISO Manuals 23 and UG21?”
- •Medium\- Tasks requiring retrieval with or without explicit source hints, combined with moderate aggregation, filtering, or cross\-attribute comparison\. Example:“Which PJM price hub had the highest day\-ahead average price in January 2024 based on your PJM database?”
- •Hard\- Tasks requiring multi\-step, multi\-source reasoning and advanced quantitative modeling under realistic operational assumptions\. Agents must correctly sequence tool calls, apply domain\-specific constraints, and integrate outputs across multiple reasoning steps\. Example:“If a 4\-hour battery earns revenues from arbitrage only in ERCOT West hub over 15 years, what should the $/MW capex be to earn a 13% IRR? Assume 81% roundtrip efficiency, $25/MWh degradation cost, state\-of\-charge limits of 10–90%, and use prices from 2010–2024 as the representative 15\-year window\.”
### 3\.4Paired Prompt Construction
A central design feature of the dataset is*paired prompt construction*: selected tasks are available in two variants – one explicitly specifying the information source, and one omitting source specification\. This enables controlled evaluation of source\-scaffolding effects on agent performance under matched semantic intent\. For example,“What are the participation requirements for regulation service in CAISO based on the latest Business Practice Manual for Market Operations?”is a task with a specified source\.“What are the participation requirements for regulation service in CAISO?”is the without\-source counterpart\.
## 4Agent Architecture and Tool Suite
### 4\.1ReAct Agent Framework
Agents are implemented as ReAct\-style reasoning\-and\-acting agents that execute an iterative Thought→\\toAction→\\toObservation loop\[[43](https://arxiv.org/html/2606.26346#bib.bib4)\]\. At each step, the agent produces a natural language reasoning trace \(*Thought*\), selects and invokes a tool with structured arguments \(*Action*\), and receives the tool’s structured output \(*Observation*\)\. The loop terminates when the agent produces a final answer or reaches a configurable maximum iteration budget\. This architecture is expressive enough to represent multi\-hop retrieval chains, sequential computation pipelines, and iterative refinement strategies without constraining the agent to an execution pattern \(see Appendix[A\.2](https://arxiv.org/html/2606.26346#A1.SS2)for a conceptual view of the architecture and Appendix[A\.4](https://arxiv.org/html/2606.26346#A1.SS4)for implementation details\)\.
### 4\.2Model Configurations
Seven frontier LLMs are evaluated as agent backends \(see Appendix[A\.3](https://arxiv.org/html/2606.26346#A1.SS3)for the full configuration table\)\. Closed\-source models \(GPT\-5\.2, GPT\-5\-mini, Gemini\-3\.1\-Pro, Claude Sonnet 4\.6\) and open\-source models \(Kimi\-K2\.5, Qwen3\-Max\-Thinking, DeepSeek\-V3\.2\) are all configured with low reasoning effort and temperature 0 for deterministic, reproducible outputs, using off\-the\-shelf inference APIs without domain adaptation\. Reasoning effort is intentionally set to low to evaluate model performance under cost constraints\. Future iterations will explore trade\-offs between higher reasoning modes and cost implications\.
### 4\.3Tool Suite
Agents are given access to a suite of domain\-specific tools grouped under nine categories spanning live structured market data \(GridStatus API, Database MCP\), formal document retrieval \(RAG MCP, Dockets, Web Search\), domain computation \(Battery Optimization, Renewables\), and contextual supplementary data \(Tariffs, Weather\)\. Tools are registered in a typed registry and exposed as structured JSON Schema function definitions, enabling identical execution across all models\. A full tool inventory and descriptions are provided in Appendix[A\.7](https://arxiv.org/html/2606.26346#A1.SS7)\.
All agent executions are traced at the step level as structured JSON artifacts \(see Appendix[A\.6](https://arxiv.org/html/2606.26346#A1.SS6)\), enabling the failure mode analyses in Section[6](https://arxiv.org/html/2606.26346#S6), and a subset is released as a secondary research artifact\. Output traces and raw evaluation reports are included in the GitHub repository \(https://github\.com/Tume\-AI/energy\-evals\)
## 5Evaluation Protocol
### 5\.1Evaluation Dimensions
Agent responses are assessed across three complementary dimensions that together capture the distinct quality requirements that are typical in the energy analytics domain\. Each dimension targets a failure mode that matters in practice but would be invisible under aggregate accuracy\-only scoring\.
1. 1\.Approach Correctness \(1–5\)\.Does the agent employ an appropriate analytical strategy? This dimension evaluates tool selection, sequencing logic, and whether the agent’s reasoning pathway is consistent with how a professional analyst would approach the task\. An agent that reaches a correct numerical answer through an inappropriate pathway \(for example, by hallucinating data rather than retrieving it from the correct API\) receives reduced credit on this dimension\.
2. 2\.Answer Accuracy / Attribute Alignment \(0–1\)\.Is the final answer factually correct, and does it satisfy all specified constraints, including temporal scope, geographic jurisdiction, entity type, and units of measurement? For tasks that are only quantitative in nature, accuracy is considered based on the difference between the ground truth and the agent’s response within an acceptable absolute or relative tolerance\. For tasks with a combination of quantitative and qualitative components, a set of up to 5 expected attributes \(specific numerical values, named entities, or conclusions\) are extracted from the ground truth with an LLM judge and manually reviewed and updated as needed by human domain experts\. Three different LLM judges \(GPT\-5\-mini, Gemini\-3\.1\-Flash\-Lite and DeepSeek V3\.2\) are then used separately to extract attributes from the agent’s answer and compare each of the extracted attributes with the expected attributes while considering the defined absolute \(e\.g\.ϵabs=±2\\epsilon\_\{\\mathrm\{abs\}\}=\\pm 2\) or relative tolerances \(e\.g\.ϵrel=±10%\\epsilon\_\{\\mathrm\{rel\}\}=\\pm 10\\,\\%\) for numerical attributes\. The score equals matched / total attributes, yielding a continuous value in\[0,1\]\[0,1\]\. This dimension captures failures where an agent retrieves valid data but for the wrong ISO, wrong time period, or wrong entity, as well as simple factual errors\.
3. 3\.Source Validity \(1–5\)\.Are the data sources cited or implicitly relied upon real, appropriate, and accessible? This dimension penalizes hallucinated source citations, fabricated document version numbers, use of inappropriate or outdated sources, and failures to ground answers in tool\-retrieved evidence where the task requires it\.
### 5\.2Category\-Aware Rubric Routing
Rubric emphasis is adapted to each capability area, reflecting the differential importance of evaluation dimensions across task types\. For*Data Retrieval and Analysis*and*Advanced Quantitative Modeling*tasks, Answer Accuracy is important as a measure of the agent’s correctness, since the outcomes of the agent’s analysis are expected to closely match the ground truth if all goes well\. However, for*Knowledge Retrieval and Interpretation*tasks, Attribute Alignment provides a better measure of correctness because the ground truth will contain multiple attributes that the agent’s response is expected to match\. Source Validity and Approach Correctness are critical for all the task categories as it is important to verify that the agent is arriving at the answers in a logical way and not via lucky hallucinations\.
Rubrics are applied using three different judges, GPT\-5\-mini, Gemini\-3\.1\-Flash\-Lite, and DeepSeek V3\.2, with access to human\-annotated ground truths and attributes, full agent execution trace, and a structured scoring rubric \(see Appendix[A\.8](https://arxiv.org/html/2606.26346#A1.SS8)\)\. The median of the three judges’ results is taken as the final score for each rubric\. Three judges from different providers are considered to avoid bias arising from overreliance on a single judge from a particular provider\. The LLM\-as\-a\-judge approach is well\-suited to the open\-ended, multi\-part responses characteristic of professional analytics tasks, which resist reduction to exact\-match or template\-based scoring\[[46](https://arxiv.org/html/2606.26346#bib.bib16)\]\. Judge outputs include a numeric score on each dimension and a natural language justification, enabling qualitative decision audits\.
### 5\.3Reported Metrics
The main reported metrics are as follows\.
- •Overall class\-balanced means for each dimension with confidence intervals:a¯m\\bar\{a\}\_\{m\},c¯m\\bar\{c\}\_\{m\},v¯m\\bar\{v\}\_\{m\}aggregated across the 243 tasks considering each category and difficulty level combination\. For each modelmmand questionqq, the judge produces scores on three dimensions: Approach Correctnessam,q∈\[1,5\]a\_\{m,q\}\\in\[1,5\], Answer Accuracycm,q∈\[0,1\]c\_\{m,q\}\\in\[0,1\], and Source Validityvm,q∈\[1,5\]v\_\{m,q\}\\in\[1,5\]\. We report class\-balanced scores \(i\.e\., weighted average score per category and difficulty level with equal weighting applied to each category\-difficulty pair\) per dimension, independently without aggregation into a composite score, along with their confidence intervals\. Each dimension captures a distinct failure mode, and collapsing them into a single number would obscure the performance profiles that are important to observe\. Also, the class\-balanced scores account for the different total number of questions in each category\. See Appendix[A\.9](https://arxiv.org/html/2606.26346#A1.SS9),[A\.10](https://arxiv.org/html/2606.26346#A1.SS10), and,[A\.11](https://arxiv.org/html/2606.26346#A1.SS11)for a breakdown by capability areas\.
- •Efficiency and cost metrics:Includes total tokens per question, tool calls, and cost per question\. These are reported as simple averages\. The cost is estimated using input, output, and cached tokens without changes to the default caching behavior of the models\. Latency is excluded because network round\-trip times vary per provider API and are not a model\-capability metric\.
- •Task failure rate:Reflects the percentage of tasks that failed based on four failure model definitions \- maxed out iterations, context windows limitations, missing final responses, and clarification requests\. This is also reported as a class\-balanced metric to avoid placing too much weight on categories with more but easier questions\. Also, this failure mode definition focuses on tasks where useful responses were not returned\. In subsequent iterations, we will also consider tasks with sub\-par responses based on defined thresholds\.
## 6Results and Analysis
### 6\.1Overall Results
Table 1:Evaluation Metrics Across ModelsModelAccuracyApproachSourceValidityTokensToolCallsCostEst\. \($\)FailureRate \(%\)Claude Sonnet 4\.60\.56±0\.050\.56\\pm 0\.053\.94±0\.113\.94\\pm 0\.112\.72±0\.232\.72\\pm 0\.23266k7\.50\.862\.6Qwen3\-Max\-Thinking0\.44±0\.050\.44\\pm 0\.053\.74±0\.163\.74\\pm 0\.162\.24±0\.162\.24\\pm 0\.16290k7\.80\.362\.7DeepSeek V3\.20\.43±0\.050\.43\\pm 0\.053\.42±0\.123\.42\\pm 0\.122\.16±0\.202\.16\\pm 0\.20491k13\.40\.0820\.2Kimi\-K2\.50\.49±0\.050\.49\\pm 0\.053\.70±0\.133\.70\\pm 0\.132\.63±0\.202\.63\\pm 0\.20453k11\.40\.0716\.5Gemini\-3\.1\-Pro0\.62±0\.070\.62\\pm 0\.073\.74±0\.163\.74\\pm 0\.162\.77±0\.302\.77\\pm 0\.30368k6\.70\.233\.2GPT\-5\-mini0\.38±0\.060\.38\\pm 0\.063\.39±0\.153\.39\\pm 0\.153\.11±0\.333\.11\\pm 0\.33101k3\.70\.017\.5GPT\-5\.20\.57±0\.060\.57\\pm 0\.063\.69±0\.173\.69\\pm 0\.174\.02±0\.284\.02\\pm 0\.28191k7\.30\.120\.8
Table[1](https://arxiv.org/html/2606.26346#S6.T1)provides a summary of the class\-balanced evaluation metrics across all capability areas and difficulty stratifications for the seven frontier models considered\. A breakdown of the class\-balanced metrics for each capability area is included in the Appendix section \(see Appendix[A\.9](https://arxiv.org/html/2606.26346#A1.SS9),[A\.10](https://arxiv.org/html/2606.26346#A1.SS10), and[A\.11](https://arxiv.org/html/2606.26346#A1.SS11)\)\. Figure[1](https://arxiv.org/html/2606.26346#S6.F1)shows the distribution of tool usage across the different models\. The key takeaways from Table[1](https://arxiv.org/html/2606.26346#S6.T1)and Figure[1](https://arxiv.org/html/2606.26346#S6.F1)are as follows\.
\(a\)Closed\-source models
\(b\)Open\-source models
Figure 1:Tool\-use distribution across closed\-source and open\-source models\.1. 1\.Closed\-source models lead, but overall performance remains unsaturated\.Considering the accuracy results \(range: 0 to 1\) in Table[1](https://arxiv.org/html/2606.26346#S6.T1), Gemini\-3\.1\-Pro, GPT\-5\.2, and Claude Sonnet 4\.6 have the best performance values—62%, 57%, and 56%, respectively\. The best\-performing open\-source model is Kimi\-K2\.5 with a score of 49%, which is 7 percentage points lower than Claude Sonnet 4\.6\. However, Kimi\-K2\.5 achieved this performance at 58%, 31%, and 8% of the costs of GPT\-5\.2, Gemini\-3\.1\-Pro, and Claude Sonnet 4\.6, respectively\. The best\-performing model, based on accuracy, still shows a 38% improvement margin, suggesting that domain expertise will continue to play an important role in designing agentic systems with high\-accuracy guarantees for energy\-domain applications\. It is worth noting that Qwen3\-Max\-Thinking is a cost outlier among open\-source models due to its unified reasoning architecture, which defaults to extended chain\-of\-thought generation, thereby incurring substantially higher token costs than standard open\-source models\.
2. 2\.Models exhibit a planning–execution gap in agentic tasks\.The Approach results \(range: 0 to 5\) from Table[1](https://arxiv.org/html/2606.26346#S6.T1)show that both closed\-source and open\-source models generally perform reasonably well\. GPT\-5\-mini, which has the lowest performance, has a score equivalent to 68% \(i\.e\., 3\.39/5\) compared to a maximum accuracy of 62%\. These results further corroborate the observation that expert guidance remains important in designing agentic systems for energy\-domain applications\. While both open and closed models are capable of proposing reasonable analytical approaches, reliable execution requires domain\-specific knowledge to structure workflows, ensure correct dataset usage, and define appropriate validation criteria\.
3. 3\.Source attribution is not an emergent behavior across current frontier models\.The Source Validity scores \(range: 0 to 5\) in Table[1](https://arxiv.org/html/2606.26346#S6.T1)are generally low across both closed\-source and open\-source models\. However, OpenAI models tend to exhibit better source attribution behavior\. The generally low scores arise because the models do not always include clear source links by default, which could hinder reproducibility of outcomes and limit trust, both of which are particularly important in this context\. Explicit prompting with expert guidance will be required to achieve the desired source referencing behavior\.
4. 4\.Context window size has a significant impact on task completion and accuracy\.The models with the highest failure rates are DeepSeek\-V3\.2, Kimi\-K2\.5, and GPT\-5\-mini, as shown in Table[1](https://arxiv.org/html/2606.26346#S6.T1)\. The context windows for DeepSeek\-V3\.2 and Kimi\-K2\.5 are 163k and 262k tokens, respectively, compared to at least 400k tokens for GPT\-5\.2, Gemini 3\.1 Pro, and Claude Sonnet 4\.6 models\[[13](https://arxiv.org/html/2606.26346#bib.bib54),[29](https://arxiv.org/html/2606.26346#bib.bib55),[17](https://arxiv.org/html/2606.26346#bib.bib57),[3](https://arxiv.org/html/2606.26346#bib.bib56)\]\. These shorter context windows could be a limitation, especially for knowledge retrieval tasks that require processing significant amounts of information and multi\-step problems that require longer contexts to preserve information from each step\. Although the Qwen3\-Max\-Thinking model also has a relatively shorter context window \(256k tokens\), it interestingly has a considerably smaller failure rate \(i\.e\., 2\.7%\)\. However, its overall accuracy score is significantly lower \(i\.e\., 0\.44\), implying that tasks that did not fail based on the four failure mode definitions \(i\.e\., maxed\-out iterations, context window limitations, missing final answers, and clarification requests\) still produced low\-accuracy outcomes\. GPT\-5\-mini’s failure mode is largely due to excessive requests for additional clarification, which violates the instructions in the system prompt \(see Appendix[A\.8](https://arxiv.org/html/2606.26346#A1.SS8)\)\.
5. 5\.Higher token usage is not always correlated with better performance\.From Table[1](https://arxiv.org/html/2606.26346#S6.T1), some of the least performing models \(DeepSeek V3\.2 and Kimi K2\.5\) also had the highest average token usage \(494k and 420k, respectively\)\. However, among the top\-performing models, Gemini\-3\.1\-Pro has the best performance, but also uses more tokens on average\. This suggests that average token usage may not always be a clear indication of superior or inferior performance\.
6. 6\.Tool selection bias appears generally similar between open and closed\-source models\.The tool usage distribution charts \(Figure[1](https://arxiv.org/html/2606.26346#S6.F1)\) show "run\_query" as the dominant tool across the models\. This is because multiple tasks require the retrieval of data from the energy markets database\. Excluding the "run\_query" tool, both open and closed source models appear to rely on running Python code to execute tasks\. However, DeepSeek V3\.2 appears to be more retrieval\-heavy\.
### 6\.2Performance with and without sources specified
As highlighted in the benchmark dataset design philosophy section, paired prompt construction is employed\. This provides the basis for measuring the impact of explicitly including sources or tools to use in each question\. Table[2](https://arxiv.org/html/2606.26346#S6.T2)shows the class\-balanced accuracy and source validity metrics for a subset of 61 tasks \(see Appendix[A\.14](https://arxiv.org/html/2606.26346#A1.SS14)for the task IDs\) that have clear counterparts with and without source\. The table presents an interesting observation—the inclusion of sources does not always translate into improved accuracy across all models\. This can be explained by a combination of the models’ reasoning capabilities and the nature of the questions and tools provided, making it more likely for the models to take similar steps when answering the questions with or without sources specified\. Figure[2](https://arxiv.org/html/2606.26346#S6.F2)corroborates this, as the distribution of tool usage for both with\- and without\-source questions is practically the same, except for the uptick in "search\_web" tool usage when sources are not specified\. A step\-by\-step view of the traces across the seven models for a pair of questions \(see Appendix[A\.15](https://arxiv.org/html/2606.26346#A1.SS15)\) also confirms this\. However, source validity performance improves materially for questions with sources, as expected\.
Table 2:Evaluation Metrics With vs Without SourcesModelWith SourcesWithout SourcesAccuracySourceValidityAccuracySourceValidityClaude Sonnet 4\.60\.68±0\.110\.68\\pm 0\.113\.58±0\.373\.58\\pm 0\.370\.72±0\.110\.72\\pm 0\.111\.92±0\.361\.92\\pm 0\.36Qwen3\-Max\-Thinking0\.59±0\.100\.59\\pm 0\.102\.88±0\.342\.88\\pm 0\.340\.63±0\.110\.63\\pm 0\.111\.82±0\.351\.82\\pm 0\.35DeepSeek V3\.20\.57±0\.110\.57\\pm 0\.112\.80±0\.332\.80\\pm 0\.330\.47±0\.090\.47\\pm 0\.091\.88±0\.391\.88\\pm 0\.39Kimi\-K2\.50\.66±0\.100\.66\\pm 0\.103\.59±0\.353\.59\\pm 0\.350\.62±0\.110\.62\\pm 0\.112\.51±0\.332\.51\\pm 0\.33Gemini\-3\.1\-Pro0\.69±0\.090\.69\\pm 0\.093\.88±0\.313\.88\\pm 0\.310\.71±0\.090\.71\\pm 0\.091\.84±0\.301\.84\\pm 0\.30GPT\-5\-mini0\.47±0\.100\.47\\pm 0\.103\.42±0\.433\.42\\pm 0\.430\.49±0\.120\.49\\pm 0\.122\.23±0\.462\.23\\pm 0\.46GPT\-5\.20\.71±0\.100\.71\\pm 0\.104\.75±0\.144\.75\\pm 0\.140\.70±0\.100\.70\\pm 0\.103\.32±0\.463\.32\\pm 0\.46
Figure 2:Tool use distribution without sources vs with sources specified \(n represents 61 questions across 7 models\)
### 6\.3Performance with and without selected domain\-specific tools
A subset of 30 tasks \(see Appendix[A\.14](https://arxiv.org/html/2606.26346#A1.SS14)for the task IDs\) was selected from the overall dataset to evaluate performance without domain\-specific tools\. The 30 tasks represent some of the most challenging in the dataset and are at Medium and Hard difficulty levels across the three capability areas\. The Accuracy and Approach metrics are shown in Table[3](https://arxiv.org/html/2606.26346#S6.T3)\. The results clearly show that agents perform better when given access to the domain tools across all the models considered\. In some cases, the accuracy scores double, emphasizing the importance of domain tools in agentic applications for the energy domain\.
Table 3:Evaluation Results With and Without Domain\-Specific ToolsModelWith ToolsWithout ToolsAccuracyApproachAccuracyApproachClaude Sonnet 4\.60\.54±0\.160\.54\\pm 0\.163\.88±0\.233\.88\\pm 0\.230\.33±0\.130\.33\\pm 0\.133\.53±0\.263\.53\\pm 0\.26Qwen3\-Max\-Thinking0\.28±0\.110\.28\\pm 0\.113\.60±0\.343\.60\\pm 0\.340\.11±0\.080\.11\\pm 0\.082\.78±0\.342\.78\\pm 0\.34DeepSeek V3\.20\.25±0\.100\.25\\pm 0\.103\.23±0\.333\.23\\pm 0\.330\.12±0\.080\.12\\pm 0\.082\.53±0\.282\.53\\pm 0\.28Kimi\-K2\.50\.28±0\.110\.28\\pm 0\.112\.85±0\.412\.85\\pm 0\.410\.14±0\.080\.14\\pm 0\.082\.98±0\.332\.98\\pm 0\.33Gemini\-3\.1\-Pro0\.38±0\.150\.38\\pm 0\.153\.55±0\.313\.55\\pm 0\.310\.25±0\.120\.25\\pm 0\.122\.55±0\.342\.55\\pm 0\.34GPT\-5\-mini0\.18±0\.100\.18\\pm 0\.102\.58±0\.362\.58\\pm 0\.360\.11±0\.080\.11\\pm 0\.082\.18±0\.282\.18\\pm 0\.28GPT\-5\.20\.48±0\.150\.48\\pm 0\.153\.53±0\.233\.53\\pm 0\.230\.25±0\.120\.25\\pm 0\.123\.00±0\.433\.00\\pm 0\.43
## 7Conclusion
In previous sections, a rich discussion regarding the performance of tool\-augmented LLM agents on real\-world energy analytics tasks was presented\. The insights show that while tool\-augmented agents can tackle some real\-world energy analytics tasks, expert guidance is still required to improve the quality of outcomes produced by those agents\. This domain expert guidance is what is required to take the performance of these agents from generally good to exceptionally insightful\.
We also noted that there are some limitations associated with this first iteration ofEnergyEvalsand, as such, the following will be captured in future iterations\.
1\.Regional and sub\-domain expansion\.The dataset considered in this first iteration ofEnergyEvalsfocuses on the US and tasks relating to electricity markets\. While this represents a good starting point, energy analytics is a global phenomenon and goes beyond electricity\-related tasks\. Subsequent iterations ofEnergyEvalswill include tasks covering new regions and other energy sub\-domains\.
2\.Performance under high reasoning model configurations\.The models considered in this iteration have low reasoning configurations to evaluate performance under resource\-constrained scenarios\. However, it is possible that other reasoning configuration levels can improve performance\. We will investigate this in subsequent iterations\.
## References
- \[1\]F\. Amjad, T\. Korõtko, and A\. Rosin\(2025\)Review of llms applications in electrical power and energy systems\.IEEE Access13\(\),pp\. 150951–150969\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2025.3599922)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[2\]Anthropic\(2024\)The Claude 3 model family: Opus, Sonnet, Haiku\.Technical reportAnthropic\.External Links:[Link](https://api.semanticscholar.org/CorpusID:268232499)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1)\.
- \[3\]Anthropic\(2026\)Claude sonnet\.Note:Anthropic model page describing the Claude Sonnet family of models\. Accessed: 2026\-03\-24External Links:[Link](https://www.anthropic.com/claude/sonnet)Cited by:[item 4](https://arxiv.org/html/2606.26346#S6.I1.i4.p1.1)\.
- \[4\]G\. Authors\(2025\-11\)GridMind: llms\-powered agents for power system analysis and operations\.InProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis,External Links:[Document](https://dx.doi.org/10.1145/3731599.3767409),[Link](https://doi.org/)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[5\]E\. O\. Badmus and A\. Pandey\(2026\)PowerDAG: reliable agentic ai system for automating distribution grid analysis\.External Links:2603\.17418,[Link](https://arxiv.org/abs/2603.17418)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[6\]E\. O\. Badmus, P\. Sang, D\. Stamoulis, and A\. Pandey\(2025\)PowerChain: a verifiable agentic ai system for automating distribution grid analyses\.External Links:2508\.17094,[Link](https://arxiv.org/abs/2508.17094)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[7\]Y\. Bai, J\. Ying, Y\. Cao, X\. Lv, Y\. He, X\. Wang, J\. Yu, K\. Zeng, Y\. Xiao, H\. Lyu, J\. Zhang, J\. Li, and L\. Hou\(2023\)Benchmarking foundation models with language\-model\-as\-an\-examiner\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.External Links:[Link](https://doi.org/10.48550/arXiv.2306.04181)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p4.1)\.
- \[8\]A\. Bigeard, L\. Nashold, R\. Krishnan, and S\. Wu\(2025\)Finance agent benchmark: benchmarking llms on real\-world financial research tasks\.External Links:2508\.00828,[Link](https://arxiv.org/abs/2508.00828)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1),[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[9\]J\. Bragg, M\. D’Arcy, N\. Balepur, D\. Bareket, B\. Dalvi, S\. Feldman, D\. Haddad, J\. D\. Hwang, P\. Jansen, V\. Kishore, B\. P\. Majumder, A\. Naik, S\. Rahamimov, K\. Richardson, A\. Singh, H\. Surana, A\. Tiktinsky, R\. Vasu, G\. Wiener, C\. Anastasiades, S\. Candra, J\. Dunkelberger, D\. Emery, R\. Evans, M\. Hamada, R\. Huff, R\. Kinney, M\. Latzke, J\. Lochner, R\. Lozano\-Aguilera, C\. Nguyen, S\. Rao, A\. Tanaka, B\. Vlahos, P\. Clark, D\. Downey, Y\. Goldberg, A\. Sabharwal, and D\. S\. Weld\(2024\)AstaBench: rigorous benchmarking of AI agents with a scientific research suite\.External Links:[Link](https://arxiv.org/abs/2510.21652)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p1.1)\.
- \[10\]A\. M\. Bran, S\. Cox, O\. Schilter, C\. Baldassari, A\. D\. White, and P\. Schwaller\(2023\)ChemCrow: augmenting large language models with chemistry tools\.External Links:[Link](https://arxiv.org/abs/2304.05376)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1)\.
- \[11\]Z\. Chen, S\. Chen, Y\. Ning, Q\. Zhang, B\. Wang, B\. Yu, Y\. Li, Z\. Liao, C\. Wei, Z\. Lu, V\. Dey, M\. Xue, F\. N\. Baker, B\. Burns, D\. Adu\-Ampratwum, X\. Huang, X\. Ning, S\. Gao, Y\. Su, and H\. Sun\(2025\)ScienceAgentBench: toward rigorous assessment of language agents for data\-driven scientific discovery\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2410.05080)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[12\]D\. Cheng, S\. Huang, Y\. Gu, H\. Song, G\. Chen, L\. Dong, W\. X\. Zhao, J\. Wen, and F\. Wei\(2026\)LLM\-in\-Sandbox elicits general agentic intelligence\.External Links:[Link](https://arxiv.org/abs/2601.16206)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p4.1)\.
- \[13\]DeepInfra\(2026\)DeepInfra models library\.Note:Accessed: 2026\-03\-24External Links:[Link](https://deepinfra.com/models)Cited by:[§A\.3](https://arxiv.org/html/2606.26346#A1.SS3.p1.1),[item 4](https://arxiv.org/html/2606.26346#S6.I1.i4.p1.1)\.
- \[14\]Y\. Dubois, C\. X\. Li, R\. Taori, T\. Zhang, I\. Gulrajani, J\. Ba, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto\(2023\)AlpacaFarm: a simulation framework for methods that learn from human feedback\.arXiv preprint arXiv:2305\.14387\.External Links:[Link](https://arxiv.org/abs/2305.14387)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p4.1)\.
- \[15\]Electric Power Research Institute \(EPRI\)\(2025\)Benchmarking large language models for the electric power sector\.White PaperTechnical Report3002034347,Electric Power Research Institute,Palo Alto, CA\.Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p2.1),[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[16\]Google DeepMind\(2024\)Gemini: a family of highly capable multimodal models\.Technical reportGoogle DeepMind\.External Links:[Link](https://arxiv.org/abs/2312.11805)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1)\.
- \[17\]Google DeepMind\(2026\)Gemini pro\.Note:Model page describing the Gemini Pro family of multimodal AI models\. Accessed: 2026\-03\-24External Links:[Link](https://deepmind.google/models/gemini/pro/)Cited by:[item 4](https://arxiv.org/html/2606.26346#S6.I1.i4.p1.1)\.
- \[18\]N\. Guha, J\. Nyarko, D\. E\. Ho, C\. Ré, A\. Chilton, A\. Narayana, A\. Chohlas\-Wood, A\. Peters, B\. Waldon, D\. N\. Rockmore, D\. Zambrano, D\. Talisman, E\. Hoque, F\. Surani, F\. Fagan, G\. Sarfaty, G\. M\. Dickinson, H\. Porat, J\. Hegland, J\. Wu, J\. Nudell, J\. Niklaus, J\. Nay, J\. H\. Choi, K\. Tobia, M\. Hagan, M\. Ma, M\. Livermore, N\. Rasumov\-Rahe, N\. Holzenberger, N\. Kolt, P\. Henderson, S\. Rehaag, S\. Goel, S\. Gao, S\. Williams, S\. Gandhi, T\. Zur, V\. Iyer, and Z\. Li\(2023\)LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2308.11462)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1)\.
- \[19\]T\. Hong and S\. Fan\(2016\)Probabilistic electric load forecasting: a tutorial review\.International Journal of Forecasting32\(3\),pp\. 914–938\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ijforecast.2015.11.011)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p2.1),[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[20\]Y\. Jiang, K\. C\. Black, G\. Geng, D\. Park, J\. Zou, A\. Y\. Ng, and J\. H\. Chen\(2025\)MedAgentBench: a realistic virtual EHR environment to benchmark medical LLM agents\.NEJM AI\.External Links:[Link](https://arxiv.org/abs/2501.14654)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[21\]C\. E\. Jiménez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2310.06770)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1),[§2](https://arxiv.org/html/2606.26346#S2.p1.1)\.
- \[22\]H\. Li, J\. Chen, J\. Yang, Q\. Ai, W\. Jia, Y\. Liu, K\. Lin, Y\. Wu, G\. Yuan, Y\. Hu, W\. Wang, Y\. Liu, and M\. Huang\(2025\)LegalAgentBench: evaluating LLM agents in legal domain\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\), Volume 1: Long Papers,Vienna, Austria,pp\. 2322–2344\.External Links:[Link](https://aclanthology.org/2025.acl-long.116)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[23\]H\. Li, L\. Zhu, B\. Zhang, R\. Feng, J\. Wang, Y\. Pan, E\. T\. Barr, F\. Sarro, Z\. Chu, and H\. Ye\(2026\)ContextBench: a benchmark for context retrieval in coding agents\.External Links:[Link](https://arxiv.org/abs/2602.05892)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[24\]H\. Li, Y\. Cao, Y\. Yu, S\. R\. Javaji, Z\. Deng, Y\. He, Y\. Jiang, Z\. Zhu, K\. Subbalakshmi, G\. Xiong, J\. Huang, L\. Qian, X\. Peng, Q\. Xie, and J\. W\. Suchow\(2025\)InvestorBench: a benchmark for financial decision\-making tasks with LLM\-based agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\), Volume 1: Long Papers,Vienna, Austria\.External Links:[Link](https://aclanthology.org/2025.acl-long.126)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[25\]X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun, S\. Wang, B\. Li, Q\. Zeng, D\. Wang, X\. Zhao, Y\. Wang, R\. Ben Chaim, Z\. Di, Y\. Gao, J\. He, Y\. He, L\. Jing, L\. Kong, X\. Lan, J\. Li, S\. Li, Y\. Li, Y\. Lin, X\. Liu, X\. Liu, H\. Lyu, Z\. Ma, B\. Wang, R\. Wang, T\. Wang, W\. Ye, Y\. Zhang, H\. Xing, Y\. Xue, S\. Dillmann, and H\. Lee\(2026\)SkillsBench: benchmarking how well agent skills work across diverse tasks\.Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[26\]X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Shen, Y\. Dong, J\. Tang, and Y\. LeCun\(2023\)AgentBench: evaluating LLMs as agents\.arXiv preprint arXiv:2308\.03688\.External Links:[Link](https://arxiv.org/abs/2308.03688)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p1.1)\.
- \[27\]G\. Mialon, C\. Fourrier, C\. Swift, T\. Wolf, Y\. LeCun, and T\. Scialom\(2024\)GAIA: a benchmark for general AI assistants\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2311.12983)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p1.1)\.
- \[28\]OpenAI\(2024\)GPT\-4 technical report\.Technical reportOpenAI\.External Links:[Link](https://arxiv.org/abs/2303.08774)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1)\.
- \[29\]OpenAI\(2026\)Models\.Note:[https://developers\.openai\.com/api/docs/models](https://developers.openai.com/api/docs/models)OpenAI API Documentation\. Accessed: 2026\-03\-24Cited by:[item 4](https://arxiv.org/html/2606.26346#S6.I1.i4.p1.1)\.
- \[30\]S\. Pfenninger and I\. Staffell\(2016\)Long\-term patterns of european PV output using 30 years of validated hourly reanalysis and satellite data\.Energy114,pp\. 1251–1265\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.energy.2016.08.060)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[31\]S\. S\. Polagani\(2025\)AI agents for smart grid operations and renewable energy management\.Iconic Research and Engineering Journals8\(11\),pp\. 1278–1292\.External Links:[Link](https://www.irejournals.com/formatedpaper/1708600.pdf)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[32\]Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, D\. Li, Z\. Liu, and M\. Sun\(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2307.16789)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p1.1)\.
- \[33\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2302.04761)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1),[§2](https://arxiv.org/html/2606.26346#S2.p4.1)\.
- \[34\]Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[35\]G\. Starace, O\. Jaffe, D\. Sherburn, J\. Aung, J\. S\. Chan, L\. Maksin, R\. Dias, E\. Mays, B\. Kinsella, W\. Thompson, J\. Heidecke, A\. Glaese, and T\. Patwardhan\(2025\)PaperBench: evaluating AI’s ability to replicate AI research\.External Links:2504\.01848,[Link](https://arxiv.org/abs/2504.01848)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[36\]H\. Vishwakarma, A\. Agarwal, O\. Patil, C\. Devaguptapu, and M\. Chandran\(2025\)Can LLMs help you at work? A sandbox for evaluating LLM agents in enterprise environments\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://arxiv.org/abs/2510.27287)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[37\]W\. Wang, D\. Han, D\. Madrigal Díaz, J\. Xu, V\. Rühle, and S\. Rajmohan\(2025\)OdysseyBench: evaluating LLM agents on long\-horizon complex office application workflows\.External Links:[Link](https://arxiv.org/abs/2508.09124)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[38\]R\. Weron\(2014\)Electricity price forecasting: a review of the state\-of\-the\-art with a look into the future\.International Journal of Forecasting30\(4\),pp\. 1030–1081\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ijforecast.2014.08.008)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p2.1),[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[39\]M\. Wornow, V\. Garodia, and V\. Vassalos\(2025\)Top of the CLASS: benchmarking LLM agents on real\-world enterprise tasks\.InICLR 2025 Workshop on Building Trust in LLMs and LLM Applications,External Links:[Link](https://openreview.net/forum?id=RQjUpeINII)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p2.1)\.
- \[40\]F\. F\. Xu, Y\. Song, B\. Li, Y\. Tang, K\. Jain, M\. Bao, Z\. Z\. Wang, X\. Zhou, Z\. Guo, M\. Cao, M\. Yang, H\. Y\. Lu, A\. Martin, Z\. Su, L\. Maben, R\. Mehta, W\. Chi, L\. Jang, Y\. Xie, S\. Zhou, and G\. Neubig\(2025\)TheAgentCompany: benchmarking LLM agents on consequential real world tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\), Datasets and Benchmarks Track,External Links:[Link](https://arxiv.org/abs/2412.14161)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p1.1)\.
- \[41\]H\. Yang, X\. Liu, and C\. D\. Wang\(2025\)FinGPT: open\-source financial large language models\.External Links:2306\.06031,[Link](https://arxiv.org/abs/2306.06031)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1)\.
- \[42\]S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan\(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.arXiv preprint arXiv:2406\.12045\.External Links:[Link](https://arxiv.org/abs/2406.12045)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p1.1)\.
- \[43\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2210.03629)Cited by:[§1](https://arxiv.org/html/2606.26346#S1.p1.1),[§2](https://arxiv.org/html/2606.26346#S2.p4.1),[§4\.1](https://arxiv.org/html/2606.26346#S4.SS1.p1.2)\.
- \[44\]B\. Yu, F\. N\. Baker, Z\. Chen, G\. Herb, B\. Gou, D\. Adu\-Ampratwum, X\. Ning, and H\. Sun\(2025\)Tooling or not tooling? the impact of tools on language agents for chemistry problem solving\.InFindings of the Association for Computational Linguistics: NAACL 2025,External Links:[Link](https://arxiv.org/abs/2411.07228)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p4.1)\.
- \[45\]Y\. Zhang, A\. M\. Saber, A\. Youssef, and D\. Kundur\(2025\)Grid\-agent: an llm\-powered multi\-agent system for power grid control\.External Links:2508\.05702,[Link](https://arxiv.org/abs/2508.05702)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
- \[46\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging LLM\-as\-a\-Judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2306.05685)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p4.1),[§5\.2](https://arxiv.org/html/2606.26346#S5.SS2.p2.1)\.
- \[47\]X\. Zhou, H\. Zhao, Y\. Cheng, Y\. Cao, G\. Liang, G\. Liu, W\. Liu, Y\. Xu, and J\. Zhao\(2024\)ElecBench: a power dispatch evaluation benchmark for large language models\.arXiv preprint arXiv:2407\.05365\.External Links:[Link](https://arxiv.org/abs/2407.05365)Cited by:[§2](https://arxiv.org/html/2606.26346#S2.p3.1)\.
## Appendix AAppendices
### A\.1Dataset Breakdown
Table A\.1:Dataset breakdown by capability area and difficulty level \(n=243\)Capability AreaEasyMediumHardTotalData136133107Knowledge4043386Quant\.084250
### A\.2Conceptual view of overall evaluation pipeline
An illustration of the overall evaluation pipeline is as shown below\.
1\. Thought2\. Action\(tool call\)3\. ObservationFinal AnswerdonerepeatReAct Agent LoopBenchmarkQuestions243 tasks*Categories:*Data RetrievalKnowledgeQuantitative*Difficulty:*Easy / Med / HardtaskGridStatus APITariff DatabaseRenewables\.ninjaBattery OptimizerDocket SearchWeb Search \(Exa\)Weather APIRAG Server \(MCP\)Database \(MCP\)Domain Toolscall / resultLLM\-as\-a\-JudgeCategory\-Aware Routinganswer \+ traceApproachCorrectnessAnswer AccuracyorAttribute AlignmentSourceValidityFigure A\.1:Conceptual overview of the evaluation pipeline\. A question from the benchmark dataset is presented to a ReAct agent backed by one of seven LLMs\. The agent iterates throughThought,Action\(tool call\), andObservationsteps, invoking domain tools as needed, until it emits a Final Answer\. The answer and execution trace are then scored by the three LLM judges across four dimensions using category\-aware rubric routing\.
### A\.3Model Configurations
Table A\.2:Models evaluated and inference configurationsModelProviderOpen SourceReasoningReasoning LevelGPT\-5\.2OpenAINoYesLowGPT\-5\-miniOpenAINoYesLowGemini\-3\.1\-ProGoogleNoYesLowClaude Sonnet 4\.6AnthropicNoYesLowKimi\-K2\.5MoonshotYesYesN/AQwen3\-Max\-ThinkingAlibabaYesYesN/ADeepSeek\-V3\.2DeepSeekYesYesN/AGPT, Gemini, and Sonnet models are configured with low reasoning effort to evaluate performance under compute\-efficient inference conditions\. In subsequent releases, high reasoning level configurations will be examined\. All models are evaluated with temperature set as 0 for deterministic, reproducible outputs\. No system\-level fine\-tuning or domain adaptation is applied; all models are used off\-the\-shelf via their respective inference APIs \(with DeepInfra APIs used for all the open source models\[[13](https://arxiv.org/html/2606.26346#bib.bib54)\]\)\.
### A\.4Agent Implementation Details
The ReAct agent is implemented through a provider abstraction layer that wraps model\-specific API differences including function\-calling schemas, tool\-use message blocks, response parsing logic, and streaming behavior behind a common interface\. This design enables identical benchmark execution across all evaluated models without modification to the agent loop or tool suite\.
### A\.5Tool Suite Overview
Table A\.3:Tool suite available to agentsTool CategoryData SourceCoverageGridStatus APIGridStatus\.ioAll US wholesale electricity marketsTariffsOpenEI Tariffs APIU\.S\. utility tariffsRenewablesRenewables\.ninjaSolar/wind generation simulationBattery Optim\.N/AArbitrage\-only revenues for battery projectsDocketsFERC; state PUCsFederal and 7 other state jurisdictionsWeb SearchExa APIOpen webWeatherOpenWeatherMapCurrent & forecastRAG \(MCP\)Document corpusMarket reports and manualsDatabase \(MCP\)Market data portalsERCOT, NYISO, PJM and ISONE markets
### A\.6Observability and Trace Collection
All agent executions are traced at the step level, capturing the complete sequence of thought, action, observation, and answer events as structured JSON artifacts\. Traces record tool call arguments and raw responses alongside token\-level timing and iteration counts\. These traces serve two functions: they enable the failure mode discussions in Section[6](https://arxiv.org/html/2606.26346#S6), and they constitute a secondary research artifact released alongside the benchmark\.
### A\.7Tool description
Table A\.4:Tool inventory grouped by categoryTool categoryToolDescriptionSystemlist\_filesLists files/directories in a specified path, optionally recursively\.grep\_filesSearches files for text patterns with optional glob/ path filters\.run\_python\_codeExecutes Python code in a sandboxed environment and returns output/errors\.run\_shell\_commandExecutes shell commands in a controlled environment and returns stdout/stderr\.GridStatus APIlist\_gridstatus\_datasetsLists available GridStatus datasets \(ID, name, description\)\.inspect\_gridstatus\_datasetReturns schema/metadata for a specific GridStatus dataset\.query\_gridstatus\_datasetQueries a GridStatus dataset with filters/time bounds and returns results\.Tariffsget\_utility\_tariffsRetrieves utility tariff/rate records from OpenEI IURDB\.Renewablesget\_solar\_profileReturns hourly solar generation profile \(capacity factors\) for a location/date range\.get\_wind\_profileReturns hourly wind generation profile \(capacity factors\) for a location/date range\.Battery Optim\.battery\_revenue\_optimizationSolves battery dispatch/arbitrage optimization and outputs revenue metrics/profile\.Docketssearch\_ferc\_docketsSearches FERC dockets/filings\.search\_dc\_docketsSearches District of Columbia PSC dockets\.search\_maryland\_docketsSearches Maryland PSC dockets\.search\_new\_york\_docketsSearches New York PSC dockets\.search\_north\_carolina\_docketsSearches North Carolina Utilities Commission dockets\.search\_south\_carolina\_docketsSearches South Carolina PSC dockets\.search\_texas\_docketsSearches Texas PUCT dockets/filings\.search\_virginia\_docketsSearches Virginia SCC dockets\.Web Searchsearch\_webRuns web search over external sources\.get\_page\_contentsFetches and extracts content from specified URLs\.Weathergeocode\_locationConverts location names to latitude/longitude\.get\_current\_weatherReturns current weather conditions for a location\.get\_forecastReturns short\-term weather forecast for a location\.get\_historical\_weatherReturns historical weather over a specified period\.get\_air\_pollutionReturns air\-quality and pollutant metrics for a location\.RAG \(MCP\)search\_documentsRetrieves relevant passages from indexed document corpora \(MCP RAG\)\.Database \(MCP\)show\_databasesLists accessible databases\.show\_tablesLists tables in a selected database/schema\.describe\_tableReturns table schema/column metadata\.show\_indexesShows table indexes for query planning\.run\_queryExecutes SQL query against connected database\.inspect\_queryProvides query inspection/validation metadata\.preview\_tableReturns a row preview/sample from a table\.
### A\.8System and evaluation prompts
Here are the different prompts used for the benchmark and evaluation runs\. They are also included in the publicly available repository\.
Agent System PromptYou are an Expert Energy Analyst\.Use your best effort to answer each question with only one attempt\.No room for back and forths with the user
Judge System PromptYou are a strict evaluator of answers relating to energy markets analysis\.Follow expert industry standards\.Your output MUST exactly match the provided output schema\.Do not add extra fields or surrounding text\.
Approach Evaluation PromptYou are evaluating the approach correctness of how an AI agent obtained answers to an energy market related question and not the correctness of the answer itself\.In addition to question, you also have a summary of the suggested approach provided by an expert and a trace of the steps the agent took to answer the question which you can use to infer the agent’s approach to answering the questionQuestion: \{question\}Suggested Approach \(Ground Truth\): \{suggested\_steps\}Agent’s Steps: \{agent\_steps\_trace\}Evaluate:•Correct problem framing•Appropriate data sources \(ISO postings, tariffs, settlement data, APIs\)•Logical analytical steps•Correct tool usage \(if applicable\)Rating scale: 5=expert\-like, 4=minor issues, 3=notable gaps, 2=major flaws, 1=wrong approach
Accuracy Evaluation PromptYou are evaluating the factual and numerical accuracy of an AI agent’s answer to a question relating to energy markets analysis\.Question: \{question\}Expected Answer \(Ground Truth\): \{expected\_answer\}Agent’s Answer: \{agent\_answer\}Evaluate:•Numerical correctness \(values, sign, magnitude, units, time basis\)•Factual alignment \(market/ISO, node/zone, product, settlement type etc\.\)•Completeness of key factsTolerance: Allow≤\\leq\{abs\_tol\} absolute error OR≤\\leq\{rel\_tol\}% relative error unless exactness is required\.
Source Evaluation PromptYou are evaluating the following two things only\.1\. Explicit inclusion of sources in an AI agent’s answer to a question relating to energy markets analysis\.2\. Relevance of the included sources for the questionYou can extract or infer relevant sources from the question itself or from the suggested approach ground truthDo not penalize for not explicitly adding queries or code for pulling data for verification as long as the source specified is consistent with what the agent has access to and is plausibleInternal databases are based on data from authoritative external sources and as such, the internal databases are equivalent to external authoritative sources \(e\.g\. market portals\) and should be treated as suchQuestion: \{question\}Suggested Steps: \{suggested\_steps\}Agent’s Answer: \{agent\_answer\}Evaluate:•Authority of sources•Alignment with expected sources•Appropriateness for the claim•Missing citations when required
Attribue Evaluation PromptYou are evaluating attribute alignment of an AI agent’s answer against a canonical set of expected attributes\.Question: \{question\}Expected Attributes \(canonical, JSON\): \{expected\_attributes\_json\}Agent’s Answer: \{agent\_answer\}For each expected attribute, decide whether the agent answer contains the correct value or a reasonable equivalent, respecting units and time basis\.Tolerance: For numeric attributes, allow≤\\leq\{abs\_tol\} absolute error OR≤\\leq\{rel\_tol\}% relative error unless exactness is required\.
Attribute Extraction PromptYou are generating a canonical attribute set for evaluating an AI agent answer to an energy market question\.Extract no more than 5 high\-value attributes from the expected answer\.Each attribute should be specific, evaluable, and tied to the question intent\.Prefer attributes that are most critical to correctness\.Question: \{question\}Expected Answer: \{expected\_answer\}
### A\.9Results for Market Data Retrieval and Analysis Tasks
Compared with the overall results in Table[1](https://arxiv.org/html/2606.26346#S6.T1), accuracy scores are higher for market data retrieval and analysis tasks\. This suggests that the models generally perform better on this category of questions\. Failure rates are also lower for most models, with the exception of GPT\-5\-mini, whose failures appear to be driven largely by clarification requests rather than by errors specific to this task category\.
Table A\.5:Evaluation Metrics for Market Data Retrieval and Analysis TasksModelAccuracyApproachSourceValidityTokensToolCallsCostEst\. \($\)FailureRate \(%\)Claude Sonnet 4\.60\.73±0\.060\.73\\pm 0\.063\.99±0\.153\.99\\pm 0\.152\.44±0\.332\.44\\pm 0\.33261k7\.20\.811\.2Qwen3\-Max\-Thinking0\.67±0\.070\.67\\pm 0\.074\.18±0\.174\.18\\pm 0\.172\.39±0\.282\.39\\pm 0\.28293k8\.50\.361\.9DeepSeek V3\.20\.61±0\.060\.61\\pm 0\.063\.93±0\.173\.93\\pm 0\.172\.14±0\.262\.14\\pm 0\.26560k14\.90\.096\.8Kimi\-K2\.50\.66±0\.070\.66\\pm 0\.073\.94±0\.183\.94\\pm 0\.182\.73±0\.342\.73\\pm 0\.34494k11\.90\.076\.9Gemini\-3\.1\-Pro0\.77±0\.050\.77\\pm 0\.054\.04±0\.144\.04\\pm 0\.142\.83±0\.312\.83\\pm 0\.31273k6\.60\.171\.2GPT\-5\-mini0\.53±0\.090\.53\\pm 0\.093\.86±0\.193\.86\\pm 0\.193\.00±0\.333\.00\\pm 0\.33108k4\.60\.014\.9GPT\-5\.20\.75±0\.050\.75\\pm 0\.054\.08±0\.174\.08\\pm 0\.174\.22±0\.214\.22\\pm 0\.21264k9\.80\.130
### A\.10Results for Knowledge Retrieval and Interpretation Tasks
Scores in Table[A\.6](https://arxiv.org/html/2606.26346#A1.T6)are generally lower than the overall results in Table[1](https://arxiv.org/html/2606.26346#S6.T1), suggesting that knowledge retrieval and interpretation tasks are more challenging for most evaluated models\.
Table A\.6:Evaluation Metrics for Knowledge Retrieval and Interpretation TasksModelAccuracyApproachSourceValidityTokensToolCallsCostEst\. \($\)FailureRate \(%\)Claude Sonnet 4\.60\.42±0\.070\.42\\pm 0\.073\.65±0\.223\.65\\pm 0\.223\.64±0\.373\.64\\pm 0\.37198k4\.20\.620Qwen3\-Max\-Thinking0\.31±0\.050\.31\\pm 0\.053\.50±0\.233\.50\\pm 0\.232\.43±0\.182\.43\\pm 0\.18181k3\.40\.221\.2DeepSeek V3\.20\.33±0\.060\.33\\pm 0\.062\.98±0\.152\.98\\pm 0\.152\.39±0\.272\.39\\pm 0\.27353k6\.90\.069\.3Kimi\-K2\.50\.36±0\.050\.36\\pm 0\.053\.51±0\.223\.51\\pm 0\.222\.82±0\.252\.82\\pm 0\.25234k5\.10\.042\.3Gemini\-3\.1\-Pro0\.44±0\.150\.44\\pm 0\.153\.67±0\.223\.67\\pm 0\.222\.89±0\.652\.89\\pm 0\.65245k3\.50\.162\.3GPT\-5\-mini0\.23±0\.060\.23\\pm 0\.063\.22±0\.253\.22\\pm 0\.253\.15±0\.723\.15\\pm 0\.72102k2\.10\.0210\.5GPT\-5\.20\.45±0\.110\.45\\pm 0\.113\.53±0\.343\.53\\pm 0\.344\.25±0\.654\.25\\pm 0\.65143k4\.30\.121\.2
### A\.11Results for Advanced Quantitative Modeling and Decision Analytics Tasks
Table A3 also shows lower accuracy scores and significantly higher failure rates, showing that the questions under this category are more challenging compared to the other two categories\. However, Gemini\-3\.1\-Pro appears to perform better in his category and the comes with high token usage\.
Table A\.7:Evaluation Metrics for Advanced Quantitative Modeling and Decision Analytics TasksModelAccuracyApproachSourceValidityTokensToolCallsCostEst\. \($\)FailureRate \(%\)Claude Sonnet 4\.60\.53±0\.140\.53\\pm 0\.144\.32±0\.214\.32\\pm 0\.211\.76±0\.481\.76\\pm 0\.48391k141\.288Qwen3\-Max\-Thinking0\.28±0\.150\.28\\pm 0\.153\.42±0\.453\.42\\pm 0\.451\.72±0\.411\.72\\pm 0\.41474k13\.90\.606DeepSeek V3\.20\.29±0\.180\.29\\pm 0\.183\.30±0\.323\.30\\pm 0\.321\.86±0\.551\.86\\pm 0\.55580k210\.0958Kimi\-K2\.50\.44±0\.140\.44\\pm 0\.143\.60±0\.273\.60\\pm 0\.272\.21±0\.422\.21\\pm 0\.42743k210\.1250Gemini\-3\.1\-Pro0\.66±0\.130\.66\\pm 0\.133\.39±0\.473\.39\\pm 0\.472\.48±0\.512\.48\\pm 0\.51783k12\.40\.488GPT\-5\-mini0\.37±0\.170\.37\\pm 0\.172\.92±0\.392\.92\\pm 0\.393\.21±0\.553\.21\\pm 0\.5587k4\.70\.016GPT\-5\.20\.49±0\.160\.49\\pm 0\.163\.35±0\.323\.35\\pm 0\.323\.36±0\.493\.36\\pm 0\.49121k7\.20\.102
### A\.12Judge Attribution Sanity Check for Gemini\-3\.1\-Pro Accuracy Wins
To assess whether Gemini\-3\.1\-Pro’s accuracy performance was disproportionately influenced by any single judge, we examined the subset of questions where Gemini\-3\.1\-Pro achieved a median accuracy score greater than or equal to every other evaluated model\. This subset includes 139 of 243 questions, corresponding to 57\.2% of the benchmark\. Ties at the top are included\.
Table A\.8:Judge Attribution for Gemini 3\.1 Pro Accuracy WinsJudgeSubsetGemini Wins,n=139n=139BaselineAll Gemini Panels,n=243n=243OpenAI GPT\-5\-mini32\.01%30\.38%DeepInfra DeepSeek V3\.228\.06%28\.53%Google Gemini 3\.1 Flash Lite39\.93%41\.08%Total100\.00%100\.00%Table A\.9:Panel Tie Regimes for Gemini 3\.1 Pro Accuracy WinsRegimePanelsShare3\-way tie among judges6949\.64%2\-way tie with one outlier5136\.69%All three judges different1913\.67%The attribution pattern does not suggest that Gemini 3\.1 Pro’s accuracy wins are driven by a single judge\. The judge shares in the Gemini\-winning subset are close to the baseline shares over all Gemini panels: OpenAI GPT\-5\-mini accounts for 32\.01% of the winning subset versus 30\.38% overall, DeepSeek V3\.2 accounts for 28\.06% versus 28\.53%, and Gemini 3\.1 Flash Lite accounts for 39\.93% versus 41\.08%\. These differences are small, with all deviations within approximately two percentage points\.
The tie\-regime analysis provides an additional check\. In 49\.64% of Gemini’s winning panels, all three judges assigned the same median\-relevant accuracy score\. A further 36\.69% involved a two\-judge agreement with one outlier\. Thus, in 86\.33% of the Gemini\-winning panels, at least two judges agreed on the score\. Only 13\.67% of the winning panels had all three judges disagree, which are the cases where the median is most sensitive to a single judge\. Overall, this suggests that Gemini 3\.1 Pro’s accuracy wins are primarily supported by cross\-judge agreement rather than by judge\-specific bias\.
### A\.13Public framework and data repository overview
The public repository \(https://github\.com/Tume\-AI/energy\-evals\) contains a complete list of the 212 questions\. However, benchmark traces, ground truths, evaluation results and justifications are released for a subset containing 30 questions\. This is to prevent data contamination and model overfitting issues\. As subsequent versions ofEnergyEvalsbecome available, the datasets and codes will be updated\.
### A\.14Task IDs for source and tools impact analysis
The task IDs for the 61 tasks without sources considered for the source impact analysis are as follows\. The ids follow the format:With Source \(Without Source Counterpart\)\.
With and Without source task IDs1 \(22\), 2 \(23\), 3 \(24\), 6 \(27\), 15 \(36\), 16 \(37\), 17 \(38\), 18 \(39\), 19 \(40\), 20 \(41\), 21 \(42\), 67 \(45\), 68 \(46\), 69 \(47\), 70 \(48\), 71 \(49\), 72 \(50\), 73 \(51\), 83 \(61\), 84 \(62\), 85 \(63\), 86 \(64\), 87 \(101\), 88 \(102\), 89 \(103\), 90 \(104\), 91 \(105\), 92 \(106\), 93 \(107\), 95 \(109\), 96 \(110\), 97 \(111\), 98 \(112\), 100 \(114\), 123 \(152\), 124 \(153\), 125 \(154\), 126 \(155\), 127 \(156\), 128 \(157\), 129 \(158\), 130 \(159\), 131 \(160\), 132 \(161\), 133 \(162\), 134 \(163\), 144 \(172\), 149 \(177\), 150 \(178\), 151 \(179\), 188 \(213\), 189 \(214\), 191 \(216\), 196 \(221\), 197 \(222\), 198 \(223\), 200 \(225\), 201 \(226\), 202 \(227\), 203 \(228\), 206 \(231\)
The task IDs for the 30 tasks considered for the tool impact analysis are as follows\.
Tool impact analysis task IDs111, 112, 115, 172, 207, 209, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 245
### A\.15Sample trace path of a pair of questions with and without sources
The trace paths for the answers to questions 88 \(with source\) and 102 \(without source variant\) across all seven models evaluated are shown in the figure below\.
Figure A\.2:Trace paths for all 7 models to answer the question \(a\)"What was the difference between average ERCOT weekday and weekend day\-ahead prices in the summer of 2023 based on your ERCOT database?"and \(b\)"What was the difference between average ERCOT weekday and weekend day\-ahead prices in the summer of 2023?Similar Articles
Agentic Trading: When LLM Agents Meet Financial Markets
This paper presents a systematic survey and evidence map of 77 studies on LLM-based trading agents, finding that architectural experimentation is expanding rapidly but evaluation protocols, execution semantics, and reproducibility remain critical bottlenecks.
ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
本文介绍ORAgentBench,一个用于评估LLM代理在端到端运筹学任务中表现的执行基准,包含107个经过人工审查的任务。实验表明,当前最佳代理仅通过35.51%的任务,揭示了在可靠决策制定方面的重大不足。
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
This paper introduces AutoLLMResearch, an agentic framework that automates the configuration of expensive LLM experiments by learning from low-fidelity environments and extrapolating to high-cost settings. It aims to reduce computational waste and reliance on expert intuition in scalable LLM research.
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
The ToolMaze benchmark evaluates LLM agents' ability to handle real-world tool failures, revealing that implicit semantic failures cause the largest performance drops and that dynamic replanning remains a critical bottleneck not addressed by scaling or prompting.