Herculean: An Agentic Benchmark for Financial Intelligence

arXiv cs.AI Papers

Summary

Introduces Herculean, a benchmark designed to evaluate the financial intelligence of AI agents across various tasks. The paper presents a comprehensive framework for assessing agentic capabilities in the financial domain.

arXiv:2605.14355v1 Announce Type: new Abstract: As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:23 AM

# Herculean: An Agentic Benchmark for Financial Intelligence
Source: [https://arxiv.org/abs/2605.14355](https://arxiv.org/abs/2605.14355)
Authors:[Xueqing Peng](https://arxiv.org/search/cs?searchtype=author&query=Peng,+X),[Zhuohan Xie](https://arxiv.org/search/cs?searchtype=author&query=Xie,+Z),[Yupeng Cao](https://arxiv.org/search/cs?searchtype=author&query=Cao,+Y),[Haohang Li](https://arxiv.org/search/cs?searchtype=author&query=Li,+H),[Lingfei Qian](https://arxiv.org/search/cs?searchtype=author&query=Qian,+L),[Yan Wang](https://arxiv.org/search/cs?searchtype=author&query=Wang,+Y),[Vincent Jim Zhang](https://arxiv.org/search/cs?searchtype=author&query=Zhang,+V+J),[Huan He](https://arxiv.org/search/cs?searchtype=author&query=He,+H),[Xuguang Ai](https://arxiv.org/search/cs?searchtype=author&query=Ai,+X),[Linhai Ma](https://arxiv.org/search/cs?searchtype=author&query=Ma,+L),[Ruoyu Xiang](https://arxiv.org/search/cs?searchtype=author&query=Xiang,+R),[Yueru He](https://arxiv.org/search/cs?searchtype=author&query=He,+Y),[Yi Han](https://arxiv.org/search/cs?searchtype=author&query=Han,+Y),[Shuyao Wang](https://arxiv.org/search/cs?searchtype=author&query=Wang,+S),[Yuqing Guo](https://arxiv.org/search/cs?searchtype=author&query=Guo,+Y),[Mingyang Jiang](https://arxiv.org/search/cs?searchtype=author&query=Jiang,+M),[Yilun Zhao](https://arxiv.org/search/cs?searchtype=author&query=Zhao,+Y),[Youzhong Dong](https://arxiv.org/search/cs?searchtype=author&query=Dong,+Y),[Xiaoyu Wang](https://arxiv.org/search/cs?searchtype=author&query=Wang,+X),[Yankai Chen](https://arxiv.org/search/cs?searchtype=author&query=Chen,+Y),[Ye Yuan](https://arxiv.org/search/cs?searchtype=author&query=Yuan,+Y),[Qiyuan Zhang](https://arxiv.org/search/cs?searchtype=author&query=Zhang,+Q),[Fuyuan Lyu](https://arxiv.org/search/cs?searchtype=author&query=Lyu,+F),[Haolun Wu](https://arxiv.org/search/cs?searchtype=author&query=Wu,+H),[Yonghan Yang](https://arxiv.org/search/cs?searchtype=author&query=Yang,+Y),[Zichen Zhao](https://arxiv.org/search/cs?searchtype=author&query=Zhao,+Z),[Yuyang Dai](https://arxiv.org/search/cs?searchtype=author&query=Dai,+Y),[Fan Zhang](https://arxiv.org/search/cs?searchtype=author&query=Zhang,+F),[Rania Elbadry](https://arxiv.org/search/cs?searchtype=author&query=Elbadry,+R),[Ayesha Gull](https://arxiv.org/search/cs?searchtype=author&query=Gull,+A),[Muhammad Usman Safder](https://arxiv.org/search/cs?searchtype=author&query=Safder,+M+U),[Nuo Chen](https://arxiv.org/search/cs?searchtype=author&query=Chen,+N),[Fengbin Zhu](https://arxiv.org/search/cs?searchtype=author&query=Zhu,+F),[Tianshi Cai](https://arxiv.org/search/cs?searchtype=author&query=Cai,+T),[Zimu Wang](https://arxiv.org/search/cs?searchtype=author&query=Wang,+Z),[Polydoros Giannouris](https://arxiv.org/search/cs?searchtype=author&query=Giannouris,+P),[Yuechen Jiang](https://arxiv.org/search/cs?searchtype=author&query=Jiang,+Y),[Zhiwei Liu](https://arxiv.org/search/cs?searchtype=author&query=Liu,+Z),[Mohsinul Kabir](https://arxiv.org/search/cs?searchtype=author&query=Kabir,+M),[Yuyan Wang](https://arxiv.org/search/cs?searchtype=author&query=Wang,+Y),[Yixiang Zheng](https://arxiv.org/search/cs?searchtype=author&query=Zheng,+Y),[Yangyang Yu](https://arxiv.org/search/cs?searchtype=author&query=Yu,+Y),[Weijin Liu](https://arxiv.org/search/cs?searchtype=author&query=Liu,+W),[Wenbo Cao](https://arxiv.org/search/cs?searchtype=author&query=Cao,+W),[Anke Xu](https://arxiv.org/search/cs?searchtype=author&query=Xu,+A),[Peng Lu](https://arxiv.org/search/cs?searchtype=author&query=Lu,+P),[Jerry Huang](https://arxiv.org/search/cs?searchtype=author&query=Huang,+J),[Fengran Mo](https://arxiv.org/search/cs?searchtype=author&query=Mo,+F),[Mingquan Lin](https://arxiv.org/search/cs?searchtype=author&query=Lin,+M),[Prayag Tiwari](https://arxiv.org/search/cs?searchtype=author&query=Tiwari,+P),[Yijia Zhao](https://arxiv.org/search/cs?searchtype=author&query=Zhao,+Y),[Victor Gutierrez Basulto](https://arxiv.org/search/cs?searchtype=author&query=Basulto,+V+G),[Xiao\-Yang Liu](https://arxiv.org/search/cs?searchtype=author&query=Liu,+X),[Kaleb E Smith](https://arxiv.org/search/cs?searchtype=author&query=Smith,+K+E),[Jiahuan Pei](https://arxiv.org/search/cs?searchtype=author&query=Pei,+J),[Arman Cohan](https://arxiv.org/search/cs?searchtype=author&query=Cohan,+A),[Jimin Huang](https://arxiv.org/search/cs?searchtype=author&query=Huang,+J),[Yuehua Tang](https://arxiv.org/search/cs?searchtype=author&query=Tang,+Y),[Alejandro Lopez\-Lira](https://arxiv.org/search/cs?searchtype=author&query=Lopez-Lira,+A),[Xi Chen](https://arxiv.org/search/cs?searchtype=author&query=Chen,+X),[Xue Liu](https://arxiv.org/search/cs?searchtype=author&query=Liu,+X),[Junichi Tsujii](https://arxiv.org/search/cs?searchtype=author&query=Tsujii,+J),[Jian\-Yun Nie](https://arxiv.org/search/cs?searchtype=author&query=Nie,+J),[Sophia Ananiadou](https://arxiv.org/search/cs?searchtype=author&query=Ananiadou,+S)

[View PDF](https://arxiv.org/pdf/2605.14355)

> Abstract:As AI agents improve, the central question is no longer whether they can solve isolated well\-defined financial tasks, but whether they can reliably carry out financial professional work\. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification\. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing\. Each workflow is instantiated as a standardized MCP\-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end\-to\-end assessment of heterogeneous agent systems\. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long\-horizon coordination, state consistency, and structured verification are critical\. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high\-stakes financial workflows\.

## Submission history

From: Xueqing Peng Dr \[[view email](https://arxiv.org/show-email/92e15dc2/2605.14355)\] **\[v1\]**Thu, 14 May 2026 04:30:49 UTC \(3,734 KB\)

Similar Articles

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

Papers with Code Trending

This paper introduces AI-Trader, the first fully automated live benchmark for evaluating LLMs in financial decision-making across US stocks, A-shares, and cryptocurrencies. It highlights that general intelligence does not guarantee trading success and emphasizes the importance of risk control in autonomous agents.

anthropics/financial-services

GitHub Trending (daily)

Anthropic released a repository of AI agents and plugins tailored for financial services workflows, including investment banking and wealth management, deployable via Claude Cowork or the Managed Agents API.

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

arXiv cs.LG

This paper benchmarks agentic AI systems on the task of loading, understanding, and reformatting fragmented neuroscience data, finding that while agents perform well on subtasks, they rarely achieve fully error-free end-to-end solutions and human oversight remains necessary.

Data readiness for agentic AI in financial services

MIT Technology Review

The article discusses how financial services companies must ensure data quality, security, and accessibility to successfully deploy agentic AI, emphasizing that the technology's effectiveness depends more on robust data foundations than on system sophistication.