OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents

arXiv cs.AI 06/26/26, 04:00 AM Papers
quantitative-finance llm-agents gym-environment benchmark evaluation multi-task finance
Summary
This paper introduces OpenFinGym, a unified multi-task gym environment for evaluating large language model agents in quantitative finance, covering forecasting, market generation, real-time trading, and fraud detection with verifiable execution and automated task construction.
arXiv:2606.26350v1 Announce Type: new Abstract: Although large language model agents are increasingly applied to quantitative-finance workflows, their evaluation remains fragmented across isolated tasks, while the financial relevance of benchmark tasks is often overlooked. Yet financial workflows are inherently multi-stage, spanning interdependent tasks such as forecasting, strategy construction, risk management, and trading. Existing platforms typically focus on a single task, and can therefore overstate agent competence and fail to reveal weaknesses in generalization, real-market interaction, and financially meaningful decision-making. We introduce OpenFinGym, a unified gym environment for quantitative-finance agent development that covers forecasting, market generation, real-time trading, and fraud detection under a single execution and verification interface. OpenFinGym additionally provides an automated task-construction pipeline that turns quantitative finance publications into executable task packages; a containerised runtime with a host-side verifier service that supports scalable agent rollouts and prevents runtime train-test leakage; a paper trading engine with a low-latency data-stream design; deferred-resolution support for long-horizon and event-market forecasts; and integration for SFT and RL post-training
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:12 AM
# A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents
Source: [https://arxiv.org/html/2606.26350](https://arxiv.org/html/2606.26350)
Kaicheng Zhang1,Wen Ge2,∗Lei Jiang3Weixin Yang2 Jordan Langham\-Lopez3Jialin Yu4 Lukasz Szpruch1,Hao Ni2,† 1University of Edinburgh 2University College London 3Alan Turing Institute 4University of Oxford K\.Zhang\-60@sms\.ed\.ac\.uk \{wen\.ge\.25, weixin\.yang, h\.ni\}@ucl\.ac\.uk \{ljiang, jlanghamlopez\}@turing\.ac\.uk,jialin\.yu@eng\.ox\.ac\.uk L\.Szpruch@ed\.ac\.uk

###### Abstract

Although large language model agents are increasingly applied to quantitative\-finance workflows, their evaluation remains fragmented across isolated tasks, while the financial relevance of benchmark tasks is often overlooked\. Yet financial workflows are inherently multi\-stage, spanning interdependent tasks such as forecasting, strategy construction, risk management, and trading\. Existing platforms typically focus on a single task, and can therefore overstate agent competence and fail to reveal weaknesses in generalization, real\-market interaction, and financially meaningful decision\-making\. We introduceOpenFinGym, a unified gym environment for quantitative\-finance agent development that covers forecasting, market generation, real\-time trading, and fraud detection under a single execution and verification interface\.OpenFinGymadditionally provides an automated task\-construction pipeline that turns quantitative finance publications into executable task packages; a containerised runtime with a host\-side verifier service that supports scalable agent rollouts and prevents runtime train–test leakage; a paper trading engine with a low\-latency data\-stream design; deferred\-resolution support for long\-horizon and event\-market forecasts; and integration for SFT and RL post\-training\. \[Code will be available upon publication\.\]

OpenFinGym: A Verifiable Multi\-Task Gym Environment for Evaluating Quant Agents

Kaicheng Zhang1,††thanks:Equal contribution\.Wen Ge2,∗Lei Jiang3Weixin Yang2Jordan Langham\-Lopez3Jialin Yu4Lukasz Szpruch1,††thanks:corresponding authors\.Hao Ni2,†1University of Edinburgh2University College London3Alan Turing Institute4University of OxfordK\.Zhang\-60@sms\.ed\.ac\.uk\{wen\.ge\.25, weixin\.yang, h\.ni\}@ucl\.ac\.uk\{ljiang, jlanghamlopez\}@turing\.ac\.uk,jialin\.yu@eng\.ox\.ac\.ukL\.Szpruch@ed\.ac\.uk

![Refer to caption](https://arxiv.org/html/2606.26350v1/Figures/openfingym_v5.png)Figure 1:A high\-level illustration ofOpenFinGymarchitecture## 1Introduction

Quantitative finance has, over the past two years, emerged as a testing ground for autonomous large language model agents\(Zhanget al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib66); Yuet al\.,[2025b](https://arxiv.org/html/2606.26350#bib.bib67),[2024](https://arxiv.org/html/2606.26350#bib.bib68)\)\. Where earlier work investigated whether deep models could outperform classical baselines on a forecasting or trading task\(Guet al\.,[2020b](https://arxiv.org/html/2606.26350#bib.bib69); Sunet al\.,[2023](https://arxiv.org/html/2606.26350#bib.bib14); Liuet al\.,[2022](https://arxiv.org/html/2606.26350#bib.bib11)\), recent systems treat the LLM itself as a researcher: an agent that selects data, engineers features, trains models, generates predictions or executes trades\. This shift changes what evaluation and training infrastructure must support\. The task surface should span future price forecasting\(jun Guet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib62); Wanget al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib63)\), generating synthetic market scenarios for stress testing\(Hounwanou and Gaba,[2025](https://arxiv.org/html/2606.26350#bib.bib64)\), executing trades against historical and live data\(Zonget al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib59); Kashif and Ślepaczuk,[2025](https://arxiv.org/html/2606.26350#bib.bib60)\), and detecting anomalies in transaction graphs\(Liet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib61)\)\. It also should span interaction modes, from one\-shot batch submission on historical data to step\-by\-step decision\-making on streaming market feeds\.

Although some existing platforms such as FinRL\-Meta\(Liuet al\.,[2022](https://arxiv.org/html/2606.26350#bib.bib11)\), TradeMaster\(Sunet al\.,[2023](https://arxiv.org/html/2606.26350#bib.bib14)\)supply gym\-style trading environments, and domain\-general gym environments such as DSGym\(Nieet al\.,[2026](https://arxiv.org/html/2606.26350#bib.bib10)\)and TimeSeriesGym\(Caiet al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib2)\)can handle static tasks such as batch forecasting, the former are restricted to trading and the latter are not specialised for finance\. Yet financial tasks are interconnected in practice: forecasts can guide trades, generated scenarios can stress\-test strategies, and anomaly signals can inform risk management\. No existing platform provides a unified environment for evaluating and training agents across forecasting, generation, trading, and anomaly detection under a common execution and verification interface, which is needed to assess whether agents can perform and generalize across financial workflows\. Moreover, in a high\-stakes domain such as finance, scalable task construction must be paired with rigorous verification: tasks should reflect meaningful financial workflows while ensuring reproducibility, leakage control, and evaluation against hidden ground truth\.

We introduceOpenFinGym, a unified gym environment for quantitative\-finance agent development that closes this gap\.OpenFinGymcontains a curated suite of 78 tasks across the four task families with an automated, literature\-to\-task pipeline\. Agents run inside containers that exclude held\-out test set ground truth, and submit predictions or actions to a host\-side verifier service for reward computation\. A low\-latency data streaming service, a paper trading engine, and a persistent SQLite ledger extend the same interface to real\-time trading and long\-horizon price and prediction market forecasting\. We benchmark four frontier LLMs onOpenFinGymand find task\-family\-specific leaders rather than a single dominant model\. We also demonstrateOpenFinGym’s utility as a post\-training environment, where SFT and GRPO post training improves the executable generation success rates of Qwen3 base models from 0% to 100% on held\-out forecasting tasks, and improves the reward by an order of magnitude on Treasury\.

Our contributions are as follows:

- •A curated task suite of 78 verified quantitative\-finance tasks spanning four different task families, constructed from prior literature and extended to real\-time financial market environment\.
- •An automated task\-construction pipeline that converts academic quantitative\-finance papers into executable and verifiable task packages\.
- •A containerised runtime that separates agent\-visible information from verifier\-only ground truth, enabling leakage\-resistant evaluation across both batch and sequential tasks\.
- •A gym environment for evaluating and training quantitative\-finance agents, which enables supervised fine\-tuning and reinforcement learning\.

Task coverageRuntimeAgent lifecycleSystemFore\-castingTrading\(offline\)Trading\(real\-time\)MarketgenerationFrauddetectionLeakagecontrolLow\-latencyAuto taskconstructionSFT / RLintegrationQuantitative finance agent systemsFinRL\-Meta\(Liuet al\.,[2022](https://arxiv.org/html/2606.26350#bib.bib11)\)–✓∼\\boldsymbol\{\\sim\}––✓––∼\\boldsymbol\{\\sim\}FinBen\(Xieet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib43)\)∼\\boldsymbol\{\\sim\}✓––✓–––✗INVESTORBENCH\(Liet al\.,[2025a](https://arxiv.org/html/2606.26350#bib.bib65)\)–✓–––––––Agent Market Arena\(Qianet al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib3)\)––✓––∼\\boldsymbol\{\\sim\}–––General containerised agent gymsSWE\-Bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib48)\)–––––✓–∼\\boldsymbol\{\\sim\}–TimeSeriesGym\(Caiet al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib2)\)✓––––✓–∼\\boldsymbol\{\\sim\}–Harbor\(Harbor Framework Team,[2026](https://arxiv.org/html/2606.26350#bib.bib9)\)–––––✓––✓OpenFinGym\(ours\)✓✓✓✓✓✓✓✓✓

Table 1:Comparison ofOpenFinGymagainst competitive systems across task coverage, runtime properties, and agent\-lifecycle support\. ✓ denotes full support,∼\\boldsymbol\{\\sim\}partial support, and–no support\.
## 2Related work

Financial agent environments include reinforcement learning trading platforms such as FinRL\-Meta\(Liuet al\.,[2022](https://arxiv.org/html/2606.26350#bib.bib11)\), TradeMaster\(Sunet al\.,[2023](https://arxiv.org/html/2606.26350#bib.bib14)\), quantitative research frameworks such as Qlib\(Yanget al\.,[2020](https://arxiv.org/html/2606.26350#bib.bib12)\), and market microstructure simulators such as ABIDES\-Gym\(Amrouniet al\.,[2021](https://arxiv.org/html/2606.26350#bib.bib13)\)\. These systems primarily support quantitative research, trading, and simulation rather than executable agentic benchmarking grounded in financial literature\. Financial LLM benchmarks, including PIXIUXieet al\.\([2023](https://arxiv.org/html/2606.26350#bib.bib42)\), FinBen\(Xieet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib43)\), FinanceBench\(Islamet al\.,[2023](https://arxiv.org/html/2606.26350#bib.bib44)\)evaluate financial reasoning and analysis abilities on largely static datasets and evaluation protocols\. More recent work has focused on executable agent evaluation\. SWE\-Bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib48)\), AgentBench\(Liuet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib51)\), and WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib50)\)evaluate agents in autonomous environment, while, DSGym\(Nieet al\.,[2026](https://arxiv.org/html/2606.26350#bib.bib10)\), TimeSeriesGym\(Caiet al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib2)\), and Harbor\(Harbor Framework Team,[2026](https://arxiv.org/html/2606.26350#bib.bib9)\)extend this paradigm to data\-science, time\-series, and containerised execution settings\.OpenFinGymbuilds on this line of work but targets finance\-specific challeges such as temporal leakage and economically grounded evaluation\. Concurrent studies further investigate LLM\-based financial agents in live or semi\-live markets, such as StockBench\(Chenet al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib4)\), LiveTradeBench\(Yuet al\.,[2025a](https://arxiv.org/html/2606.26350#bib.bib5)\), and When Agents Trade\(Qianet al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib3)\)\. Other works highlight the look\-ahead risks, temporal contamination, and benchmark leakage in financial agent evaluation\(Benhenda,[2026](https://arxiv.org/html/2606.26350#bib.bib6); Liet al\.,[2025b](https://arxiv.org/html/2606.26350#bib.bib7); Xuet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib46)\)\.OpenFinGymis positioned at the intersection of executable agent evaluation and quantitative\-finance benchmarking through literature\-grounded task construction and verifier\-isolated evaluation\.

## 3Open Financial Gym

OpenFinGymis designed to support rigorous, scalable, and leakage\-resistant evaluation of quantitative\-finance agents\.OpenFinGymcovers a broader family of financial workflows, including forecasting, trading, market generation, and fraud detection\. The benchmark is organised around three components: a reusable quantitative\-finance knowledge base, a collection of representative and verified tasks, and a gym\-style execution environment for evaluating and training agents\. The overall architecture is shown in Figure[1](https://arxiv.org/html/2606.26350#S0.F1)\.

### 3\.1Task Specifications

##### Unified task abstraction\.

The quantitative\-finance tasks inOpenFinGymare formulated as machine\-learning problems\. Formally, each task is represented by a triplet\(𝒟,𝒯,ℰ\)\(\\mathcal\{D\},\\mathcal\{T\},\\mathcal\{E\}\), where

- •𝒟\\mathcal\{D\}denotes the dataset, including the training and test splits;
- •𝒯\\mathcal\{T\}specifies the task description, including the data source, input\-output interface, prediction or decision objective, required output format, and overall goal of the task;
- •ℰ\\mathcal\{E\}denotes the evaluation protocol used to assess agent performance\. The evaluation protocol may include multiple metrics, together with an aggregation rule that defines the final task score\.

This formulation provides a unified interface for representing heterogeneous quantitative\-finance problems\.

Each task is packaged in a standardised format extended from by the Harbor framework\(Harbor Framework Team,[2026](https://arxiv.org/html/2606.26350#bib.bib9)\)\. A task package contains the agent\-facing instructions, task interface module, data payload, configuration file used for execution and orchestration, and the evaluation components required by the runtime\. A key design principle is the separation between information available to the agent and information used only for verification\. During evaluation, agents may access the training split and the input features of the test split, but the corresponding labels, realised outcomes, or hidden evaluation state are withheld and made available only to the verifier\. This separation reduces leakage risk and allows the same agent to be evaluated across heterogeneous task sources without task\-specific glue code\.

![[Uncaptioned image]](https://arxiv.org/html/2606.26350v1/Figures/openfingym_task.png)

Task typeAsset Classes\# TasksForecastingEquities, Commodities, FX, Crypto, LOB, Yield Curve48Market SimulationEquities, FX, Crypto, LOB, Synthetic Processes13TradingEquities, Crypto10Fraud DetectionTransactional, Behavioral, and Graph\-Structured Signals7

Table 2:Task type distribution \(top\) and asset class breakdown \(bottom\) of the curated financial tasks in the gym\.
##### Task taxonomy\.

OpenFinGymcovers four task families\.

1. 1\.Forecasting\.Forecasting tasks require an agent to predict future financial quantities, such as returns, prices, volatility, yield\-curve movements, or market states, from historical observations and other predictive signals\. These tasks are evaluated using statistical metrics such asR2R^\{2\}, RMSE, and directional accuracy, as well as finance\-specific metrics that measure the downstream economic value or risk profile of the forecasts\.
2. 2\.Trading\.Trading tasks require an agent to take sequential trading actions on the observed market snapshots\. Offline trading tasks evaluate policies on historical data, while real\-time trading tasks require sequential interaction with live market observations\. Evaluation metrics include profit and loss, Sharpe ratio, drawdown, turnover, and returns with configurable transaction cost and price slippage\.
3. 3\.Market generation\.Market generation tasks evaluate whether an agent can generate realistic financial time series, market states, or scenario distributions\. Generated samples are compared with real data using distributional, temporal, and financial diagnostics, including distributional scores, statistic\-based scores \(e\.g\., autocorrelation score, marginal distribution score\), volatility, tail\-risk, and return\-dynamics statistics\.
4. 4\.Fraud detection\.Fraud detection tasks are formulated as classification, ranking, or anomaly\-detection problems\. The objective is to identify fraudulent, suspicious, or anomalous financial activity\. Evaluation metrics include AUROC, recall, and threshold\-dependent operating metrics\.

The task catalogue spans multiple asset classes, including equities, indices, foreign exchange \(FX\), commodities, crypto\-assets, rates, limit\-order\-book \(LOB\) microstructure, and prediction markets, as well as different frequencies ranging from intraday to monthly observations and graph\-structured data\. It includes offline tasks derived from the academic literature and real\-time tasks in which agents must act before the relevant market or event outcome is observed\. The resulting task universe is summarised in Table[2](https://arxiv.org/html/2606.26350#S3.T2)\.

##### Knowledge base\.

Task construction produces a reusable quantitative\-finance knowledge base\. The knowledge base stores structured information about papers, datasets, task settings, experimental protocols, evaluation metrics, and reusable reward functions\. This enables datasets and metrics to be reused across related tasks, supports systematic construction of new benchmark instances, and provides the substrate for the automated task construction pipeline described in Section[4](https://arxiv.org/html/2606.26350#S4)\.

### 3\.2Agents

An agent receives the task instructions, the observable data, and the required output contract\. For batch tasks, the agent typically writes executable code that trains a model on the provided training data and submits predictions for the held\-out test inputs\. For sequential tasks, the agent interacts with the environment through a step\-based interface, observes market or portfolio state, and submits actions such as forecasts, trades, or portfolio weights\.

The agent is not required to know how the verifier stores hidden labels or realised outcomes\. Instead, it interacts with the task through a fixed submission interface\. This design allows agents with different internal architectures, such as prompted language\-model agents, tool\-using agents, coding agents, or trained policies, to be evaluated under the same task interface\.

## 4Methodology

OpenFinGymhas two methodological components\. The first is a task\-construction pipeline that turns quantitative\-finance publications and curated benchmark specifications into executable tasks\. The second is a task\-execution framework that runs agents in controlled environments and evaluates their submissions through isolated verifier processes or deferred\-resolution services\.

### 4\.1Task construction

![Refer to caption](https://arxiv.org/html/2606.26350v1/Figures/pipe_v3.png)Figure 2:Schematics of the four phases of the task construction pipeline and the knowledge baseOpenFinGymconstructs tasks through three complementary pathways: automated task generation, curated benchmark implementation, and trading task routing\. Forecasting and market\-generation tasks are well suited to automated construction because they typically follow a common pattern: materialise a dataset, define an input\-output split, generate predictions, and evaluate them against held\-out targets\. Trading and real\-time tasks require additional infrastructure, including execution engines, market\-data providers, portfolio\-state tracking, and deferred evaluation; these tasks are therefore implemented through curated harnesses that can be parameterised by paper\-specific configurations\.

#### 4\.1\.1Automated Task Construction pipeline

OpenFinGymconstructs its knowledge base and executable benchmark tasks through a multi\-stage pipeline, comprising four main phases: paper harvesting, dataset construction, reward construction, and task generation, as shown in Figure[2](https://arxiv.org/html/2606.26350#S4.F2)\.

##### Paper harvesting\.

The pipeline first collects candidate papers academic finance within a user\-specified task scope, then filters them through a two\-stage procedure\. A prefiltering stage screens titles and abstracts to discard out\-of\-scope papers; A full\-text filtering stage admits a candidate only if it presents a complete, in\-scope experimental protocol together with well\-documented evaluation metrics and at least one publicly accessible datasets\. Accepted papers are summarised into structured records detailing its task setting, datasets, evaluation metrics, and experimental protocol\. These records form the initial substrate of the knowledge base and provide the task specifications for the downstream construction stages\.

##### Dataset and Reward construction\.

For each accepted task specification, the pipeline first check for potential reuses of existing datasets and rewards in the knowledge base\. Otherwise, the required artifact is materialised via a generator\-reviewer loop, in which an LLM writes the artifact and a reviewer judges it with the aid of unit tests\. For dataset construction, the loop runs an LLM\-authored download script in a safety sandbox, with the reviewer judging both the script and the materialised payload against the task’s metadata\. Evaluation metrics undergo similar reuse checks against a reusable reward bank\. A metric absent from the bank is generated by the same generator\-reviewer loop from a standardised template\. It is admitted only after interface checks, unit tests on synthetic tensors and the reviewer’s audit\. Together these stages populate a reusable quantitative\-finance knowledge base of papers, datasets and reward functions organised by task scope\.

##### Task assembly\.

The pipeline assembles the datasets, rewards, and task specifications into a self\-contained,OpenFinGym\-structured task package\. Where applicable, the dataset is partitioned into train and test sets by a task\-specific loader generated through the same generator\-reviewer loop\. The generator step also emits a ground\-truth provenance record \(e\.g\. source column, prediction horizon\) that the reviewer audits for source\-mechanism consistency\. A separate feature\-leakage test then checks over the loader’s output tensors before the loader is accepted\. An evaluator is then assembled deterministically from a template by plugging each resolved reward, as specified by the task’s metadata\. A final generation step writes the agent\-facing task interface module, declaring the task’s metadata and observation / action spaces, and writes an instruction markdown file detailing task goal, dataset description, and evaluation protocol\. The task is installed only after it passes both contract and smoke tests that exercise the full agent–evaluator path\. Each package bundles these artifacts with environment configuration files and dataset payload, forming executable benchmark tasks consumable by theOpenFinGymruntime for agent evaluation and downstream training\.

#### 4\.1\.2Generated and curated task packages

The automated pipeline of Section[4\.1\.1](https://arxiv.org/html/2606.26350#S4.SS1.SSS1)emits a task package from each accepted paper\. For forecasting, market\-generation and fraud\-detection papers, this pipeline materialises the full executable package end to end: dataset, evaluator, task module and instructions\. Additionally, we construct offline benchmarks within each scope by manually selecting papers that are widely cited, methodologically reproducible, and accompanied by either public code or a sufficiently detailed protocol\. Each task is then implemented and manually verified to reproduce the original literature benchmark as faithfully as possible\. The full summary of the curated offline tasks is given in Appendix[C](https://arxiv.org/html/2606.26350#A3)\.

For trading tasks, the required execution and market\-data infrastructure cannot be reliably built by the automated pipeline alone\. The pipeline instead routes the task specification to the curated trading harness and re\-exports a task parameterised by the paper\-derived configuration \(symbols, dates, frequency, prediction horizon, etc\.\)\. The harness reused here is the same one used by the manually curated trading tasks, which bundles a configurable execution engine, in\-process portfolio\-state accounting, and adapters to stock, crypto, and prediction\-market data providers\.

Table 3:Latency and mean refresh rate measurements of the WebSocket \(WS\) and REST paths of data providers for real\-time trading tasks\. Both measures are done in a 60 s window\. The WS path incurs an additional≈0\.5μ\\approx 0\.5\\,\\mus buffer lookup time on the agent’s side\. Results may vary depend on users’ locations and market conditions\.ProviderChannelLatencySpeed\-upMean refreshImprovement\(ms\)\(Latency\)\(Hz\)\(Freshness\)AlpacaWS54\.21\.90×1\.90\\times13\.539\.7×9\.7\\timesREST103\.11\.40BinanceWS129\.11\.86×1\.86\\times0\.6740\.1×40\.1\\timesREST240\.50\.017

#### 4\.1\.3Real\-time tasks

Real\-time tasks are another case whereOpenFinGymrelies on a curated harness rather than the automated pipeline\. The agent takes trading actions or makes forecasts on streaming market data, and the runtime requirements cannot be reliably built by an automated pipeline alone such as low\-latency data streaming, in\-process order execution, and long\-horizon deferred resolution\.

##### Real\-time trading tasks\.

The agent observes live market and portfolio state at each gym step and submits orders through an executor interface\. The default is an in\-process paper trading simulator that holds positions, cash, and pending orders in memory and fills orders against the latest market snapshot\. It supports the common order types and applies configurable price slippage and per\-trade transaction costs\. The alternative executor routes the same orders to a broker’s paper trading sandbox and mirrors the resulting account state locally, keeping account and portfolio reads constant\-time on the agent side without a broker round\-trip per observation\. Further details of the trading implementation are given in Appendix[A](https://arxiv.org/html/2606.26350#A1)\.

##### Latency\-aware data path\.

Each gym step in a real\-time trading task reads the current price and order book before the agent acts, and the freshness of those reads to the quality of the agent’s actions\. A naive REST\-per\-step approach incurs a network round\-trip on every observation and reaches rate limits well below the price refresh rate\.OpenFinGyminstead serves these data from an in\-process buffer that is continuously refreshed by a background WebSocket \(WS\) subscription\. The WS path roughly halves the per\-read latency and achieves a peak refresh rate more than an order of magnitude higher than REST, as shown by Table[3](https://arxiv.org/html/2606.26350#S4.T3)\.

##### Deferred\-resolution forecasting tasks\.

OpenFinGymalso accommodates price and event market forecasting on real\-time data\. This raises a different problem: the realised outcome of a long\-horizon price forecast or an event market prediction is not available at the moment of submission\. The verifier therefore records each submission together with its reference state and the resolution time in a persistent SQLite ledger\. Once the resolution time has elapsed, a resolver service fetches the realised market price or event outcome from the relevant data provider and computes the configured reward\.

ModelForecastingGenerationTradingFraud DetectionReward↓\\downarrowSuccess \(%\)↑\\uparrowReward↓\\downarrowSuccess \(%\)↑\\uparrowReward↑\\uparrowSuccess \(%\)↑\\uparrowReward↑\\uparrowSuccess \(%\)↑\\uparrowGPT\-4o0\.0521007\.3777019\.9121001\.290100GPT\-5\.1\-codex\-mini0\.03610011\.2128018\.0901001\.205100Haiku 4\.50\.05810014\.7729011\.4901001\.147100Sonnet 4\.60\.0531007\.1149012\.0441001\.630100

Table 4:Agent performance across task families and LLM models\. For each task family, we report both the average reward score and the task success rate, aggregated over all tasks within that category\. The reward score is computed as a weighted combination of multiple evaluation metrics, providing a comprehensive assessment of overall agent performance\.↓\\downarrowrepresents lower is better while↑\\uparrowrepresents higher is better\.

### 4\.2Task execution

OpenFinGymexecutes tasks through containerised agent environments and a verifier\-based scoring service\. Each agent rollout runs inside a container that holds the task instructions, the training data \(both inputs and labels\), test inputs, task interface, and the Python libraries required for modelling and submission\. The held\-out test labels are explicitly excluded from the container’s bind mounts and are therefore inaccessible to the agent, removing test\-set leakage during runtime by construction\.

The verifier is a host\-side process that loads the held\-out test set ground truth into memory at startup and exposes a submission API\. The API has a rate limit that prevents the reverse engineering of the held\-out ground truth\. The verifier process can be shared across every agent container running the same task concurrently, making the setup suitable for large\-scale batch evaluation, batch rollouts for RL, and trajectory collection for SFT\. On each call, the verifier validates the submission format, computes the reward host\-side, and returns the per\-metric scores together with a weighted total\.

OpenFinGymsupports two interaction surfaces: Batch tasks, such as forecasting, use a one\-shot submission interface, where the agent submits all predictions in a single call, which are scored immediately or after deferred resolution; Gym\-loop tasks, such as trading, use a sequential interface, where the agent repeatedly observes state, takes actions, accumulates a trajectory, and is evaluated at the end of the episode\.

ModelCommodityCryptoEquityTreasuryFXReward↓\\downarrowSuccess↑\\uparrowReward↓\\downarrowSuccess↑\\uparrowReward↓\\downarrowSuccess↑\\uparrowReward↓\\downarrowSuccess↑\\uparrowReward↓\\downarrowSuccess↑\\uparrowQwen3 1\.7b–0–0–0–0–0SFT0\.023600\.539300\.016601\.152600\.00640RL0\.0201000\.0071000\.0151000\.0991000\.006100Qwen3 4b–0–0–0–0–0SFT0\.019700\.007500\.0151001\.055600\.00690RL0\.0181000\.0061000\.0141000\.0901000\.006100Qwen3 8b–0–0–0–0–0SFT0\.019900\.007900\.015700\.803700\.00670RL0\.0181000\.0061000\.0141000\.0901000\.005100

Table 5:The performance of base model, SFT and RL results on the five held\-out forecasting test tasks, averaged over 10 random seeds per task\. The first row of each model is the base checkpoint\. We report the mean reward \(↓\\downarrowlower is better\) and the task success rate \(%,↑\\uparrowhigher is better\)\. Base success rates are uniformly0%0\\%\.

## 5Numerical Results

### 5\.1Agent Performance

We evaluate a representative subset of OpenFinGym comprising 24 tasks across four families: 5 forecasting tasks, 10 market\-generation tasks, 2 trading tasks, and 7 fraud\-detection tasks\. Results are summarised in Table[4](https://arxiv.org/html/2606.26350#S4.T4)\. For each task family, we report both the average reward score and the task success rate\. The success rate measures the percentage of tasks for which the agent successfully generates executable submissions that satisfy the task interface and can be evaluated by the OpenFinGym verifier harness\.

Several observations emerge from the results\. First, all evaluated agents achieve 100% success rates on forecasting, trading, and fraud\-detection tasks, indicating that frontier models can reliably interact with the OpenFinGym interfaces and produce executable quantitative\-finance solutions\. In forecasting, GPT\-5\.1\-codex\-mini performs best with a reward of 0\.036, clearly ahead of the other agents whose rewards range from 0\.052 to 0\.058\. In market generation, Sonnet 4\.6 gives the strongest result with the lowest reward of 7\.114, slightly better than GPT\-4o at 7\.377 and substantially better than the remaining agents\. For trading, GPT\-4o achieves the highest reward of 19\.912, followed by GPT\-5\.1\-codex\-mini at 18\.090, indicating stronger trading\-strategy quality among the evaluated agents\. In fraud detection, Sonnet 4\.6 obtains the best reward of 1\.630, outperforming GPT\-4o by 26\.4%, GPT\-5\.1\-codex\-mini by 35\.3%, and Haiku 4\.5 by 42\.1%\.

Overall, Table[4](https://arxiv.org/html/2606.26350#S4.T4)shows that OpenFinGym reveals task\-specific agent strengths rather than a single dominant model: GPT\-5\.1\-codex\-mini leads on forecasting, Sonnet 4\.6 leads on market generation and fraud detection, and GPT\-4o leads on trading\. These results demonstrate that OpenFinGym can differentiate agent capabilities across heterogeneous quantitative\-finance settings while supporting reliable large\-scale automated evaluation\.

### 5\.2SFT & RL Results

OpenFinGym is integrated with both SFT and RL post\-training\. To demonstrate its utility as a post\-training environment for quant agents, we focus on forecasting tasks under a single\-shot generation setting, in which the agent emits its full submission in one pass without iterative tool use\.

Our SFT corpus consists of 189 single\-shot trajectories collected from Claude Opus 4\.7 at low reasoning effort over 21 forecasting training tasks\. Agents are encouraged to take classical machine learning approaches over generating neural network architectures and training loops from scratch, as the latter are more difficult for small base models to tackle within a single shot\.

We fine\-tune three Qwen3 backbones\(Yanget al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib55)\)of sizes 1\.7B, 4B, and 8B with LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.26350#bib.bib56)\)for two to four epochs\. The merged LoRA weights then serve as the initialization for RL, where each SFT checkpoint is post\-trained with GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.26350#bib.bib57)\)under SkyRL\(Caoet al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib58)\)\. With a batch size of 4 prompts per gradient step and 8 rollouts per prompt, training runs for 50–100 epochs \(∼\\sim250–500 gradient steps\), with reward defined as the negative weighted averages of per\-task metrics\. Full hyperparameter settings are listed in Appendix[B](https://arxiv.org/html/2606.26350#A2)\.

Table[5](https://arxiv.org/html/2606.26350#S4.T5)reports mean reward and success rate on the five held\-out forecasting tasks\. Base Qwen3 of every size fails to produce an executable submission on any task\. SFT closes most of this gap, lifting mean success to roughly5050–80%80\\%\. RL then boosts every model to a100%100\\%success rate on every task, with mean reward at or below the corresponding SFT value across the board, including order\-of\-magnitude reductions on Treasury and on Crypto\. We view these results as a demonstration of OpenFinGym’s SFT and RL integration\.

## 6Conclusion

We introducedOpenFinGym, a comprehensive benchmark and training environment for quant agents\. By combining executable task packages, verification mechanisms, and an automated task\-construction pipeline, this gym supports rigorous benchmarking and downstream agent training across diverse financial workflows\.

The future works onOpenFinGymincludes the extension to a broader range of financial task types, including additional execution settings, risk\-management workflows, and decision\-making problems beyond the current task families\. The existing task\-generation pipeline already provides a scalable path for this expansion\. Another future direction is to exposeOpenFinGym’s authenticated verifier interface as a public evaluation service, allowing external users to evaluate their agents without downloading the full benchmark or accessing test labels\. This would enable fair comparison, supports for leaderboard\-style evaluation, and help establishOpenFinGymas a community platform for quant agent developments\.

## 7Limitations

WhileOpenFinGymprovides a unified environment for evaluating and training quant agents, our current SFT and RL results should be viewed as an initial demonstration rather than a fully optimized training recipe\. Due to time and computational constraints, the trajectories used for post\-training are limited in both scale and quality, and the backbone agent used in our experiments is relatively lightweight\. These choices constrain the attainable gains from supervised fine\-tuning and reinforcement learning\. Future work will scale trajectory collection, improve trajectory filtering and reward\-based data selection, incorporate stronger backbone models, and study larger\-scale RL post\-training across more diverse financial tasks\.

## 8Acknowledgements

KZ was supported by the EPSRC Centre for Doctoral Training in Mathematical Modelling, Analysis and Computation \(MAC\-MIGS\) funded by the UK Engineering and Physical Sciences Research Council \(grant EP/S023291/1\), Heriot\-Watt University and the University of Edinburgh\. LJ, JL and LS acknowledge the support of the UKRI Prosperity Partnership Scheme \(FAIR\) under the EPSRC Grant EP/V056883/1 and The Alan Turing Institute\. WY and HN are supported by the EPSRC Program Grant \[Grant No\. UKRI1010\] entitled “High order mathematical and computational infrastructure for streamed data that enhance contemporary generative and large language models”\. WG is supported by the EPSRC through the UCL EPSRC Landscape Award \(UELA\) under Grant EP/Z534882/1\.

## References

- S\. Amrouni, A\. Moulin, J\. Vann, S\. Vyetrenko, T\. Balch, and M\. Veloso \(2021\)ABIDES\-gym: gym environments for multi\-agent discrete event simulation and application to financial markets\.InProceedings of the Second ACM International Conference on AI in Finance,pp\. 1–9\.Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- Y\. Ang, Y\. Bao, L\. Jiang, J\. Tao, A\. K\. Tung, L\. Szpruch, and H\. Ni \(2025\)Structured agentic workflows for financial time\-series modelling with llms and reflective feedback\.InProceedings of the 6th ACM International Conference on AI in Finance,pp\. 924–932\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.8.8.8.3),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.9.9.9.3),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.6.6.6.3),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.7.7.7.3),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.8.8.8.3)\.
- Y\. Bao, X\. Xi, X\. Liu, W\. Ge, L\. Jiang, K\. Zhang, R\. Khraishi, Y\. Ang, A\. K\. Tung, L\. Szpruch, and H\. Ni \(2026\)MOSAIC: modular orchestration for structured agentic intelligence and composition\.arXiv preprint arXiv:2606\.00708\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[§C\.2](https://arxiv.org/html/2606.26350#A3.SS2.p2.2),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.24.24.24.5),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.27.27.27.5),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.3.3.3.4),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.30.30.30.5),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.33.33.33.5),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.36.36.36.5),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.5.5.5.4),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.7.7.7.4),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.2.2.2.4),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.3.3.3.3),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.5.5.5.4)\.
- M\. Benhenda \(2026\)Look\-ahead\-bench: a standardized benchmark of look\-ahead bias in point\-in\-time llms for finance\.External Links:2601\.13770,[Link](https://arxiv.org/abs/2601.13770)Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- Y\. Cai, X\. Li, M\. Goswami, M\. Wiliński, G\. Welter, and A\. Dubrawski \(2025\)TimeSeriesGym: a scalable benchmark for \(time series\) machine learning engineering agents\.External Links:2505\.13291,[Link](https://arxiv.org/abs/2505.13291)Cited by:[Table 1](https://arxiv.org/html/2606.26350#S1.T1.6.6.6.2),[§1](https://arxiv.org/html/2606.26350#S1.p2.1),[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- S\. Cao, S\. Hegde, D\. Li, T\. Griggs, S\. Liu, E\. Tang, J\. Pan, X\. Wang, A\. Malik, G\. Neubig, K\. Hakhamaneshi, R\. Liaw, P\. Moritz, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica \(2025\)SkyRL\-v0: train real\-world long\-horizon agents via reinforcement learning\.Cited by:[§5\.2](https://arxiv.org/html/2606.26350#S5.SS2.p3.1)\.
- Y\. Chen, Z\. Yao, Y\. Liu, J\. Ye, J\. Yu, L\. Hou, and J\. Li \(2025\)StockBench: can llm agents trade stocks profitably in real\-world markets?\.External Links:2510\.02209,[Link](https://arxiv.org/abs/2510.02209)Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- I\. Chevyrev and H\. Oberhauser \(2022\)Signature moments to characterize laws of stochastic processes\.Journal of Machine Learning Research23\(176\),pp\. 1–42\.Cited by:[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.38.2.1.1)\.
- R\. Cont, M\. Cucuringu, R\. Xu, and C\. Zhang \(2026\)Tail\-gan: learning to simulate tail risk scenarios\.Management Science72\(4\),pp\. 2917–2936\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.32.2.1.1),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.19.19.19.3)\.
- F\. X\. Diebold and K\. Yilmaz \(2012\)Better to give than to receive: predictive directional measurement of volatility spillovers\.International Journal of forecasting28\(1\),pp\. 57–66\.Cited by:[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.44.2.1.1)\.
- F\. X\. Diebold and R\. S\. Mariano \(1995\)Comparing predictive accuracy\.Journal of Business & Economic Statistics13\(3\),pp\. 253–263\.Cited by:[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.29.2.1.1)\.
- B\. Dinda \(2024\)Gated recurrent neural network with tpe bayesian optimization for enhancing stock index prediction accuracy\.arXiv preprint arXiv:2406\.02604\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.14.14.14.4)\.
- C\. Esteban, S\. L\. Hyland, and G\. Rätsch \(2017\)Real\-valued \(medical\) time series generation with recurrent conditional gans\.arXiv preprint arXiv:1706\.02633\.Cited by:[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.37.2.1.1)\.
- European Insurance and Occupational Pensions Authority \(EIOPA\) \(2026\)Risk\-free interest rate term structures\.Note:Official monthly Solvency II risk\-free interest rate term structures\. Accessed: 2026\-05\-26External Links:[Link](https://www.eiopa.europa.eu/tools-and-data/risk-free-interest-rate-term-structures_en)Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1)\.
- S\. Gu, B\. Kelly, and D\. Xiu \(2020a\)Empirical asset pricing via machine learning\.The Review of Financial Studies33\(5\),pp\. 2223–2273\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.29.2.1.1),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.16.16.16.3)\.
- S\. Gu, B\. Kelly, and D\. Xiu \(2020b\)Empirical asset pricing via machine learning\.The Review of Financial Studies33\(5\),pp\. 2223–2273\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- Harbor Framework Team \(2026\)Harbor: A framework for evaluating and optimizing agents and models in container environmentsExternal Links:[Link](https://github.com/harbor-framework/harbor)Cited by:[Table 1](https://arxiv.org/html/2606.26350#S1.T1.6.6.12.1),[§2](https://arxiv.org/html/2606.26350#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.26350#S3.SS1.SSS0.Px1.p2.1)\.
- C\. D\. Hounwanou and Y\. U\. Gaba \(2025\)Deep generative models for synthetic financial data: applications to portfolio and risk modeling: applications of synthetic financial data in portfolio and risk modeling\.arXiv preprint arXiv:2512\.21798\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§5\.2](https://arxiv.org/html/2606.26350#S5.SS2.p3.1)\.
- P\. Islam, A\. Kannappan, D\. Kiela, R\. Qian, N\. Scherrer, and B\. Vidgen \(2023\)Financebench: a new benchmark for financial question answering\.arXiv preprint arXiv:2311\.11944\.Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)Swe\-bench: can language models resolve real\-world github issues?\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 54107–54157\.Cited by:[Table 1](https://arxiv.org/html/2606.26350#S1.T1.5.5.5.2),[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- W\. jun Gu, Y\. hao Zhong, S\. zun Li, C\. song Wei, L\. ting Dong, Z\. yue Wang, and C\. Yan \(2024\)Predicting stock prices with finbert\-lstm: integrating news sentiment analysis\.InProceedings of the 2024 8th International Conference on Cloud and Big Data Computing,ICCBDC ’24,New York, NY, USA,pp\. 67–72\.External Links:ISBN 9798400717253,[Link](https://doi.org/10.1145/3694860.3694870),[Document](https://dx.doi.org/10.1145/3694860.3694870)Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- K\. Kashif and R\. Ślepaczuk \(2025\)LSTM\-arima as a hybrid approach in algorithmic investment strategies\.Knowledge\-Based Systems320,pp\. 113563\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- A\. Khosravi, S\. Nahavandi, D\. Creighton, and A\. F\. Atiya \(2010\)Lower upper bound estimation method for construction of neural network\-based prediction intervals\.IEEE transactions on neural networks22\(3\),pp\. 337–346\.Cited by:[Table 6](https://arxiv.org/html/2606.26350#A3.T6.17.17.1.1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.34.2.1.1)\.
- P\. N\. Kumar, N\. Umeorah, and A\. Alochukwu \(2024\)Dynamic graph neural networks for enhanced volatility prediction in financial markets\.arXiv preprint arXiv:2410\.16858\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.19.19.19.4)\.
- P\. H\. Kupiecet al\.\(1995\)Techniques for verifying the accuracy of risk measurement models\.Cited by:[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.32.2.1.1)\.
- H\. Li, Y\. Cao, Y\. Yu, S\. R\. Javaji, Z\. Deng, Y\. He, Y\. Jiang, Z\. Zhu, K\. Subbalakshmi, J\. Huang,et al\.\(2025a\)Investorbench: a benchmark for financial decision\-making tasks with llm\-based agent\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2509–2525\.Cited by:[Table 1](https://arxiv.org/html/2606.26350#S1.T1.6.6.10.1)\.
- K\. Li, T\. Yang, M\. Zhou, J\. Meng, S\. Wang, Y\. Wu, B\. Tan, H\. Song, L\. Pan, F\. Yu,et al\.\(2024\)Sefraud: graph\-based self\-explainable fraud detection via interpretative mask learning\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5329–5338\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 9](https://arxiv.org/html/2606.26350#A3.T9.2.2.2.4),[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- X\. Li, Y\. Zeng, X\. Xing, J\. Xu, and X\. Xu \(2025b\)Profit mirage: revisiting information leakage in llm\-based financial agents\.External Links:2510\.07920,[Link](https://arxiv.org/abs/2510.07920)Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2024\)Agentbench: evaluating llms as agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 52989–53046\.Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- X\. Liu, Z\. Xia, J\. Rui, J\. Gao, H\. Yang, M\. Zhu, C\. Wang, Z\. Wang, and J\. Guo \(2022\)FinRL\-meta: market environments and benchmarks for data\-driven financial reinforcement learning\.Advances in Neural Information Processing Systems35,pp\. 1835–1849\.Cited by:[Table 1](https://arxiv.org/html/2606.26350#S1.T1.2.2.2.3),[§1](https://arxiv.org/html/2606.26350#S1.p1.1),[§1](https://arxiv.org/html/2606.26350#S1.p2.1),[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- H\. Lou, S\. Li, and H\. Ni \(2023\)PCF\-gan: generating sequential data via the characteristic function of measures on the path space\.Advances in Neural Information Processing Systems36,pp\. 39755–39781\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.36.2.1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.37.2.1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.38.2.1.1),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.10.10.10.3),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.11.11.11.3),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.9.9.9.3)\.
- W\. K\. Newey and K\. D\. West \(1986\)A simple, positive semi\-definite, heteroskedasticity and autocorrelationconsistent covariance matrix\.Cited by:[Table 6](https://arxiv.org/html/2606.26350#A3.T6.7.7.2.1.1)\.
- F\. Nie, J\. Wang, H\. Hua, F\. Bianchi, Y\. Kwon, Z\. Qi, O\. Queen, S\. Zhu, and J\. Zou \(2026\)DSGym: a holistic framework for evaluating and training data science agents\.arXiv preprint arXiv:2601\.16344\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p2.1),[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- L\. Qian, X\. Peng, Y\. Wang, V\. J\. Zhang, H\. He, H\. Smith, Y\. Han, Y\. He, H\. Li, Y\. Cao, Y\. Yu, A\. Lopez\-Lira, P\. Lu, J\. Nie, G\. Xiong, J\. Huang, and S\. Ananiadou \(2025\)When agents trade: live multi\-market trading benchmark for llm agents\.External Links:2510\.11695,[Link](https://arxiv.org/abs/2510.11695)Cited by:[Table 1](https://arxiv.org/html/2606.26350#S1.T1.4.4.4.2),[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- E\. Rahimikia, H\. Ni, and W\. Wang \(2025\)Re \(visiting\) time series foundation models in finance\.arXiv preprint arXiv:2511\.18578\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.7.7.2.1.1),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.12.12.12.5)\.
- R\. Richman and S\. Scognamiglio \(2024\)Multiple yield curve modeling and forecasting using deep learning\.ASTIN Bulletin: The Journal of the IAA54\(3\),pp\. 463–494\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.17.17.1.1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.34.2.1.1),[Table 7](https://arxiv.org/html/2606.26350#A3.T7.21.21.21.3)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§5\.2](https://arxiv.org/html/2606.26350#S5.SS2.p3.1)\.
- S\. Sun, M\. Qin, W\. Zhang, H\. Xia, C\. Zong, J\. Ying, Y\. Xie, L\. Zhao, X\. Wang, and B\. An \(2023\)TradeMaster: a holistic quantitative trading platform empowered by reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 59047–59061\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1),[§1](https://arxiv.org/html/2606.26350#S1.p2.1),[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- J\. Tao, H\. Ni, and C\. Liu \(2024\)High rank path development: an approach to learning the filtration of stochastic processes\.Advances in Neural Information Processing Systems37,pp\. 115309–115350\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.18.18.2.1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.40.2.1.1),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.14.14.14.5),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.18.18.18.6)\.
- S\. Wang, T\. Ji, J\. He, M\. Almutairi, D\. Wang, L\. Wang, M\. Zhang, and C\. Lu \(2024\)AMA\-lstm: pioneering robust and fair financial audio analysis for stock volatility prediction\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 6: Industry Track\),pp\. 379–386\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- Q\. Xie, W\. Han, Z\. Chen, R\. Xiang, X\. Zhang, Y\. He, M\. Xiao, D\. Li, Y\. Dai, D\. Feng, Y\. Xu, H\. Kang, Z\. Kuang, C\. Yuan, K\. Yang, Z\. Luo, T\. Zhang, Z\. Liu, G\. Xiong, Z\. Deng, Y\. Jiang, Z\. Yao, H\. Li, Y\. Yu, G\. Hu, J\. Huang, X\. Liu, A\. Lopez\-Lira, B\. Wang, Y\. Lai, H\. Wang, M\. Peng, S\. Ananiadou, and J\. Huang \(2024\)Finben: a holistic financial benchmark for large language models\.Advances in Neural Information Processing Systems37,pp\. 95716–95743\.Cited by:[Table 1](https://arxiv.org/html/2606.26350#S1.T1.3.3.3.2),[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- Q\. Xie, W\. Han, X\. Zhang, Y\. Lai, M\. Peng, A\. Lopez\-Lira, and J\. Huang \(2023\)Pixiu: a large language model, instruction data and evaluation benchmark for finance\.arXiv preprint arXiv:2306\.05443\.Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- R\. Xu, Z\. Wang, R\. Fan, and P\. Liu \(2024\)Benchmarking benchmark leakage in large language models\.arXiv preprint arXiv:2404\.18824\.Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- Yahoo Finance \(2026\)Historical market data\.Note:Accessed: 2026\-05\-20External Links:[Link](https://finance.yahoo.com/)Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.2](https://arxiv.org/html/2606.26350#S5.SS2.p3.1)\.
- X\. Yang, W\. Liu, D\. Zhou, J\. Bian, and T\. Liu \(2020\)Qlib: an ai\-oriented quantitative investment platform\.arXiv preprint arXiv:2009\.11189\.Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- J\. Yoon, D\. Jarrett, and M\. Van der Schaar \(2019\)Time\-series generative adversarial networks\.Advances in neural information processing systems32\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.36.2.1.1),[Table 6](https://arxiv.org/html/2606.26350#A3.T6.19.37.2.1.1),[Table 8](https://arxiv.org/html/2606.26350#A3.T8.20.20.20.3)\.
- H\. Yu, F\. Li, and J\. You \(2025a\)LiveTradeBench: seeking real\-world alpha with large language models\.External Links:2511\.03628,[Link](https://arxiv.org/abs/2511.03628)Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- Y\. Yu, H\. Li, Z\. Chen, Y\. Jiang, Y\. Li, J\. W\. Suchow, D\. Zhang, and K\. Khashanah \(2025b\)Finmem: a performance\-enhanced llm trading agent with layered memory and character design\.IEEE Transactions on Big Data\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- Y\. Yu, Z\. Yao, H\. Li, Z\. Deng, Y\. Jiang, Y\. Cao, Z\. Chen, J\. W\. Suchow, Z\. Cui, R\. Liu,et al\.\(2024\)Fincon: a synthesized llm multi\-agent system with conceptual verbal reinforcement for enhanced financial decision making\.Advances in Neural Information Processing Systems37,pp\. 137010–137045\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- W\. Zhang, L\. Zhao, H\. Xia, S\. Sun, J\. Sun, M\. Qin, X\. Li, Y\. Zhao, Y\. Zhao, X\. Cai,et al\.\(2024\)A multimodal foundation agent for financial trading: tool\-augmented, diversified, and generalist\.InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining,pp\. 4314–4325\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.
- L\. Zheng, J\. Birge, H\. Wu, Y\. Zhang, and J\. He \(2025\)Cluster aware graph anomaly detection\.InProceedings of the ACM on Web Conference 2025,pp\. 1771–1782\.Cited by:[Appendix C](https://arxiv.org/html/2606.26350#A3.SS0.SSS0.Px1.p1.1),[Table 9](https://arxiv.org/html/2606.26350#A3.T9.2.2.4.2),[Table 9](https://arxiv.org/html/2606.26350#A3.T9.2.2.5.2),[Table 9](https://arxiv.org/html/2606.26350#A3.T9.2.2.6.2),[Table 9](https://arxiv.org/html/2606.26350#A3.T9.2.2.7.2),[Table 9](https://arxiv.org/html/2606.26350#A3.T9.2.2.8.2),[Table 9](https://arxiv.org/html/2606.26350#A3.T9.2.2.9.2)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2024\)Webarena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 15585–15606\.Cited by:[§2](https://arxiv.org/html/2606.26350#S2.p1.1)\.
- C\. Zong, C\. Wang, M\. Qin, L\. Feng, X\. Wang, and B\. An \(2024\)Macrohft: memory augmented context\-aware reinforcement learning on high frequency trading\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 4712–4721\.Cited by:[§1](https://arxiv.org/html/2606.26350#S1.p1.1)\.

## Appendix APaper Trade Engine

Trading tasks require additional execution and market\-data infrastructure that the automated pipeline cannot synthesise from a paper alone\. We therefore build a paper trade engine so that the construction of the trading tasks in the task construction pipeline reduces to a configuration change, e\.g\. asset type, time window, data resolution, etc\. The engine is organised around a single executor interface that plugs into the existing financial data providers\. The default is an in\-process simulator that mirrors portfolio state directly against the latest asset price, as well as applying configurable slippage on market orders and per\-trade transaction costs\.

Offline trading tasks step over a retrieved historical bar dataset\. The engine accepts market, limit, stop, and stop\-limit orders under immediate\-or\-cancel or good\-till\-cancelled time\-in\-force, supports batch multi\-symbol submissions and explicit order cancellation, and enforces account\-level constraints such as cash availability for long entries and a1×1\\timesequity cap on short leverage\. Orders rejected for constraint violations are surfaced through the next observation step rather than raised as exceptions, letting the agent learn from rejections without crashing the rollout\.

Realtime trading tasks act on fresh market data and require a latency\-sensitive design\. Rather than the naive approach of pulling each observation from a REST endpoint, we subscribe to the data provider’s per\-trade websocket stream and aggregate the trades into OHLCV bars locally\. This decision is motivated by two factors\. First, the per\-update latency is roughly halved relative to a REST round\-trip, and the achievable refresh rate is over an order of magnitude higher in our measurements \(Table[3](https://arxiv.org/html/2606.26350#S4.T3)\)\. Second, the websocket path avoids the tight rate limits of the REST API for frequent data fetch\. Each symbol’s prices and order book are served from a market\-data buffer that a background daemon thread continuously refreshes from the streaming subscription, so an ordinary gym observation step is a constant\-time read from memory\. A REST refresh is dispatched only as a fallback when the buffer’s staleness exceeds a small per\-quantity budget; when one is needed, per\-symbol calls are fanned out in parallel through a persistent worker pool so that the wall\-clock cost is one round\-trip rather than one per symbol\.

As an alternative to the in\-process simulator, and selectable by configuration alone, the engine can route the same orders to a real broker’s paper\-trading sandbox and mirror the resulting account state locally\. The local mirror keeps portfolio queries, e\.g\. positions, cash, pending orders, and PnL as constant\-time lookups on the agent side, and is kept honest by treating the broker’s order\-event stream as the primary update path with a periodic REST reconciler as gap\-fill; the reconciler takes care not to overwrite confirmed local fills from the lagging account aggregate\.

## Appendix BTraining Integrations

This appendix lists the hyperparameters and implementation details of the SFT and RL results in Section[5\.2](https://arxiv.org/html/2606.26350#S5.SS2)\.

##### Trajectory collection\.

The 189 SFT trajectories are produced over 21 training tasks by a one\-shot agent with a Claude Opus 4\.7 backbone at low reasoning effort\. The agent emits a single executable per task, which is executed by the verifier executes and returns an aggregated reward\. The system prompt forbids training neural networks from scratch and steers the agent toward ridge, lasso, HistGradientBoosting, LightGBM, ARIMA/VAR, exponential smoothing, and GARCH\. Trajectories from the five held\-out test tasks are dropped\.

##### Supervised fine\-tuning\.

Each Qwen3 backbone is adapted with a LoRA module targeting every linear projection: rank3232withα=64\\alpha=64for the 1\.7B model, rank1616withα=32\\alpha=32for the 4B and 8B models; dropout0\.050\.05for 1\.7B and 4B,0\.100\.10for 8B\. Training uses a global batch size of44, maximum sequence length81928192and AdamW with a warmup\-then\-cosine schedule\. Per\-size learning rates are2×10−42\{\\times\}10^\{\-4\},1\.5×10−41\.5\{\\times\}10^\{\-4\}, and1×10−41\{\\times\}10^\{\-4\}; the three models are trained for184184,138138, and9292optimization steps, corresponding to roughly four, three, and two epochs over the trajectory corpus\. This takes<0\.5<0\.5GPUh on a single GH200\. As a demonstrative result, we did not tune the training hyperparameter for SFT\. The trained LoRA weights are then merged into the full\-parameter checkpoint that initializes RL\.

##### Reinforcement learning\.

GRPO runs under SkyRL and vllm with full\-parameter updates\. Rollouts use temperature1\.01\.0and top\-pp1\.01\.0, with sequence length capped at81928192tokens\. We apply a KL penalty of0\.010\.01against a frozen reference policy together with token\-level truncated importance sampling clipped at2\.02\.0\. Training takes roughly1−41\-4GPUh depending on the model size\. The RL system prompt is identical to the one used at trajectory collection and SFT\.

##### Reward transform\.

The verifier’s per\-task scalar is a lower\-is\-better weighted aggregate of task\-specific metrics\. We remap it to a maximization\-friendly RL reward by negating and clipping: scoreable trials receiver=−min⁡\(loss,5\)r=\-\\min\(\\mathrm\{loss\},\\,5\), while failed or unparseable submissions receive a fixedr=−6r=\-6so that any scoreable trial is strictly preferred over any failure\.

## Appendix CCurated Tasks Catalogue

This appendix catalogues the manually curated and verified forecasting, generation, and fraud detection tasks shipped with OpenFinFym\. Each task fixes deterministic train/validation/test splits, a supervised or sample contract that pins the agent’s output shape, and a verifier\-enforced evaluator whose metric definitions are ported from the source paper\.

##### Dataset provenance\.

Every curated task is either reproduced from a publicly released source\-paper benchmarkBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\); Anget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib22)\); Taoet al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib27)\); Yoonet al\.\([2019](https://arxiv.org/html/2606.26350#bib.bib30)\)following their Apache licenses, or rebuilt from open public data feeds—Yahoo FinanceYahoo Finance \([2026](https://arxiv.org/html/2606.26350#bib.bib53)\), EIOPAEuropean Insurance and Occupational Pensions Authority \(EIOPA\) \([2026](https://arxiv.org/html/2606.26350#bib.bib54)\), and U\.S\. Treasury par yieldsRahimikiaet al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib25)\); Guet al\.\([2020a](https://arxiv.org/html/2606.26350#bib.bib21)\); Dinda \([2024](https://arxiv.org/html/2606.26350#bib.bib24)\); Kumaret al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib23)\); Louet al\.\([2023](https://arxiv.org/html/2606.26350#bib.bib28)\); Richman and Scognamiglio \([2024](https://arxiv.org/html/2606.26350#bib.bib26)\)in a way Consistent with their intended uses; no proprietary or paywalled feed is required at any stage\. Synthetic tasks \(Ornstein–Uhlenbeck \(OU\) and rough\-volatility dataLouet al\.\([2023](https://arxiv.org/html/2606.26350#bib.bib28)\), Tail\-GANContet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib29)\), and the multivariate fractional Brownian Motion \(fBM\) datasetTaoet al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib27)\)\) regenerate their data based on the model parameters specified in the corresponding source papers\. Fraud\-detection tasks reuse established node\-anomaly benchmarksZhenget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib52)\)together with a synthetic graph benchmarkLiet al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib61)\)\. Per\-task source citations appear in the Source column of Tables[7](https://arxiv.org/html/2606.26350#A3.T7),[8](https://arxiv.org/html/2606.26350#A3.T8), and[9](https://arxiv.org/html/2606.26350#A3.T9)\.

### C\.1Metric definitions and references

Table[6](https://arxiv.org/html/2606.26350#A3.T6)defines the metrics abbreviations used throughout this appendix\. Unless otherwise stated, scoring metrics are losses for which smaller values are better\.

SymbolDefinition*Point\-forecast errors*MSEmean squared error\.RMSEroot mean squared error\.MAEmean absolute error\.MAFEmean absolute forecast error \(per\-horizon MAE used in the volatility\-graph task\)\.MAPEmean absolute percentage error\.SMAPEsymmetric MAPE\.R2R^\{2\}coefficient of determination\.ROOS2R^\{2\}\_\{\\rm OOS\},Rzero2R^\{2\}\_\{\\rm zero\}out\-of\-sampleR2R^\{2\}/R2R^\{2\}vs\. a zero forecast\.*Statistical tests*DMDiebold\-Mariano forecast\-accuracy test\(Diebold and Mariano,[1995](https://arxiv.org/html/2606.26350#bib.bib33); Guet al\.,[2020a](https://arxiv.org/html/2606.26350#bib.bib21)\)\.NWtt\-statNewey\-West HACtt\-statistic\(Newey and West,[1986](https://arxiv.org/html/2606.26350#bib.bib34); Rahimikiaet al\.,[2025](https://arxiv.org/html/2606.26350#bib.bib25)\)\.*Financial\-risk diagnostics*Sharpeannualised Sharpe ratio\.VaRα, ESαValue\-at\-Risk / Expected Shortfall at levelα\\alpha\.Δ\\DeltaSharpeasset\-mean\|Sharpepred−Sharpetrue\|\|\\text\{Sharpe\}\_\{\\rm pred\}\-\\text\{Sharpe\}\_\{\\rm true\}\|\.Δ\\DeltaVaRasset\-mean\|VaRpred−VaRtrue\|\|\\text\{VaR\}\_\{\\rm pred\}\-\\text\{VaR\}\_\{\\rm true\}\|\.Δ\\DeltaESasset\-mean\|ESpred−EStrue\|\|\\text\{ES\}\_\{\\rm pred\}\-\\text\{ES\}\_\{\\rm true\}\|\.Kupiec testunconditional coverage test for VaR\(Kupiec and others,[1995](https://arxiv.org/html/2606.26350#bib.bib36); Contet al\.,[2026](https://arxiv.org/html/2606.26350#bib.bib29)\)\.*Interval forecasting*PICPprediction\-interval coverage probability \(target≈0\.95\\approx 0\.95\)\(Khosraviet al\.,[2010](https://arxiv.org/html/2606.26350#bib.bib38); Richman and Scognamiglio,[2024](https://arxiv.org/html/2606.26350#bib.bib26)\)\.MPIWmean prediction\-interval width \(tighter is better, conditional on PICP\)\(Khosraviet al\.,[2010](https://arxiv.org/html/2606.26350#bib.bib38); Richman and Scognamiglio,[2024](https://arxiv.org/html/2606.26350#bib.bib26)\)\.*Generative\-model evaluators*Disc\. scorepost\-hoc discriminative score: train / test AUC of a classifier separating real from generated paths\(Yoonet al\.,[2019](https://arxiv.org/html/2606.26350#bib.bib30); Louet al\.,[2023](https://arxiv.org/html/2606.26350#bib.bib28)\)\.Pred\. scorepost\-hoc predictive score: next\-step prediction error of a model trained on generated paths and tested on real ones\(Yoonet al\.,[2019](https://arxiv.org/html/2606.26350#bib.bib30); Estebanet al\.,[2017](https://arxiv.org/html/2606.26350#bib.bib41); Louet al\.,[2023](https://arxiv.org/html/2606.26350#bib.bib28)\)\.Sig\-MMDsignature\-kernel MMD between path distributions\(Chevyrev and Oberhauser,[2022](https://arxiv.org/html/2606.26350#bib.bib40); Louet al\.,[2023](https://arxiv.org/html/2606.26350#bib.bib28)\)\.Sig\-W1W\_\{1\}signature\-based Wasserstein\-1 path distanceTaoet al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib27)\)\.ACFautocorrelation\-function distance, per asset and lag\.ONNDone\-nearest\-neighbour distance for distributional fidelityTaoet al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib27)\)\.Marginalmarginal\-distribution histogram distance\.CorrelationL1L\_\{1\}error of cross\-asset correlation matrix\.CovarianceFrobenius distance of covariance matrices\.*Volatility\-graph specific*DY spilloverDiebold\-Yilmaz generalised\-FEVD volatility\-spillover graphDiebold and Yilmaz \([2012](https://arxiv.org/html/2606.26350#bib.bib39)\)\.Table 6:Metric abbreviations used throughout the task catalogue\.
### C\.2Curated forecasting tasks

The forecasting catalogue is shown in Table[7](https://arxiv.org/html/2606.26350#A3.T7)\. For each task we report the source paper, the data and frequency, the supervised contract \(lookbackLLto horizonHH\), and the headline metric family\. Three entries deviate from the standard time\-series\-tensor and are flagged in the table: Equity excess return uses a tabular per\-row input \(42 features per asset\-month\); Global index volatility is graph\-conditioned \(correlation \+ Diebold\-Yilmaz spillover graphs\); and Yield curve is probabilistic \(lower / central / upper interval curves\)\.

Following the method of task generation proposed inBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)and its accompanying code, we construct 38 curated forecasting tasks, which cover U\.S\. equities \(organised by sector\), U\.S\. Treasury curve points \(3m–30y\), G7 / emerging\-market / Asia\-focused FX pairs, crypto majors and altcoins, and metal / energy / soft commodities, and instantiate 7\-8 variants per asset family with different panel selections and lookback\-to\-horison contracts\. All variants are scored with the same metric set as in the source paper \(RMSE, MAE, and theΔ\\Deltarisk triplet\(ΔSharpe,ΔVaR,ΔES\)\(\\Delta\\text\{Sharpe\},\\Delta\\text\{VaR\},\\Delta\\text\{ES\}\)between predicted and realised paths\)\.

TaskSourceData / frequencyContract\(L→HL\\\!\\to\\\!H\)MetricsLOBBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)14\-channel LOB; high\-frequency bars72→372\\\!\\to\\\!3\(5 of 14 price channels scored\)RMSE, MAE,Δ\\DeltariskStock\-10Baoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)10 U\.S\. equities; daily60→360\\\!\\to\\\!3RMSE, MAE,Δ\\DeltariskCrypto pairsBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)20 crypto pairs; hourly72→372\\\!\\to\\\!3RMSE, MAE,Δ\\DeltariskExchange rateAnget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib22)\)8 FX series; daily60→2060\\\!\\to\\\!20RMSE, MAE, MAPE, SMAPEStock OHLCVAnget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib22)\)5\-channel OHLCV; daily60→1060\\\!\\to\\\!10\(Close only scored\)RMSE, MAE, MAPE, SMAPETSFMRahimikiaet al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib25)\)60\-name U\.S\. excess returns; daily252→1252\\\!\\to\\\!1, yearly walk\-forwardROOS2R^\{2\}\_\{\\rm OOS\}, Sharpe, NWtt\-statNIFTY\-50Dinda \([2024](https://arxiv.org/html/2606.26350#bib.bib24)\)8\-factor panel; daily30→130\\\!\\to\\\!1R2R^\{2\}, RMSE, MAPEEquity excess return∗Guet al\.\([2020a](https://arxiv.org/html/2606.26350#bib.bib21)\)U\.S\. stock\-month panel, 42 featuresper\-row, yearly walk\-forwardRzero2R^\{2\}\_\{\\rm zero\}, DM, long\-short SharpeGlobal index volatility†Kumaret al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib23)\)8 indices with corr\. / DY graphs; daily21→\{1,5,15,21\}21\\\!\\to\\\!\\\{1,5,15,21\\\}MSE, MAFE, MAPE,R2R^\{2\}per horizonYield curve‡\(Richman and Scognamiglio,[2024](https://arxiv.org/html/2606.26350#bib.bib26)\)Multi\-country EIOPA curves; monthly10×150→15010\\\!\\times\\\!150\\\!\\to\\\!150MSE, MAE, PICP, MPIW*MOSAIC pipeline\-generated tasks \(7–8 variants per asset family\)*Equity panelsBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)U\.S\. equity panels by sector \(5–9 names\); daily20→120\\\!\\to\\\!1–90→2090\\\!\\to\\\!20RMSE, MAE,Δ\\DeltariskCommodity panelsBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)metals / energy / softs panels \(6–17 names\); daily20→220\\\!\\to\\\!2–90→2090\\\!\\to\\\!20RMSE, MAE,Δ\\DeltariskCrypto panelsBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)majors / DeFi / L1 / altcoin panels \(6–12 names\); hourly12→112\\\!\\to\\\!1–168→24168\\\!\\to\\\!24RMSE, MAE,Δ\\DeltariskFX panelsBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)G7 / EM / Asia / European FX panels \(6–10 pairs\); daily10→110\\\!\\to\\\!1–90→2090\\\!\\to\\\!20RMSE, MAE,Δ\\DeltariskTreasury curveBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)U\.S\. Treasury curve segments \(3m–30y, 6–10 points\); daily5→15\\\!\\to\\\!1–60→1060\\\!\\to\\\!10RMSE, MAE,Δ\\Deltarisk

Table 7:Forecasting tasks in OpenFinGym\. The top block lists the individually curated tasks; the bottom block summarises additional tasks produced by the MOSAIC pipeline, with each row collapsing several variants that share a data family and metric set but differ in panel composition and lookback\-to\-horizon contract\. Input and output channel counts are equal unless noted\.Δ\\Deltarisk denotes the triplet\(ΔSharpe,ΔVaR,ΔES\)\(\\Delta\\text\{Sharpe\},\\Delta\\text\{VaR\},\\Delta\\text\{ES\}\)between predicted and realised paths\. Markers:∗tabular per\-row input \(no lookback tensor\);†graph\-conditioned input \(correlation and Diebold\-Yilmaz spillover graphs alongside the lookback tensor\);‡probabilistic output \(lower / central / upper interval curves\) rather than a point forecast\. All metric abbreviations are defined in Table[6](https://arxiv.org/html/2606.26350#A3.T6)\.
### C\.3Curated generative tasks

The generative catalogue is shown in Table[8](https://arxiv.org/html/2606.26350#A3.T8)\. The*Sample contract*column pins the shape of the generated tensor: unconditional tasks list a single window shape, while conditional tasks list the context\-to\-future shape together with the numberKKof futures generated per context\.

TaskSourceData / frequencySample contractMetricsCryptoBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)20 pairs; hourlyuncond\.60×2060\\\!\\times\\\!20Marginal, Correlation, ACF, Covariance,Δ\\DeltariskLOBBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)14\-channel LOB; HF barsuncond\.60×1460\\\!\\times\\\!14Marginal, Correlation, ACF, CovarianceStockBaoet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib31)\)10 equities; dailyuncond\.60×1060\\\!\\times\\\!10Marginal, Correlation, ACF, Covariance,Δ\\DeltariskCrypto log\-returnAnget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib22)\)10 returns; hourlyuncond\.24×1024\\\!\\times\\\!10VaR, ESExchangeAnget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib22)\)8 FX series; dailyuncond\.7×87\\\!\\times\\\!8Marginal, Correlation, ACF, CovarianceStock log\-returnAnget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib22)\)8 stocks; dailyuncond\.5×85\\\!\\times\\\!8Marginal, Correlation, ACF, CovarianceOULouet al\.\([2023](https://arxiv.org/html/2606.26350#bib.bib28)\)Ornstein–Uhlenbeckuncond\.64×164\\\!\\times\\\!1Disc\., Pred\., Sig\-MMDRough volLouet al\.\([2023](https://arxiv.org/html/2606.26350#bib.bib28)\)log price \+ log voluncond\.200×2200\\\!\\times\\\!2Disc\., Pred\., Sig\-MMD, ACFStock OHLCVLouet al\.\([2023](https://arxiv.org/html/2606.26350#bib.bib28)\)10 stocks OHLCV; dailyuncond\.20×520\\\!\\times\\\!5Disc\., Pred\., Sig\-MMDUS\-5 \(cond\.\)Taoet al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib27)\)5 U\.S\. stocks; daily ret\.cond\.5→55\\\!\\to\\\!5,K=128K\\\!=\\\!128Sig\-W1W\_\{1\}, ACF, Correlation, Marginal, Am\.\-put err\., ONNDfBM \(cond\.\)Taoet al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib27)\)3D fBM,H=0\.25H\\\!=\\\!0\.25cond\.6→56\\\!\\to\\\!5,K=1K\\\!=\\\!1Sig\-W1W\_\{1\}, ACF, Correlation, Marginal, ONNDTail\-GANContet al\.\([2026](https://arxiv.org/html/2606.26350#bib.bib29)\)5\-asset syntheticuncond\.100×5100\\\!\\times\\\!5rel\. VaR / ES, Kupiec, Correlation, ACFTimeGAN stockYoonet al\.\([2019](https://arxiv.org/html/2606.26350#bib.bib30)\)6\-column stock; dailyuncond\.24×624\\\!\\times\\\!6Disc\., Pred\.

Table 8:Curated generative tasks\. The*Sample contract*column reports the window shape as time\-steps×\\timeschannels for unconditional tasks; conditional bundles list context length→\\tofuture length together with the numberKKof generated futures emitted per context\. Metric symbols are defined in Table[6](https://arxiv.org/html/2606.26350#A3.T6);Δ\\Deltarisk denotes the triplet\(ΔSharpe,ΔVaR,ΔES\)\(\\Delta\\text\{Sharpe\},\\Delta\\text\{VaR\},\\Delta\\text\{ES\}\)\.For the fBM task, the model emits a single future per context \(K=1K\\\!=\\\!1\)\. For US\-5, the American\-put error aggregates over futures, therefore uses K = 128; the put is priced on theK=128K\\\!=\\\!128paths at an at\-the\-money strike\.

### C\.4Curated Fraud detection tasks

The fraud detection tasks are shown in Table[9](https://arxiv.org/html/2606.26350#A3.T9)\. It contains seven graph\-structured tasks drawn from established anomaly\-detection and fraud\-detection benchmarks\. Six tasks are node\-level anomaly scoring problems on attributed graphs; one task is a graph\-level binary classification on a synthetic motif benchmark\.

DatasetOriginal taskDescriptionEvaluation modeAnomaly rateMetricAmazonZhenget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib52)\)10,224\-node e\-commerce product graph; 25 feat\.; COO adjacencyNode anomaly scoring6\.8%AUCROC, RecallBlogCatalogZhenget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib52)\)5,196\-node social network \(bloggers\); 8,189 feat\.; COO adjacencyNode anomaly scoring5\.8%AUCROC, RecallYelpChiZhenget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib52)\)23,831\-node review graph \(reviews\); 32 feat\.; COO adjacencyNode anomaly scoring5\.1%AUCROC, RecallDBLPZhenget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib52)\)4,057\-node citation/co\-author graph; 334 feat\.; 2 relation types merged into dense adjacencyNode anomaly scoring7\.0%AUCROC, RecallIMDB 5KZhenget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib52)\)4,780\-node movie co\-occurrence graph; 1,232 feat\.; 2 relation types merged into dense adjacencyNode anomaly scoring7\.0%AUCROC, RecallCERTZhenget al\.\([2025](https://arxiv.org/html/2606.26350#bib.bib52)\)1,000\-node user behaviour graph; 6 feat\. \(2 per view: email, logon, file\); COO email\-activity adjacencyNode anomaly scoring7\.0%AUCROC, RecallBA\-2motifsLiet al\.\([2024](https://arxiv.org/html/2606.26350#bib.bib61)\)1,000 synthetic graphs; 25 nodes×\\times10 feat\. per graph \(constant 0\.1\); 25×\\times25 adjacency per graph; balanced binary classes \(house vs\. 5\-cycle motif\)Graph binary classification50\.0% \(balanced\)AUCROC, Recall

Table 9:Fraud detection tasks in OpenFinGym\. The Anomaly rate column reports the fraction of positive \(anomalous / fraud\-class\) labels in the training split\.
OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents

Similar Articles

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

Submit Feedback

Similar Articles

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
SEAGym: An Evaluation Environment for Self-Evolving LLM Agents