PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance
Summary
PandaAI proposes a closed-loop neuro-symbolic LLM agent for sequential decision-making in quantitative finance, integrating market regime modeling and constrained alpha generation to address low SNR and non-stationarity in financial data, achieving significant improvements over state-of-the-art time-series models.
View Cached Full Text
Cached at: 06/08/26, 09:18 AM
# PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance
Source: [https://arxiv.org/html/2606.06823](https://arxiv.org/html/2606.06823)
Siyuan Liu Panda AI liusiyuan@pandaai\.online Bingjun Liu Panda AI liubingjun@pandaai\.online
###### Abstract
While deep learning has excelled in various domains, its application to sequential decision\-making in finance remains challenging due to the low Signal\-to\-Noise Ratio \(SNR\) and non\-stationarity of financial data\. Leveraging the reasoning capabilities of Large Language Models \(LLMs\), we proposePandaAI, a closed\-loop neuro\-symbolic LLM agent with market regime modeling and constrained alpha generation, which bridges general LLM reasoning with financial rigor and suppresses the financial toxicity of LLM\-generated outputs\. To bridge the gap between general linguistic capability and financial rigor, we fine\-tune a domain\-specific LLM\. Furthermore, we integrate this LLM into a modular architecture and form a closed\-loop system\. Unlike traditional models that optimize isolated prediction metrics,PandaAIis designed as a neuro\-symbolic agent that navigates the complex, real\-world financial environment with explicit risk awareness\. Extensive experiments on CSI 300 stock data show thatPandaAIachieves a18\.2%18\.2\\%higher Rank IC and25\.7%25\.7\\%lower maximum drawdown than state\-of\-the\-art time\-series models\. Our constrained LLM generation and dual\-channel adaptation method provide a general paradigm for LLM deployment in high\-stakes sequential decision\-making scenarios\.
## 1Introduction
Recently, deep learning has obtained great success in many real\-world applications, such as facial recognitionSunet al\.\([2024](https://arxiv.org/html/2606.06823#bib.bib50)\), object segmentationKirillovet al\.\([2023](https://arxiv.org/html/2606.06823#bib.bib49)\), and natural language processingDevlinet al\.\([2019](https://arxiv.org/html/2606.06823#bib.bib51)\); Yenduriet al\.\([2023](https://arxiv.org/html/2606.06823#bib.bib52)\)\. The financial data brings great challenges to deep learning due to its inherently low Signal\-to\-Noise Ratio \(SNR\) and non\-stationarity\. SNR refers to the relative strength of predictable, economically meaningful patterns \(the signal\) compared to random, unpredictable fluctuations \(the noise\) that dominate price movements, and financial price series exhibit strong non\-stationarity in the form of trending behavior \(near unit\-root processes\), volatility clustering, structural breaks during economic regime changes, and evolving cross\-asset relationships, all of which violate the stationarity assumptions implicit in many standard deep learning architectures\.
In this paper, we integrate the quantitative investment method to improve SNR mining the formulaic alpha factorfffor decision\-making, rather than relying on raw data\. The quantitative investment task is modeled as a sequential decision\-making process\. The goal is to optimize the portfolio weights𝐰t\\mathbf\{w\}\_\{t\}to maximize cumulative rewards while satisfying a set of risk constraints𝒞\\mathcal\{C\}\. A formulaic alpha factorffis a symbolic expression mapping market history to a cross\-sectional signal vector𝐬t∈ℝN\\mathbf\{s\}\_\{t\}\\in\\mathbb\{R\}^\{N\}whereNNrepresents the number of products in the panel\. The search space forffis defined by a context\-free grammar involving mathematical operators \(e\.g\.,\+\+,−\-,log\\log,rank\) and market variables\. Unlike unconstrained code generation, viable financial factors must adhere to specific structural constraints \(e\.g\., dimensional homogeneity\) and risk constraints \(e\.g\., decay rate\)\. We denote the feasible set of factors as𝒜feasible⊂𝒜all\\mathcal\{A\}\_\{\\text\{feasible\}\}\\subset\\mathcal\{A\}\_\{\\text\{all\}\}\.
Financial time series violate the stationary assumption \(i\.e\., joint distributions change over time\)\. We formalize this by introducing a latent regime stateztz\_\{t\}, which denotes the continuous latent market regime state, which captures the dynamic characteristics of the market \(e\.g\., volatility, liquidity\) at timett\. The market dynamics are governed by a time\-varying transition function conditioned onztz\_\{t\}\. Consequently, a static policyπ\(a\|s\)\\pi\(a\|s\)inevitably degrades\. A market\-aware policy must implyπ\(a\|s,zt\)\\pi\(a\|s,z\_\{t\}\), dynamically adapting parameters \(e\.g\., risk aversionλ\\lambda\) based on the inferred regimeztz\_\{t\}\. We summarize the notation in this paper in Table[4](https://arxiv.org/html/2606.06823#S6.T4)\. In summary, our main contributions are:
- •Constrained MCTS Alpha Mining with LLM Guidance:We design an LLM\-guided constrained MCTS alpha mining framework, integrating financial hard constraints into the full lifecycle of LLM generation to address the financial toxicity issue of factors generated by unconstrained methods\.
- •Market Regime Latent Modeling with Dual\-Channel Adaptation:We propose market regime latent variable modeling and a dual\-channel adaptation mechanism, compressing high\-dimensional market dynamics factors into continuous latent statesztz\_\{t\}to achieve unified market perception for LLM’s symbolic reasoning and the numerical optimization of quantitative modules\.
- •Closed\-Loop Update System for Quantitative Finance Lifecycle:We construct a closed\-loop update system covering the full lifecycle of quantitative finance, combining fast logical constraint induction and slow parameter adaptation to realize continuous adaptation of the model to non\-stationary financial markets, breaking through the limitations of traditional open\-loop models\.
## 2Related Work
Figure 1:Overview of the PandaAI Market\-Aware Quantitative Framework\.The system operates as a closed\-loop dynamical system spanning six core modules\.\(Left\)The Market Dynamics Module \(ℳ\\mathcal\{M\}\) ingests data to generate the regime stateztz\_\{t\}\(supportingH1\)\.\(Center\)The Alpha Research Module \(ℛ\\mathcal\{R\}\) utilizes LLM\-guided MCTS to search for robust factors under constraints𝒞\\mathcal\{C\}\(supportingH2\)\.\(Right\)Portfolio \(𝒫\\mathcal\{P\}\) and Execution \(ℰ\\mathcal\{E\}\) modules actuate decisions conditioned onztz\_\{t\}\.\(Bottom\)The feedback loop collects Evidence \(EE\) via Verification \(𝒱\\mathcal\{V\}\) to update parametersθ\\thetaand constraints𝒞\\mathcal\{C\}\(supportingH3\)\.Solid arrows denote Data Flow; Dashed arrows denote Control/Update Flow\.### 2\.1Alpha Space Exploration
Similar to feature engineering in machine learning, automated alpha mining is also a cornerstone of quantitative finance\. Before the rise of machine learning, Genetic Programming \(GP\)Koza \([1992](https://arxiv.org/html/2606.06823#bib.bib11)\)performed effectively, although it often inefficiently searched the alpha space\. Subsequently, many works have tried their best to cover the full alpha space\.DeepScalperSunet al\.\([2022](https://arxiv.org/html/2606.06823#bib.bib13)\)introduces Deep Reinforcement Learning, whileDSOPetersenet al\.\([2019](https://arxiv.org/html/2606.06823#bib.bib14)\)andAlphaGenYuet al\.\([2023](https://arxiv.org/html/2606.06823#bib.bib15)\)achieve better interpretability using symbolic regression\. More recently, since large language models \(LLMs\) have shown remarkable improvement in their capability for complex semantic understanding and prowess in code\-generation,AlphaGPTWanget al\.\([2025](https://arxiv.org/html/2606.06823#bib.bib16)\)integrates Llama3 70BGrattafioriet al\.\([2024](https://arxiv.org/html/2606.06823#bib.bib48)\)to mine, test, and deploy investment signals \(alphas\) by translating human intuition into quantitative trading strategies\.Shiet al\.\([2025](https://arxiv.org/html/2606.06823#bib.bib17)\)proposes MCTS\-guided exploration to explicitly cover the full alpha space\. Despite these significant strides in search capability, robustness remains a primary concern\. Generative approaches are susceptible to overfitting, often yielding factors that are mathematically valid but financially toxic \(e\.g\., extreme turnover\) due to the absence of continuous, rigorous verification mechanisms\.
### 2\.2Market Dynamics and Adaptation
In the broader machine learning community, advanced architectures likeTimesNetWuet al\.\([2022](https://arxiv.org/html/2606.06823#bib.bib19)\)andiTransformerLiuet al\.\([2023](https://arxiv.org/html/2606.06823#bib.bib20)\)have set new state\-of\-the\-art standards for handling temporal variations\. However, financial markets are inherently non\-stationary\. The distribution shift derived from market dynamics poses severe challenges to static modelsHamilton \([2020](https://arxiv.org/html/2606.06823#bib.bib18)\)\. AlthoughRevINKimet al\.\([2021](https://arxiv.org/html/2606.06823#bib.bib21)\)and the meta\-learning frameworkDoubleAdaptZhaoet al\.\([2023](https://arxiv.org/html/2606.06823#bib.bib22)\)have successfully addressed concept drift in stock forecasting, transferring these adaptive mechanisms to LLM\-based agents remains underexplored\. Contemporary LLM agents typically operate under implicit stationary assumptions, often overlooking the explicit modeling of market dynamics \(e\.g\., regime shifts\)\. This limitation restricts their ability to contextually adapt downstream strategies during periods of market turbulence, such as liquidity crises\.
### 2\.3Autonomous Agents and Closed\-Loop Systems
The deployment of autonomous agents represents a frontier in AI research\. Generalist frameworksShenet al\.\([2023](https://arxiv.org/html/2606.06823#bib.bib24)\); Parket al\.\([2023](https://arxiv.org/html/2606.06823#bib.bib25)\)demonstrated the immense potential of planning and tool usage\. By integrating domain\-specific tools, finance\-specific agentsLiet al\.\([2023](https://arxiv.org/html/2606.06823#bib.bib26)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.06823#bib.bib27)\)extended capabilities on finance\. Notwithstanding these innovations, current systems predominantly operate in open\-loop simulations\. They are frequently decoupled from strict financial hard constraints \(e\.g\., leverage limits, transaction costs\) and lack systematic feedback loops from execution to model updates\. This structural fragmentation limits the potential for synergistic distillation, where insights from one module \(e\.g\., regime detection\) could critically inform another \(e\.g\., alpha pruning\)\. To address these limitations, we propose a foundational, market\-aware framework that integrates these disparate components into a unified, closed\-loop system, enabling holistic optimization across the entire quantitative investment lifecycle\.
## 3Methodology
To bridge the structural fragmentation, we propose a foundational framework and posit that its efficacy stems from three mechanism\-driven hypotheses:
H1 \(Contextualization Hypothesis\):Explicitly modeling market regimes \(ztz\_\{t\}\) and conditioning all downstream tasks on it will yield more robust and context\-aware strategies than those assuming stationarity\.
H2 \(Constrained\-Creativity Hypothesis\):Guiding LLM\-based alpha generation with first\-class financial constraints \(𝒞\\mathcal\{C\}\) within an MCTS search will produce factors with superior out\-of\-sample robustness and lower financial toxicity compared to unconstrained generative methods\.
H3 \(Meta\-Adaptation Hypothesis\):A closed\-loop feedback mechanism that updates both model parameters \(θ\\theta\) and constraint logic \(𝒞\\mathcal\{C\}\) based on backtest evidence \(EE\) will enable continuous adaptation to non\-stationary markets, outperforming static or open\-loop systems\.
Our framework, as shown in Figure[1](https://arxiv.org/html/2606.06823#S2.F1), is designed to instantiate and test these hypotheses, and Table[1](https://arxiv.org/html/2606.06823#S3.T1)summarizes the relation between our hypotheses and the corresponding mechanisms
Table 1:Correspondence between Scientific Hypotheses, System Modules, and Implementation Mechanisms\.### 3\.1Market Dynamics Moduleℳ\\mathcal\{M\}
Financial markets are inherently non\-stationary, characterized by shifting distributions that render static models obsolete\. To address this, we operationalize market awareness not as discrete labels, but as a continuous latent regime manifold\. The moduleℳ\\mathcal\{M\}functions as a compression engine that distills high\-dimensional heterogeneous data into a compact, informative state representationztz\_\{t\}\.
##### Latent State Construction
We utilizeBarrafactorsSheikh \([1996](https://arxiv.org/html/2606.06823#bib.bib56)\), which serve as industry\-standard risk indicators comprising style \(e\.g\., momentum and volatility\) and industry exposures, to characterize market dynamics\. We collected these factors over a 10\-year horizon\. To retain numerical fidelity while reducing noise, a lightweightAutoencoderarchitecture is employed to obtain the low\-dimensionalztz\_\{t\}that preserves the continuous dynamic properties of the market\. This encoder is pre\-trained in an unsupervised manner to minimize reconstruction error, ensuringztz\_\{t\}captures the intrinsic manifold of market evolution\.
##### Dual\-Channel Adaptation
Sinceztz\_\{t\}must interface with both the symbolic reasoning of the LLM and the numerical optimization of execution modules, we design a Dual\-Channel Adapter:
- •Symbolic Adapter for LLM \(Channel 1\):To make the continuous vectorztz\_\{t\}comprehensible to the LLM, we employ a projection MLP that mapsztz\_\{t\}intokklearnable soft tokens\. These tokens are prepended to the LLM’s input embedding sequence\.
- •Numerical Adapter for Control \(Channel 2\):For modules requiring scalar inputs \(Portfolio𝒫\\mathcal\{P\}and Executionℰ\\mathcal\{E\}\), a separate feature extraction network mapsztz\_\{t\}to specific control parameters \(e\.g\., risk aversionλt\\lambda\_\{t\}, liquidity participation rateγt\\gamma\_\{t\}\)\.
This architecture ensures that a unified, consistent market perceptionztz\_\{t\}drives both the high\-level reasoning and low\-level control of the agent\.
### 3\.2LLM\-Powered Alpha Research Moduleℛ\\mathcal\{R\}
We conceptualize Alpha Mining not as creative writing, but as a Constrained Search Problem over a Directed Acyclic Graph \(DAG\) of operators\. We implement an LLM\-guided Monte Carlo Tree Search \(MCTS\) framework to navigate this sparse solution space\. The process ensures robustness through four quant\-specific phases that incorporate the constraint set𝒞\\mathcal\{C\}and the market stateztz\_\{t\}at distinct checkpoints \(visualized in Figure[2](https://arxiv.org/html/2606.06823#S3.F2)\); the detailed procedure is presented in Section[6\.1](https://arxiv.org/html/2606.06823#S6.SS1)\.
1\. Selection \(ztz\_\{t\}\)2\. Expansion \(LLM\)PassGforbiddenG\_\{\\text\{forbidden\}\}3\. SimulationCheck𝒞dynamic\\mathcal\{C\}\_\{\\text\{dynamic\}\}4\. BackpropYesNo \(Regenerate\)Pass \(VV\)Fail \(V−λV\-\\lambda\)c\(zt\)c\(z\_\{t\}\)modulates explorationInject𝒞\\mathcal\{C\}into PromptFigure 2:Single MCTS Iteration Flow\.Illustrating where the Constraint Set𝒞\\mathcal\{C\}is applied\.GforbiddenG\_\{\\text\{forbidden\}\}\(a subset of𝒞\\mathcal\{C\}\) acts as a hard filter during Expansion, while dynamic constraints𝒞dynamic\\mathcal\{C\}\_\{\\text\{dynamic\}\}apply soft penalties during Simulation\.In summary,ℛ\\mathcal\{R\}reframes LLM\-based financial creativity from an open\-ended generation task into a constrained, tree\-search\-guided reasoning process\. This addresses the core limitation of prior generative approaches \(Section 3\.2\): the LLM is not merely a code generator but a reasoning engine whose proposals are continuously subjected to simulation\-based financial validation \(Eq\. 8\) within the search loop\. The constraint set𝒞\\mathcal\{C\}acts primarily as an intrinsic regularizer via prompting and filtering, with residual enforcement via value penalization\. This tight integration of symbolic reasoning \(LLM\), systematic exploration \(MCTS\), and domain\-specific validation \(financial backtest\) is the key to generating alphas that are both novel and robust\. This process directly embodies and operationalizesH2\.
### 3\.3Fine\-Tuning Module𝒯\\mathcal\{T\}forCQ2
In Module𝒯\\mathcal\{T\}, the Chain\-of\-thought Quantitative LLM, namedCQ2, plays a central role in this framework, handling financial information for textual embeddinghtxth\_\{txt\}and generating the formula alpha factor\.CQ2will be first fine\-tuned before other modules\. Unlike generic LLMs,CQ2must exhibit two domain\-specific capabilities: understanding market dynamics, risk constraints, and alpha semantics, and adapting its reasoning patterns to the market stateztz\_\{t\}\. To achieve this, a two\-stage fine\-tuning pipeline is designed\.CQ2utilizes the DeepSeek\-Coder\-33B architectureGuoet al\.\([2024](https://arxiv.org/html/2606.06823#bib.bib47)\)\. The fine\-tuning procedure consists of Supervised Fine\-Tuning \(SFT\) and Reinforcement Learning from Human Feedback \(RLHF\)\. RLHF aligns language models with feedback\. Consequently, the fine\-tuned LLM can be improved in terms of truthfulness and generalization to user preferences, thereby reducing hallucinations to a great extent, i\.e\., a21%21\\%vs\.41%41\\%hallucination rate before and after, respectivelyOuyanget al\.\([2022](https://arxiv.org/html/2606.06823#bib.bib38)\)\. Furthermore, RLHF primarily increases the probability of sampling high\-quality rolloutsQiet al\.\([2025](https://arxiv.org/html/2606.06823#bib.bib31)\)\. The detailed fine\-tune technique and procedure are demonstrated in Section[6\.2](https://arxiv.org/html/2606.06823#S6.SS2)\.
### 3\.4Market\-Aware Decision Making \(𝒫\\mathcal\{P\}&ℰ\\mathcal\{E\}\)
Instead of static optimization, PandaAI dynamically adjusts the entire trading pipeline based onztz\_\{t\}\.
##### Portfolio Optimization \(𝒫\\mathcal\{P\}\)
We solve the regime\-conditioned convex problem:
wt=argmaxw∈𝒞port\(wTst−λ\(zt\)⋅wTΣw\)w\_\{t\}=\\arg\\max\_\{w\\in\\mathcal\{C\}\_\{port\}\}\(w^\{T\}s\_\{t\}\-\\lambda\(z\_\{t\}\)\\cdot w^\{T\}\\Sigma w\)\(1\)whereλ\(zt\)\\lambda\(z\_\{t\}\)automatically increases during high\-volatility regimes to prioritize capital preservation\.
##### Execution Control \(ℰ\\mathcal\{E\}\)
To minimize implementation shortfall \(IS\), the execution policyπexec\(a\|wt,zt\)\\pi\_\{exec\}\(a\|w\_\{t\},z\_\{t\}\)selects strategies \(e\.g\., TWAP, VWAP\) isomorphic to the inferred liquidity profile\. This closes the ”simulation\-to\-reality” gap by accounting for market friction\.
### 3\.5Updates𝒰\\mathcal\{U\}: Closing the Loop
Operator𝒰\\mathcal\{U\}operationalizesH3via a dual\-timescale mechanism, enabling PandaAI to adapt to non\-stationary environments beyond static pipelines\.
##### Fast Loop \(Symbolic Rule Induction\)
Upon detecting statistically significant failure clusters inEE, the system triggers symbolic induction\. By contrastively prompting the LLM with failed samples \(Sharpe<0\\text\{Sharpe\}<0\) against successful ones within regimeztz\_\{t\},𝒰\\mathcal\{U\}extracts logical predicates—for instance, identifying that reversal operators underperform during high\-momentum phases\. These insights are formalized as symbolic rules \(e\.g\.,IF Trend\(ztz\_\{t\}\)\>\>τ\\tauTHEN Ban\(Reversal\)\) and immediately appended to𝒞\\mathcal\{C\}\.
##### Slow Loop \(Parametric Adaptation\)
To incorporate long\-term market evolution, successful trajectories \(verified CoT traces and profitable executions\) are stored in an experience replay buffer\. We periodically updateθ\\thetavia LoRAHuet al\.\([2022](https://arxiv.org/html/2606.06823#bib.bib33)\)with a 5% data replay ratioQiet al\.\([2025](https://arxiv.org/html/2606.06823#bib.bib31)\)\. This configuration preserves structural priors while adapting to distribution shifts, effectively mitigating catastrophic forgetting more efficiently than full\-parameter fine\-tuning\. This ensures the model continuously evolves alongside market dynamics without performance degradation\.
## 4Experiments
##### Data Partition and Anti\-leakage
Our data universe consists of the CSI 300 Index constituents\. To ensure the integrity of our results against temporal data leakage \(a common concern in LLM\-based finance\), we implement a strict time\-series split:
- •Training and SFT Period:January 1, 2015, to December 31, 2022\.
- •Validation/Buffer Period:January 1, 2023, to December 31, 2023\.
- •Out\-of\-Sample \(OOS\) Test:January 1, 2024, to August 31, 2024\.
Notably, our testing period \(2024\) is strictly post\-release of the DeepSeek\-Coder\-33B \(Nov 2023\), ensuring that the agent navigates market dynamics it has never encountered during pre\-training\. Table[5](https://arxiv.org/html/2606.06823#S6.T5)summarizes all the formulas of the alpha factors derived from our approach\.
##### Features and Labels
The input sequence dataxxhas a look\-back window ofT=60T=60days of OHLCV data\. The target labelyty\_\{t\}is the cross\-sectional standardized 5\-day forward return:yt=pricet\+5\+1−pricet\+1pricet\+1y\_\{t\}=\\frac\{price\_\{t\+5\+1\}\-price\_\{t\+1\}\}\{price\_\{t\+1\}\}\. This formulation is consistent with baselines like LSTMHochreiter and Schmidhuber \([1997](https://arxiv.org/html/2606.06823#bib.bib54)\), TransformerVaswaniet al\.\([2017](https://arxiv.org/html/2606.06823#bib.bib53)\), and StockMixerFan and Shen \([2024](https://arxiv.org/html/2606.06823#bib.bib55)\)\.
##### Financial Realism and Metrics
To bridge the “simulation\-to\-reality” gap, our backtest incorporates realistic market frictions:
- •Transaction Costs:We apply a double\-sided commission of1515bps and a slippage of55bps\.
- •Trading Logic:Rebalancing is performed daily\. Daily turnover is capped at50%50\\%through the constraint set𝒞\\mathcal\{C\}to suppress “financially toxic” high\-frequency noise\.
Performance is evaluated via Information Coefficient \(IC\), Rank IC, ICIR, Annualized Return \(AR\), and Maximum Drawdown \(MDD\)\. We conduct a 5\-group backtest and implement a t\-test on the returns\. A t\-statistic\>2\.0\>2\.0\(95%95\\%confidence\) indicates significant alpha\. Formulaic factors used in our experiments are summarized in Table[5](https://arxiv.org/html/2606.06823#S6.T5)\.
### 4\.1Overall Performance
Table[2](https://arxiv.org/html/2606.06823#S4.T2)compares the performance with the neural network baselines\. We observe that pure deep neural networks learn little meaningful pattern from the SNR financial data\. The tailored StockMixer can adapt to this data better\. It is noted thatFactor 1generated byPandaAIoutperforms all the neural networks, so our framework can mine the rational formulaic alpha factor from financial data\. Furthermore, the t statistic ofFactor 1is9\.96679\.9667, so our factor leads to obvious profit\.
Table 2:Comparison on CSI 300 with neural network baselines
### 4\.2Ablation Study
In this section, we will explore the effectiveness of each module based on three hypotheses\.
#### 4\.2\.1Contextualization Hypothesis Test
To thoroughly investigate the contributions of fine\-tuning and the injection of the latent stateztz\_\{t\}, we conduct a series of carefully controlled ablation experiments:Factor 2is generated by the unfine\-tuned LLM;Factor 3is generated by the fine\-tuned LLM withoutztz\_\{t\};Factor 4is generated for𝒜\\mathcal\{A\}withoutztz\_\{t\};Factor 5is generated for𝒫\\mathcal\{P\}withoutztz\_\{t\}\. Figure[3](https://arxiv.org/html/2606.06823#S4.F3)visualizes their performance\. We can observe thatFactor 1from the fully\-equipped framework outperforms the factors from ablation studies\. This validates theH1is held\.
Figure 3:The results ofContextualization Hypothesisfrom 5 metrics
#### 4\.2\.2Constrained\-Creativity Hypothesis Test
To verifyH2, we generatedFactor 6by disabling the constraint set𝒞\\mathcal\{C\}during the MCTS search\. Factor 6 achieved an IC of0\.02070\.0207, Rank IC of0\.05920\.0592, and a raw ICIR of0\.24840\.2484\.
While the unconstrained Factor 6 shows a higher raw ICIR \(0\.24840\.2484\) than Factor 1 \(0\.1930\.193\), a deeper financial analysis reveals itsfinancial toxicity: its extreme daily turnover \(\>80%\>80\\%\) renders it untradeable in real\-world scenarios\. Once realistic transaction costs \(1515bps fee \+55bps slippage\) are deducted, its net performance drops significantly, failing to maintain consistent profitability\. In contrast,Factor 1\(generated under constraints𝒞\\mathcal\{C\}\) maintains a healthy balance between predictive power and tradeability\. This validates thatH2is essential for generating alpha factors that are not just statistically significant but also practically robust and execution\-friendly\.
#### 4\.2\.3Meta\-Adaptation Hypothesis Test
Meta\-Adaptation Hypothesis consists of the fast loop and the slow loop\. Firstly, we generateFactor 7without the fast loop\. Furthermore, we leave a small\-period data for the slow\-loop adaptation, and the rest is used for the first fine\-tuning\. We use this fine\-tune LLM to generateFactor 8\. Table[3](https://arxiv.org/html/2606.06823#S4.T3)shows the result of this ablation study\. We conclude that without the fast loop, the quality of mined formulaic alpha decays, and the slow loop make the fine\-tune LLM perform similarly with the full data\. In a nutshell, theH3is held\.
Table 3:Results on Meta\-Adaptation Hypothesis Test
## 5Conclusion
In this paper, we presentedPandaAI, a neuro\-symbolic framework that integrates a fine\-tuned domain\-specific Large Language Model into a closed\-loop system to address the low signal\-to\-noise ratio and non\-stationarity inherent in financial data\. By explicitly modeling latent market regimesztz\_\{t\}\(H1\), constraining LLM\-guided MCTS search with financial priors to produce robust, low\-toxicity alpha factors \(H2\), and closing the loop via symbolic constraint induction and parametric updates from execution feedback \(H3\),PandaAIenables adaptive, context\-aware quantitative decision\-making across market dynamics, alpha mining, portfolio optimization, and realistic execution\.
## References
- \[1\]DeepSeek\-AI\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[§6\.2\.2](https://arxiv.org/html/2606.06823#S6.SS2.SSS2.Px2.p1.3)\.
- \[2\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.External Links:1810\.04805,[Link](https://arxiv.org/abs/1810.04805)Cited by:[§1](https://arxiv.org/html/2606.06823#S1.p1.1)\.
- \[3\]J\. Fan and Y\. Shen\(2024\)StockMixer: a simple yet strong mlp\-based architecture for stock price forecasting\.InProceedings of the Thirty\-Eighth AAAI Conference on Artificial Intelligence and Thirty\-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence,AAAI’24/IAAI’24/EAAI’24\.External Links:ISBN 978\-1\-57735\-887\-9,[Link](https://doi.org/10.1609/aaai.v38i8.28681),[Document](https://dx.doi.org/10.1609/aaai.v38i8.28681)Cited by:[§4](https://arxiv.org/html/2606.06823#S4.SS0.SSS0.Px2.p1.4)\.
- \[4\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§2\.1](https://arxiv.org/html/2606.06823#S2.SS1.p1.1)\.
- \[5\]D\. Guo, Q\. Zhu, D\. Yang, Z\. Xie, K\. Dong, W\. Zhang, G\. Chen, X\. Bi, Y\. Wu, Y\. K\. Li, F\. Luo, Y\. Xiong, and W\. Liang\(2024\)DeepSeek\-coder: when the large language model meets programming – the rise of code intelligence\.External Links:2401\.14196,[Link](https://arxiv.org/abs/2401.14196)Cited by:[§3\.3](https://arxiv.org/html/2606.06823#S3.SS3.p1.5)\.
- \[6\]J\. D\. Hamilton\(2020\)Time series analysis\.Princeton university press\.Cited by:[§2\.2](https://arxiv.org/html/2606.06823#S2.SS2.p1.1)\.
- \[7\]S\. Hochreiter and J\. Schmidhuber\(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.External Links:[Document](https://dx.doi.org/10.1162/neco.1997.9.8.1735)Cited by:[§4](https://arxiv.org/html/2606.06823#S4.SS0.SSS0.Px2.p1.4)\.
- \[8\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§3\.5](https://arxiv.org/html/2606.06823#S3.SS5.SSS0.Px2.p1.1)\.
- \[9\]T\. Kim, J\. Kim, Y\. Tae, C\. Park, J\. Choi, and J\. Choo\(2021\)Reversible instance normalization for accurate time\-series forecasting against distribution shift\.InInternational conference on learning representations,Cited by:[§2\.2](https://arxiv.org/html/2606.06823#S2.SS2.p1.1)\.
- \[10\]A\. Kirillov, E\. Mintun, N\. Ravi, H\. Mao, C\. Rolland, L\. Gustafson, T\. Xiao, S\. Whitehead, A\. C\. Berg, W\. Lo, P\. Dollár, and R\. Girshick\(2023\)Segment anything\.External Links:2304\.02643,[Link](https://arxiv.org/abs/2304.02643)Cited by:[§1](https://arxiv.org/html/2606.06823#S1.p1.1)\.
- \[11\]J\. R\. Koza\(1992\)Genetic programming: on the programming of computers by means of natural selection cambridge\.MA: MIT Press\.\[Google Scholar\]\.Cited by:[§2\.1](https://arxiv.org/html/2606.06823#S2.SS1.p1.1)\.
- \[12\]H\. Li, L\. Ding, M\. Fang, and D\. Tao\(2024\-11\)Revisiting catastrophic forgetting in large language model tuning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 4297–4308\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.249/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.249)Cited by:[§6\.2\.1](https://arxiv.org/html/2606.06823#S6.SS2.SSS1.Px3.p1.4)\.
- \[13\]Y\. Li, Y\. Yu, H\. Li, Z\. Chen, and K\. Khashanah\(2023\)Tradinggpt: multi\-agent system with layered memory and distinct characters for enhanced financial trading performance\.arXiv preprint arXiv:2309\.03736\.Cited by:[§2\.3](https://arxiv.org/html/2606.06823#S2.SS3.p1.1)\.
- \[14\]Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long\(2023\)Itransformer: inverted transformers are effective for time series forecasting\.arXiv preprint arXiv:2310\.06625\.Cited by:[§2\.2](https://arxiv.org/html/2606.06823#S2.SS2.p1.1)\.
- \[15\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe\(2022\)Training language models to follow instructions with human feedback\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§3\.3](https://arxiv.org/html/2606.06823#S3.SS3.p1.5),[§6\.2\.2](https://arxiv.org/html/2606.06823#S6.SS2.SSS2.Px2.p1.3)\.
- \[16\]J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§2\.3](https://arxiv.org/html/2606.06823#S2.SS3.p1.1)\.
- \[17\]B\. K\. Petersen, M\. Landajuela, T\. N\. Mundhenk, C\. P\. Santiago, S\. K\. Kim, and J\. T\. Kim\(2019\)Deep symbolic regression: recovering mathematical expressions from data via risk\-seeking policy gradients\.arXiv preprint arXiv:1912\.04871\.Cited by:[§2\.1](https://arxiv.org/html/2606.06823#S2.SS1.p1.1)\.
- \[18\]Z\. Qi, F\. Nie, A\. Alahi, J\. Zou, H\. Lakkaraju, Y\. Du, E\. P\. Xing, S\. M\. Kakade, and H\. Zhang\(2025\)EvoLM: in search of lost language model training dynamics\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=B6bE2GC71a)Cited by:[§3\.3](https://arxiv.org/html/2606.06823#S3.SS3.p1.5),[§3\.5](https://arxiv.org/html/2606.06823#S3.SS5.SSS0.Px2.p1.1)\.
- \[19\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Link](https://arxiv.org/abs/1707.06347)Cited by:[§6\.2\.2](https://arxiv.org/html/2606.06823#S6.SS2.SSS2.Px3.p2.4)\.
- \[20\]A\. Sheikh\(1996\)BARRA’s risk models\.Barra Research Insights,pp\. 1–24\.Cited by:[§3\.1](https://arxiv.org/html/2606.06823#S3.SS1.SSS0.Px1.p1.2)\.
- \[21\]Y\. Shen, K\. Song, X\. Tan, D\. Li, W\. Lu, and Y\. Zhuang\(2023\)Hugginggpt: solving ai tasks with chatgpt and its friends in hugging face\.Advances in Neural Information Processing Systems36,pp\. 38154–38180\.Cited by:[§2\.3](https://arxiv.org/html/2606.06823#S2.SS3.p1.1)\.
- \[22\]Y\. Shi, Y\. Duan, and J\. Li\(2025\)Navigating the alpha jungle: an llm\-powered mcts framework for formulaic factor mining\.arXiv preprint arXiv:2505\.11122\.Cited by:[§2\.1](https://arxiv.org/html/2606.06823#S2.SS1.p1.1)\.
- \[23\]N\. Stiennon, L\. Ouyang, J\. Wu, D\. M\. Ziegler, R\. Lowe, C\. Voss, A\. Radford, D\. Amodei, and P\. Christiano\(2020\)Learning to summarize from human feedback\.InProceedings of the 34th International Conference on Neural Information Processing Systems,NIPS ’20,Red Hook, NY, USA\.External Links:ISBN 9781713829546Cited by:[§6\.2\.2](https://arxiv.org/html/2606.06823#S6.SS2.SSS2.Px2.p1.13)\.
- \[24\]S\. Sun, W\. Xue, R\. Wang, X\. He, J\. Zhu, J\. Li, and B\. An\(2022\)DeepScalper: a risk\-aware reinforcement learning framework to capture fleeting intraday trading opportunities\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management,pp\. 1858–1867\.Cited by:[§2\.1](https://arxiv.org/html/2606.06823#S2.SS1.p1.1)\.
- \[25\]Z\. Sun, C\. Feng, I\. Patras, and G\. Tzimiropoulos\(2024\-06\)LAFS: landmark\-based facial self\-supervised learning for face recognition\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 1639–1649\.Cited by:[§1](https://arxiv.org/html/2606.06823#S1.p1.1)\.
- \[26\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)Cited by:[§4](https://arxiv.org/html/2606.06823#S4.SS0.SSS0.Px2.p1.4)\.
- \[27\]S\. Wang, H\. Yuan, L\. Zhou, L\. Ni, H\. Y\. Shum, and J\. Guo\(2025\)Alpha\-gpt: human\-ai interactive alpha mining for quantitative investment\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp\. 196–206\.Cited by:[§2\.1](https://arxiv.org/html/2606.06823#S2.SS1.p1.1)\.
- \[28\]H\. Wu, T\. Hu, Y\. Liu, H\. Zhou, J\. Wang, and M\. Long\(2022\)Timesnet: temporal 2d\-variation modeling for general time series analysis\.arXiv preprint arXiv:2210\.02186\.Cited by:[§2\.2](https://arxiv.org/html/2606.06823#S2.SS2.p1.1)\.
- \[29\]G\. Yenduri, R\. M, C\. S\. G, S\. Y, G\. Srivastava, P\. K\. R\. Maddikunta, D\. R\. G, R\. H\. Jhaveri, P\. B, W\. Wang, A\. V\. Vasilakos, and T\. R\. Gadekallu\(2023\)Generative pre\-trained transformer: a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions\.External Links:2305\.10435,[Link](https://arxiv.org/abs/2305.10435)Cited by:[§1](https://arxiv.org/html/2606.06823#S1.p1.1)\.
- \[30\]S\. Yu, H\. Xue, X\. Ao, F\. Pan, J\. He, D\. Tu, and Q\. He\(2023\)Generating synergistic formulaic alpha collections via reinforcement learning\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5476–5486\.Cited by:[§2\.1](https://arxiv.org/html/2606.06823#S2.SS1.p1.1)\.
- \[31\]W\. Zhang, L\. Zhao, H\. Xia, S\. Sun, J\. Sun, M\. Qin, X\. Li, Y\. Zhao, Y\. Zhao, X\. Cai,et al\.\(2024\)A multimodal foundation agent for financial trading: tool\-augmented, diversified, and generalist\.InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining,pp\. 4314–4325\.Cited by:[§2\.3](https://arxiv.org/html/2606.06823#S2.SS3.p1.1)\.
- \[32\]L\. Zhao, S\. Kong, and Y\. Shen\(2023\)Doubleadapt: a meta\-learning approach to incremental learning for stock trend forecasting\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 3492–3503\.Cited by:[§2\.2](https://arxiv.org/html/2606.06823#S2.SS2.p1.1)\.
## 6Appendix
Table 4:Summary of key notations and module definitions in PandaAI\.### 6\.1Detailed Procedure of Constrainted LLM\-guided Monte Carlo Tree Search
Algorithm[4](https://arxiv.org/html/2606.06823#S6.F4)illustrates the pseudocode of the whole procedure\.
1\. Selection \(Regime\-Adaptive\):Nodes are selected using a modified UCT algorithm where the exploration constantccis not static but modulated by the market stateztz\_\{t\}:
UCT\(s\)=Q\(s\)N\(s\)\+c\(zt\)⋅lnN\(parent\)N\(s\)\\text\{UCT\}\(s\)=\\frac\{Q\(s\)\}\{N\(s\)\}\+c\(z\_\{t\}\)\\cdot\\sqrt\{\\frac\{\\ln N\(\\text\{parent\}\)\}\{N\(s\)\}\}\(2\)Here,c\(zt\)c\(z\_\{t\}\)is inversely proportional to the market entropy detected byℳ\\mathcal\{M\}\. In stable regimes,c\(zt\)c\(z\_\{t\}\)increases to encourage broad exploration; in turbulent regimes \(e\.g\., liquidity crises\),c\(zt\)c\(z\_\{t\}\)decays to enforce conservative exploitation of known safe sub\-trees\.
2\. Expansion \(Constrained Generation\):TheLLMθ\\text\{LLM\}\_\{\\theta\}acts as the policy networkπ\(a\|s,zt\)\\pi\(a\|s,z\_\{t\}\)\. To operationalizeConstraints as A Priori Regularization, we clarify the relationship between the static syntax rulesGforbiddenG\_\{\\text\{forbidden\}\}\(in Algorithm 1\) and the dynamic risk constraints𝒞\\mathcal\{C\}\(updated by Module𝒰\\mathcal\{U\}\): specifically,Gforbidden⊂𝒞G\_\{\\text\{forbidden\}\}\\subset\\mathcal\{C\}\. During generation, we employ a”Prompt\-Check\-Regenerate”loop:
- •Prompt Injection:The semantic rules from𝒞\\mathcal\{C\}are combined with the regime stateztz\_\{t\}, which isprepended as continuous soft tokens to the system prompt via Channel 1\.
- •Pre\-Simulation Filter:Generated candidates are immediately parsed againstGforbiddenG\_\{\\text\{forbidden\}\}\. Invalid formulas are rejected before entering the costly simulation phase\.
3\. Simulation \(Feedback & Soft Penalty\):Candidates passing the expansion filter undergo backtesting\. Although obvious violations are filtereda priori, subtle financial toxicities \(e\.g\., high correlation with existing factors\) can only be detecteda posteriori\. Thus, we define the node value functionV\(f\)V\(f\)with a penalty term for these residual violations:
V\(f\)=Rperf\(f\)\+α⋅Rmodel\(f\)−λ⋅𝕀\(f∉𝒞dynamic\)V\(f\)=R\_\{\\text\{perf\}\}\(f\)\+\\alpha\\cdot R\_\{\\text\{model\}\}\(f\)\-\\lambda\\cdot\\mathbb\{I\}\(f\\notin\\mathcal\{C\}\_\{\\text\{dynamic\}\}\)\(3\)whereRperfR\_\{\\text\{perf\}\}denotes backtest metrics \(e\.g\., IC\), andRmodelR\_\{\\text\{model\}\}is the alignment score from the RLHF Reward Model\. The penalty𝕀\(⋅\)\\mathbb\{I\}\(\\cdot\)handles the dynamic subset𝒞dynamic=𝒞∖Gforbidden\\mathcal\{C\}\_\{\\text\{dynamic\}\}=\\mathcal\{C\}\\setminus G\_\{\\text\{forbidden\}\}, ensuring that factors surviving the filter but failing on risk metrics \(e\.g\., excessive turnover\>50%\>50\\%daily111High turnover rates usually incur prohibitive transaction costs\.\) are heavily penalized\.
4\. Backpropagation:The evaluation signals are propagated to update the node statistics, progressively steering the LLM towards the ”valid and robust” subspace of the alpha universe\.
Figure 4:LLM\-Guided Constrained MCTS for Alpha Mining1:Input:
fseedf\_\{\\mathrm\{seed\}\}\(seed alpha\),
LLMθ\\text\{LLM\}\_\{\\theta\},
ztz\_\{t\}\(market state\),
𝒞\\mathcal\{C\}\(constraint set\),
BB\(budget\)
2:Output:
FzooF\_\{\\mathrm\{zoo\}\}\(robust alpha repository\)
3:/\* Initialization \*/
4:
Fzoo←∅F\_\{\\mathrm\{zoo\}\}\\leftarrow\\emptyset
5:
s0←CreateRoot\(fseed\)s\_\{0\}\\leftarrow\\textsc\{CreateRoot\}\(f\_\{\\mathrm\{seed\}\}\)
6:
Tree𝒯←\{s0\}\\textsc\{Tree \}\\mathcal\{T\}\\leftarrow\\\{s\_\{0\}\\\}
7:
Gforbidden←ExtractSyntaxRules\(𝒞\)G\_\{\\text\{forbidden\}\}\\leftarrow\\textsc\{ExtractSyntaxRules\}\(\\mathcal\{C\}\)⊳\\trianglerightStatic constraints subset
8:for
iter←1\\textit\{iter\}\\leftarrow 1to
BBdo
9:/\* 1\. Selection Phase \(Regime\-Adaptive via H1\) \*/
10:
cexp←ComputeExploration\(zt\)c\_\{\\text\{exp\}\}\\leftarrow\\textsc\{ComputeExploration\}\(z\_\{t\}\)⊳\\trianglerightc\(zt\)c\(z\_\{t\}\)modulated by market entropy
11:
sleaf←SelectViaUCT\(𝒯,s0,cexp\)s\_\{\\text\{leaf\}\}\\leftarrow\\textsc\{SelectViaUCT\}\(\\mathcal\{T\},s\_\{0\},c\_\{\\text\{exp\}\}\)
12:/\* 2\. Expansion Phase \(Constraint\-Guided via H2\) \*/
13:
context←GetContext\(sleaf\)\\textit\{context\}\\leftarrow\\textsc\{GetContext\}\(s\_\{\\text\{leaf\}\}\)
14:
prompt←context∪𝒞\\textit\{prompt\}\\leftarrow\\textit\{context\}\\cup\\mathcal\{C\}⊳\\trianglerightInject constraints into prompt
15:
fnew←NULLf\_\{\\text\{new\}\}\\leftarrow\\text\{NULL\}
16:/\* Pre\-Simulation Filter Loop \*/
17:while
fnewf\_\{\\text\{new\}\}is Invalid OR
fnew∈Gforbiddenf\_\{\\text\{new\}\}\\in G\_\{\\text\{forbidden\}\}do
18:
fnew←LLMθ\.Generate\(prompt,zt\)f\_\{\\text\{new\}\}\\leftarrow\\text\{LLM\}\_\{\\theta\}\.\\textsc\{Generate\}\(\\textit\{prompt\},z\_\{t\}\)⊳\\trianglerightCoT generation
19:endwhile
20:
snew←AddChild\(𝒯,sleaf,fnew\)s\_\{\\text\{new\}\}\\leftarrow\\textsc\{AddChild\}\(\\mathcal\{T\},s\_\{\\text\{leaf\}\},f\_\{\\text\{new\}\}\)
21:/\* 3\. Simulation Phase \(Value Estimation\) \*/
22:
Rperf←Backtest\(fnew,zt\)R\_\{\\text\{perf\}\}\\leftarrow\\textsc\{Backtest\}\(f\_\{\\text\{new\}\},z\_\{t\}\)
23:
Rmodel←RewardModel\(fnew\)R\_\{\\text\{model\}\}\\leftarrow\\textsc\{RewardModel\}\(f\_\{\\text\{new\}\}\)⊳\\trianglerightRLHF alignment score
24:
penalty←CheckDynamicConstraints\(fnew,𝒞∖Gforbidden\)\\textit\{penalty\}\\leftarrow\\textsc\{CheckDynamicConstraints\}\(f\_\{\\text\{new\}\},\\mathcal\{C\}\\setminus G\_\{\\text\{forbidden\}\}\)
25:
V\(fnew\)←Rperf\+α⋅Rmodel−λ⋅penaltyV\(f\_\{\\text\{new\}\}\)\\leftarrow R\_\{\\text\{perf\}\}\+\\alpha\\cdot R\_\{\\text\{model\}\}\-\\lambda\\cdot\\textit\{penalty\}⊳\\trianglerightEq\. 8
26:/\* 4\. Backpropagation Phase \*/
27:
BackupValues\(𝒯,snew,V\(fnew\)\)\\textsc\{BackupValues\}\(\\mathcal\{T\},s\_\{\\text\{new\}\},V\(f\_\{\\text\{new\}\}\)\)
28:/\* Repository Update \*/
29:if
V\(fnew\)\>τacceptV\(f\_\{\\text\{new\}\}\)\>\\tau\_\{\\text\{accept\}\}then
30:
Fzoo←Fzoo∪\{fnew\}F\_\{\\mathrm\{zoo\}\}\\leftarrow F\_\{\\mathrm\{zoo\}\}\\cup\\\{f\_\{\\text\{new\}\}\\\}
31:endif
32:endfor
33:return
FzooF\_\{\\mathrm\{zoo\}\}
### 6\.2Detailed Fine\-Tuning𝒯\\mathcal\{T\}
The fine\-tuning dataset is not publicly available due to privacy obligations to clients and restrictions imposed by non\-disclosure agreements\.
#### 6\.2\.1Supervised Fine\-Tuning
##### Regime\-Conditioned Instruction Tuning
A financial instruction dataset𝒟SFT\\mathcal\{D\}\_\{\\text\{SFT\}\}is constructed where each samplevvis tagged with the originating market stateztz\_\{t\}from Moduleℳ\\mathcal\{M\}\. Each samplevvcontains three components\(x,y,zt\)\(x,y,z\_\{t\}\), wherexxdenotes the question andyycorresponds to the answer\. This teaches the model to condition financial concepts on market context\. The same underlying concept must be implemented differently depending onztz\_\{t\}, supportingH1by embedding regime\-awareness into the model’s generative process\.
##### Chain\-of\-Thought Financial Reasoning
A reasoning trace datasetVSFT\_QAV\_\{SFT\\\_QA\}is constructed, where each sample is a tripletv=\(x,c,y\)v=\(x,c,y\)\.ccrepresents the reasoning trace formatted as<think\>⋯</think\><\\textit\{think\}\>\\dots</\\textit\{think\}\>andyycorresponds to the answer formatted as<answer\>⋯</answer\><\\textit\{answer\}\>\\dots</\\textit\{answer\}\>\. Through𝒟QRA\\mathcal\{D\}\_\{\\text\{QRA\}\}samples, explicit reasoning structures are learned which decompose factor design into logical steps\.
##### Supervised Fine\-Tuning
Supervised Fine\-Tuning \(SFT\) is initially performed on DeepSeek\-Coder\-33B usingVSFT\_QAV\_\{SFT\\\_QA\}to broaden LLM’s perspective on the contextualization hypothesis\. In this stage, the taggedztz\_\{t\}of each sample is injected into LLM, followingChannel 1, and parameters of the Symbolic Adapter are trained synchronously withCQ2fine\-tune\. Next, SFT continues usingVSFT\_QRAV\_\{SFT\\\_QRA\}, optimizing key aspects of financial reasoning\. However, naively implementing SFT on all the parameters with new data leads to the catastrophic forgetting\[[12](https://arxiv.org/html/2606.06823#bib.bib36)\]that hampers the performance of the original models\. To solve this limitation, the original DeepSeek\-Coder\-33B is used as the teacher, and an auxiliary lossLKDL\_\{KD\}is used to measure the KL distance between the output of the original and fine\-tuned models\.
#### 6\.2\.2Execution\-Driven Reinforcement Learning from Human Feedback
While SFT establishes financial knowledge, Reinforcement Learning from Human Feedback alignsCQ2’s generative behavior with execution success criteria derived from trading contexts\. Conventional RLHF is extended by incorporating execution simulation feedback into the reward signal\.
##### RLHF Dataset
Two different datasets are produced and used in the RLHF procedure: \(1\) Reward Model DatasetVRL\_RMV\_\{RL\\\_RM\}, where one sample consists of a promptxxand two user responses\(y0,y1\)\(y\_\{0\},y\_\{1\}\), used to train our reward model, and \(2\) Proximal Policy Optimization DatasetVRL\_PPOV\_\{RL\\\_PPO\}, with only promptxxof each sample, which are used as inputs for RLHF\.
##### Execution\-Grounded Reward Modeling \(RM\)
DeepSeek\-R1\-Distill\-Qwen\-7B\[[1](https://arxiv.org/html/2606.06823#bib.bib42)\]is fine\-tuned, followed by a randomly initialized linear head that outputs a scalar value, to take in a prompt and response\. In this work, only a 7B reward model is used to reduce computational cost, and larger reward models can be unstable and are therefore less suitable as the value function during reinforcement learning\.\[[15](https://arxiv.org/html/2606.06823#bib.bib38)\]\. This model is trained to predict whichy∈\{y0,y1\}y\\in\\\{y\_\{0\},y\_\{1\}\\\}is better as judged by a user, given a promptxx\. The RM is trained to compare two model outputs on the same prompt\. If the response preferred by the user isyjy\_\{j\}, the RM loss can be written as:
L\(rθ\)=−𝔼VRM\[log\(σ\(rθ\(x,yi\)−rθ\(x,y1−i\)\)\],L\(r\_\{\\theta\}\)=\-\\mathbb\{E\}\_\{V\_\{RM\}\}\\left\[\\log\(\\sigma\(r\_\{\\theta\}\(x,y\_\{i\}\)\-r\_\{\\theta\}\(x,y\_\{1\-i\}\)\)\\right\],\(4\)whererθ\(x,y\)r\_\{\\theta\}\(x,y\)is the scalar output of the reward model for the promptxxand the responseyywith parametersθ\\theta, andσ\(z\)=11\+e−z\\sigma\(z\)=\\frac\{1\}\{1\+e^\{\-z\}\}is the logistic sigmoid function\. This RM has been proven to generalize well on unseen datasets\[[23](https://arxiv.org/html/2606.06823#bib.bib23)\]\. Therefore, this RM can guarantee the quality of the following Reinforcement Learning\. Furthermore,rθ\(⋅,⋅\)r\_\{\\theta\}\(\\cdot,\\cdot\)is also used asRmodel\(f\)=rθ\(x,f\)R\_\{model\}\(f\)=r\_\{\\theta\}\(x,f\)in Equation \([3](https://arxiv.org/html/2606.06823#S6.E3)\) of Section[3\.2](https://arxiv.org/html/2606.06823#S3.SS2), wherexxis the prompt andffis the generated formula factor\. MaximizingRmodel\(f\)R\_\{model\}\(f\)can make the generated factor satisfy the user preference\.
##### Reinforcement Learning \(RL\)
After training the reward model, a policyπϕRL\(y∣x\)\\pi^\{RL\}\_\{\\phi\}\(y\\mid x\)is optimized to generate responses that achieve higher reward under human preference\. This stage is formulated as reinforcement learning in a bandit\-style setting: a promptxxis sampled from the dataset, the policy generates a complete responsey∼πϕRL\(⋅∣x\)y\\sim\\pi^\{RL\}\_\{\\phi\}\(\\cdot\\mid x\), and the episode terminates with a single scalar score provided by the fixed reward modelrθ\(x,y\)r\_\{\\theta\}\(x,y\)\.
Directly maximizingrθ\(x,y\)r\_\{\\theta\}\(x,y\)can cause the policy to drift away from the SFT policy and overfit to imperfections in the reward model\. To stabilize training, a KL regularization term that penalizes deviation from the SFT policyπSFT\(y∣x\)\\pi^\{SFT\}\(y\\mid x\)is introduced\. The resulting shaped reward is
R\(x,y\)=rθ\(x,y\)−βlog\(πϕRL\(y∣x\)πSFT\(y∣x\)\),R\(x,y\)=r\_\{\\theta\}\(x,y\)\\;\-\\;\\beta\\log\\\!\\left\(\\frac\{\\pi^\{RL\}\_\{\\phi\}\(y\\mid x\)\}\{\\pi^\{SFT\}\(y\\mid x\)\}\\right\),\(5\)whereβ\\betacontrols the strength of the constraint\. This KL term acts as an entropy bonus, encouraging the policy to explore and deterring it from collapsing to a single mode\. Then𝔼y∼πϕ\(⋅∣x\)\[R\(x,y\)\]\\mathbb\{E\}\_\{y\\sim\\pi\_\{\\phi\}\(\\cdot\\mid x\)\}\[R\(x,y\)\]is maximized using Proximal Policy Optimization \(PPO\)\[[19](https://arxiv.org/html/2606.06823#bib.bib39)\]\.
Table 5:The summary of Formulaic Alpha Factors in our experiments
### 6\.3Evaluation Metrics
To comprehensively evaluate the performance of the proposed Panda AI framework in quantitative finance decision\-making, we employ three groups of widely recognized metrics, measuring its predictive power, risk\-adjusted performance, and absolute performance, respectively\.
##### Predictive Power
Predictive power metrics quantify the correlation between the alpha signals generated by the model and the subsequent realized returns\. We employ the Information Coefficient and its variant\.
- •Information Coefficient \(IC\)measures the linear correlation between the predicted returns vectorRpR\_\{p\}and the actual realized returns vectorRaR\_\{a\}, defined as the Pearson correlation coefficient: IC=ρ\(Rp,Ra\)IC=\\rho\(R\_\{p\},R\_\{a\}\)\(6\)whereρ\(⋅\)\\rho\(\\cdot\)denotes the Pearson correlation\. An IC value closer to 1 indicates stronger predictive ability\.
- •Rank Information Coefficient \(Rank IC\)assesses the monotonic relationship by comparing the ordinal rankings of predictions and outcomes, which is more robust to outliers: RankIC=ρ\(rank\(Rp\),rank\(Ra\)\)RankIC=\\rho\(\\text\{rank\}\(R\_\{p\}\),\\text\{rank\}\(R\_\{a\}\)\)\(7\)whererank\(⋅\)\\text\{rank\}\(\\cdot\)denotes the transformation of the return series into their rank orders\.
##### Risk\-Adjusted Performance
Risk\-adjusted performance metrics evaluate the model’s ability to generate excess return per unit of risk undertaken and the stability of its predictive skill\.
- •Information Coefficient Information Ratio \(ICIR\)evaluates the stability and significance of predictive skill over time\. It is defined as the mean of period IC divided by its standard deviation: ICIR=IC¯σ\(IC\)ICIR=\\frac\{\\overline\{IC\}\}\{\\sigma\(IC\)\}\(8\)whereIC¯\\overline\{IC\}is the mean of IC over multiple periods \(e\.g\., monthly\), andσ\(IC\)\\sigma\(IC\)is its standard deviation\. A higher ICIR indicates more stable and reliable predictive ability\.
##### Absolute Performance
Absolute performance metrics provide direct insight into the raw profitability and the extreme downside risk of the investment strategy\.
- •Annualized Return \(AR\)compounds the cumulative return to an annualized figure, facilitating comparison across strategies and time periods: AR=\(1\+Total Return\)252N−1AR=\(1\+\\text\{Total Return\}\)^\{\\frac\{252\}\{N\}\}\-1\(9\)whereTotal Returnis the cumulative return over the entire period, andNNis the total number of trading days\. The constant 252 represents the typical number of trading days in a year\.
- •Maximum Drawdown \(MDD\)quantifies the maximum observed loss from a peak to a trough of a portfolio before a new peak is attained\. It is a critical measure of extreme downside risk: MDD=maxt∈\(0,T\)\(Pt−mint≤τ≤T\(Pτ\)Pt\)MDD=\\max\_\{t\\in\(0,T\)\}\\left\(\\frac\{P\_\{t\}\-\\min\_\{t\\leq\\tau\\leq T\}\(P\_\{\\tau\}\)\}\{P\_\{t\}\}\\right\)\(10\)wherePtP\_\{t\}is the portfolio’s net asset value at timett, andTTis the evaluation period\.Similar Articles
QuantAgent: Price-Driven Multi-Agent LLMs for High-Frequency Trading
QuantAgent is a multi-agent LLM framework designed specifically for high-frequency trading, using four specialized agents (Indicator, Pattern, Trend, Risk) to make rapid, risk-aware decisions based on short-horizon signals. In zero-shot evaluations across ten financial instruments including Bitcoin and Nasdaq futures, it outperforms existing neural and rule-based baselines in predictive accuracy and cumulative return.
MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning
MoCA-Agent is a market-of-claims code agent that improves financial and numerical reasoning by decomposing questions into atomic claims and using specialist agents to buy/sell those claims, achieving strong results on multiple benchmarks using a fixed Qwen 3.6-27B backbone.
AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets
This paper introduces AI-Trader, the first fully automated live benchmark for evaluating LLMs in financial decision-making across US stocks, A-shares, and cryptocurrencies. It highlights that general intelligence does not guarantee trading success and emphasizes the importance of risk control in autonomous agents.
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
PANDO is a web agent framework that improves efficiency through online skill distillation, reducing token usage by 58-61% while outperforming baselines on VisualWebArena tasks.
different approach for agentic AI for regulated industry - questions
Summarizes a deterministic, constraint-based approach for building AI agents in regulated finance, where the LLM only generates prose, numbers are cryptographically sealed, and auditability is ensured through separated layers.