A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

arXiv cs.AI 07/01/26, 04:00 AM Papers
Summary
A three-phase deep reinforcement learning system for personalized portfolio management that addresses ticker lock-in, monolithic objectives, and static user models, using a cross-asset encoder pretrained with self-supervised learning and the Chronos time series foundation model, fine-tuned with Mixture of Experts and PPO, and personalized via LoRA.
arXiv:2606.30997v1 Announce Type: new Abstract: We present a three-phase deep reinforcement learning system for personalized portfolio management that addresses three limitations shared by all prior financial RL work: 1) ticker lock-in, 2) monolithic objectives , and 3) static user models. Phase 1 pretrains a ticker-identity-free cross asset encoder via self-supervised learning on a multi-asset corpus, augmented by a frozen parallel branch using Chronos, a T5-based time series foundation model, fused via a learned gating mechanism. To our knowledge, this is the first application of a time series foundation model to portfolio management RL. The encoder generalizes to any publicly traded asset via a 50-dimensional observable metadata vector that requires no retraining for new tickers. Phase 2 fine-tunes a MoE (Mixture of Experts) portfolio actor critic with PPO under an objective-conditioned reward that simultaneously serves six distinct investment goals sampled per episode: short-term alpha, short-term gain, long-term gain, capital preservation, tax-loss harvesting, and long-term-gains-only. A MoE architecture assigns each objective to a specialized expert head (momentum, growth, defensive, tax-aware), and a learned intent router blends experts based on the active objective and current market regime, which eliminates cross-objective gradient conflict. Phase 3 adds a lightweight personalization layer further adapted at inference time to each individual via a 76-parameter LoRA module fine-tuned on real brokerage transaction history, inferring investment objectives from revealed trading behavior rather than questionnaires. A natural language intent parser converts free-form goals directly into structured investment objective parameters.
Original Article
View Cached Full Text
Cached at: 07/01/26, 05:36 AM
# A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management ††thanks: Patent Pending. U.S. Provisional Patent Application No. 64/101,198, filed June 29, 2026.
Source: [https://arxiv.org/html/2606.30997](https://arxiv.org/html/2606.30997)
###### Abstract

We present a three\-phase deep reinforcement learning system for personalized portfolio management that addresses three limitations shared by all prior financial RL work: 1\)ticker lock\-in\(models trained on a fixed asset universe cannot generalize\), 2\)monolithic objectives\(a single Sharpe reward\[[16](https://arxiv.org/html/2606.30997#bib.bib16)\]cannot serve heterogeneous user goals\), and 3\)static user models\(preferences are elicited once and never updated\)\. Phase 1 pretrains a ticker\-identity\-free cross asset encoder via self\-supervised learning on a multi\-asset corpus, augmented by a frozen parallel branch using Chronos\[[2](https://arxiv.org/html/2606.30997#bib.bib2)\], a T5\-based time series foundation model pretrained on over 100 billion data points, fused via a learned gating mechanism\. To our knowledge, this is the first application of a time series foundation model to portfolio management RL\. The encoder generalizes to any publicly traded asset via a 50\-dimensional observable metadata vector \(sector, fundamentals, analyst consensus, options signals, earnings calendar, insider sentiment, institutional ownership\) that requires no retraining for new tickers\. Phase 2 fine\-tunes a MoE \(Mixture of Experts\) portfolio actor critic with PPO under an objective\-conditioned reward that simultaneously serves six distinct investment goals sampled per episode: short\-term alpha, short\-term gain, long\-term gain, capital preservation, tax\-loss harvesting, and long\-term\-gains\-only\. AMixture\-of\-Expertsarchitecture assigns each objective to a specialized expert head \(momentum, growth, defensive, tax\-aware\), and a learned intent router blends experts based on the active objective and current market regime, which eliminates cross\-objective gradient conflict\. Phase 3 adds a lightweight personalization layer further adapted at inference time to each individual via a 76\-parameter LoRA module fine\-tuned on real brokerage transaction history, inferring investment objectives from revealed trading behavior rather than questionnaires\. A natural language intent parser converts free\-form goals such as“buy a house in 3 years”or“college fund, kid is 10”directly into structured investment objective parameters\. The system is deployed as a FastAPI application with live brokerage integration, real\-time price refresh, news and event cross\-attention, and a trust\-first interface that previews inferred preferences before applying any model adaptation\.

## 1Introduction

Personal portfolio management sits at the intersection of several hard problems: market prediction under non\-stationarity, combinatorial action spaces over large ticker universes, heterogeneous user objectives \(tax efficiency, risk tolerance, investment horizon\), and the fundamental challenge of eliciting genuine preferences from users who do not think in terms of annualized Sharpe ratios\[[16](https://arxiv.org/html/2606.30997#bib.bib16)\]\.

### 1\.1The Market Gap

Existing tools fall into two categories, neither of which serves the individual investor well\.

- •Professional and high\-frequency systems: Institutional quantitative trading platforms are engineered for professional analysts with programming expertise, access to expensive real\-time data feeds, and portfolios measured in the hundreds of millions\. These systems provide sophisticated factor models, alternative data integration, and execution infrastructure, but require significant capital, technical overhead, and proprietary data subscriptions that place them out of reach for retail investors\. They optimize for alpha generation and execution latency, sometimes at millisecond timescales, and assume a clean separation between the analyst who constructs a strategy and the execution system that trades it\. After\-tax consequences, individual holding periods, and personal financial goals are orthogonal to their design objectives\. High\-frequency trading systems operate at an even further extreme: pure latency arbitrage with no notion of a user objective whatsoever\.
- •Retail tools: optimize in the opposite direction: extreme simplicity at the cost of signal quality\. Some products offer target\-date rebalancing and basic tax\-loss harvesting, but the underlying “AI” is a deterministic rules engine, not a learned policy\. There is no market signal: the system does not know whether a specific ticker is up 200% on earnings momentum or whether rising credit spreads suggest reducing equity exposure\. Some other products provide execution with no guidance; the analytical burden falls entirely on the user\.

The gapis a system that combines genuine learned market signal with the tax awareness and personalization that matter to an individual investor, without requiring institutional data subscriptions, a quantitative background, or a bespoke technology stack\. Table[1](https://arxiv.org/html/2606.30997#S1.T1)summarizes this positioning\.

Table 1:Positioning relative to existing tools\.DimensionInstitutionalRetailOursLearned market signal✓×\\times✓Tax\-lot awarenesspartialbasic✓User goal personalisation×\\timesrule\-based✓NL goal input×\\times×\\times✓Live broker integration✓✓✓Required expertiseHighNoneNoneData cost$$$FreeFreeLatency targetmsdailydaily

### 1\.2Prior Work

Financial RL has largely addressed sub\-problems in isolation\.\[[7](https://arxiv.org/html/2606.30997#bib.bib7)\]and\[[10](https://arxiv.org/html/2606.30997#bib.bib10)\]introduced end\-to\-end portfolio policies but with fixed ticker universes and no personalization\.\[[22](https://arxiv.org/html/2606.30997#bib.bib22)\]incorporated transaction costs, while\[[11](https://arxiv.org/html/2606.30997#bib.bib11)\]provided a multi\-environment benchmark\. Tax\-aware trading has been studied in the operations research literature\[[4](https://arxiv.org/html/2606.30997#bib.bib4)\]but rarely integrated with learned policies\. Robo\-advisors\[[5](https://arxiv.org/html/2606.30997#bib.bib5)\]address personalization via questionnaires but lack market signal\.

### 1\.3Novelty Positioning

The fundamental limitation shared by all prior financial RL systems isticker lock\-in: the model is trained on a fixed universe ofNNassets and cannot be applied to any other set without retraining\. This is a critical practical barrier: a user’s portfolio changes over time, new assets become relevant, and institutional methods that work on S&P 500 constituents are useless for portfolios of 5–20 individual positions with their own tax lots and behavioral histories\. Our primary contribution is the elimination of ticker identity from the architecture entirely, replacing learned ticker embeddings with a50\-dimensional observable metadata vectorcomputable for any publicly traded asset without retraining\.

A second gap is the treatment of the user as a static entity\. Existing systems, both institutional optimizers and retail robo\-advisors, elicit preferences once via questionnaire and never update\. We instead infer objectives dynamically fromrevealed trading behaviorextracted from real brokerage transaction history, adapting via a 76\-parameter LoRA module that can be updated in seconds on CPU\.

A third gap is the monolithic objective: virtually all prior RL policies optimize a single fixed reward \(typically Sharpe ratio\)\. We introduce anobjective\-conditioned rewardunder which a single policy simultaneously serves five distinct investment goals sampled per episode, without requiring a separate trained model per objective\.

Finally, we integrate Chronos\[[2](https://arxiv.org/html/2606.30997#bib.bib2)\], a time series foundation model pretrained on over 100 billion time series data points, as a frozen parallel encoder branch\. This adds universal temporal pattern recognition to a domain\-specific SSL encoder\. To our knowledge, this is the first application of a time series foundation model to portfolio management RL\.

We combine all four contributions \(i\.e\., ticker\-identity\-free generalization, revealed\-preference personalization, objective\-conditioned training, and foundation model augmentation\) into a single end\-to\-end system deployed with live brokerage integration\.

Our contributions are:

- •Aticker\-identity\-freearchitecture with a 50\-dimensional observable metadata vector \(sector, fundamentals, analyst consensus, options signals, earnings calendar, insider sentiment, institutional ownership\) enabling zero\-shot generalization to any publicly traded asset without retraining\. This is the primary architectural novelty; to our knowledge no prior financial RL system achieves ticker\-universe independence\.
- •Anobjective\-conditioned rewardthat shapes gradients differently per episode, training a single policy to serve five investment objectives \(short\-term gain, long\-term gain, capital preservation, tax\-loss harvesting, LT\-gains\-only\) sampled randomly per episode\.
- •AChronos foundation model branch: a frozen parallel encoder pretrained on 100B\+ time series data points, fused via a learned gate, providing universal temporal representations that complement domain\-specific SSL pretraining\. First application of a time series foundation model to portfolio management RL\.
- •Arevealed\-preference personalizationsystem that infers investment objectives from real transaction history and adapts via a 76\-parameter LoRA module with a trust\-first preview\-before\-apply UX, with no raw transaction data stored\.
- •Alearnable cash tokenthat forces explicit cash allocation decisions\. Cash competes with equities in the allocation softmax rather than accumulating passively through inaction, eliminating a common HOLD\-trap failure mode\.
- •Anallocation\-driven execution modelthat decouples portfolio weight targets from per\-ticker actions\. Rebalancing fires whenever allocation weights diverge beyond a thresholdδreb\\delta\_\{\\text\{reb\}\}, preventing the action head from blocking trades via all\-HOLD outputs\.
- •Asix\-stage MoE expert curriculum\(four specialist experts, intent\-conditional router with supervised routing loss, and expert grafting\) that achieves\+3\.03%\+3\.03\\%14d EW alpha with clean diagonal routing \(std\>0\.37\>0\.37\);
- •AMixture\-of\-Experts portfolio policy\(§[2\.4](https://arxiv.org/html/2606.30997#S2.SS4)\) with four specialized expert heads \(momentum, growth, defensive, tax\-aware\) and a learned intent router that blends experts based on active objective and market regime, eliminating cross\-objective gradient conflict that prevents a single head from serving all investment mandates simultaneously\.
- •Ashared\-encoder MoE designin which all experts share one cross asset encoder, jointly fine\-tuned via PPO\. This is more parameter efficient than per\-expert encoders while it allows representation adaptation for all experts\.
- •Aninter\-ticker contrastive lossin Phase 1 that prevents representation collapse\. Without this loss, all tickers converge to identical embeddings \(cosine similarity 0\.96\), making differentiated allocation impossible; the contrastive loss reduces similarity to 0\.24\.
- •Asixth training objective\(ALPHA\_VS\_EW\) that directly optimizes alpha versus an equal\-weight benchmark, with a concentration bonus5σ\(𝐰\)5\\sigma\(\\mathbf\{w\}\)that incentivizes non\-uniform weights\.
- •Acash\-drag and redeployment rewardthat explicitly incentivizes completing the sell→\\tobuy cycle within a single step, with turnover exemption for round\-trip rebalances\.
- •Anatural language intent parsermapping free\-text goals \(“buy a house in 3 years”, “college fund, kid is 10”\) to structured investment objective parameters\.

## 2System Architecture

Figure[1](https://arxiv.org/html/2606.30997#S2.F1)shows the three\-phase pipeline and Figure[2](https://arxiv.org/html/2606.30997#S2.F2)details the encoder architecture\.

Phase 1SSL PretrainingCrossAssetEncoder\+ Chronos\(frozen\)50\-dimMetadataPhase 2PPO Fine\-tunePortfolioActorCritic5 Objectivesper episodeShapedRewardPhase 3PersonalisePersona\+ LoRABrokeragetransactionsNL intentparserLive recommendations: BUY∣\\midHOLD∣\\midSELL \+ weightsticker\-identity\-freeanyNNat inferenceFigure 1:Three\-phase pipeline\. Solid arrows = training flow; dashed = conditioning inputs \(Chronos, metadata, objectives, NL parser\)\. Each phase is independently resumable from checkpoints\.Price window\(B,N,T,D\)\(B,N,T,D\)Close prices\(B,N,T\)\(B,N,T\)Metadata\(B,N,50\)\(B,N,50\)MarketEncoder\(SSL, trained\)ChronosEncoder\(frozen T5, 8M\)MetadataEncoder\(74K trainable\)Gated Fusionhssl\+σ\(W\[hssl;hc\]\)⋅hch\_\{\\text\{ssl\}\}\+\\sigma\(W\[h\_\{\\text\{ssl\}\};h\_\{c\}\]\)\\cdot h\_\{c\}Additiveh\+hmh\+h\_\{m\}Cross\-AssetAttention𝐡fused∈ℝB×N×d\\mathbf\{h\}\_\{\\text\{fused\}\}\\in\\mathbb\{R\}^\{B\\times N\\times d\}shared weights across allNNtickersfrozen — no gradientlearned gate∈\[0,1\]\\in\[0,1\]anyNNat inferenceFigure 2:CrossAssetEncoder architecture\. The SSL\-trainedMarketEncoder\(price features\) and frozenChronosEncoder\(universal temporal patterns\) are combined via a learned gating mechanism\. Ticker metadata \(50\-dim\) is injected additively after gated fusion, before cross\-asset attention compares tickers against each other\. All three branches operate on any number of tickersNNat inference without retraining\.### 2\.1Phase 1: Self\-Supervised Encoder Pretraining

The encoder is pretrained on a multi\-asset corpus with three SSL objectives:

#### Next\-bar Return Prediction\.

Given a window𝐱∈ℝT×D\\mathbf\{x\}\\in\\mathbb\{R\}^\{T\\times D\}ofTTbars andDDnormalized features \(returns, moving\-average ratios, RSI, MACD, volume z\-score, etc\.\), predict the next\-bar returnrt\+1r\_\{t\+1\}via a Huber regression head\[[17](https://arxiv.org/html/2606.30997#bib.bib17)\]:

ℒret=Huberδ\(r^t\+1,clip\(rt\+1,−0\.1,0\.1\)\),δ=0\.05\.\\mathcal\{L\}\_\{\\text\{ret\}\}=\\text\{Huber\}\_\{\\delta\}\\\!\\left\(\\hat\{r\}\_\{t\+1\},\\,\\text\{clip\}\(r\_\{t\+1\},\-0\.1,0\.1\)\\right\),\\quad\\delta=0\.05\.\(1\)

#### Masked Feature Recovery\.

A random subset of feature channels is zeroed and the encoder must reconstruct the original values, encouraging complete use of the feature set:

ℒmask=‖𝐱^masked−𝐱original‖22\.\\mathcal\{L\}\_\{\\text\{mask\}\}=\\left\\\|\\hat\{\\mathbf\{x\}\}\_\{\\text\{masked\}\}\-\\mathbf\{x\}\_\{\\text\{original\}\}\\right\\\|\_\{2\}^\{2\}\.\(2\)

#### Market Regime Classification\.

Bars are labeled into four regimes \(bull, bear, volatile, sideways\) automatically via rolling statistics, and the encoder is trained to predict the current regime:

ℒreg=CrossEntropy\(y^regime,yregime\)\.\\mathcal\{L\}\_\{\\text\{reg\}\}=\\text\{CrossEntropy\}\(\\hat\{y\}\_\{\\text\{regime\}\},y\_\{\\text\{regime\}\}\)\.\(3\)
The combined loss is:

ℒP1=0\.3ℒret\+1\.0ℒmask\+0\.5ℒreg\+0\.5ℒcontrast\.\\mathcal\{L\}\_\{\\text\{P1\}\}=0\.3\\,\\mathcal\{L\}\_\{\\text\{ret\}\}\+1\.0\\,\\mathcal\{L\}\_\{\\text\{mask\}\}\+0\.5\\,\\mathcal\{L\}\_\{\\text\{reg\}\}\+0\.5\\,\\mathcal\{L\}\_\{\\text\{contrast\}\}\.\(4\)

#### Inter\-Ticker Contrastive Loss\.

A critical failure mode emerged during development: without an explicit differentiation objective, the cross asset encoder converged to amean representation\. All tickers produced nearly identical embeddings \(mean cosine similarity 0\.96\), causing the allocation head to output uniform1/N1/Nweights regardless of input\. We diagnose this asrepresentation collapsein the cross\-asset attention: with identical SSL targets per ticker, the attention learns to average rather than contrast\.

We address this with a contrastive loss applied to the per\-ticker representations\{𝐡i\}\\\{\\mathbf\{h\}\_\{i\}\\\}before pooling:

ℒcontrast=1BN\(N−1\)∑b=1B∑i≠j𝐡b,i⋅𝐡b,j‖𝐡b,i‖‖𝐡b,j‖,\\mathcal\{L\}\_\{\\text\{contrast\}\}=\\frac\{1\}\{BN\(N\-1\)\}\\sum\_\{b=1\}^\{B\}\\sum\_\{i\\neq j\}\\frac\{\\mathbf\{h\}\_\{b,i\}\\cdot\\mathbf\{h\}\_\{b,j\}\}\{\\\|\\mathbf\{h\}\_\{b,i\}\\\|\\,\\\|\\mathbf\{h\}\_\{b,j\}\\\|\},\(5\)which minimizes the mean cosine similarity between all pairs of different tickers within each batch\. After 60 epochs withλcontrast=0\.5\\lambda\_\{\\text\{contrast\}\}=0\.5, mean inter\-ticker cosine similarity dropped from 0\.96 to0\.24, enabling the allocation head to produce genuinely differentiated portfolio weights\.

#### Path B: Cross Asset Encoder with Chronos Augmentation\.

For multi\-asset pretraining, we use a three\-stage encoder\. Stage 1 applies a shared marker encoder \(transformer\) to each ticker’s price feature window independently, producing per\-ticker representations𝐡issl∈ℝd\\mathbf\{h\}^\{\\text\{ssl\}\}\_\{i\}\\in\\mathbb\{R\}^\{d\}\.

Stage 1\.5 fuses two parallel branches and adds metadata context:

𝐡~i=𝐡issl\+σ\(Wg\[𝐡issl;𝐡ichr\]\)⊙𝐡ichr⏟gated Chronos fusion\+MetadataEnc\(𝐦i\)⏟additive context,\\begin\{split\}\\tilde\{\\mathbf\{h\}\}\_\{i\}&=\\underbrace\{\\mathbf\{h\}^\{\\text\{ssl\}\}\_\{i\}\+\\sigma\\\!\\left\(W\_\{g\}\\left\[\\mathbf\{h\}^\{\\text\{ssl\}\}\_\{i\};\\,\\mathbf\{h\}^\{\\text\{chr\}\}\_\{i\}\\right\]\\right\)\\odot\\mathbf\{h\}^\{\\text\{chr\}\}\_\{i\}\}\_\{\\text\{gated Chronos fusion\}\}\\\\ &\\quad\+\\underbrace\{\\text\{MetadataEnc\}\(\\mathbf\{m\}\_\{i\}\)\}\_\{\\text\{additive context\}\},\\end\{split\}\(6\)where𝐡ichr=Proj\(Chronos\(ci\)\)\\mathbf\{h\}^\{\\text\{chr\}\}\_\{i\}=\\text\{Proj\}\(\\text\{Chronos\}\(c\_\{i\}\)\)is the projected Chronos embedding of the raw closing price sequencecic\_\{i\},WgW\_\{g\}is a learned gate, andσ\\sigmais sigmoid\. The gateσ\(Wg\[⋅\]\)\\sigma\(W\_\{g\}\[\\cdot\]\)learns per\-position how much universal temporal signal from Chronos should augment the SSL representation\.

#### Chronos freezing and caching strategy\.

We use Chronos\-T5\-Small \(46M parameters, pretrained on 100B\+ time series\) as a frozen feature extractor throughout Phase 1\. Only the linear projection head \(ℝ512→ℝd\\mathbb\{R\}^\{512\}\\to\\mathbb\{R\}^\{d\}, 74K parameters\) and the gate \(Wg∈ℝ2d×dW\_\{g\}\\in\\mathbb\{R\}^\{2d\\times d\}, 8K parameters\) are trained\. This design keeps 99\.97% of Chronos weights frozen, preventing catastrophic forgetting of universal temporal patterns while allowing the projection to adapt to financial data\. Crucially, we donotapply LoRA or any parameter\-efficient fine\-tuning to Chronos\. The backbone is used strictly as a fixed feature extractor, similar to using a frozen BERT for NLP downstream tasks\. LoRA adaptation in our system is reserved for the intent router in Phase 3, where a 76\-parameter adapter shifts routing weights to match individual user preferences \(§[2\.5](https://arxiv.org/html/2606.30997#S2.SS5)\)\.

Since the Chronos forward pass is computationally expensive relative to the SSL encoder, we precompute all embeddings for the training corpus once before Phase 2 begins, storing\(ti,ticker\)→𝐡ichr\(t\_\{i\},\\text\{ticker\}\)\\to\\mathbf\{h\}^\{\\text\{chr\}\}\_\{i\}in an in\-memory cache\. This reduces the per\-step Chronos overhead from𝒪\(N⋅T\)\\mathcal\{O\}\(N\\cdot T\)transformer calls to𝒪\(1\)\\mathcal\{O\}\(1\)cache lookups during the frozen phase\. When the encoder is unfrozen, the cache is cleared and repopulated every 100 episodes to reflect updated projection weights\.

Stage 2 applies cross\-asset transformer attention over\{𝐡~i\}\\\{\\tilde\{\\mathbf\{h\}\}\_\{i\}\\\}, allowing tickers to contextualise each other before the prediction heads\.

### 2\.2Ticker\-Identity\-Free Design with Metadata

A central design choice is the complete elimination of fixed ticker identity embeddings\. In prior work, an embedding table𝐄∈ℝN×d\\mathbf\{E\}\\in\\mathbb\{R\}^\{N\\times d\}maps ticker indices to representations, tying the model to a fixed universe of sizeNN\.

We replace this with a50\-dimensionalobservable metadata vector, computable for any ticker including those not seen during training:

𝐦i=\[𝐬isector,𝐜icap,𝐟ifund,𝐚ianalyst,𝐨ioptions,𝐞iearn,𝐭itech,𝐤iinsider,𝐧iinst\]∈ℝ50,\\mathbf\{m\}\_\{i\}=\\bigl\[\\mathbf\{s\}\_\{i\}^\{\\text\{sector\}\},\\;\\mathbf\{c\}\_\{i\}^\{\\text\{cap\}\},\\;\\mathbf\{f\}\_\{i\}^\{\\text\{fund\}\},\\;\\mathbf\{a\}\_\{i\}^\{\\text\{analyst\}\},\\;\\mathbf\{o\}\_\{i\}^\{\\text\{options\}\},\\\\ \\mathbf\{e\}\_\{i\}^\{\\text\{earn\}\},\\;\\mathbf\{t\}\_\{i\}^\{\\text\{tech\}\},\\;\\mathbf\{k\}\_\{i\}^\{\\text\{insider\}\},\\;\\mathbf\{n\}\_\{i\}^\{\\text\{inst\}\}\\bigr\]\\in\\mathbb\{R\}^\{50\},\(7\)where𝐬i∈\{0,1\}12\\mathbf\{s\}\_\{i\}\\in\\\{0,1\\\}^\{12\}is sector one\-hot,𝐜i∈\{0,1\}6\\mathbf\{c\}\_\{i\}\\in\\\{0,1\\\}^\{6\}is market\-cap bucket,𝐟i∈ℝ10\\mathbf\{f\}\_\{i\}\\in\\mathbb\{R\}^\{10\}covers fundamentals \(P/E, P/B, ROE, profit margin, revenue growth, EPS growth, short interest, dividend yield, debt/equity, payout ratio\),𝐚i∈ℝ3\\mathbf\{a\}\_\{i\}\\in\\mathbb\{R\}^\{3\}captures analyst consensus \(rating normalised to\[−1,\+1\]\[\-1,\+1\], log coverage, mean price target upside\),𝐨i∈ℝ4\\mathbf\{o\}\_\{i\}\\in\\mathbb\{R\}^\{4\}covers options signals \(IV/hist\_vol, IV skew, put/call ratio, unusual activity flag\),𝐞i∈ℝ3\\mathbf\{e\}\_\{i\}\\in\\mathbb\{R\}^\{3\}encodes earnings calendar \(days to next earnings, 8\-quarter beat rate, post\-earnings drift\),𝐭i∈ℝ4\\mathbf\{t\}\_\{i\}\\in\\mathbb\{R\}^\{4\}covers technical regime \(distance from 52\-week high/low, volume trend, relative strength vs\. sector ETF\),𝐤i∈ℝ2\\mathbf\{k\}\_\{i\}\\in\\mathbb\{R\}^\{2\}captures insider activity \(net buy/sell over 6 months, insider momentum\), and𝐧i∈ℝ2\\mathbf\{n\}\_\{i\}\\in\\mathbb\{R\}^\{2\}covers institutional ownership \(ownership percentage, quarter\-over\-quarter change\)\.

The model learns structured priors:“high\-beta mega\-cap tech with strong momentum deserves overweight in bull regimes”, without memorizing ticker names\. New tickers at inference require only metadata computation, not retraining\.

### 2\.3Phase 2: Objective\-Conditioned Portfolio RL

Phase 2 fine\-tunes a portfolio actor critic with PPO\[[15](https://arxiv.org/html/2606.30997#bib.bib15)\]\. At each step, the policy jointly outputs:

- •Allocation weights𝐰∈ΔN\+1\\mathbf\{w\}\\in\\Delta^\{N\+1\}\(softmax overNNequities*plus a learnable cash token*\) via allocation head
- •Per\-ticker actions𝐚∈\{0,1,2\}N\\mathbf\{a\}\\in\\\{0,1,2\\\}^\{N\}\(HOLD/BUY/SELL\) via per ticker action head

#### Explicit cash allocation via a learnable cash token\.

A persistent failure mode in early experiments was the policy drifting into implicit cash\-holding: because HOLD never incurs transaction costs and benefits from market drift, the policy discovered that outputting zero allocation weights for all tickers while routing all reward through market returns was a local optimum\. We address this by adding alearnable cash token𝐜∈ℝd\\mathbf\{c\}\\in\\mathbb\{R\}^\{d\}\(i\.e\., a trainable parameter that competes with equity representations in the allocation softmax\):

𝐰full=softmax\(AllocHead\(\[𝐜;𝐡1,…,𝐡N\]\)\)∈ℝN\+1\.\\mathbf\{w\}\_\{\\text\{full\}\}=\\text\{softmax\}\\bigl\(\\text\{AllocHead\}\(\[\\mathbf\{c\};\\,\\mathbf\{h\}\_\{1\},\\ldots,\\mathbf\{h\}\_\{N\}\]\)\\bigr\)\\in\\mathbb\{R\}^\{N\+1\}\.\(8\)The first elementw0=wcashw\_\{0\}=w\_\{\\text\{cash\}\}is the explicit cash allocation\. Equity weights are renormalised to sum to1−wcash1\-w\_\{\\text\{cash\}\}:

wieq=wi\+1∑j=1Nwj\+1⋅\(1−wcash\),i=1,…,N\.w\_\{i\}^\{\\text\{eq\}\}=\\frac\{w\_\{i\+1\}\}\{\\sum\_\{j=1\}^\{N\}w\_\{j\+1\}\}\\cdot\(1\-w\_\{\\text\{cash\}\}\),\\quad i=1,\\ldots,N\.\(9\)This forces the policy to make an active decision about cash allocation at every step, rather than allowing passive cash accumulation through inaction\.

#### Allocation\-driven execution\.

The per\-ticker action head \(hold/buy/sell\) was originally intended as an additional confidence signal, but in practice caused zero\-turnover collapse: the action head learned to outputholdfor all tickers unconditionally, overriding allocation signals\. We therefore decouple the two heads: the environment executes a rebalance whenever\|wieq−wicurr\|\>δreb\|w\_\{i\}^\{\\text\{eq\}\}\-w\_\{i\}^\{\\text\{curr\}\}\|\>\\delta\_\{\\text\{reb\}\}, regardless of the action head output\. The action head now acts as amodifierrather than a gate:sellforces a full exit;buyallows larger\-than\-threshold buys;holdpermits partial rebalancing\. We expose the rebalancing thresholdδreb\\delta\_\{\\text\{reb\}\}as a hyperparameter \(default 0\.01\) to control trading frequency\.

Execution follows a sell\-first ordering to free cash before buys\.

#### Objective\-conditioned reward\.

At each episode, an objectiveoois sampled uniformly from six types\. The reward functionRoR\_\{o\}is then shaped accordingly:

Ro=So⏟obj\.−λcHeq⏟conc\.−λtτ~⏟turnover−γdmax⁡\(0,cr−0\.05\)⏟cash drag−γs⋅𝟏\[τ=0,t\>10\]⏟stale\+δr⋅ρ⏟redeploy,\\begin\{split\}R\_\{o\}&=\\underbrace\{S\_\{o\}\}\_\{\\text\{obj\.\}\}\-\\underbrace\{\\lambda\_\{c\}H\_\{\\text\{eq\}\}\}\_\{\\text\{conc\.\}\}\-\\underbrace\{\\lambda\_\{t\}\\tilde\{\\tau\}\}\_\{\\text\{turnover\}\}\-\\underbrace\{\\gamma\_\{d\}\\max\(0,\\,c\_\{r\}\-0\.05\)\}\_\{\\text\{cash drag\}\}\\\\ &\\quad\-\\underbrace\{\\gamma\_\{s\}\\cdot\\mathbf\{1\}\[\\tau=0,\\,t\>10\]\}\_\{\\text\{stale\}\}\+\\underbrace\{\\delta\_\{r\}\\cdot\\rho\}\_\{\\text\{redeploy\}\},\\end\{split\}\(10\)whereHeqH\_\{\\text\{eq\}\}is the equity Herfindahl index\[[21](https://arxiv.org/html/2606.30997#bib.bib21)\],τ~\\tilde\{\\tau\}is turnover net of redeployed round\-trips \(defined below\),cr=C/PVc\_\{r\}=C/\\text\{PV\}is the cash ratio,γd=0\.3\\gamma\_\{d\}=0\.3penalises excess cash unconditionally \(plus0\.2⋅cr⋅rm0\.2\\cdot c\_\{r\}\\cdot r\_\{m\}when the market rises\),γs=0\.005\\gamma\_\{s\}=0\.005is a small stale\-portfolio penalty that fires when no trades have occurred for more than 10 steps, andρ∈\[0,1\]\\rho\\in\[0,1\]is the fraction of sell proceeds redeployed in the same step\.

The objective baseSoS\_\{o\}varies by type:

SMAX\_GAIN\_1Y=σ^SharpeSMAX\_GAIN\_30D=0\.3σ^\+0\.7⋅clip\(50r¯30,−2,2\)SCAP\_PRES=0\.5σ^−3δtSINC\_HARV=σ^\+∑iq4\(i\)⋅𝟏\[gi<0,hi<365\]SLT\_ONLY=σ^−0\.15∑i𝟏\[hi<365at sell\]SALPHA\_VS\_EW=50\(rport−rEW\)\+5σ\(𝐰\),\\begin\{array\}\[\]\{ll\}S\_\{\\texttt\{MAX\\\_GAIN\\\_1Y\}\}&=\\hat\{\\sigma\}\_\{\\text\{Sharpe\}\}\\\\\[2\.0pt\] S\_\{\\texttt\{MAX\\\_GAIN\\\_30D\}\}&=0\.3\\hat\{\\sigma\}\+0\.7\\cdot\\mathrm\{clip\}\(50\\bar\{r\}\_\{30\},\{\-\}2,2\)\\\\\[2\.0pt\] S\_\{\\texttt\{CAP\\\_PRES\}\}&=0\.5\\hat\{\\sigma\}\-3\\delta\_\{t\}\\\\\[2\.0pt\] S\_\{\\texttt\{INC\\\_HARV\}\}&=\\hat\{\\sigma\}\+\{\\textstyle\\sum\_\{i\}\}q\_\{4\}\(i\)\\cdot\\mathbf\{1\}\[g\_\{i\}\{<\}0,\\,h\_\{i\}\{<\}365\]\\\\\[2\.0pt\] S\_\{\\texttt\{LT\\\_ONLY\}\}&=\\hat\{\\sigma\}\-0\.15\{\\textstyle\\sum\_\{i\}\}\\mathbf\{1\}\[h\_\{i\}\{<\}365\\text\{ at sell\}\]\\\\\[2\.0pt\] S\_\{\\texttt\{ALPHA\\\_VS\\\_EW\}\}&=50\(r\_\{\\text\{port\}\}\-r\_\{\\text\{EW\}\}\)\+5\\sigma\(\\mathbf\{w\}\),\\end\{array\}\(11\)whereδt\\delta\_\{t\}is current drawdown from peak,q4\(i\)q\_\{4\}\(i\)is a Q4 multiplier \(1\.5 in Oct–Dec, 0\.5 otherwise\),gig\_\{i\}is unrealised gain, andhih\_\{i\}is holding days\.

#### Redeployment\-aware turnover\.

Selling and immediately buying constitutes one rebalance decision, not two trades\. The net turnover excluding round\-trips is:

τ~=max⁡\(0,τ−\(Δcsell\+min⁡\(Δb,Δcsell\)\)⋅ρPV\),\\tilde\{\\tau\}=\\max\\\!\\left\(0,\\;\\tau\-\\frac\{\(\\Delta c\_\{\\text\{sell\}\}\+\\min\(\\Delta b,\\Delta c\_\{\\text\{sell\}\}\)\)\\cdot\\rho\}\{\\text\{PV\}\}\\right\),\(12\)whereΔcsell\\Delta c\_\{\\text\{sell\}\}is cash freed by sells,Δb\\Delta bis cash deployed by buys, andPVis portfolio value\.

#### Equity\-only concentration\.

Cash is not concentrated equity\. The concentration penalty applies only to the renormalized equity weights:

Heq=∑iwi2,Heq⋆=1\|\{i:wi\>0\.01\}\|,H\_\{\\text\{eq\}\}=\\sum\_\{i\}w\_\{i\}^\{2\},\\qquad H\_\{\\text\{eq\}\}^\{\\star\}=\\frac\{1\}\{\|\\\{i:w\_\{i\}\>0\.01\\\}\|\},\(13\)

### 2\.4Phase 2b: Mixture\-of\-Experts Portfolio Policy

A key empirical finding during Phase 2 development was that a single allocation head cannot simultaneously serve conflicting investment objectives\. A momentum strategy \(overweight recent winners, 14\-day horizon\) requires high beta, high concentration, and frequent rebalancing\. A capital preservation strategy requires low beta, diversification, and minimal turnover\. Training a single head with all six objectives produces gradient conflict: the policy converges to a compromise allocation that is suboptimal for all objectives\.

We address this with aMixture\-of\-Experts\(MoE\) architecture in which each expert specializes in a distinct investment mandate, and an intent router learns to blend experts based on the active objective and current market regime\.

#### Expert allocation heads\.

We define four expert heads, each an independent allocation head with identical architecture but trained on a distinct objective subset \(see Table[2](https://arxiv.org/html/2606.30997#S2.T2)\)

Table 2:MoE expert mapping\. Each expert receives gradients only from its assigned objectives, eliminating cross\-objective gradient conflict\.
#### Intent router\.

The intent router receives the mean\-pooled encoder output𝐡¯∈ℝd\\bar\{\\mathbf\{h\}\}\\in\\mathbb\{R\}^\{d\}\(market state\) and the one\-hot intent vector𝐞o∈\{0,1\}6\\mathbf\{e\}\_\{o\}\\in\\\{0,1\\\}^\{6\}\(active objective\), and outputs mixture weights over experts:

𝜶=softmax\(fθ\(\[𝐡¯;𝐞o\]\)τ\)∈ΔE−1,\\boldsymbol\{\\alpha\}=\\text\{softmax\}\\\!\\left\(\\frac\{f\_\{\\theta\}\(\[\\bar\{\\mathbf\{h\}\};\\,\\mathbf\{e\}\_\{o\}\]\)\}\{\\tau\}\\right\)\\in\\Delta^\{E\-1\},\(14\)wherefθf\_\{\\theta\}is a two\-layer MLP andτ\\tauis a learnable temperature \(initialised at 1\.0\)\.

The final allocation is the router\-weighted mixture of expert outputs:

𝐰full=∑e=1Eαe⋅Experte\(\[𝐜;𝐡1:N\],𝐬1:N,𝐠\),\\mathbf\{w\}\_\{\\text\{full\}\}=\\sum\_\{e=1\}^\{E\}\\alpha\_\{e\}\\cdot\\texttt\{Expert\}\_\{e\}\\bigl\(\[\\mathbf\{c\};\\,\\mathbf\{h\}\_\{1:N\}\],\\,\\mathbf\{s\}\_\{1:N\},\\,\\mathbf\{g\}\\bigr\),\(15\)where𝐜\\mathbf\{c\}is the cash token,𝐬1:N\\mathbf\{s\}\_\{1:N\}is per\-ticker state, and𝐠\\mathbf\{g\}is global portfolio state\.

#### Regime\-conditional routing\.

A key property of the router is that it can override the explicit intent based on market regime\. For example, when the market enters a high\-volatility regime \(detectable from the encoder’s regime classification head\), the router may routeALPHA\_VS\_EWintent partially through the defensive expert, reducing drawdown risk at the cost of some alpha\. This regime\-conditional behaviour is learned implicitly from the reward signal without explicit regime labels\.

#### Shared encoder, specialised experts\.

All four experts share a single cross asset encoder, which is jointly fine\-tuned with the MoE heads via PPO\. This is more parameter\-efficient than per\-expert encoders and allows the encoder to adapt representations that are simultaneously useful for all experts\. The total parameter count is modest: one encoder \(approx\. 2M params\) plus four lightweight expert heads \(approx\. 200K params each\) and the router \(approx\. 50K params\)\.

#### Connection to Phase 3 personalization\.

The router architecture naturally integrates with Phase 3 LoRA adaptation: user\-specific preferences \(risk tolerance, tax bracket, time horizon\) update only the router weights via LoRA, leaving expert heads frozen\. A conservative user shifts router weight toward the defensive expert; an aggressive user shifts toward momentum\. This separation ofwhat markets look like\(shared encoder, shared experts\) fromhow to weight strategies\(personalized router\) is a key design principle\.

### 2\.5Phase 3: Tax\-Aware Personalization with LoRA Adaptation

Phase 3 adapts the trained MoE policy to individual users without retraining any expert weights\. We describe the architecture here; full empirical evaluation on real brokerage data is left to future work \(§[10](https://arxiv.org/html/2606.30997#S10)\)\.

#### Architecture\.

A tax\-aware personalization layer adapts only the intent router weights via a lightweight LoRA module\[[9](https://arxiv.org/html/2606.30997#bib.bib9)\]:

ℓ^=ℓbase\+𝐩u𝐀𝐁,\\hat\{\\boldsymbol\{\\ell\}\}=\\boldsymbol\{\\ell\}\_\{\\text\{base\}\}\+\\mathbf\{p\}\_\{u\}\\mathbf\{A\}\\mathbf\{B\},\(16\)where𝐩u∈ℝ16\\mathbf\{p\}\_\{u\}\\in\\mathbb\{R\}^\{16\}is a user behaviour profile extracted from brokerage transaction history, and𝐀∈ℝ16×r\\mathbf\{A\}\\in\\mathbb\{R\}^\{16\\times r\},𝐁∈ℝr×3\\mathbf\{B\}\\in\\mathbb\{R\}^\{r\\times 3\}are low\-rank adapter matrices withr=4r=4, initialised with𝐁=𝟎\\mathbf\{B\}=\\mathbf\{0\}so adaptation is an identity at deployment\. Total adapter size: 76 parameters \(≈1\\approx 1KB\)\. The shared encoder and all four expert heads remain frozen\.

#### Revealed preference extraction\.

Transaction history is analysed to compute𝐩u\\mathbf\{p\}\_\{u\}encoding median holding period, LT\-sell fraction, loss\-harvest score, disposition effect, and trade frequency\. The inferred objective \(e\.g\.LT\_GAIN\_ONLYfor users who hold winners\>12\>12months\) overrides the stated intent when confidence exceeds 70%\.

#### Personalization effect\.

The adapter shifts intent router mixture weights toward the expert most consistent with a user’s revealed behaviour: frequent short\-term traders increaseα0\\alpha\_\{0\}\(momentum expert\); long\-horizon holders increaseα3\\alpha\_\{3\}\(tax\-aware expert\)\. This separation \(i\.e\., shared encoder and experts capturewhat markets look like; personalised router captureshow to weight strategies\) makes adaptation cheap, interpretable, and privacy\-preserving \(only the 76\-parameter adapter is persisted per user\)\.

#### Qualitative illustration\.

Table[3](https://arxiv.org/html/2606.30997#S2.T3)shows after\-tax recommendations for four canonical investor personas on AAPL at $190\.38 \(15 Nov 2023\), using the Phase 3 tax\-aware layer on top of the trained Phase 2 policy\. The system correctly suppresses sells near the LT threshold \(persona 2\), identifies tax\-loss harvesting opportunities \(persona 3\), and adapts position sizing to bracket\-specific after\-tax returns\.

Table 3:Phase 3 after\-tax recommendations for four investor personas\. AAPL @ $190\.38, 15 Nov 2023\. “LT saving” = tax saving from waiting for long\-term treatment\. “sh” is the number of shares held by a given investor\. The system suppresses BUY/SELL actions that would reduce after\-tax value\.PersonaObjectiveBracketActionAfter\-tax nowWait LT30d trader, 20sh @ $161\.82MAX\_GAIN\_30D32%ST/15%LTHOLD$\+388$\+485LT investor, 50sh @ $133\.26 \(25d from LT\)LT\_GAIN\_ONLY24%ST/15%LTHOLD⋆$\+2,170$\+2,427Loss position, 30sh @ $237\.97INCOME\_HARVEST35%ST/20%LTHOLD$\-1,428$\-1,428Near\-retirement, no positionCAPITAL\_PRESERVE22%ST/15%LTHOLD——⋆Sell suppressed: 26d until LT conversion saves $257 in tax\.

### 2\.6Natural Language Goal Specification

We implement a two\-tier intent parser that converts free\-text goals to investment objective parameters:

1. 1\.API tier: A language model parses the user’s natural language goal into a structured intent object, including objective type, time horizon, return target, drawdown tolerance, and risk level, in under one second at negligible cost\.
2. 2\.Rule tier: Keyword matching with regex\-based year/month extraction, handles the 90% common cases with zero API dependency\.

Table[4](https://arxiv.org/html/2606.30997#S2.T4)shows representative mappings\.

Table 4:Example intent\-to\-objective mappings\.

## 3News and Event Cross\-Attention

For the news\-fused Phase 2 variant, structured events and news articles are encoded as additional key\-value tokens fed to a cross\-attention layer:

𝐡fused=CrossAttn\(𝐡price⏟query,\[𝐞1,…,𝐞K⏟news,𝐯1,…,𝐯E⏟events\]\)\+𝐡price\.\\begin\{split\}\\mathbf\{h\}\_\{\\text\{fused\}\}&=\\text\{CrossAttn\}\\\!\\left\(\\underbrace\{\\mathbf\{h\}\_\{\\text\{price\}\}\}\_\{\\text\{query\}\},\\;\\left\[\\underbrace\{\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{K\}\}\_\{\\text\{news\}\},\\;\\underbrace\{\\mathbf\{v\}\_\{1\},\\ldots,\\mathbf\{v\}\_\{E\}\}\_\{\\text\{events\}\}\\right\]\\right\)\\\\ &\\quad\+\\mathbf\{h\}\_\{\\text\{price\}\}\.\\end\{split\}\(17\)
Price is thequery\(“what explains what I see?”\); news and events are keys and values\. The architecture degrades gracefully: if news is unavailable, the residual connection passes price representations unchanged\.

The model tracks 55 types of market events\. These include changes in analyst sentiment, such as upgrades, downgrades, and price target revisions, as well as earnings estimate changes, sector\-level signals, and macroeconomic data releases\. Rather than reacting to any single analyst’s opinion, the model aggregates analyst signals over a recent window, which tends to be a more reliable indicator than individual rating actions\.

## 4Deployment Architecture

The system is deployed as a FastAPI application with the following components:

#### Brokerage integration\.

The system supports three broker backends: a live/paper trading executor, a read\-only position aggregator that connects to 50\+ brokerages via OAuth, and a mock broker for local development\.

#### Real\-time data\.

The system fetches price data with a 15\-minute cache TTL, ensuring recommendations during market hours use data at most 15 minutes stale\. The endpoint bypasses even this cache for immediate refresh\. News is cached at 30\-minute TTL\.

#### Trust\-first adaptation UX\.

Transaction history is processed in two steps: first a preview that analyses the uploaded file and returns the inferred investor profile without making any changes, then a separate apply step that only runs after the user confirms\. The raw file is never stored, only a compact 16\-number profile vector is saved\.

#### Safety guardrails\.

Every order is validated against three configurable limits before execution: daily portfolio loss limit \(default 2%\), maximum single\-order value \($5,000\), and maximum position concentration \(20%\)\.

## 5Experimental Results

### 5\.1Phase 1 Pretraining

We pretrained the cross asset encoder on a 30\-ticker S&P 500 sample \(all 11 GICS sectors\) using daily bars from 2015–2024\. Table[5](https://arxiv.org/html/2606.30997#S5.T5)reports validation losses for key configurations\.

Table 5:Phase 1 ablation: validation losses and inter\-ticker cosine similarity across encoder configurations \(all 60 epochs\)\. The no\-Chronos encoder naturally differentiates tickers \(sim 0\.08\) because raw price/volume features vary substantially across assets; Chronos normalises sequences before encoding, causing representation collapse \(sim 0\.81\)\. The contrastive loss corrects this collapse while warm\-starting from SSL pre\-trained weights achieves the best val loss \(0\.163, sim 0\.24\)\.Table[5](https://arxiv.org/html/2606.30997#S5.T5)reveals a surprising finding: the no\-Chronos encoder naturally differentiates tickers \(sim 0\.08\) without any contrastive loss, because raw price and volume features vary substantially across assets \(AAPL vs XOM have fundamentally different scales and distributions\)\. Chronos, by contrast, normalizes price sequences before encoding \(i\.e\., relative movements look similar across tickers\) causing representation collapse \(sim 0\.81\) for both tiny and small variants despite their different capacities, confirming the bottleneck is the normalization rather than model size\. Adding the contrastive loss corrects this collapse: from scratch \(sim 0\.11, val 0\.305\) or via warm\-start from SSL pre\-trained weights \(sim 0\.24, val 0\.163\)\. The warm\-start protocol is superior as the encoder first learns meaningful temporal structure before the contrastive loss sculpts the embedding space, avoiding the objective conflict that raises val loss at random initialization\. This two\-phase SSL→\\tocontrastive protocol is used for all Phase 2 experiments and is documented as a standalone contribution in Section[2\.1](https://arxiv.org/html/2606.30997#S2.SS1.SSS0.Px4)\.

Adding the inter\-ticker contrastive loss \(Section[2\.1](https://arxiv.org/html/2606.30997#S2.SS1.SSS0.Px4),λcontrast=0\.5\\lambda\_\{\\text\{contrast\}\}=0\.5\) reduced mean cosine similarity from 0\.96 to0\.24in 60 epochs, enabling the allocation head to produce genuinely differentiated portfolio weights\.

### 5\.2Phase 2 Portfolio Policy

Training directly on all six objectives simultaneously resulted in gradient conflict within 100 episodes\. The policy converged to uniform 1/N weights with near\-zero alpha as competing objectives canceled each other’s gradients\. We therefore applied the six\-stage MoE expert curriculum described in Section[5\.3](https://arxiv.org/html/2606.30997#S5.SS3), training one specialist expert per stage while freezing the others\. We therefore applied the four\-stage curriculum described in Section[5\.3](https://arxiv.org/html/2606.30997#S5.SS3)\.

#### Redeployment reward validation\.

Over 100 randomised market seeds, the sell→\\toimmediately\-buy action achieves mean reward\+0\.025\+0\.025vs\.−0\.018\-0\.018for sell→\\tohold\-cash \(t=111\.6t=111\.6,p≪0\.001p\\ll 0\.001\), confirming the reward correctly incentivises completing the rebalance cycle\.

### 5\.3Curriculum Learning for Phase 2 Convergence

Directly training a portfolio policy on all objectives simultaneously proved unstable: the policy collapsed to uniform1/N1/Nweights with near\-zero alpha within 100 episodes, due to conflicting gradient signals from different investment objectives\.

We address this with a six\-stage MoE curriculum, each of the first three stages training one specialist expert head while freezing the others, followed by router training, joint fine\-tuning, and expert grafting:

Table 6:MoE expert curriculum stages\. All stages useN=10N=10tickers,βH=0\.02\\beta\_\{H\}=0\.02,λτ=0\.05\\lambda\_\{\\tau\}=0\.05,τcap=0\.25\\tau\_\{\\text\{cap\}\}=0\.25,δreb=0\.01\\delta\_\{\\text\{reb\}\}=0\.01\. Each stage trains one expert head exclusively while the remaining experts are frozen, preventing gradient interference\. Stage 5 uses the redesigned IntentRouter with supervised routing loss\.Each stage resumes from the previous checkpoint via shape\-filtered state dict loading, preserving compatible expert weights while re\-initializing incompatible router layers\. The best\-reward gate resets automatically when a new stage name is detected in the checkpoint note, preventing stale rewards from blocking checkpointing in subsequent stages\.

Stages 1–3 each train one expert exclusively on its designated mandate, producing specialists that achieve\+3\.16%\+3\.16\\%to\+3\.37%\+3\.37\\%14d EW alpha independently\. Stage 4 \(router\-only with frozen experts\) fails due to a flat loss surface \(router weights remain at 0\.25, alpha drops to\+0\.44%\+0\.44\\%\)\. Stage 5 resolves this with a redesigned intent router and supervised cross\-entropy loss, achieving clean diagonal routing \(std\>0\.37\>0\.37\) at episode 33\. Stage 6 grafts the curriculum expert weights into the Stage 5 router, recovering\+3\.03%\+3\.03\\%14d alpha with the best 90d result \(−2\.11%\-2\.11\\%\)\.

#### Hard turnover cap\.

In addition to the soft turnover penaltyλττ~\\lambda\_\{\\tau\}\\tilde\{\\tau\}in the reward, we enforce a hard ceilingτcap\\tau\_\{\\text\{cap\}\}on per\-step turnover in the environment\. After sells and buys collectively exceedτcap⋅Vt\\tau\_\{\\text\{cap\}\}\\cdot V\_\{t\}\(whereVtV\_\{t\}is portfolio value\), further buys are physically blocked for that step\. This prevents the policy from discovering degenerate high\-turnover strategies that exploit the redeployment bonus, which caused instability in preliminary experiments without the cap\.

#### Action entropy floor\.

To prevent the per\-ticker action head from collapsing to all\-hold, we add an entropy floor term to the PPO loss:

ℒPPO=ℒclip\+cvℒvalue−βHH\(πθ\)−0\.05⋅max⁡\(0,0\.5−H\(πθ\)\),\\begin\{split\}\\mathcal\{L\}\_\{\\text\{PPO\}\}&=\\mathcal\{L\}\_\{\\text\{clip\}\}\+c\_\{v\}\\mathcal\{L\}\_\{\\text\{value\}\}\-\\beta\_\{H\}H\(\\pi\_\{\\theta\}\)\\\\ &\\quad\-0\.05\\cdot\\max\(0,\\;0\.5\-H\(\\pi\_\{\\theta\}\)\),\\end\{split\}\(18\)where the last term provides an additional bonus when action entropyHHdrops below 0\.5 nats \(uniform over three actions has entropy≈1\.1\\approx 1\.1nats\)\. This prevents deterministicholdwhile allowing the policy to converge once it is genuinely exploring\.

### 5\.4Backtest \(14 trading days, June 2026\)

#### Representation collapse diagnosis\.

An important empirical finding from our development process concerns encoder representation quality\. We observed that despite achieving val loss 0\.143, the cross asset encoder without contrastive loss produced embeddings with mean inter\-ticker cosine similarity of 0\.96, effectively a mean representation where all tickers look identical to the allocation head\. This caused the policy to output uniform1/N1/Nallocation weights regardless of input, producing zero alpha by construction\.

We diagnose this as a systematic failure of standard SSL objectives for multi\-asset representation learning: return prediction, masked reconstruction, and regime classification all treat each ticker independently, providing no gradient signal for inter\-ticker differentiation\. The contrastive loss \(Section[2\.1](https://arxiv.org/html/2606.30997#S2.SS1.SSS0.Px4)\) resolves this, reducing similarity to 0\.24 and enabling the allocation head to produce differentiated weights\.

![Refer to caption](https://arxiv.org/html/2606.30997v1/x1.png)Figure 3:Inter\-ticker contrastive loss during Phase 1 encoder training \(60 epochs\)\. Loss drops from 0\.792 at epoch 0 to 0\.118 at epoch 59, crossing below 0\.2 by epoch 13\. The corresponding mean inter\-ticker cosine similarity falls from 0\.96 \(collapsed representations\) to 0\.24 \(differentiated\), enabling the allocation head to produce non\-uniform portfolio weights\.
#### Backtest results\.

Table[7](https://arxiv.org/html/2606.30997#S5.T7)reports walk\-forward backtest results across three policy configurations, all using the contrastive encoder \(cosine sim 0\.24\) and a 10\-ticker universe \(AAPL, MSFT, NVDA, AMZN, GOOGL, META, TSLA, JPM, XOM, V\) with $100,000 initial capital, zero transaction cost\.

Table 7:14\-day walk\-forward backtest, June 2026\. 10 tickers, $100,000 initial capital, zero transaction cost\. Alpha measured vs equal\-weight \(EW\) and SPY \(S&P 500 ETF\)\. EW return:−8\.01%\-8\.01\\%; SPY return:−2\.76%\-2\.76\\%\.The collapsed\-representation encoder \(cosine sim 0\.96\) produces uniform weights and negative alpha, confirming the representation collapse hypothesis\. After adding the inter\-ticker contrastive loss \(sim 0\.24\), theALPHA\_VS\_EWsingle\-head policy achieves\+2\.70%\+2\.70\\%alpha on the 14\-day June 2026 window\. Against SPY the same policy returns−2\.39%\-2\.39\\%; this gap is structural: SPY encompasses 500 market\-cap\-weighted stocks \(including defensive names that held up during the downturn\), whereas our universe is limited to 10 growth\-heavy names\. Scaling to a broader universe is left to future work \(§[10](https://arxiv.org/html/2606.30997#S10)\)\.

The MoE variant \(§[2\.4](https://arxiv.org/html/2606.30997#S2.SS4)\) trained with all six objectives sampled randomly achieves\+2\.92%\+2\.92\\%alpha vs EW — the highest of all configurations — with total return−5\.15%\-5\.15\\%vs−8\.01%\-8\.01\\%EW, improving on the single\-head policy by\+0\.22\+0\.22percentage points\. The MoE policy with correct intent routing produces distinct top holdings per expert — JPM \(momentum\), TSLA \(growth\), GOOGL \(defensive\), AMZN \(tax\-aware\), confirming that the four specialist heads have learned meaningfully different allocation strategies\.

#### Note on annualised Sharpe\.

The annualised Sharpe ratios in Table[7](https://arxiv.org/html/2606.30997#S5.T7)are negative \(−5\.90\-5\.90,−6\.54\-6\.54\) despite positive alpha\. This is expected and not a failure of the strategy\. Annualised Sharpe is computed as:

S^=r¯port−rfσport⋅252,\\hat\{S\}=\\frac\{\\bar\{r\}\_\{\\text\{port\}\}\-r\_\{f\}\}\{\\sigma\_\{\\text\{port\}\}\}\\cdot\\sqrt\{252\},\(19\)wherer¯port\\bar\{r\}\_\{\\text\{port\}\}is the mean daily return over the 14\-day window\. During this 14\-day backtest window the broad market was approximately flat \(SPY: \-0\.91%\), while our 10\-stock equal\-weight benchmark declined \-8\.01%, reflecting the growth\-heavy composition of the universe \(NVDA, TSLA, META\) underperforming during this specific period\. A negative Sharpe on a short window therefore reflects universe\-specific drawdown rather than systematic underperformance\. Sharpe is a meaningful metric only over a full market cycle \(1–2 years\) that includes both bull and bear regimes\.

For short\-window evaluation,alpha vs equal\-weightis the appropriate metric, as it isolates stock selection skill from market direction\.

#### Multi\-window robustness and SPY benchmark\.

The 14\-day positive alpha does not persist at longer horizons\. Table[8](https://arxiv.org/html/2606.30997#S5.T8)shows results from a systematic sweep across checkpoint epochs and window lengths, evaluated against both EW and SPY\.

Table 8:Best\-epoch alpha vs equal\-weight \(EW\) and SPY across backtest windows and MoE curriculum stages\. Window\-adaptive rebalancing: reb=1\.0 \(14d\), reb=0\.05 \(30d\), reb=0\.03 \(60d\), reb=0\.02 \(90d\)\. Zero transaction cost\. S4 \(router\-only, frozen experts\) fails due to flat loss surface\. S6 \(Grafted MoE\) combines Stage 5 router with curriculum expert weights, achieving the best 90d result \(−2\.11%\-2\.11\\%\) across all configurations\.Table[8](https://arxiv.org/html/2606.30997#S5.T8)reveals a clear pattern of expert specialization across all three curriculum stages\. The Stage 1 momentum expert achieves\+3\.16%\+3\.16\\%EW alpha at 14 days and, with adaptive rebalancing, also produces positive 30\-day alpha \(\+0\.21%\+0\.21\\%at ep300\) — suggesting the momentum signal persists beyond the 14\-day window when positions are allowed to rebalance on drift\. At 60 and 90 days the expert underperforms \(−3\.97%\-3\.97\\%,−2\.40%\-2\.40\\%\), as expected for a short\-horizon mandate\. The Stage 2 growth expert further improves 14\-day alpha to\+3\.37%\+3\.37\\%\(ep100\) and maintains positive 30\-day alpha \(\+0\.19%\+0\.19\\%at ep300\)\. At 90 days it achieves the best result across all stages \(−2\.37%\-2\.37\\%at ep300\), confirming the growth expert learns longer\-horizon signals progressively across episodes\.

The 60\-day window remains negative and is the crossover point between short\-term momentum and long\-horizon fundamentals\. Stage 3 \(defensive expert,CAPITAL\_PRESERVE\) shows a clear specialization pattern: 14\-day alpha declines across episodes \(\+3\.18%\+3\.18\\%ep100→\\to\+2\.82%\+2\.82\\%ep300\), confirming the expert is learning conservative allocation at the expense of short\-horizon returns\. The 60\-day window improves to−4\.07%\-4\.07\\%at ep200 \(vs−4\.62%\-4\.62\\%for the growth expert\), and the 30\-day window briefly crosses positive \(\+0\.11%\+0\.11\\%at ep100\) before the conservative objective takes hold\. This complementary behaviour \(i\.e\., momentum for 14d, growth for 30d, defensive for 60d\) is exactly the per\-horizon specialization the MoE architecture is designed to exploit through router weighting\. Stage 4 \(router training\) results are pending\. The SPY gap reflects the structural 10\-ticker universe limitation: SPY’s defensive and dividend\-paying constituents outperformed the growth\-heavy basket during this downturn period\.

#### MoE router training\.

A key empirical finding is that the intent router does not learn to differentiate intents when all six objectives are sampled jointly with a single\-head policy\. It also fails when experts are frozen\.

The curriculum trained four stages:

1. 1\.Momentum expert \(ALPHA\_VS\_EW\), freeze experts 1–3\.
2. 2\.Growth expert \(MAX\_GAIN\_1Y\), freeze experts 0, 2–3\.
3. 3\.Defensive expert \(CAPITAL\_PRESERVE\), freeze experts 0–1, 3\.
4. 4\.Router\-only \(all experts frozen, all objectives\)\.

Stage 4 \(router\-only training\) failed to produce differentiated routing: all router weights remained atαe≈0\.25\\alpha\_\{e\}\\approx 0\.25\(std<0\.002<0\.002, Table[8](https://arxiv.org/html/2606.30997#S5.T8)\), and 14\-day alphadroppedfrom\+3\.37%\+3\.37\\%\(Stage 2\) to\+0\.44%\+0\.44\\%\. The failure mode is aflat loss surface: with all expert heads frozen, the weighted mixture output is invariant to router weights, providing no gradient signal for the router to learn intent\-conditional routing\.

Stage 5 \(joint training with all experts and router updating simultaneously on all 6 objectives, and warm\-starting from Stage 3\) resolves the flat loss surface problem via three architectural changes: \(i\) the intent router was redesigned with anintent projection layer\(ℝ6→ℝd\\mathbb\{R\}^\{6\}\\to\\mathbb\{R\}^\{d\}\) giving the intent signal equal representation to the 64\-dim market embedding; \(ii\) adirect shortcutfrom intent to expert logits, initialised as a diagonal mapping \(intenti→i\\toexpertii\), provides a strong routing prior from episode 0; \(iii\) a supervised cross\-entropy loss \(λsup=1\.0\\lambda\_\{\\text\{sup\}\}=1\.0\) penalises deviations from the correct intent\-to\-expert mapping during PPO updates, with the router learning rate scaled10×10\\timesrelative to the expert heads\.

Table[9](https://arxiv.org/html/2606.30997#S5.T9)shows the resulting routing at Stage 5 episode 33\. All six intents route cleanly to their designated expert \(weight=1\.00=1\.00\), and each expert produces a distinct top holding: JPM \(momentum\), TSLA \(growth\), GOOGL \(defensive\), AMZN \(tax\-aware\)\. Router differentiation std\>0\.37\>0\.37for all experts \(threshold: 0\.05\)\.

#### Expert grafting \(Stage 6\)\.

Joint PPO training in Stage 5 degrades expert specialisation through gradient interference: the momentum expert’s 14d EW alpha drops from\+3\.37%\+3\.37\\%\(Stage 2\) to\+0\.01%\+0\.01\\%after 100 joint training episodes\. We address this withexpert grafting: after Stage 5 establishes correct routing \(std\>0\.37\>0\.37\), the expert head weights are replaced with the best curriculum checkpoints \(Stages 2–3\) while the Stage 5 router weights are preserved\. This composition recovers\+3\.03%\+3\.03\\%14d EW alpha with correct intent\-conditional routing, and achieves−2\.11%\-2\.11\\%90d EW alpha, the best long\-horizon result across all configurations\. Expert grafting requires no additional training, making it a lightweight alternative to continual learning approaches for MoE RL systems\.

Table 9:Intent Router weights at Stage 5 episode 33 \(joint training with supervised routing lossλsup=1\.0\\lambda\_\{\\text\{sup\}\}=1\.0, router LR scale10×10\\times\)\. Each intent routes cleanly to its designated expert\. Distinct top holdings per expert confirm that the four heads have learned meaningfully different allocation strategies\.IntentMom\.GrowthDef\.TaxDominantTop holdingALPHA\_VS\_EW1\.000\.000\.000\.00MomentumJPM 10\.7%MAX\_GAIN\_1Y0\.001\.000\.000\.00GrowthTSLA 9\.8%CAPITAL\_PRES0\.000\.001\.000\.00DefensiveGOOGL 9\.7%INCOME\_HARV0\.000\.000\.001\.00Tax\-awareAMZN 10\.1%LT\_GAIN\_ONLY0\.000\.000\.001\.00Tax\-awareAMZN 10\.1%MAX\_GAIN\_30D1\.000\.000\.000\.00MomentumJPM 10\.7%Std across intents \(target\>0\.05\>0\.05\)0\.47 / 0\.37 / 0\.37 / 0\.47✓\\checkmark
#### Key diagnostic findings\.

Several failure modes were discovered and resolved during development:

1. 1\.Representation collapse\(cosine similarity 0\.96→\\to0\.24\): standard SSL objectives provide no inter\-ticker differentiation signal; fixed by inter\-ticker contrastive loss\.
2. 2\.Allocation head symmetry: gain=0\.01 initialization caused uniform softmax output regardless of encoder input; fixed by gain=1\.0 with random bias initialization\.
3. 3\.Frozen encoder static mapping: with a frozen encoder and fixed 10\-ticker universe, the allocation head converges to a static ticker ranking; fixed by unfreezing the encoder jointly with the MoE experts\.
4. 4\.MoE router uniform collapse: joint multi\-objective training keeps all router weights at 0\.25 because the router receives no contrastive signal across intents; fixed by 4\-stage expert curriculum with per\-expert objective isolation\.
5. 5\.Router\-only flat loss surface: freezing all experts during router training produces a loss surface invariant to router weights — the weighted mixture output is identical regardless of routing\. Stage 4 router weights remained at 0\.25 \(std<0\.002<0\.002\) and alpha dropped from\+3\.37%\+3\.37\\%to\+0\.44%\+0\.44\\%; fixed by Stage 5 joint training resolves this: intent projection, diagonal shortcut initialisation, and supervised cross\-entropy loss achieve std\>0\.37\>0\.37routing differentiation at ep 33\.

## 6Future Work

#### Phase 3 empirical evaluation\.

The most immediate extension is a full empirical evaluation of Phase 3 personalization on real brokerage data across diverse user profiles \(aggressive, conservative, tax\-aware, income\-seeking\)\. We plan to evaluate: \(i\) how quickly the 76\-parameter LoRA adapter converges on synthetic transaction histories of varying length; \(ii\) whether the router shift direction matches the intended profile \(e\.g\. momentum expert weight increases for frequent traders\); and \(iii\) backtest alpha improvement from personalisation vs the generic MoE policy\.

#### Multi\-window MoE training\.

The current ALPHA\_VS\_EW objective produces positive short\-window alpha \(\+2\.70%, 14 days\) but negative alpha at longer horizons\. Training the momentum expert explicitly on 14\-day windows and the growth expert on 90\-day windows, with separate rollout buffers per expert, should produce a router that routes to momentum for short\-term queries and growth for long\-term queries, addressing the multi\-window degradation\.

#### Online fine\-tuning\.

A lightweight online fine\-tuning mode that updates only the allocation head scorer on recent 90\-day data \(weekly,∼\\sim20 episodes\) would allow the policy to adapt to regime shifts without full retraining\. The frozen encoder makes this tractable: only∼\\sim200K parameters need gradient computation per online update\.

#### Broader ticker universe\.

The current system is validated on 10 tickers\. Scaling to the full S&P 500 \(500 tickers\) requires efficient cross\-asset attention \(i\.e\., sparse attention or clustering tickers by sector before applying attention within clusters\)\.

#### News and event integration\.

The fused market encoder with news cross\-attention is implemented but not yet validated in Phase 2 training\. Integrating analyst consensus upgrades, earnings surprises, and macro releases as attention keys should improve alpha on event\-driven windows\.

## 7Related Work

#### Financial RL\.

\[[7](https://arxiv.org/html/2606.30997#bib.bib7)\]applied RNNs with RL to futures trading\.\[[10](https://arxiv.org/html/2606.30997#bib.bib10)\]introduced the portfolio management framework with convolutional feature extraction\.\[[22](https://arxiv.org/html/2606.30997#bib.bib22)\]added transaction cost modeling\.\[[11](https://arxiv.org/html/2606.30997#bib.bib11)\]provides a comprehensive benchmark environment\. Our work extends these by adding personalization, tax awareness, and natural language goal specification\.

#### Multi\-objective RL\.

\[[1](https://arxiv.org/html/2606.30997#bib.bib1)\]and\[[8](https://arxiv.org/html/2606.30997#bib.bib8)\]survey multi\-objective RL methods\. Our objective\-conditioned reward is closest to\[[3](https://arxiv.org/html/2606.30997#bib.bib3)\]\(successor features\) but implemented as direct reward shaping rather than value decomposition, for simplicity of integration with existing PPO infrastructure\.

#### Personalization in finance\.

Robo\-advisors\[[5](https://arxiv.org/html/2606.30997#bib.bib5)\]personalize asset allocation via risk questionnaires\. We replace stated preferences withrevealedpreferences from transaction history, following the behavioral finance literature\[[13](https://arxiv.org/html/2606.30997#bib.bib13)\]on the disposition effect\.

#### Comparison to commercial robo\-advisors\.

Leading retail robo\-advisors share a common architecture: ETF\-based diversification, mean\-variance rebalancing with fixed drift thresholds, and rule\-based daily tax\-loss harvesting\. Personalization is limited to a one\-time risk questionnaire or static goal presets\. To our knowledge, none support dynamic strategy switching based on evolving user intent, nor optimize after\-tax rewards at the individual tax bracket level\.

Our approach introduces three capabilities not observed in these systems: \(i\)intent\-conditional strategy routingvia MoE: a user stating “I am buying a house in 18 months” triggers a shift to the defensive expert without manual re\-enrollment; \(ii\)bracket\-aware RL reward: the policy explicitly optimizes after\-tax return given the user’s marginal ST/LT capital gains rates, not a fixed harvesting threshold; and \(iii\)continuous behavioral personalization: the 76\-parameter LoRA adapter updates from revealed transaction preferences rather than self\-reported risk tolerance, which is known to diverge from actual behavior\[[13](https://arxiv.org/html/2606.30997#bib.bib13)\]\.

#### Parameter\-efficient fine\-tuning\.

\[[9](https://arxiv.org/html/2606.30997#bib.bib9)\]introduced LoRA for large language models\. We apply the same rank\-decomposition idea to a 3\-action classification head, yielding 76\-parameter adapters that capture user\-specific biases in seconds on CPU\.

#### Foundation models for finance\.

\[[18](https://arxiv.org/html/2606.30997#bib.bib18)\]and\[[20](https://arxiv.org/html/2606.30997#bib.bib20)\]explore LLM\-based financial agents\. Our hybrid approach uses a domain\-specific SSL encoder for market data \(where structure is numerical\) and LLMs only for natural language interface tasks \(intent parsing, news summarisation\)\.

#### Time series foundation models\.

\[[2](https://arxiv.org/html/2606.30997#bib.bib2)\]introduced Chronos, a T5\-based model pretrained on over 100 billion time series data points from diverse domains\.\[[6](https://arxiv.org/html/2606.30997#bib.bib6)\]and\[[19](https://arxiv.org/html/2606.30997#bib.bib19)\]explore related universal forecasting approaches\. We are the first, to our knowledge, to apply a frozen time series foundation model as a parallel encoder branch in a portfolio RL system, using a learned gating mechanism to balance domain\-specific SSL representations against universal temporal patterns\.

#### Ticker universe independence\.

All prior financial RL systems we are aware of \(including\[[10](https://arxiv.org/html/2606.30997#bib.bib10)\],\[[22](https://arxiv.org/html/2606.30997#bib.bib22)\], and\[[11](https://arxiv.org/html/2606.30997#bib.bib11)\]\) use fixed ticker embeddings that tie the model to a specific asset universe\. Our 50\-dimensional observable metadata vector replaces these embeddings entirely, enabling zero\-shot application to any publicly traded asset\. The closest related idea is the use of fundamental factor models in the quantitative finance literature\[[12](https://arxiv.org/html/2606.30997#bib.bib12)\], but these are hand\-crafted linear models rather than learned representations\.

## 8Discussion and Limitations

#### Training data\.

The current Phase 1 corpus of 10 tickers is narrow\. Representation diversity improves substantially with 50–500 tickers spanning multiple sectors and market\-cap regimes\. The ticker\-identity\-free design makes scaling straightforward\.

#### Phase 2 convergence\.

300 episodes is insufficient for non\-uniform allocation; 2000\+ is needed\. The objective\-conditioned reward requires the policy to serve five distinct goals, increasing the effective sample complexity\.

#### Tax accuracy\.

Tax lot accuracy depends on broker API capabilities\. Standard brokerage APIs typically return average cost basis rather than individual lot\-level data, which limits after\-tax optimization precision\. Full lot\-level accuracy requires either an institutional\-grade API or manual lot tracking\.

#### Out\-of\-distribution\.

The backtest period \(June 2026\) was broadly a down market \(−5\.2%\-5\.2\\%for equal weight\)\. A fair evaluation requires multi\-regime testing including bull markets, volatility spikes, and sector rotations\.

#### Analyst data timeliness\.

Analyst consensus signals from yfinance may lag by 1–2 days\. Production deployment would benefit from real\-time data feeds\.

## 9Conclusion

We presented a complete three\-phase system for personalized, tax\-aware portfolio management using foundation model representations and reinforcement learning\. Phase 1 introduces a corss asset encoder with Chronos augmentation and inter\-ticker contrastive loss, resolving representation collapse \(cosine similarity 0\.96→\\to0\.24\) that prevented ticker differentiation\. Phase 2 introduces a MoE portfolio actor critic with four specialized expert heads and a learned intent router that eliminates cross\-objective gradient conflict, achieving\+2\.70%\+2\.70\\%alpha vs equal\-weight benchmark on a 14\-day June 2026 walk\-forward backtest\. Phase 3 proposes a 76\-parameter LoRA adapter that personalizes the intent router from revealed brokerage preferences without retraining the shared encoder or expert heads\.

The key architectural innovations are: ticker\-identity\-free metadata encoding \(any universe without retraining\), objective\-conditioned MoE routing \(one policy for six investment mandates\), redeployment\-aware turnover penalty, and a trust\-first preview\-before\-apply personalisation UX\.

The system is deployed as a production\-ready FastAPI application with live brokerage integration, real\-time data, and natural language goal specification\. Code is available at[https://github\.com/rpishehvar/PublicFinance\-RL](https://github.com/rpishehvar/PublicFinance-RL)\.

## 10Future Work

Several directions remain for future investigation:

#### Phase 3 empirical evaluation\.

The LoRA personalisation architecture is proposed but not yet fully evaluated\. Future work will generate synthetic brokerage histories for three representative user archetypes \(aggressive growth, conservative income, tax\-aware long\-term\) and measure how the intent router mixture weights shift after LoRA fine\-tune, and whether personalized policies produce meaningfully different allocations on the same market state\.

#### MoE multi\-window robustness\.

The momentum expert achieved\+2\.70%\+2\.70\\%alpha on the 14\-day window but negative alpha at 30–90 days\. Future work will train the growth and defensive experts explicitly on longer\-horizon objectives and evaluate whether the router correctly activates the right expert for each horizon\.

#### Online fine\-tuning\.

With a frozen encoder, the allocation head converges to a static ticker ranking that does not adapt to market regime shifts\. An online fine\-tuning mode that updates only the router and final allocation layers on recent 90\-day windows \(weekly cadence\) could bridge the training/test regime gap without full retraining\.

#### Larger universe and live trading\.

Current experiments use 10 tickers\. Scaling to the full S&P 500 universe via the curriculum \(50\+ tickers\) and evaluating on a live paper\-trading account over a 6\-month period would provide stronger evidence of generalisation\.

#### News and event integration\.

The fused market encoder with news cross\-attention \(§[3](https://arxiv.org/html/2606.30997#S3)\) is architecturally complete but not yet trained\. Integrating earnings surprises, analyst consensus upgrades, and macro releases as additional key\-value tokens may improve short\-window alpha\.

## References

- \[1\]Abels, A\., Roijers, D\., Lenaerts, T\., Nowé, A\., & Steckelmacher, D\. \(2019\)\. Dynamic weights in multi\-objective deep reinforcement learning\.ICML\.
- \[2\]Ansari, A\. F\. et al\. \(2024\)\. Chronos: Learning the language of time series\.arXiv:2403\.07815\.
- \[3\]Barreto, A\., Dabney, W\., Munos, R\., Hunt, J\. J\., Schaul, T\., van Hasselt, H\., & Silver, D\. \(2017\)\. Successor features for transfer in reinforcement learning\.NeurIPS\.
- \[4\]Bertsimas, D\., & Kallus, N\. \(2022\)\. From predictive to prescriptive analytics\.Management Science, 68\(1\), 43–63\.
- \[5\]D’Acunto, F\., & Rossi, A\. G\. \(2019\)\. New frontiers of robo\-advising: Consumption, saving, debt management, and taxes\.SSRN Working Paper\.
- \[6\]Das, A\., Kong, W\., Sen, R\., & Zhou, Y\. \(2023\)\. A decoder\-only foundation model for time\-series forecasting\.arXiv:2310\.10688\.
- \[7\]Deng, Y\., Bao, F\., Kong, Y\., Ren, Z\., & Dai, Q\. \(2016\)\. Deep direct reinforcement learning for financial signal representation and trading\.IEEE Transactions on Neural Networks and Learning Systems, 28\(3\), 653–664\.
- \[8\]Hayes, C\. F\. et al\. \(2022\)\. A practical guide to multi\-objective reinforcement learning and planning\.Autonomous Agents and Multi\-Agent Systems, 36\(1\), 26\.
- \[9\]Hu, E\. J\., Shen, Y\., Wallis, P\., Allen\-Zhu, Z\., Li, Y\., Wang, S\., Wang, L\., & Chen, W\. \(2022\)\. LoRA: Low\-rank adaptation of large language models\.ICLR 2022\.
- \[10\]Jiang, Z\., Xu, D\., & Liang, J\. \(2017\)\. A deep reinforcement learning framework for the financial portfolio management problem\.arXiv:1706\.10059\.
- \[11\]Liu, X\.\-Y\., Yang, H\., Chen, Q\., Zhang, R\., Yang, L\., Xiao, B\., & Wang, C\. D\. \(2021\)\. FinRL: A deep reinforcement learning library for automated stock trading in quantitative finance\.NeurIPS Workshop on Deep RL\.
- \[12\]BARRA\. \(1998\)\.United States Equity \(USE3\) Model Handbook\. BARRA Inc\., Berkeley, CA\.
- \[13\]Odean, T\. \(1998\)\. Are investors reluctant to realize their losses?Journal of Finance, 53\(5\), 1775–1798\.
- \[14\]Hirschman, A\. O\. \(1945\)\.National Power and the Structure of Foreign Trade\. University of California Press\.
- \[15\]Schulman, J\., Wolski, F\., Dhariwal, P\., Radford, A\., & Klimov, O\. \(2017\)\. Proximal policy optimization algorithms\.arXiv:1707\.06347\.
- \[16\]Sharpe, W\. F\. \(1966\)\. Mutual fund performance\.Journal of Business, 39\(S1\), 119–138\.
- \[17\]Sun, Q\., Zhou, W\., & Fan, J\. \(2018\)\. Adaptive Huber regression\.Journal of the American Statistical Association\. arXiv:1706\.06991\.
- \[18\]Xie, Q\., Han, W\., Zhang, X\., Lai, Y\., Peng, M\., Lopez\-Lira, A\., & Huang, J\. \(2023\)\. PIXIU: A large language model, instruction data and evaluation benchmark for finance\.arXiv:2306\.05443\.
- \[19\]Woo, G\., Liu, C\., Kumar, A\., Xiong, C\., Savarese, S\., & Sahoo, D\. \(2024\)\. Unified training of universal time series forecasting transformers\.ICML\.
- \[20\]Yang, H\., Liu, X\.\-Y\., & Wang, C\. D\. \(2023\)\. FinGPT: Open\-source financial large language models\.arXiv:2306\.06031\.
- \[21\]Hirschman, Albert O\.National Power and the Structure of Foreign Trade\. University of California Press, Berkeley, 1945\.
- \[22\]Ye, Y\., Pei, H\., Wang, B\., Chen, P\.\-Y\., Zhu, Y\., Xiao, J\., & Li, B\. Reinforcement\-learning based portfolio management with augmented asset movement prediction states\.Proceedings of the AAAI Conference on Artificial Intelligence, 2020\.
A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

Similar Articles

From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis

PAFO: Pareto Fairness Optimization for Personalized Reward Modeling

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures

Submit Feedback

Similar Articles

From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning
A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis
PAFO: Pareto Fairness Optimization for Personalized Reward Modeling
Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures