Bridging the Gap Between Natural Language and Market Dynamics via High-Dimensional Representation Learning

arXiv cs.LG Papers

Summary

This paper explores replacing scalar sentiment scores with high-dimensional FinBERT embeddings in a Transformer-based architecture for short-term stock price prediction, showing improved accuracy with Siamese-optimized embeddings.

arXiv:2605.30652v1 Announce Type: new Abstract: Traditional multi-modal financial forecasting often relies on scalar sentiment scores, which fail to capture the nuances of financial news. To address this information loss, this paper explores high-dimensional representation learning by replacing discrete polarity ratings with dense FinBERT embeddings within a Transformer-based forecasting architecture. We benchmarked various embedding strategies on the FNSPID dataset, including raw embeddings, attention-weighted aggregation, and a custom Siamese network. While the attention-based mechanism struggled with the low signal-to-noise ratio typical of financial data, the integration of Siamese-optimized embeddings outperformed both the scalar baseline and raw embedding approaches, demonstrating that preserving high-dimensional narrative context yields improved predictive accuracy for short-term stock price movements.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:30 AM

# Bridging the Gap Between Natural Language and Market Dynamics via High-Dimensional Representation Learning
Source: [https://arxiv.org/html/2605.30652](https://arxiv.org/html/2605.30652)
Brian Y\. C\. Leung \(Mike\)mbl@stanford\.edu

###### Abstract

Traditional multi\-modal financial forecasting often relies on scalar sentiment scores, which fail to capture the nuances of financial news\. To address this information loss, this paper explores high\-dimensional representation learning by replacing discrete polarity ratings with dense FinBERT embeddings within a Transformer\-based forecasting architecture\. We benchmarked various embedding strategies on the FNSPID dataset, including raw embeddings, attention\-weighted aggregation, and a custom Siamese network\. While the attention\-based mechanism struggled with the low signal\-to\-noise ratio typical of financial data, the integration of Siamese\-optimized embeddings outperformed both the scalar baseline and raw embedding approaches, demonstrating that preserving high\-dimensional narrative context yields improved predictive accuracy for short\-term stock price movements\.

![[Uncaptioned image]](https://arxiv.org/html/2605.30652v1/CS230.png)

Bridging the Gap Between Natural Language and Market Dynamics via High\-Dimensional Representation Learning

Yujin Jeongyujinjng@stanford\.edu

Noelle Jungnoellej@stanford\.edu

Brian Y\. C\. Leung \(Mike\)mbl@stanford\.edu

## 1Introduction

Recent studies[jun\_Gu\_2024](https://arxiv.org/html/2605.30652#bib.bib14)[10796670](https://arxiv.org/html/2605.30652#bib.bib20)demonstrate the utility of capturing financial sentiment for financial modeling\. Motivated by the shift towards high\-dimensional representation learning[vinden2025contrastivesimilaritylearningmarket](https://arxiv.org/html/2605.30652#bib.bib21)[guo2024finetuninglargelanguagemodels](https://arxiv.org/html/2605.30652#bib.bib9), this project aims to bridge the gap between sentiment classification and direct market prediction\. We first replicate established baselines[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)that fuse structured data with prompt\-generated sentiment scores, then extend this work by replacing scalar sentiment with richer semantic representations\. Specifically, we investigate leveraging raw FinBERT[Huang2023FinBERTAL](https://arxiv.org/html/2605.30652#bib.bib11)embeddings, unfreezing FinBERT for stock prediction, training a custom Siamese network to learn a task\-specific embedding, and using attention to weight sentiment averages\. The input to our algorithm is a sequence of 50 days of financial news embeddings and structured price data\. We then use a Transformer\-based architecture to output a predicted closing price 3 days in the future\. Our code can be found on Github[Github](https://arxiv.org/html/2605.30652#bib.bib1)\.

## 2Related work

Past research in multi\-modal stock prediction that we examined can be grouped into \(1\) scalar sentiment integration and \(2\) semantic representation\.

Scalar Sentiment Integration\.This traditional approach reduces text to a polarity score prior to fusing with structured data\. We draw on Dong et al\.[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)who established baselines on theFinancial News and Stock Price Integration Dataset \(FNSPID\)by combining ChatGPT\-derived discrete sentiment ratings with Transformer architectures\. Similarly, Gu et al\.[jun\_Gu\_2024](https://arxiv.org/html/2605.30652#bib.bib14)proposed the FinBERT\-LSTM model, which uses FinBERT to generate sentiment indicators that are fed into an LSTM alongside historical prices\. While computationally efficient and interpretable, we posit that this method suffers from significant information loss, and that projecting nuanced financial narratives into scalars discards the context required for detecting market signals\.

Semantic Representation and Direct Prediction\.Recent state\-of\-the\-art research shifts toward mapping dense text embeddings directly to market movements\. Works like[vinden2025contrastivesimilaritylearningmarket](https://arxiv.org/html/2605.30652#bib.bib21)and[guo2024finetuninglargelanguagemodels](https://arxiv.org/html/2605.30652#bib.bib9)apply contrastive learning and fine\-tuning of general LLMs \(e\.g\., Mistral, LLaMA\) to map text embeddings directly to market movements, capturing task\-specific semantic nuances\. Our experiments align with this approach but differ by utilizing article summaries rather than headlines to capture richer context\. We also bring together FinBERT[Huang2023FinBERTAL](https://arxiv.org/html/2605.30652#bib.bib11)and latent semantic analysis \(LSA\) summarization[gong2001generic](https://arxiv.org/html/2605.30652#bib.bib8)to the FNSPID dataset to set a new useful benchmark\. Unlike generic models, FinBERT is pre\-trained specifically on financial texts to better grasp domain\-specific contexts\. LSA employs singular value decomposition to extract semantically dense sentences, balancing input length against information\-density\.

## 3Dataset and Features

We used FNSPID[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6), selecting Google \(GOOG\), Microsoft \(MSFT\), Nvidia \(NVDA\), Apple \(AAPL\), and Amazon \(AMZN\)\. These tickers were chosen to match the original selection criteria from[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6), which focused on the 5 most influential stocks in the S&P 500\.

### 3\.1Dataset Construction for Market Prediction

FNSPID consists of daily stock metrics aligned with financial news summaries\. We constructed a time\-series corpus spanning the available daily price\-news history for the selected tickers, resulting in a total of 37,707 records\. The data was split chronologically to prevent data leakage, allocating the first 80% for training and the most recent 20% for validation, according to backtesting best practices\. This resulted in 30,165 training examples and 7,542 validation examples\.

Feature Engineering and Normalization\.Consistent with[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6), we utilizedOpen,Close, andTrade Volumeas structured inputs\. To ensure numerical stability, these features were Min\-Max normalized per stock and data partition\.

Sentiment Signal Processing\.First we replaced the ChatGPT API used in[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)with FinBERT\. For the scalar baseline, FinBERT probabilities \(Pp​o​sP\_\{pos\},Pn​e​gP\_\{neg\},Pn​e​uP\_\{neu\}\) were mapped to a discrete sentiment scoreSSin \[1, 5\] using the formula:

S=\(Pn​e​g×1\.0\)\+\(Pn​e​u×3\.0\)\+\(Pp​o​s×5\.0\)S=\(P\_\{neg\}\\times 1\.0\)\+\(P\_\{neu\}\\times 3\.0\)\+\(P\_\{pos\}\\times 5\.0\)\(1\)Daily scores were computed by averaging all articles released on a given trading day\. Data sparsity proved to be a challenge, with coverage of summaries for our selected stocks averaging only 7\.3%\. To impute missing sentiment, we adopted the recursive decay mechanism used by[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)\. This approach assumes market sentiment persists but decays toward neutrality \(Sn​e​u​t​r​a​l=3S\_\{neutral\}=3\) over time:

S​\(t\)=Sn​e​u​t​r​a​l\+\(S​\(0\)−Sn​e​u​t​r​a​l\)⋅e−λ​tS\(t\)=S\_\{neutral\}\+\(S\(0\)\-S\_\{neutral\}\)\\cdot e^\{\-\\lambda t\}\(2\)whereλ=0\.03\\lambda=0\.03represents the decay rate\. For our later explorations on high\-dimensional embeddings, we used forward\-filling to impute missing embeddings\.

For visualization of sentiment data, refer to Figures[4](https://arxiv.org/html/2605.30652#S8.F4)and[5](https://arxiv.org/html/2605.30652#S8.F5)in the Appendix\.

Time\-Series Discretization\.Following the methodology established in[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6), the data was arranged into sliding windows consisting of a lookback window ofL=50L=50days to predict a target horizon ofH=3H=3days into the future\. Thus our final input and output dimensions become:

- •Input:X∈ℝB×50×FX\\in\\mathbb\{R\}^\{B\\times 50\\times F\}
- •Output:y∈ℝB×1y\\in\\mathbb\{R\}^\{B\\times 1\}, representing theCloseprice att\+3t\+3

whereBBis the batch size andFFis the number of features\. For our baseline replication,B=64B=64andF=4F=4\.

#### 3\.1\.1Example data

Table 1:Example 50\-day window for AAPLTable 2:Example news summaries for AAPLTarget dateTarget Close price2023\-03\-28 \(t=52t=52\)157\.01Table 3:Target variable for Table[1](https://arxiv.org/html/2605.30652#S3.T1)

### 3\.2Dataset Construction for Representation Learning

To train our custom Siamese network we curated a distinct subset of data from the FNSPID tables\.

Label Generation for Contrastive Learning\.Our Siamese network requires pairs of news items labeled by their market impact\. We defined the ground\-truth market movementym​a​r​k​e​ty\_\{market\}as the forward close\-to\-open return, calculated as the change from the previous day’s close to the next day’s open:

ym​a​r​k​e​t=Opent\+1−Closet−1Closet−1y\_\{market\}=\\frac\{\\text\{Open\}\_\{t\+1\}\-\\text\{Close\}\_\{t\-1\}\}\{\\text\{Close\}\_\{t\-1\}\}\(3\)We selected the previous day’s close \(Closet−1\\text\{Close\}\_\{t\-1\}\) as the baseline to ensure the reference price is strictly prior to any information released on Dayttto avoid lookahead bias\. We filtered outliers using an Isolation Forest[liu2008isolation](https://arxiv.org/html/2605.30652#bib.bib15)\(5% contamination\) and removed extreme price movements \(below 1st and above 99th percentiles\) to ensure stability, leaving 1,991,435 rows of data\. Refer to Figure[6](https://arxiv.org/html/2605.30652#S8.F6)in the Appendix for a distribution ofym​a​r​k​e​ty\_\{market\}values\.

Discretization\.To generate positive and negative pairs, we discretized the continuousym​a​r​k​e​ty\_\{market\}values into bins\. We experimented with three binning strategies: \(1\) Quartile, \(2\) Median and \(3\) Tercile\.

Sampling\.We sampled 15,000 distinct price\-news pairs, split into 12,000 for training/validation and a held\-out set of 3,000 for testing the embedding quality across all three bin strategies\. We excluded news from the 5 stocks used for the downstream task as Siamese training pairs should be isolated from the end\-to\-end Transformer architecture validation set to prevent data leakage\.

## 4Methods

Our research consists of three phases: replicating the scalar baseline, introducing raw embedding integration, and optimizing the embedding space via unfreezing FinBERT, Siamese networks, and attention\-based aggregation\.

### 4\.1Benchmark Replication

In this phase, we replicated both the Transformer and LSTM baselines\. As[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)did not disclose specific stock symbols or hyperparameters, we approximated the experiment setup to generate comparable results\. Consequently, our validation focused on matching relative performance trends rather than exact numerical values\. While we observed the expected performance gains when adding FinBERT sentiment scores to the Transformer, the LSTM model displayed mixed results rather than consistently benefiting from the sentiment data\. This aligns with the findings of[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)\. From this point onward, we focus on the Transformer model since[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)highlighted it as the best performing across six tested architectures\.

### 4\.2Replacing Sentiment Score with Embedding

We hypothesized that both ChatGPT and FinBERT’s final projection layer remove valuable information when producing a sentiment score\. This motivated us to remove the projection layer and feed the FinBERT embedding directly into the training and inference input\. Validating this hypothesis was critical in determining whether to continue improving upon embeddings for the downstream task of stock prediction\.

### 4\.3Further Embedding Improvement

Although FinBERT improves upon BERT for analyzing text in a financial context, it was trained on linguistic tasks rather than stock market forecasting\. We hypothesized that inputting optimized FinBERT embeddings into our Transformer together with the structured data would lead to better predictions\. We approached this task in 3 different ways: \(1\) unfreezing FinBERT layers, \(2\) applying contrastive learning, and \(3\) using attention weighted aggregation\.

#### 4\.3\.1Unfreezing FinBERT

End\-to\-end training typically yields higher performance as the encoder learns to extract exactly the information downstream layers need\.

In[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)’s baseline architecture, sentiment scores are precomputed and fed into the model’s input, prohibiting back propagation into the FinBERT layers\. We prepared\(news\_tokens,o​p​e​n​\_​p​r​i​c​e,c​l​o​s​e​\_​p​r​i​c​e,t​r​a​d​e​\_​v​o​l​u​m​e\)\(\\textbf\{news\\\_tokens\},open\\\_price,close\\\_price,trade\\\_volume\)instead of\(sentiment\_score,o​p​e​n​\_​p​r​i​c​e,c​l​o​s​e​\_​p​r​i​c​e,t​r​a​d​e​\_​v​o​l​u​m​e\)\(\\textbf\{sentiment\\\_score\},open\\\_price,close\\\_price,trade\\\_volume\)as inputs, so that FinBERT could be brought into the model “in\-graph”\.

#### 4\.3\.2Contrastive Learning

To align the semantic space of our text embeddings with financial outcomes, we implemented a Siamese network\.

Network Architecture\.The architecture consists of two identical sub\-networks \(towers\) with shared weights\. The input to each tower is the 768\-dimensional embedding vector generated by a pretrained FinBERT model\. Rather than using these high\-dimensional embeddings directly, we process them through a Multi\-Layer Perceptron \(MLP\) to introduce non\-linearity and to filter linguistic noise\. The specific sequence of layers in each tower is:Linear Layer\(768→\\rightarrow256\),ReLU Activation,Linear Layer\(256→\\rightarrow64\)\. The output is a pair of 64\-dimensional vectors representing the two input articles in the optimized manifold\.

![Refer to caption](https://arxiv.org/html/2605.30652v1/siamese_diagram.png)Figure 1:Diagram of Siamese network architectureQuantile Binning\.We constructed training pairs\(X1,X2\)\(X\_\{1\},X\_\{2\}\)from our dataset using quantile binning onym​a​r​k​e​ty\_\{market\}\. We labeled pairs as "similar" \(Y=1Y=1\) if they fell into the same bin and "dissimilar" \(Y=0Y=0\) otherwise\. We chose to use quantile bins because financial data is inherently stochastic\. We experimented with three binning strategies to determine the optimal method for constructing contrastive pairs: \(1\)Quartile\(25/50/25\), \(2\)Median\(50/50\) and \(3\)Tercile\(33/33/33\)\.

Contrastive Loss\.The network was trained using Contrastive Loss, which minimizes the Euclidean distanceDwD\_\{w\}between positive pairs and maximizes the distance between negative pairs up to a marginm=1\.0m=1\.0\. The loss function is defined as:

L=12\[Y⋅Dw2\+\(1−Y\)⋅max\{0,m−Dw\}2\]L=\\frac\{1\}\{2\}\[Y\\cdot D^\{2\}\_\{w\}\+\(1\-Y\)\\cdot\\max\{\\\{0,m\-D\_\{w\}\\\}\}^\{2\}\]\(4\)By minimizing this loss, the shared encoder learns a metric space where proximity reflects market movement similarity\.

#### 4\.3\.3Attention\-Based Daily Sentiment Aggregation

The standard mean pooling approach, as used in[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6), is suboptimal due to significant variance in daily news volume, causing high\-volume days to dilute high\-signal narratives with noise\. Refer to Figure[7](https://arxiv.org/html/2605.30652#S8.F7)in the Appendix as an example of variance in daily news volume for Nvidia\.

To mitigate this, we implemented a stock\-specific attention mechanism\. We formulate a Query\-Key\-Value operation where theQuery \(Q\)is a learnable stock\-specific vector, theKey \(K\)represents the article’s semantic topic via SentenceBERT, and theValue \(V\)is the FinBERT sentiment embedding\. We compute attention weightsαs,i\\alpha\_\{s,i\}via softmax on the dot\-product of the stock queryqsq\_\{s\}and article keykik\_\{i\}:

αs,i=softmax​\(qs⋅kidk\)=exp⁡\(qs⋅ki\)∑j=1\|N\|exp⁡\(qs⋅kj\)\\alpha\_\{s,i\}=\\text\{softmax\}\\left\(\\frac\{q\_\{s\}\\cdot k\_\{i\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)=\\frac\{\\exp\(q\_\{s\}\\cdot k\_\{i\}\)\}\{\\sum\_\{j=1\}^\{\|N\|\}\\exp\(q\_\{s\}\\cdot k\_\{j\}\)\}\(5\)These weights then aggregate the FinBERT values into the daily sentiment embeddingEs,tE\_\{s,t\}:

Es,t=∑i=1\|N\|αs,i​viE\_\{s,t\}=\\sum\_\{i=1\}^\{\|N\|\}\\alpha\_\{s,i\}v\_\{i\}\(6\)This process dynamically emphasizes articles whose topics align with the stock’s learned profile\.

## 5Experiments

Table 4:Performance of LSTM\-based Baseline ModelsTable 5:Performance of Transformer\-based Baseline and Experiment Models### 5\.1Benchmark Replication

We follow[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)’s evaluation framework to use MSE, MAE, andR2R^\{2\}as our comparison yardstick\. We selected the same hyperparameters used by[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)\(100 epochs, 4 vanilla transformer layers\)\. We utilize the Mean Squared Error \(MSE\) as our loss function, consistent with[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6), defined as:

ℒMSE=1N​∑i=1N\(yi−y^i\)2\\mathcal\{L\}\_\{\\text\{MSE\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(y\_\{i\}\-\\hat\{y\}\_\{i\}\)^\{2\}\(7\)
whereNNrepresents the number of samples in the batch,yiy\_\{i\}denotes the ground truth value, andy^i\\hat\{y\}\_\{i\}is the predicted value generated by the model\. This configuration remains consistent across all replication and experimental models\.

We made reasonable choices for parameters that were not explicitly published \(e\.g\. default AdamW optimizer, 4 transformer heads, 0\.2 for dropout, using a learning rate scheduler with a 30% ramp\-up, with a max learning rate of1×10−31\\times 10^\{\-3\}\) and kept these consistent across experiments\. We found that these choices gave us more consistent loss improvement trends and helped prevent overfitting\. We experimented with learning rates and found that exceedingly low values waste compute as both training and validation loss don’t change, while high values cause training loss fluctuation\. We found that using a learning rate scheduler and 30% ramp\-up period helped us achieve a relatively consistent downward loss trend and generalized well for our subsequent experiments\. Drop out, as expected, helped with overfitting to the training set\.

For normalization,[dong2024fnspid](https://arxiv.org/html/2605.30652#bib.bib6)mentioned that they used Min\-Max scaling, but did not specify exactly how\. We tried normalizing stock prices globally and found that it led to a data leakage problem that resulted in inflated validation metrics, as the test data sees some information from the validation set\. We tested different normalization configurations and settled on strictly splitting normalization between test and validation, and doing this per stock, using Min\-Max scaling\. Normalization was applied for targets as well\.

We observed that adding FinBERT sentiment achieves the same directional effect as FNSPID’s benchmark \(see Tables[4](https://arxiv.org/html/2605.30652#S5.T4)and[5](https://arxiv.org/html/2605.30652#S5.T5)\)\. On Transformer\-based models, we achieved a bigger magnitude of improvement across all 3 metrics\.

Table 6:Performance impact of sentiment score on Transformer baselineTable 7:Performance impact of sentiment score on LSTM baseline
### 5\.2Replacing Sentiment Score with embedding

For this experiment we used the same hyperparameters as the benchmark replication in order to maintain consistency for a direct comparison of metrics\.

The results of this experiment \(see Table[8](https://arxiv.org/html/2605.30652#S5.T8)\) validated our hypothesis that replacing sentiment score with a daily embedding, calculated as the average of the individual last\-layer FinBERT article embeddings, would produce better predictions\. Going forward,“Transformer \+ Embeddings \(Frozen FinBERT\)”serves as our iteration platform for experimental text embeddings for the task of stock price prediction\.

Table 8:Performance impact of news embeddings on Transformer model
### 5\.3Further Embedding Improvement

#### 5\.3\.1Unfreezing FinBERT

Our first step in attempting to improve the sentiment embedding was task\-specific fine tuning\. We discovered that fine tuning even one layer of FinBERT is expensive\. Freezing FinBERT and only training 4 layers of vanilla transformers of 4 heads using representation dimensionality of 32 had 75k trainable parameters\. Unfreezing just one layer of FinBERT would result in 7\.7 million trainable parameters\. With our data and model configuration, each epoch would take 5̃0 minutes to train on a T4 GPU\. For fair comparison, each full iteration would take 100 hours\. We finished training 1 epoch, but ultimately abandoned this direction and opted for computationally more efficient directions discussed below\.

#### 5\.3\.2Contrastive Learning

We trained the Siamese network using the ADAM optimizer with an initial learning rate of1×10−41\\times 10^\{\-4\}\. We chose this lower rate to keep the training process stable, as contrastive loss can be sensitive to large updates\. We employed a batch size of 64, which fit within our memory limits\. The model was trained for 15 epochs with a learning rate scheduler, which dynamically decayed the learning rate when validation loss plateaued, allowing fine\-grained optimization in later epochs\.

We evaluated the three binning strategies on a common hold\-out test set to ensure fair comparison\. As shown in Figure[8](https://arxiv.org/html/2605.30652#S8.F8)in the Appendix, all three strategies resulted in ROC AUC scores near random chance \(0\.5\), indicating that the Siamese network struggled to learn a robust manifold that linearly separates market reactions based solely on text embeddings\. This is expected as each news summary merely represents a small part of the information that the market considers, so we only expect it to contribute marginally to the daily market movement\.

While the Quartile strategy achieved AUC comparable to Tercile \(0\.509 vs\. 0\.505\), a deeper inspection of the confusion matrices revealed significant issues with model bias\.

![Refer to caption](https://arxiv.org/html/2605.30652v1/quartile_cm.png)\(a\)Quartile 25/50/25 binning
![Refer to caption](https://arxiv.org/html/2605.30652v1/tercile_cm.png)\(b\)Tercile 33/33/33 binning

Figure 2:Confusion matrices for binning strategiesThe confusion matrix for the Quartile strategy \(Figure[2\(a\)](https://arxiv.org/html/2605.30652#S5.F2.sf1)\) shows a severe imbalance\. The model correctly identified 962 dissimilar pairs \(True Negatives\) but only 258 similar pairs \(True Positives\), resulting in a Recall of just 20\.9%\. This suggests the model collapsed into predicting "Dissimilar" for the majority of inputs\. In contrast, the Tercile strategy demonstrated a much more balanced confusion matrix \(Figure[2\(b\)](https://arxiv.org/html/2605.30652#S5.F2.sf2)\)\. It achieved a Recall of 49\.7% and a Precision of 49\.6%\. Although the overall AUC was slightly lower, this strategy forced the model to actually learn features distinguishing the two classes\.

Loss curves for each Siamese network \(see Figures[9](https://arxiv.org/html/2605.30652#S8.F9),[10](https://arxiv.org/html/2605.30652#S8.F10), and[11](https://arxiv.org/html/2605.30652#S8.F11)in the Appendix\) show overfitting in Quartile and Median Split binning strategies\. We addressed overfitting by tuning hyperparameters such as dropout, weight decay, and learning rate, but were not able to further reduce variance without creating an upward trend in validation loss\. This is likely due to the inherent difficulty of the task as discussed above\.

We chose the Tercile strategy for our final model, which seemed to be the most balanced and robust to model collapse\. The Tercile strategy also showed slightly more stable validation loss during training, indicating it was less likely to memorize noise in the training data\. This combination of better class balance and training stability made it the best candidate for generating the embeddings needed for our Transformer model\.

End\-to\-end Performance\.The Transformer integrated with Siamese Network embeddings achieved the highest predictive performance metrics, reaching an MSE of 0\.078109 and anR2R^\{2\}of 0\.969052 \(see Table[5](https://arxiv.org/html/2605.30652#S5.T5)\)\.

#### 5\.3\.3Attention Weighted Aggregation

For the attention mechanism, we utilizedValue Vectors \(V\):768\-dimensional FinBERT embeddings for each article,Key Vectors \(K\):384\-dimensional SentenceBERT embeddings representing article topics, andQuery Vectors \(Q\):Learnable 384\-dimensional stock\-specific vectors initialized randomly\. Q dimension was chosen to enable dot\-product calculation with K\.

Table 9:Performance Comparison of Aggregation Strategies \(Horizon = 3 Days\)Table[9](https://arxiv.org/html/2605.30652#S5.T9)summarizes the performance of the Attention\-Based Aggregation compared to the standard Mean Pooling baseline on the test set\. Despite the theoretical appeal of filtering noise via attention, the experimental results suggest several factors contributed to its failure in this specific context\.

Financial news datasets are characterized by a low Signal\-to\-Noise Ratio \(SNR\)\. As observed in similar domains[jain2019attentionexplanation](https://arxiv.org/html/2605.30652#bib.bib13), attention mechanisms trained on noisy, limited data often suffer from "attention collapse," where the learned weightsαs,i\\alpha\_\{s,i\}fail to distinguish meaningfully between high\-value and low\-value articles, effectively reverting to a \(noisier\) uniform average\.

Introducing learnable queries and key\-value projections added significant complexity to the optimization landscape\. Without a dedicated auxiliary loss to guide the attention mechanism \(e\.g\., a supervised signal on which articles were important\), the model struggled to learn meaningful associations purely from future price movements\.

### 5\.4Tying Results to the Real World

To translate our model’s predictive performance into practical trading outcomes, we conducted a simple event\-driven backtest\. We initiated a hypothetical long position when the model predicted a positive price movement three days ahead; otherwise, we held cash to accrue the risk\-free rate\. The results were consistent across all models, including the baseline\. With directional accuracies hovering near random chance \(~50%\), none were able to outperform a passive buy\-and\-hold benchmark \(see Figure[3](https://arxiv.org/html/2605.30652#S5.F3)\)\.

![Refer to caption](https://arxiv.org/html/2605.30652v1/backtest.png)Figure 3:Comparison of sentiment\-enhanced models vs\. a passive buy\-and\-hold strategyThese results align with the Efficient Market Hypothesis[emh](https://arxiv.org/html/2605.30652#bib.bib7), suggesting that current asset prices largely reflect available information\. Our findings demonstrate that even a sophisticated NLP\-derived sentiment factor lacks the predictive power required to generate statistically significant alpha when used in isolation\.

This does not imply that markets are perfectly efficient or that quantitative strategies are futile\. Rather, it highlights the high bar for efficiency in modern markets\. Single\-factor signals are often quickly identified and arbitraged away, a phenomenon well\-documented by McLean and Pontiff[mclean](https://arxiv.org/html/2605.30652#bib.bib17)\. In this competitive environment, edge is rarely found in isolation; instead, successful quantitative strategies rely on the robust aggregation of many weakly correlated factors[valmoe](https://arxiv.org/html/2605.30652#bib.bib4)\.

## 6Conclusion

This study evaluated high\-dimensional representation learning for financial forecasting, moving beyond scalar sentiment scores to preserve narrative nuance\. Our benchmarks demonstrate that the Transformer integrated with Siamese Network embeddings outperformed all other architectures, including those using raw FinBERT embeddings and attention weighted aggregation\. Although the Siamese network failed to effectively learn a "market\-aware" geometric space, its embeddings combined with structured data improved predictions\. Contrary to our hypothesis, the attention weighted average mechanism failed to improve performance\. We attribute this to the low Signal\-to\-Noise Ratio \(SNR\) of financial news, which often leads to attention collapse, and the limitation of using static learnable query vectors to model dynamic market interests\. A static query cannot adapt to the rapidly shifting narratives that drive stock prices over time\.

Future research should prioritize addressing the Attention Collapse issue, potentially through the use of Differential Transformers[ye2024differential](https://arxiv.org/html/2605.30652#bib.bib23)which use subtractive attention to cancel noise\. Additionally, the unfreezing of FinBERT, deprioritized in this study due to compute constraints, remains a promising avenue for end\-to\-end optimization\. Further exploration is needed to validate the model’s predictive power and eliminate autocorrelation bias\. Finally, the experimental scope should be expanded beyond the current five\-ticker subset to the full FNSPID dataset to assess generalizability across diverse market sectors\.

## 7Contributions

All members contributed equally to literature review, idea selection, and writing reports\.

Noelle focused on the LSTM baseline replication with and without sentiment scores and the contrastive learning experiment\. The latter entailed creating scripts for processing news and price data, training the Siamese network, integrating the embeddings into the Transformer architecture, training the end\-to\-end model, and evaluating results\.

Mike built initial prototypes of FinBERT Siamese network, trained a 1\-day look\-back variant of the baseline, trained the baseline variant that replaced sentiment scores with embeddings, trained finetuned FinBERT, and built the back testing code\. He also set up Github and contributed to early data engineering and Colab pipeline setup\.

Yujin did data engineering and feature extraction using AWS Athena and SQL\. She developed and refined early Siamese network experiments to formulate experiment ideas, conducted the benchmark replication of Transformer\-based models, and implemented Attention\-Based Daily Sentiment Aggregation\.

## References

- \[1\]Github: github\.com/hkmamike/market\-encoder\.
- \[2\]Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G\. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng\.TensorFlow: A system for large\-scale machine learning\.In12th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 16\), pages 265–283, 2016\.
- \[3\]Amazon Web Services\.Boto3: The AWS SDK for Python, 2023\.
- \[4\]CLIFFORD S\. ASNESS, TOBIAS J\. MOSKOWITZ, and LASSE HEJE PEDERSEN\.Value and momentum everywhere\.The Journal of Finance, 68\(3\):929–985, 2013\.
- \[5\]AWS Professional Services\.AWS data wrangler, 2023\.
- \[6\]Zihan Dong, Xinyu Fan, and Zhiyuan Peng\.Fnspid: A comprehensive financial news dataset in time series, 2024\.
- \[7\]Eugene F\. Fama\.Efficient capital markets: A review of theory and empirical work\.The Journal of Finance, 25\(2\):383–417, 1970\.
- \[8\]Yihong Gong and Xin Liu\.Generic text summarization using relevance measure and latent semantic analysis\.InProceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 19–25, 2001\.
- \[9\]Tian Guo and Emmanuel Hauptmann\.Fine\-tuning large language models for stock return prediction using newsflow, 2024\.
- \[10\]Charles R\. Harris, K\. Jarrod Millman, Stéfan J\. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J\. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H\. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard\-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E\. Oliphant\.Array programming with NumPy\.Nature, 585\(7825\):357–362, 2020\.
- \[11\]Allen H\. Huang, Hui Wang, and Yi Yang\.Finbert: A large language model for extracting information from financial text\.Contemporary Accounting Research, 40\(7\):1588–1619, 2023\.
- \[12\]John D\. Hunter\.Matplotlib: A 2d graphics environment\.Computing in Science & Engineering, 9\(3\):90–95, 2007\.
- \[13\]Sarthak Jain and Byron C\. Wallace\.Attention is not explanation, 2019\.
- \[14\]Wen jun Gu, Yi hao Zhong, Shi zun Li, Chang song Wei, Li ting Dong, Zhuo yue Wang, and Chao Yan\.Predicting stock prices with finbert\-lstm: Integrating news sentiment analysis\.InProceedings of the 2024 8th International Conference on Cloud and Big Data Computing, ICCBDC 2024, page 67–72\. ACM, August 2024\.
- \[15\]Fei Tony Liu, Kai Ming Ting, and Zhi\-Hua Zhou\.Isolation forest\.In2008 Eighth IEEE International Conference on Data Mining, pages 413–422\. IEEE, 2008\.
- \[16\]Wes McKinney\.Data structures for statistical computing in Python\.In Stéfan van der Walt and Jarrod Millman, editors,Proceedings of the 9th Python in Science Conference, pages 56–61, 2010\.
- \[17\]R\. DAVID McLEAN and JEFFREY PONTIFF\.Does academic research destroy stock return predictability?The Journal of Finance, 71\(1\):5–31, 2016\.
- \[18\]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala\.PyTorch: An imperative style, high\-performance deep learning library\.InAdvances in Neural Information Processing Systems 32, pages 8024–8035, 2019\.
- \[19\]F\. Pedregosa, G\. Varoquaux, A\. Gramfort, V\. Michel, B\. Thirion, O\. Grisel, M\. Blondel, P\. Prettenhofer, R\. Weiss, V\. Dubourg, J\. Vanderplas, A\. Passos, D\. Cournapeau, M\. Brucher, M\. Perrot, and E\. Duchesnay\.Scikit\-learn: Machine learning in Python\.Journal of Machine Learning Research, 12:2825–2830, 2011\.
- \[20\]Yanxin Shen and Pulin Kirin Zhang\.Financial sentiment analysis on news and reports using large language models and finbert\.In2024 IEEE 6th International Conference on Power, Intelligent Computing and Systems \(ICPICS\), pages 717–721, 2024\.
- \[21\]Nicholas Vinden, Raeid Saqur, Zining Zhu, and Frank Rudzicz\.Contrastive similarity learning for market forecasting: The contrasim framework, 2025\.
- \[22\]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush\.Transformers: State\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45\. Association for Computational Linguistics, 2020\.
- \[23\]Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei\.Differential transformer\.arXiv preprint arXiv:2410\.05258, 2024\.

## 8Appendix

### Implementation Details

The experimental framework was primarily implemented using PyTorch\[[18](https://arxiv.org/html/2605.30652#bib.bib18)\]for deep learning model development, though initial comparative experiments were conducted using TensorFlow\[[2](https://arxiv.org/html/2605.30652#bib.bib2)\]\.

For natural language processing tasks, we utilized the Hugging Face Transformers library\[[22](https://arxiv.org/html/2605.30652#bib.bib22)\]\. Data preprocessing, manipulation, and numerical computations were handled using Pandas\[[16](https://arxiv.org/html/2605.30652#bib.bib16)\]and NumPy\[[10](https://arxiv.org/html/2605.30652#bib.bib10)\], while Scikit\-learn\[[19](https://arxiv.org/html/2605.30652#bib.bib19)\]was employed for evaluation metrics and baseline algorithms\.

Cloud data interactions were managed via the AWS SDK for Python \(Boto3\)\[[3](https://arxiv.org/html/2605.30652#bib.bib3)\]and AWS Data Wrangler\[[5](https://arxiv.org/html/2605.30652#bib.bib5)\]\. Finally, all visualizations were generated using Matplotlib\[[12](https://arxiv.org/html/2605.30652#bib.bib12)\]\.

### Charts and Figures

![Refer to caption](https://arxiv.org/html/2605.30652v1/sentiment_coverage.png)Figure 4:Percentage of sentiment data by stock symbol![Refer to caption](https://arxiv.org/html/2605.30652v1/sentiment_distribution.png)Figure 5:Percentage of sentiment data by stock symbol![Refer to caption](https://arxiv.org/html/2605.30652v1/siamese_histograms.png)Figure 6:Distribution of values forym​a​r​k​e​ty\_\{market\}in contrastive learning data![Refer to caption](https://arxiv.org/html/2605.30652v1/nvda.png)Figure 7:Article counts by date for NVDA![Refer to caption](https://arxiv.org/html/2605.30652v1/siamese_roc.png)Figure 8:Comparison of ROC curves across Siamese networks for binning strategies![Refer to caption](https://arxiv.org/html/2605.30652v1/loss_quartile.png)Figure 9:Loss of Siamese network \(Quartile Strategy\)![Refer to caption](https://arxiv.org/html/2605.30652v1/loss_median.png)Figure 10:Loss of Siamese network \(Median Strategy\)![Refer to caption](https://arxiv.org/html/2605.30652v1/loss_tercile.png)Figure 11:Loss of Siamese network \(Tercile Strategy\)

Similar Articles

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

arXiv cs.CL

This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.