A Comparative Study on Affective Cues in Text Embeddings Across Psychological Emotion Theories
Summary
This paper evaluates twelve recent text encoders on their ability to encode affective cues from three psychological emotion theories, finding that instruction-aware open-weight encoders match or exceed proprietary ones at word level, while task-tuned embeddings are superior at sentence level.
View Cached Full Text
Cached at: 06/30/26, 05:30 AM
# A Comparative Study on Affective Cues in Text Embeddings Across Psychological Emotion Theories
Source: [https://arxiv.org/html/2606.29068](https://arxiv.org/html/2606.29068)
11institutetext:Department of Electronics, Information, and Bioengineering,
Polytechnic University of Milan
11email:fabio1\.ciani@mail\.polimi\.it22institutetext:Multimedia Mining and Search Group, Institute of Computational Perception, Johannes Kepler University Linz
22email:\{harald\.schweiger,markus\.schedl\}@jku\.at33institutetext:Department of Music Pedagogy, Nuremberg University of Music
33email:emiliaparada\.cabaleiro@hfm\-nuernberg\.de44institutetext:Human\-centered AI Group, AI Lab, Linz Institute of Technology###### Abstract
Text encoders are known for their utility in natural language processing, as they are able to efficiently compress inputs into dense vectors while preserving semantics\. These models have been applied to affective computing, in particular to help with solving sentiment analysis and emotion recognition tasks\. Nevertheless, it remains unclear to what extent the latent representations produced by modern text encoders capture well\-defined psychological theories of affect\. In this work, we investigate the affective capabilities of twelve recently released text encoders by probing their generated embeddings as input features for solving regression and classification tasks across three established emotion frameworks, using both word\- and sentence\-level data\. Additionally, we apply a semantic data\-leakage prevention technique to improve robustness in word\-level evaluations\. Our main findings show that the latent manifolds of the latest instruction\-aware open\-weight encoders enclose an equal or even a larger amount of affective information in comparison with proprietary counterparts when evaluated at word level\. In contrast, embeddings of task\-tuned and proprietary encoders reach the highest scores on sentence\-level affective classification\. Furthermore, a qualitative analysis of latent representations and their encoded affective cues is provided\.
## 1Introduction
The advent of text encoders has transformed the way of converting textual content into numerical representations, known as embeddings, which now serve as core components for a variety of tasks, e\.g\., semantic similarity, retrieval, and reranking\[[8](https://arxiv.org/html/2606.29068#bib.bib8)\]\. They have also proven to be valuable for text\-based sentiment analysis, e\.g\., in emotion classification\[[11](https://arxiv.org/html/2606.29068#bib.bib11)\]or valence\-arousal regression\[[20](https://arxiv.org/html/2606.29068#bib.bib20)\]\. To solve these tasks, models such asBERT\[[12](https://arxiv.org/html/2606.29068#bib.bib12),[42](https://arxiv.org/html/2606.29068#bib.bib42)\]have been tested under various configurations, mostly involving end\-to\-end fine\-tuning\. Since then, the performance of text encoders has been progressively improved by different techniques, including instruction\-based queries and advanced training schemes\[[58](https://arxiv.org/html/2606.29068#bib.bib58)\]\.
Despite relevant prior works, the usage of text encoders as zero\-shot feature extractors for emotion recognition tasks remains underexplored, especially with respect to the latest state\-of\-the\-art and instruction\-aware models\. Moreover, it is uncertain whether the latent manifolds induced by these models enclose established emotion frameworks from psychology through a sufficient encoding of affective information\.
To fill this gap, our analysis compares recently released text encoders, without prior fine\-tuning, across different emotion theories, i\.e\., the Pleasure\-Arousal\-Dominance model by Mehrabian and Russell\[[27](https://arxiv.org/html/2606.29068#bib.bib27)\], the model of emotions by Plutchik\[[40](https://arxiv.org/html/2606.29068#bib.bib40)\], and the “big six” by Ekman\[[13](https://arxiv.org/html/2606.29068#bib.bib13)\]\. We use two structured lexica and one sentence\-level dataset, i\.e\., NRC\-VAD\[[33](https://arxiv.org/html/2606.29068#bib.bib33)\], NRC\-EIL\[[32](https://arxiv.org/html/2606.29068#bib.bib31)\], and GoEmotions\[[11](https://arxiv.org/html/2606.29068#bib.bib11)\], which respectively correspond to each of the emotion theories, together with a novel technique to limit leakage between splits\. Accordingly, embeddings are computed and frozen to be exploited as input features to four downstream predictors and subsequently evaluated with respect to their affective cues\. The quantitative results, together with a qualitative visual analysis, address the following research questions\.
RQ1To what extent do latent manifolds from text encoders enclose emotional cues?
RQ2Are instruction\-aware text encoders superior to task\-tuned models or ones without explicit prompt support for generating optimized embeddings?
RQ3Are proprietary models better than open\-weight ones?
RQ4Does model performance vary depending on the chosen emotion framework and downstream predictor?
The paper proceeds as follows\. In[Section˜2](https://arxiv.org/html/2606.29068#S2), a background on emotion theories and affective language processing is presented\.[Section˜3](https://arxiv.org/html/2606.29068#S3)describes the experimental setup, including datasets, encoders, and predictors\. Finally,[Section˜4](https://arxiv.org/html/2606.29068#S4)reports the results, while[Section˜5](https://arxiv.org/html/2606.29068#S5)concludes the article\.111Our code can be publicly accessed at the repository at the following hyperlink:[https://github\.com/hcai\-mms/affective˙embeddings](https://github.com/hcai-mms/affective_embeddings)\.
## 2Related works
### 2\.1Emotion theories
Consolidated research in psychology has led to a variety of frameworks explaining human emotions from both taxonomic and perceptual standpoints\. One of the first categorical models was proposed by Ekman, in which the so called “big six” basic emotions \(*anger*,*disgust*,*fear*,*happiness*,*sadness*, and*surprise*\) were identified from facial expressions and considered biologically encoded and cross\-cultural\[[13](https://arxiv.org/html/2606.29068#bib.bib13)\]\.
Another notable theory is the circumplex model by Russell, which represents affects in a two\-dimensional space, where the horizontal and vertical axes respectively measure valence and arousal\. While valence captures the pleasure ranging from negative to positive, arousal reflects the energy spanning from low to high\[[43](https://arxiv.org/html/2606.29068#bib.bib43)\]\. This was preceded by the PAD \(Pleasure\-Arousal\-Dominance\) emotion model by Mehrabian, where a third bipolar dimension to quantify the evoked control or submissiveness was also included\[[27](https://arxiv.org/html/2606.29068#bib.bib27)\]\.
An additional framework, also relevant to affective computing, is the one by Plutchik, who designed a hybrid categorical\-dimensional model with a resemblance to Russell’s theory, in which spatial proximity links to affect similarity\. Eight primary emotions \(*joy*,*trust*,*fear*,*surprise*,*sadness*,*disgust*,*anger*, and*anticipation*\) are arranged in concentric circles corresponding to different levels of intensity, i\.e\., as a cone subdivided in sectors, with the possibility to mix adjacent and opposite emotions to form combined mood dyads\[[40](https://arxiv.org/html/2606.29068#bib.bib40)\]\.
### 2\.2Affective language processing
Early techniques to extract emotions from textual content and drawing from the distributional hypothesis in linguistics\[[19](https://arxiv.org/html/2606.29068#bib.bib19),[17](https://arxiv.org/html/2606.29068#bib.bib17)\]included latent semantic analysis\[[3](https://arxiv.org/html/2606.29068#bib.bib3)\], a matrix factorization procedure to learn compressed representations, that later evolved into popular self\-supervised word embeddings\[[28](https://arxiv.org/html/2606.29068#bib.bib28),[29](https://arxiv.org/html/2606.29068#bib.bib29),[39](https://arxiv.org/html/2606.29068#bib.bib39),[4](https://arxiv.org/html/2606.29068#bib.bib4)\]\.
Word vectors were retrained from scratch incorporating supervised affective contexts into the objective function\[[49](https://arxiv.org/html/2606.29068#bib.bib49)\]\. Faruqui et al\.\[[15](https://arxiv.org/html/2606.29068#bib.bib15)\]and Mrksic et al\.\[[34](https://arxiv.org/html/2606.29068#bib.bib34)\]devised a method for adjusting pre\-trained word embeddings with respect to lexical relationships and constraints, which was adopted by Yu et al\.\[[55](https://arxiv.org/html/2606.29068#bib.bib55)\]and Seyeditabari et al\.\[[44](https://arxiv.org/html/2606.29068#bib.bib45)\]on affective datasets to mitigate reported issues in vector similarity and arithmetic associated with general\-purpose distributional embeddings\[[45](https://arxiv.org/html/2606.29068#bib.bib44)\]\. Notable extensions built upon the Transformer architecture have been presented both at word level\[[9](https://arxiv.org/html/2606.29068#bib.bib9),[10](https://arxiv.org/html/2606.29068#bib.bib10)\]and at sentence level\[[46](https://arxiv.org/html/2606.29068#bib.bib46)\]\. The attention mechanism has also been adapted to enrich learnt representations by combining vectors and data from a knowledge base\[[48](https://arxiv.org/html/2606.29068#bib.bib48)\]\.
More broadly, it has been questioned whether language models \(LMs\) can effectively understand the multifaceted nature of emotions\. Lee et al\.\[[22](https://arxiv.org/html/2606.29068#bib.bib23)\]isolated low\-level subcomponents focused on handling patterns deriving from specific affects, while Reichman et al\.\[[41](https://arxiv.org/html/2606.29068#bib.bib41)\]continued the analysis underlining the presence of a complex redundancy scheme implemented by sets of specialized neurons and connections within the neural architecture\. At a higher level, the neuropsychology of LMs has been studied by assessing whether their internal representations can be refined to align with established emotion theories\[[24](https://arxiv.org/html/2606.29068#bib.bib22)\]\. It has also been observed that larger foundational models tend to exhibit emotional intelligence more accurately than smaller counterparts\[[53](https://arxiv.org/html/2606.29068#bib.bib52)\]and build increasingly detailed hierarchical taxonomies to organize emotions\[[57](https://arxiv.org/html/2606.29068#bib.bib57)\]\.
Lastly, connecting the affective expression in an emotion model to the definition in another theoretical framework has been achieved through annotated textual content, i\.e\., data with a series of assigned labels or numerical quantities, either directly bridging the categorical and dimensional families\[[38](https://arxiv.org/html/2606.29068#bib.bib38)\]or upon learning an agnostic intermediate representation space for conversions\[[5](https://arxiv.org/html/2606.29068#bib.bib5)\]\.
## 3Methodology
To test the ability of text encoders to capture emotional cues, we performed evaluations on three corpora grounded in different emotion frameworks from psychology \(cf\.[Section˜3\.1](https://arxiv.org/html/2606.29068#S3.SS1)\)\. We favored structured lexica, i\.e\., NRC\-VAD\[[31](https://arxiv.org/html/2606.29068#bib.bib32),[33](https://arxiv.org/html/2606.29068#bib.bib33)\]and NRC\-EIL\[[32](https://arxiv.org/html/2606.29068#bib.bib31)\]to reduce ambiguity in syntax and better match the conditions under which the corresponding theories have been studied\. Besides, GoEmotions\[[11](https://arxiv.org/html/2606.29068#bib.bib11)\]provides a more conservative perspective, as it contains full sentences instead of single\- and multi\-word samples\.
Figure 1:Pipeline demonstrating the fitting and evaluating procedure\. All embeddings are calculated once and frozen for each dataset \(blue section\)\. For simplicity, the remaining control flow is depicted for one experiment only \(yellow and purple sections\), i\.e\., the regression task of NRC\-VAD in combination with semantic leakage prevention and usingKaLM v2as text encoder\.We evaluated twelve text encoders \(cf\.[Section˜3\.2](https://arxiv.org/html/2606.29068#S3.SS2)\) using a two\-step procedure\. First, embeddings for all words and sentences in the datasets were computed and frozen\. Second, a collection of downstream predictors was trained and its hyperparameters were tuned, with the generated latent features as input, to assess the predictive performance on the corresponding regression and classification tasks \(cf\.[Figure˜1](https://arxiv.org/html/2606.29068#S3.F1)\)\. We selected four predictors owning distinct characteristics to map embeddings and emotions, allowing to test whether emotional cues are accessible linearly or via nonlinear transformations \(cf\.[Section˜3\.3](https://arxiv.org/html/2606.29068#S3.SS3)\)\.
To better measure the true generalization capabilities on the structured lexica, we applied two techniques to prevent morphological and semantic leakage across data splits, reducing evaluation biases on predictive models that rely on closely related lexical items to build their solutions\. The results presented in the main body of this paper focus on the semantics\-aware splitting strategy only, as it is inherently less biased\. More comprehensive summaries, including those obtained with the morphology\-aware approach, are reported in the Appendix \(cf\.[Appendices˜0\.A](https://arxiv.org/html/2606.29068#Pt0.A1)and[0\.D](https://arxiv.org/html/2606.29068#Pt0.A4)\)\.
### 3\.1Emotion datasets
All corpora are freely accessible and in English\. They present varying input granularity \(single\-word, multi\-word, and sentence\-level\) as well as output formats \(discrete and continuous\)\.
NRC\-VAD\[[31](https://arxiv.org/html/2606.29068#bib.bib32),[33](https://arxiv.org/html/2606.29068#bib.bib33)\]consists of around 55k single\- and multi\-word samples, annotated via crowdsourcing with real\-valued valence, arousal, and dominance scores in the interval\[−1,1\]\[\-1,1\], following Mehrabian’s theory\[[27](https://arxiv.org/html/2606.29068#bib.bib27)\]\.
NRC\-EIL\[[32](https://arxiv.org/html/2606.29068#bib.bib31)\]contains almost 6k single words with emotion intensities in the interval\[0,1\]\[0,1\], in line with Plutchik’s model\[[40](https://arxiv.org/html/2606.29068#bib.bib40)\]\. The properties of the collection would make it suitable as both a regression and a classification task, since 62\.4% of the entries are assigned a single emotion, 18\.6% have a pair of nonzero intensities, and the remaining terms are characterized by three or more emotions\. In our experiments, we focused on regression of the real\-valued intensities\.
GoEmotions\[[11](https://arxiv.org/html/2606.29068#bib.bib11)\]comprises over 54k comments crawled from Reddit, paired with 27 labels and filtered according to inter\-rater agreement\. The dataset provides official documentation to map these categories to a subset of labels consistent with Ekman’s framework\[[13](https://arxiv.org/html/2606.29068#bib.bib13)\]\. For our evaluations, we took the labeled sentences in combination with this projection to obtain a multi\-label classification dataset with 7 classes, six for Ekman’s emotions and one for the neutral category, where 91\.2% of the samples have one category and 8\.8% at least two\.
##### Splitting strategy
To train the predictive models, each dataset is partitioned into five folds for cross\-validation and one holdout test set for final evaluation in a 80%/20% proportion\. Given that GoEmotions has a predefined train\-dev\-test split, we combined the training and development splits and applied stratified sampling to balance the labels in the folds, while the test split was left unchanged to enable comparisons with the evaluations of the original work\. For the two lexicon\-based datasets, i\.e\., NRC\-VAD and NRC\-EIL, we used a novel technique for semantic leakage prevention, as detailed in the next paragraph\.
##### Leakage prevention
Random train\-test splits can overestimate generalization due to morphological or semantic leakage\. For instance, NRC\-VAD includes multiple inflected and derived forms of the same lexical root \(e\.g\.,*pleasure*,*pleasures*,*pleasurable*\) and semantically related terms \(e\.g\.,*calm*,*chill*,*peaceful*\)\. Text encoders tend to map these elements into nearby regions of the embedding space, which downstream predictive models could exploit by relying on nearest\-neighbor similarity rather than learning to genuinely generalize\.
To prevent this, we created a graph representation of the dataset lexemes and clustered them with the Leiden algorithm\[[50](https://arxiv.org/html/2606.29068#bib.bib50)\]\. Nodes correspond to lexemes, with edges that are present if exceeding a threshold and weighted by the Wu–Palmer semantic similarity\[[54](https://arxiv.org/html/2606.29068#bib.bib54)\]computed between the corresponding term synsets in WordNet\[[30](https://arxiv.org/html/2606.29068#bib.bib30),[16](https://arxiv.org/html/2606.29068#bib.bib16)\]\. Lexemes not covered by WordNet are excluded from the dataset\.
To assign the clusters to the cross\-validation folds and the holdout set, we applied a greedy balancing algorithm that optimizes two criteria: \(i\) the split sizes, i\.e\., 16% for each fold and 20% for the test set; and \(ii\) the preservation of the mean of the distribution for each emotion dimension across splits\. The procedure is executed multiple times with different initialization seeds and the best balanced split is kept\.
### 3\.2Text encoders
For a comprehensive analysis, we handpicked six open\-weight and three proprietary models from the top entries of the Massive Text Embedding Benchmark \(MTEB\)222Online at[https://huggingface\.co/spaces/mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard)\.leaderboard curated by Hugging Face\[[35](https://arxiv.org/html/2606.29068#bib.bib35),[14](https://arxiv.org/html/2606.29068#bib.bib14)\]\. When multiple models shared the same base architecture, including updated versions, we selected the encoder with the best average score\. Of the six considered open models, five of them \(KaLM Embedding v2,Qwen3 Embedding 8B,Linq Embed Mistral,LLaMA Embed Nemotron 8B, andMultilingual E5 Large Instruct\) are instruction\-aware, i\.e\., they were trained to generate embeddings optimized for user\-defined specifications included as additional input\[[47](https://arxiv.org/html/2606.29068#bib.bib47)\]\. Differently,EmbeddingGemmais the only one which is task\-tuned, i\.e\., a relaxed instruction\-aware model with predefined configurations for a set of use cases \(e\.g\., classification or semantic text similarity\)\. These were set up following the recommended prompts in their model cards together with the task to be solved, i\.e\., regression and classification, to detail the desired feature extraction \(cf\.[Appendix˜0\.B](https://arxiv.org/html/2606.29068#Pt0.A2)\)\. As for proprietary models, we choseOpenAI Text Embedding v3 Large,Gemini Embedding 001, andVoyage v3 Large, none of which is instruction\-aware\.
To further diversify the model pool, we also included two very recent encoders, i\.e\.,Jina Embeddings v4andNomic Embed v2, whose earlier releases demonstrated good performance on the MTEB benchmark, but which have been assessed on a limited subset of the list of tasks in MTEB in their current versions\. Additionally,Sentence T5 XXL, a popular sentence embedding model, is included to serve as a representative for text encoders without explicit support for custom prompts or task instructions\.
In total, twelve models are considered, spanning a wide range of parameter sizes, output dimensionalities, and training corpora, both English\-only and multilingual \(cf\.[Table˜1](https://arxiv.org/html/2606.29068#S3.T1)\)\.
Table 1:Description of the analyzed text encoders\. Licenses are subdivided into open\-weight \( \) and proprietary \( \)\. The number of parameters for downloadable models and the dimensionality of the latent features are specified byppanddd, respectively\. Each entry is tagged with its type between no prompt support \(▲\), task\-tuned \(◼\), and instruction\-aware \(⚫\)\. All encoders are multilingual, unless their training corpora are mainly in English \(†\\dagger\)\.[Sentence T5 XXL](https://huggingface.co/sentence-transformers/sentence-t5-xxl)▲†\\dagger4\.8B768768\[[36](https://arxiv.org/html/2606.29068#bib.bib36)\][EmbeddingGemma](https://huggingface.co/google/embeddinggemma-300m)◼300M768768\[[51](https://arxiv.org/html/2606.29068#bib.bib51)\][Nomic Embed v2](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe)◼305M768768\[[37](https://arxiv.org/html/2606.29068#bib.bib37)\][Multilingual E5 Large Instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)⚫560M10241024\[[52](https://arxiv.org/html/2606.29068#bib.bib53)\][Jina Embeddings v4](https://huggingface.co/jinaai/jina-embeddings-v4)◼3\.8B20482048\[[18](https://arxiv.org/html/2606.29068#bib.bib18)\][KaLM Embedding v2](https://huggingface.co/tencent/KaLM-Embedding-Gemma3-12B-2511)⚫12B38403840\[[58](https://arxiv.org/html/2606.29068#bib.bib58)\][Linq Embed Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)⚫†\\dagger7B40964096\[[21](https://arxiv.org/html/2606.29068#bib.bib21)\][LLaMA Embed Nemotron 8B](https://huggingface.co/nvidia/llama-embed-nemotron-8b)⚫8B40964096\[[2](https://arxiv.org/html/2606.29068#bib.bib2)\][Qwen3 Embedding 8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)⚫8B40964096\[[56](https://arxiv.org/html/2606.29068#bib.bib56)\]Voyage v3 Large▲,aaahttps://blog\.voyageai\.com/2025/01/07/voyage\-3\-large\.20482048OpenAI Text Embedding v3 Large▲,bbbhttps://platform\.openai\.com/docs/models/text\-embedding\-3\-large\.30723072Gemini Embedding 001◼30723072\[[23](https://arxiv.org/html/2606.29068#bib.bib24)\]
### 3\.3Predictive models
We employed four predictors, i\.e\., linear and logistic regression with elastic net regularization \(LR\),kk\-nearest neighbors \(kk\-NN\), XGBoost \(XGB\), and multilayer perceptron \(MLP\), to leverage text embeddings for downstream emotion prediction tasks\.
On the regression datasets, i\.e\., NRC\-VAD and NRC\-EIL, all predictive models were trained to minimize the mean squared error \(MSE\)\. For multi\-label classification, i\.e\., GoEmotions, performance was optimized maximizing the macro\-averagedF1F\_\{1\}, using a fixed decision threshold of0\.50\.5\. Since LR and XGB do not natively support multi\-label classification, one\-vs\-the\-rest strategy was applied\. The hyperparameters of the predictors were independently tuned for each dataset and generating encoder over multiple optimization trials with respect toR2R^\{2\}for regression and macroF1F\_\{1\}for classification\. More details on how hyperparameter tuning was carried out can be found in[Appendix˜0\.C](https://arxiv.org/html/2606.29068#Pt0.A3)\.
## 4Experimental results
### 4\.1Quantitative regression analysis
We evaluated the performance on the holdout test sets using three regression metrics, i\.e\., MSE,R2R^\{2\}, and concordance correlation coefficient
ρc=2ρσxσyσx2\+σy2\+\(μx−μy\)2\.\\rho\_\{c\}=\\frac\{2\\rho\\sigma\_\{x\}\\sigma\_\{y\}\}\{\\sigma\_\{x\}^\{2\}\+\\sigma\_\{y\}^\{2\}\+\(\\mu\_\{x\}\-\\mu\_\{y\}\)^\{2\}\}\.\(1\)In Equation[1](https://arxiv.org/html/2606.29068#S4.E1),xxandyydenote true and predicted values respectively, whereasρ\\rhoequals to Pearson’s correlation coefficient\.ρc\\rho\_\{c\}was chosen because, as defined in its formula, it captures both correlation and agreement, penalizing scale mismatches\[[25](https://arxiv.org/html/2606.29068#bib.bib25)\]\. To further support our results, the outcomes of paired difference tests, estimated via bootstrapping, are provided for each metric and encoder\. The null hypothesis claims that, using a given predictor as backend, the candidate encoder leads to better performance than the best encoder\.
[Tables˜2](https://arxiv.org/html/2606.29068#S4.T2)and[3](https://arxiv.org/html/2606.29068#S4.T3)sum up the inference performance, with the best results in bold and highlighting the cases without statistical evidence \(p\>0\.05p\>0\.05\) with a double underline and statistically significantpp\-values falling within the interval\[0\.005,0\.05\]\[0\.005,0\.05\]with a single underline\. The absence of an underlined value refers topp\-values below0\.0050\.005\.
Table 2:Regression metrics at test time for NRC\-VAD, sorted byR2R^\{2\}score of the MLP model in descending order\.#### 4\.1\.1NRC\-VAD
KaLM v2achieves the highest scores across all predictive models with one exception, i\.e\.,ρc\\rho\_\{c\}forkk\-NN, whereLinq Mistralis slightly more performant, though not in a statistically significant way \(cf\. double\-underlined value in the first row of[Table˜2](https://arxiv.org/html/2606.29068#S4.T2)\)\.OpenAI Text v3 Lranks third with the MLP backend, whereas other instruction\-aware models, i\.e\.,Qwen3 8BandLLaMA Nemotron 8B, show comparable performance, followed by the task\-tunedGemini 001\. In relation to RQ[3](https://arxiv.org/html/2606.29068#S1.I1.ix3), these insights indicate that some open\-weight text encoders can significantly outperform proprietary alternatives \(cf\. no underlined scores forOpenAI Text v3 L\), likely due to instruction\-awareness\. This also provides initial evidence for RQ[2](https://arxiv.org/html/2606.29068#S1.I1.ix2)\. Interestingly, the remaining instructionalMultilang E5 L Insoccupies one of the last positions\.
The maximumR2R^\{2\}score of\.677\.677and correlationρ\\rhoof\.811\.811indicate that the text encoders are capable of capturing affective cues, thereby addressing RQ[1](https://arxiv.org/html/2606.29068#S1.I1.ix1)\. In general, with the MLP predictor, all encoders except forNomic v2reach aR2\>\.55R^\{2\}\>\.55andρ\>\.7\\rho\>\.7, hinting their ability to enclose affective signals\. Concerning RQ[4](https://arxiv.org/html/2606.29068#S1.I1.ix4), the extraction of this affective information is more effective when using the MLP backend, with LR as runner\-up\.
Table 3:Regression metrics at test time for NRC\-EIL, sorted byR2R^\{2\}score of the MLP model in descending order\.
#### 4\.1\.2NRC\-EIL
Similar patterns emerge with respect to NRC\-VAD\.KaLM v2andLinq Mistralconsistently occupy the top positions\.KaLM v2benefits again from the MLP predictor \(cf\. top\-right corner of[Table˜3](https://arxiv.org/html/2606.29068#S4.T3)\), whileLinq Mistralperforms better across almost all other predictive models \(cf\. bold scores on second row\)\. In contrast to the ranking on NRC\-VAD,Qwen3 8BandEmbeddingGemmarise to the third and the fourth places, whileOpenAI Text v3 Ldrops to the sixth position\. This addresses RQ[4](https://arxiv.org/html/2606.29068#S1.I1.ix4)\.
Performance differences betweenLinq MistralandKaLM v2,Qwen3 8B, andEmbeddingGemmahavepp\-values in\[0\.005,0\.05\]\[0\.005,0\.05\], indicating that the ranking is not statistically decisive \(cf\. single\-underlined values in the corresponding rows\)\. This suggests that task\-tuned encoders such asEmbeddingGemmaare competitive with respect to instruction\-aware alternatives, providing further insight into RQ[2](https://arxiv.org/html/2606.29068#S1.I1.ix2)\. With regard to RQ[3](https://arxiv.org/html/2606.29068#S1.I1.ix3), four instruction\-aware models and one task\-tuned encoder outperform all proprietary models \(cf\. top\-5 ranking and rows below\), hence the support for generating optimized embeddings via instructions appears to be generally beneficial\.
As for RQ[1](https://arxiv.org/html/2606.29068#S1.I1.ix1), the highestR2R^\{2\}reaches a score of\.540\.540andρ\\rhoa correlation of\.730\.730, showing a moderate encoding of affective information for the top performing embedding model\. Recent open encoders, i\.e\.,Jina v4andNomic v2, seem to weakly enclose affective cues, as hinted by their lowR2R^\{2\}scores of\.432\.432and\.369\.369\(cf\. last rows\)\.
Table 4:Summary of the macro\-averaged classification metrics at test time for GoEmotions, sorted byF1F\_\{1\}score of the MLP model in descending order\.
### 4\.2Quantitative classification analysis
We used three common measures, i\.e\., precision, recall, andF1F\_\{1\}score, aggregating with respect to multiple classes by calculating the macro average of these metrics\.[Table˜4](https://arxiv.org/html/2606.29068#S4.T4)summarizes the aggregated results\. As in[Section˜4\.1](https://arxiv.org/html/2606.29068#S4.SS1), results are supported by significance checks\. In particular, paired permutation tests were applied\.
#### 4\.2\.1GoEmotions
Results reveal a substantially different ranking in comparison with the regression analysis\. The proprietaryGemini 001and the open\-weightEmbeddingGemmaachieve the top places on all predictive backends with theirF1F\_\{1\}scores, followed byOpenAI Text v3 L\(cf\. top\-3 ranking of[Table˜4](https://arxiv.org/html/2606.29068#S4.T4)\)\. Interestingly,KaLM v2achieves anF1F\_\{1\}score of\.588\.588with LR, which is the only occurrence where MLP is outperformed \(cf\.F1F\_\{1\}score of\.565\.565in the sixth row\)\. This makesKaLM v2the model with the third bestF1F\_\{1\}, giving new insights on RQ[4](https://arxiv.org/html/2606.29068#S1.I1.ix4)\. Two open encoders without explicit prompt support for optimized embeddings, i\.e\.,Voyage v3 LandST5 XXL, occupy the last positions \(cf\. last rows\)\.
Considering RQ[3](https://arxiv.org/html/2606.29068#S1.I1.ix3), in contrast to the analyses on NRC\-VAD and NRC\-EIL, the top ranker is a proprietary text encoder, i\.e\.,Gemini 001, rather than an open\-weight model\. However, this superior performance is not statistically significant \(cf\. underlinedF1F\_\{1\}scores forEmbeddingGemma\)\. In relation to RQ[2](https://arxiv.org/html/2606.29068#S1.I1.ix2), this highlights that instruction\-aware encoders, as well as models with large parameter size and embedding dimensionality, are not always better than task\-tuned or proprietary alternatives, in particular when sentence\-level samples are used instead of word\-level data\.
Concerning RQ[1](https://arxiv.org/html/2606.29068#S1.I1.ix1), a maximumF1F\_\{1\}of0\.600\.60is achieved \(cf\. MLP column in the top\-right corner\)\. Instead, the authors of the dataset report a score of0\.640\.64by fine\-tuningBERT\[[11](https://arxiv.org/html/2606.29068#bib.bib11)\], though without specifying the selected classification threshold\. This comparison suggests that exploiting fine\-tuning could be advantageous\.
As for RQ[4](https://arxiv.org/html/2606.29068#S1.I1.ix4), in terms ofF1F\_\{1\}scores, the MLP backend consistently outperforms all other predictive models, with only one exception \(cf\. LR backend forKaLM v2\)\. LR ranks second, whereaskk\-NN and XGB exhibit more variable performance depending on the used encoder\. These trends are in line with those observed in the experiments on regression\.
### 4\.3Qualitative visual analysis
Figure 2:UMAP visualization of the full embeddings with color\-coded labels\.We report a series of visualizations with the intent to discover whether generated vectors are expressive enough to imply a clustering of similar elements with respect to their affect\. For this analysis, we focus on the best performing open\-weight representatives for each of the three types of text encoder, i\.e\., without prompt support, task\-tuned, and instruction\-aware, as specified in[Table˜1](https://arxiv.org/html/2606.29068#S3.T1)\.
The embeddings calculated on the full datasets were transformed into a 2D representation with UMAP\[[26](https://arxiv.org/html/2606.29068#bib.bib26)\]set up with cosine similarity as metric to compare vectors\. Depending on the data format, the output variable information was converted through a color encoding as follows\.
- •3D emotion points from NRC\-VAD were read as RGB triples and their hue value was used as parametrization of a cyclic colormap\. Samples with pure valence \(R\), arousal \(G\), or dominance \(B\) signals are equidistant, with respect to the color space, to other pure points\.
- •Entries of NRC\-EIL with more than one active affect intensity were filtered out and the remaining elements were transmuted into members of the positive emotions \(joy,trust,anger, andanticipation\) or negative counterparts \(sadness,disgust,fear, andsurprise\) following Plutchik’s original statement, where the two groups render the opposite ends of a diverging colormap\.
- •Samples from GoEmotions either having multiple labels or tagged as neutral were dropped and linked to Ekman’s taxonomy through the official dataset lookup table, with each category corresponding to a distinct color\.
For the sake of readability,[Figure˜2](https://arxiv.org/html/2606.29068#S4.F2)is limited to 500 points per dataset, where stratified sampling was applied to NRC\-EIL and GoEmotions\. In the first row, it can be seen that none of the encoders is able to entail the creation of clusters of consistent entries for NRC\-VAD\. In fact, an equilateral triangle of uniformly distributed samples, with pure points of a component at its vertices, should ideally form\. Among the three affective axes, the one for arousal looks to be the most easily separable when considered alone\. Instead, all text encoders can sharply divide the elements of NRC\-EIL into two groupings, as evident in the second line of plots\. In addition, the instruction\-aware model is particularly skilled at avoiding poisoning the cluster of positive samples with intense negative points\. As for GoEmotions in the third row, while disgust\-, fear\-, and sadness\-labeled entries tend to gather together, especially in the task\-tuned and instruction\-aware encoders, elements tagged with the anger, joy, and surprise labels spread out and hardly group\. The difficulty might be possibly due to the fact that semantically broader labels are intuitively associated with the tendency to be more frequently selected by a human rater\. Since a bigger cardinality for the subset of a category can imply a higher variance in its embedding vectors, samples marked with the most occurring tags have a higher risk of being characterized by inconsistent latent features with respect to a representative of their cluster of belongingness\.
## 5Conclusions and future works
To conclude, our analyses show that affective information is present to a varying degree within twelve text encoders and across three emotion frameworks\. Addressing RQ[1](https://arxiv.org/html/2606.29068#S1.I1.ix1), the highestR2R^\{2\}score achieved by the best encoder on VAD regression has been of\.677\.677, highlighting that affective information is well\-represented by the embeddings within this framework\. In contrast, for Plutchik regression, the maximum score has been of\.540\.540, likely due to the higher dimensionality \(i\.e\., eight versus three\) and the smaller dataset size\. As for the sentence\-based multi\-label classification over Ekman’s six emotions plus a neutral one, the highestF1F\_\{1\}score has been of\.600\.600\. In addition, visualizing the down projections of the embeddings reveals similar patterns\. Concerning RQ[4](https://arxiv.org/html/2606.29068#S1.I1.ix4), affective cues appear to be more readily accessible through nonlinear downstream predictors, even though linear transformations can achieve comparable results\. On the lexicon datasets, instruction\-aware models rank the highest and outperform other candidates, including proprietary ones\. However, this trend does not hold for sentence\-level data, where task\-tuned models are the best performing, giving insights into RQ[2](https://arxiv.org/html/2606.29068#S1.I1.ix2)and RQ[3](https://arxiv.org/html/2606.29068#S1.I1.ix3)\.
Regarding future works, we pivoted on three emotion theories, considering both the categorical and the dimensional families\. However, other frameworks motivated by psychology and specifically developed for affective computing have been presented\[[7](https://arxiv.org/html/2606.29068#bib.bib7)\]\. Therefore, it would be desirable to extend our overview to them\. Additionally, we focused on discovering emotional cues from text\. Nevertheless, most of the latest LMs have been trained with multimodality in mind\. It has already been observed that incorporating different types of data can improve the understanding capabilities on a vast series of downstream tasks\.333For instance, see the technical report at[https://www\.anthropic\.com/claude\-3\-model\-card](https://www.anthropic.com/claude-3-model-card)\.Consequently, adapting our methodology and applying it to multimodal resources\[[6](https://arxiv.org/html/2606.29068#bib.bib6)\]could highlight even more which embedding models are the best in distinguishing the nuanced facets of emotions\.
## Limitations
We should acknowledge that the prompts of the task\-based and instruction\-aware text encoders under examination were set up at our own discretion\. To enable fair evaluation and comparison, we tried to configure these models as much as possible with matching settings for feature extraction\. However, it is reasonable to suppose that, for each specific encoder and task between regression and classification, there might exist other instructions which would imply better predictive performance\.
## References
- \[1\]T\. Akiba, S\. Sano, T\. Yanase, T\. Ohta, and M\. Koyama\(2019\-08\)Optuna: A Next\-generation Hyperparameter Optimization Framework\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,Anchorage, AK, USA,pp\. 2623–2631\.External Links:[Document](https://dx.doi.org/10.1145/3292500.3330701),ISBN 978\-1\-45\-036201\-6Cited by:[Appendix 0\.C](https://arxiv.org/html/2606.29068#Pt0.A3.p1.4)\.
- \[2\]Y\. Babakhin, R\. Osmulski, R\. Ak, G\. Moreira, M\. Xu, B\. Schifferer, B\. Liu, and E\. Oldridge\(2025\)Llama\-Embed\-Nemotron\-8B: A Universal Text Embedding Model for Multilingual and Cross\-Lingual Tasks\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2511.07025)Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.22.12.5.1.1)\.
- \[3\]J\. R\. Bellegarda\(2013\-08\)Data\-driven Analysis of Emotion in Text Using Latent Affective Folding and Embedding\.Computational Intelligence29\(3\),pp\. 506–526\.External Links:[Document](https://dx.doi.org/10.1111/j.1467-8640.2012.00457.x),ISSN 0824\-7935Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p1.1)\.
- \[4\]P\. Bojanowski, E\. Grave, A\. Joulin, and T\. Mikolov\(2017\-12\)Enriching Word Vectors with Subword Information\.Transactions of the Association for Computational Linguistics5,pp\. 135–146\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00051),ISSN 2307\-387XCited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p1.1)\.
- \[5\]S\. Buechel, L\. Modersohn, and U\. Hahn\(2021\-11\)Towards Label\-Agnostic Emotion Embeddings\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Punta Cana, Dominican Republic,pp\. 9231–9249\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.728),ISBN 978\-1\-95\-591709\-4Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p4.1)\.
- \[6\]C\. Busso, M\. Bulut, C\. Lee, A\. Kazemzadeh, E\. Mower, S\. Kim, J\. N\. Chang, S\. Lee, and S\. S\. Narayanan\(2008\-12\)IEMOCAP: interactive emotional dyadic motion capture database\.Language Resources and Evaluation42\(4\),pp\. 335–359\.External Links:[Document](https://dx.doi.org/10.1007/s10579-008-9076-6),ISSN 1574\-020XCited by:[§5](https://arxiv.org/html/2606.29068#S5.p2.1)\.
- \[7\]E\. Cambria, A\. Livingstone, and A\. Hussain\(2012\)The Hourglass of Emotions\.InCognitive Behavioural Systems,pp\. 144–157\.External Links:[Document](https://dx.doi.org/10.1007/978-3-642-34584-5%5F11),ISBN 978\-3\-64\-234583\-8Cited by:[§5](https://arxiv.org/html/2606.29068#S5.p2.1)\.
- \[8\]J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu\(2024\-08\)M3\-Embedding: Multi\-Linguality, Multi\-Functionality, Multi\-Granularity Text Embeddings Through Self\-Knowledge Distillation\.InFindings of the Association for Computational Linguistics 2024,Bangkok, Thailand,pp\. 2318–2335\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.137),ISBN 979\-8\-89\-176099\-8Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p1.1)\.
- \[9\]G\. Chochlakis, G\. Mahajan, S\. Baruah, K\. Burghardt, K\. Lerman, and S\. Narayanan\(2023\-06\)Leveraging Label Correlations in a Multi\-Label Setting: a Case Study in Emotion\.InIEEE International Conference on Acoustics, Speech and Signal Processing 2023,Rhodes Island, Greece\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP49357.2023.10096864),ISBN 978\-1\-72\-816327\-7Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[10\]G\. Chochlakis, G\. Mahajan, S\. Baruah, K\. Burghardt, K\. Lerman, and S\. Narayanan\(2023\-06\)Using Emotion Embeddings to Transfer Knowledge between Emotions, Languages, and Annotation Formats\.InIEEE International Conference on Acoustics, Speech and Signal Processing 2023,Rhodes Island, Greece\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP49357.2023.10095597),ISBN 978\-1\-72\-816327\-7Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[11\]D\. Demszky, D\. Movshovitz\-Attias, J\. Ko, A\. Cowen, G\. Nemade, and S\. Ravi\(2020\-07\)GoEmotions: A Dataset of Fine\-Grained Emotions\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 4040–4054\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.372),ISBN 978\-1\-95\-214825\-5Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p1.1),[§1](https://arxiv.org/html/2606.29068#S1.p3.1),[item GoEmotions](https://arxiv.org/html/2606.29068#S3.I1.ix3.p1.1),[§3](https://arxiv.org/html/2606.29068#S3.p1.1),[§4\.2\.1](https://arxiv.org/html/2606.29068#S4.SS2.SSS1.p3.3)\.
- \[12\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\-06\)BERT: Pre\-training of Deep Bidirectional Transformers for Language Understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long and Short Papers\),Minneapolis, MN, USA,pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423),ISBN 978\-1\-95\-073713\-0Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p1.1)\.
- \[13\]P\. Ekman\(1971\)Universals and Cultural Differences in Facial Expressions of Emotion\.Nebraska Symposium on Motivation19,pp\. 207–283\.Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.29068#S2.SS1.p1.1),[item GoEmotions](https://arxiv.org/html/2606.29068#S3.I1.ix3.p1.1)\.
- \[14\]K\. Enevoldsen, I\. Chung, I\. Kerboua, M\. Kardos, A\. Mathur, D\. Stap, J\. Gala, W\. Siblini, D\. Krzemiński, G\. Indra Winata, S\. Sturua, S\. Utpala, M\. Ciancone, M\. Schaeffer, G\. Sequeira, D\. Misra, S\. Dhakal, J\. Rystrøm, R\. Solomatin, Ö\. Çağatan, A\. Kundu, M\. Bernstorff, S\. Xiao, A\. Sukhlecha, B\. Pahwa, R\. Poświata, K\. K\. GV, S\. Ashraf, D\. Auras, B\. Plüster, J\. P\. Harries, L\. Magne, I\. Mohr, M\. Hendriksen, D\. Zhu, H\. Gisserot\-Boukhlef, T\. Aarsen, J\. Kostkan, K\. Wojtasik, T\. Lee, M\. Šuppa, C\. Zhang, R\. Rocca, M\. Hamdy, A\. Michail, J\. Yang, M\. Faysse, A\. Vatolin, N\. Thakur, M\. Dey, D\. Vasani, P\. Chitale, S\. Tedeschi, N\. Tai, A\. Snegirev, M\. Günther, M\. Xia, W\. Shi, X\. H\. Lù, J\. Clive, G\. Krishnakumar, A\. Maksimova, S\. Wehrli, M\. Tikhonova, H\. Panchal, A\. Abramov, M\. Ostendorff, Z\. Liu, S\. Clematide, L\. J\. Miranda, A\. Fenogenova, G\. Song, R\. B\. Safi, W\. Li, A\. Borghini, F\. Cassano, H\. Su, J\. Lin, H\. Yen, L\. Hansen, S\. Hooker, C\. Xiao, V\. Adlakha, O\. Weller, S\. Reddy, and N\. Muennighoff\(2025\-04\)MMTEB: Massive Multilingual Text Embedding Benchmark\.In13th International Conference on Learning Representations,Singapore\.External Links:ISBN 979\-8\-33\-132085\-0,[Link](https://iclr.cc/virtual/2025/poster/27651)Cited by:[§3\.2](https://arxiv.org/html/2606.29068#S3.SS2.p1.1)\.
- \[15\]M\. Faruqui, J\. Dodge, S\. K\. Jauhar, C\. Dyer, E\. Hovy, and N\. A\. Smith\(2015\-05\)Retrofitting Word Vectors to Semantic Lexicons\.InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Denver, CO, USA,pp\. 1606–1615\.External Links:[Document](https://dx.doi.org/10.3115/v1/N15-1184),ISBN 978\-1\-94\-164349\-5Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[16\]C\. Fellbaum\(1998\-05\)WordNet: An Electronic Lexical Database\.Language, Speech, and Communication,MIT Press,Cambridge, MA, USA\.External Links:ISBN 978\-0\-26\-206197\-1Cited by:[§3\.1](https://arxiv.org/html/2606.29068#S3.SS1.SSS0.Px2.p2.1)\.
- \[17\]J\. R\. Firth\(1957\)Studies in Linguistic Analysis\.Blackwell,Oxford, United Kingdom\.Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p1.1)\.
- \[18\]M\. Günther, S\. Sturua, M\. K\. Akram, I\. Mohr, A\. Ungureanu, B\. Wang, S\. Eslami, S\. Martens, M\. Werk, N\. Wang, and H\. Xiao\(2025\-11\)jina\-embeddings\-v4: Universal Embeddings for Multimodal Multilingual Retrieval\.InProceedings of the 5th Workshop on Multilingual Representation Learning,Suzhuo, China,pp\. 531–550\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.mrl-main.36),ISBN 979\-8\-89\-176345\-6Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.18.8.5.1.1)\.
- \[19\]Z\. S\. Harris\(1954\-08\)Distributional Structure\.Word10\(2\-3\),pp\. 146–162\.External Links:[Document](https://dx.doi.org/10.1080/00437956.1954.11659520),ISSN 0043\-7956Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p1.1)\.
- \[20\]M\. Ito and K\. Markov\(2022\-10\)Sentence Embedding Based Emotion Recognition from Text Data\.InProceedings of the Conference on Research in Adaptive and Convergent Systems,Aizuwakamatsu, Japan,pp\. 53–57\.External Links:[Document](https://dx.doi.org/10.1145/3538641.3561488),ISBN 978\-1\-45\-039398\-0Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p1.1)\.
- \[21\]K\. Junseong, L\. Seolhwa, K\. Jihoon, G\. Sangmo, K\. Yejin, C\. Minkyung, S\. Jy\-yong, and C\. Chanyeol\(2024\)Linq\-Embed\-Mistral: Elevating Text Retrieval with Improved GPT Data Through Task\-Specific Control and Quality Refinement\.External Links:[Link](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral/blob/main/LinqAIResearch2024_Linq-Embed-Mistral.pdf)Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.21.11.5.1.1)\.
- \[22\]J\. Lee, W\. Lee, O\. Kwon, and H\. Kim\(2025\-07\)Do Large Language Models Have “Emotion Neurons”? Investigating the Existence and Role\.InFindings of the Association for Computational Linguistics 2025,Vienna, Austria,pp\. 15617–15639\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.806),ISBN 979\-8\-89\-176256\-5Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p3.1)\.
- \[23\]J\. Lee, F\. Chen, S\. Dua, D\. Cer, M\. Shanbhogue, I\. Naim, G\. H\. Ábrego, Z\. Li, K\. Chen, H\. Schechter Vera, X\. Ren, S\. Zhang, D\. Salz, M\. Boratko, J\. Han, B\. Chen, S\. Huang, V\. Rao, P\. Suganthan, F\. Han, A\. Doumanoglou, N\. Gupta, F\. Moiseev, C\. Yip, A\. Jain, S\. Baumgartner, S\. Shahi, F\. Palma Gomez, S\. Mariserla, M\. Choi, P\. Shah, S\. Goenka, K\. Chen, Y\. Xia, K\. Chen, S\. M\. Karthik Duddu, Y\. Chen, T\. Walker, W\. Zhou, R\. Ghiya, Z\. Gleicher, K\. Gill, Z\. Dong, M\. Seyedhosseini, Y\. Sung, R\. Hoffmann, and T\. Duerig\(2025\)Gemini Embedding: Generalizable Embeddings from Gemini\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2503.07891)Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.26.16.5.1.1)\.
- \[24\]J\. Lee and C\. Kim\(2023\-07\)A Structure of basic emotions: A review of basic emotion theories using an emotionally fine\-tuned language model\.InProceedings of the 45th Annual Meeting of the Cognitive Science Society,Sydney, Australia,pp\. 509–516\.External Links:ISBN 978\-1\-71\-388579\-5,[Link](https://escholarship.org/uc/item/2zd4f4dk)Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p3.1)\.
- \[25\]L\. I\. Lin\(1989\-03\)A Concordance Correlation Coefficient to Evaluate Reproducibility\.Biometrics45\(1\),pp\. 255–268\.External Links:[Document](https://dx.doi.org/10.2307/2532051),ISSN 0006\-341XCited by:[§4\.1](https://arxiv.org/html/2606.29068#S4.SS1.p1.5)\.
- \[26\]L\. McInnes, J\. Healy, and J\. Melville\(2018\)UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1802.03426)Cited by:[§4\.3](https://arxiv.org/html/2606.29068#S4.SS3.p2.1)\.
- \[27\]A\. Mehrabian and J\. A\. Russell\(1974\-03\)An Approach to Environmental Psychology\.MIT Press,Cambridge, MA, USA\.External Links:ISBN 978\-0\-26\-213090\-5Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.29068#S2.SS1.p2.1),[item NRC\-VAD](https://arxiv.org/html/2606.29068#S3.I1.ix1.p1.1)\.
- \[28\]T\. Mikolov, K\. Chen, G\. Corrado, and J\. Dean\(2013\)Efficient Estimation of Word Representations in Vector Space\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1301.3781)Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p1.1)\.
- \[29\]T\. Mikolov, I\. Sutskever, K\. Chen, G\. S\. Corrado, and J\. Dean\(2013\-12\)Distributed Representations of Words and Phrases and their Compositionality\.In27th Annual Conference on Neural Information Processing Systems,Advances in Neural Information Processing Systems, Vol\.26,Lake Tahoe, NV, USA,pp\. 3136–3144\.External Links:ISBN 978\-1\-63\-266024\-4,[Link](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html)Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p1.1)\.
- \[30\]G\. A\. Miller\(1995\-11\)WordNet: A Lexical Database for English\.Communications of the ACM38\(11\),pp\. 39–41\.External Links:[Document](https://dx.doi.org/10.1145/219717.219748),ISSN 0001\-0782Cited by:[§3\.1](https://arxiv.org/html/2606.29068#S3.SS1.SSS0.Px2.p2.1)\.
- \[31\]S\. M\. Mohammad\(2018\-07\)Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Melbourne, Australia,pp\. 174–184\.External Links:[Document](https://dx.doi.org/10.18653/v1/P18-1017),ISBN 978\-1\-94\-808732\-2Cited by:[item NRC\-VAD](https://arxiv.org/html/2606.29068#S3.I1.ix1.p1.1),[§3](https://arxiv.org/html/2606.29068#S3.p1.1)\.
- \[32\]S\. M\. Mohammad\(2018\-05\)Word Affect Intensities\.InProceedings of the 11th International Conference on Language Resources and Evaluation,Miyazaki, Japan,pp\. 174–183\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1704.08798),ISBN 979\-1\-09\-554600\-9Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p3.1),[item NRC\-EIL](https://arxiv.org/html/2606.29068#S3.I1.ix2.p1.1),[§3](https://arxiv.org/html/2606.29068#S3.p1.1)\.
- \[33\]S\. M\. Mohammad\(2025\)NRC VAD Lexicon v2: Norms for Valence, Arousal, and Dominance for over 55k English Terms\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2503.23547)Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p3.1),[item NRC\-VAD](https://arxiv.org/html/2606.29068#S3.I1.ix1.p1.1),[§3](https://arxiv.org/html/2606.29068#S3.p1.1)\.
- \[34\]N\. Mrkšić, D\. Ó Séaghdha, B\. Thomson, M\. Gašić, L\. M\. Rojas\-Barahona, P\. Su, D\. Vandyke, T\. Wen, and S\. Young\(2016\-06\)Counter\-fitting Word Vectors to Linguistic Constraints\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,San Diego, CA, USA,pp\. 142–148\.External Links:[Document](https://dx.doi.org/10.18653/v1/N16-1018),ISBN 978\-1\-94\-164391\-4Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[35\]N\. Muennighoff, N\. Tazi, L\. Magne, and N\. Reimers\(2023\-05\)MTEB: Massive Text Embedding Benchmark\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,Dubrovnik, Croatia,pp\. 2014–2037\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.eacl-main.148),ISBN 978\-1\-95\-942944\-9Cited by:[§3\.2](https://arxiv.org/html/2606.29068#S3.SS2.p1.1)\.
- \[36\]J\. Ni, G\. H\. Ábrego, N\. Constant, J\. Ma, K\. B\. Hall, D\. Cer, and Y\. Yang\(2021\)Sentence\-T5: Scalable Sentence Encoders from Pre\-trained Text\-to\-Text Models\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2108.08877)Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.14.4.5.1.1)\.
- \[37\]Z\. Nussbaum and B\. Duderstadt\(2025\)Training Sparse Mixture Of Experts Text Embedding Models\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2502.07972)Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.16.6.5.1.1)\.
- \[38\]S\. Park, J\. Kim, S\. Ye, J\. Jeon, H\. Y\. Park, and A\. Oh\(2021\-11\)Dimensional Emotion Detection from Categorical Emotion\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Punta Cana, Dominican Republic,pp\. 4367–4380\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.358),ISBN 978\-1\-95\-591709\-4Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p4.1)\.
- \[39\]J\. Pennington, R\. Socher, and C\. Manning\(2014\-10\)GloVe: Global Vectors for Word Representation\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,Doha, Qatar,pp\. 1532–1543\.External Links:[Document](https://dx.doi.org/10.3115/v1/D14-1162),ISBN 978\-1\-93\-728496\-1Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p1.1)\.
- \[40\]R\. Plutchik\(1980\)A General Psychoevolutionary Theory of Emotion\.InEmotion: Theory, Research, and Experience, Volume 1: Theories of Emotion,pp\. 3–33\.External Links:[Document](https://dx.doi.org/10.1016/B978-0-12-558701-3.50007-7),ISBN 978\-0\-12\-558701\-3Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.29068#S2.SS1.p3.1),[item NRC\-EIL](https://arxiv.org/html/2606.29068#S3.I1.ix2.p1.1)\.
- \[41\]B\. Reichman, A\. Avsian, and L\. Heck\(2025\-10\)Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models\.InCOLM 2025 1st Workshop on the Interplay of Model Behavior and Model Internals,Montreal, Canada\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2510.22042)Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p3.1)\.
- \[42\]N\. Reimers and I\. Gurevych\(2019\-11\)Sentence\-BERT: Sentence Embeddings using Siamese BERT\-Networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,Hong Kong, China,pp\. 3980–3990\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1410),ISBN 978\-1\-95\-073790\-1Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p1.1)\.
- \[43\]J\. A\. Russell\(1980\-12\)A Circumplex Model of Affect\.Journal of Personality and Social Psychology39\(6\),pp\. 1161–1178\.External Links:[Document](https://dx.doi.org/10.1037/h0077714),ISSN 0022\-3514Cited by:[§2\.1](https://arxiv.org/html/2606.29068#S2.SS1.p2.1)\.
- \[44\]A\. Seyeditabari, N\. Tabari, S\. Gholizade, and W\. Zadrozny\(2019\)Emotional Embeddings: Refining Word Embeddings to Capture Emotional Content of Words\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1906.00112)Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[45\]A\. Seyeditabari and W\. Zadrozny\(2017\-05\)Can Word Embeddings Help Find Latent Emotions in Text? Preliminary Results\.InProceedings of the 30th International Florida Artificial Intelligence Research Society Conference,Marco Island, FL, USA,pp\. 206–209\.External Links:ISBN 978\-1\-57\-735787\-2,[Link](https://aaai.org/papers/206-flairs-2017-15516/)Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[46\]S\. Shah, S\. Reddy, and P\. Bhattacharyya\(2023\-12\)Retrofitting Light\-weight Language Models for Emotions using Supervised Contrastive Learning\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 3640–3654\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.222),ISBN 979\-8\-89\-176060\-8Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[47\]H\. Su, W\. Shi, J\. Kasai, Y\. Wang, Y\. Hu, M\. Ostendorf, W\. Yih, N\. A\. Smith, L\. Zettlemoyer, and T\. Yu\(2023\-07\)One Embedder, Any Task: Instruction\-Finetuned Text Embeddings\.InFindings of the Association for Computational Linguistics 2023,Toronto, Canada,pp\. 1102–1121\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.71),ISBN 978\-1\-95\-942962\-3Cited by:[§3\.2](https://arxiv.org/html/2606.29068#S3.SS2.p1.1)\.
- \[48\]V\. Suresh and D\. C\. Ong\(2021\-09\)Using Knowledge\-Embedded Attention to Augment Pre\-trained Language Models for Fine\-Grained Emotion Recognition\.In9th International Conference on Affective Computing and Intelligent Interaction,Nara, Japan\.External Links:[Document](https://dx.doi.org/10.1109/ACII52823.2021.9597390),ISBN 978\-1\-66\-540019\-0Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[49\]D\. Tang, F\. Wei, B\. Qin, N\. Yang, T\. Liu, and M\. Zhou\(2016\-02\)Sentiment Embeddings with Applications to Sentiment Analysis\.IEEE Transactions on Knowledge and Data Engineering28\(2\),pp\. 496–509\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2015.2489653),ISSN 1041\-4347Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[50\]V\. A\. Traag, L\. Waltman, and N\. J\. Van Eck\(2019\-03\)From Louvain to Leiden: guaranteeing well\-connected communities\.Scientific Reports9\(5233\)\.External Links:[Document](https://dx.doi.org/10.1038/s41598-019-41695-z),ISSN 2045\-2322Cited by:[§3\.1](https://arxiv.org/html/2606.29068#S3.SS1.SSS0.Px2.p2.1)\.
- \[51\]H\. Vera Schechter, S\. Dua, B\. Zhang, D\. Salz, R\. Mullins, S\. R\. Panyam, S\. Smoot, I\. Naim, J\. Zou, F\. Chen, D\. Cer, A\. Lisak, M\. Choi, L\. Gonzalez, O\. Sanseviero, G\. Cameron, I\. Ballantyne, K\. Black, K\. Chen, W\. Wang, Z\. Li, G\. Martins, J\. Lee, M\. Sherwood, J\. Ji, R\. Wu, J\. Zheng, J\. Singh, A\. Sharma, D\. Sreepathihalli, A\. Jain, A\. Elarabawy, A\. J\. Co, A\. Doumanoglou, B\. Samari, B\. Hora, B\. Potetz, D\. Kim, E\. Alfonseca, F\. Moiseev, F\. Han, F\. Palma Gomez, G\. H\. Ábrego, H\. Zhang, H\. Hui, J\. Han, K\. Gill, K\. Chen, K\. Chen, M\. Shanbhogue, M\. Boratko, P\. Suganthan, S\. M\. Karthik Duddu, S\. Mariserla, S\. Ariafar, S\. Zhang, S\. Zhang, S\. Baumgartner, S\. Goenka, S\. Qiu, T\. Dabral, T\. Walker, V\. Rao, W\. Khawaja, W\. Zhou, X\. Ren, Y\. Xia, Y\. Chen, Y\. Chen, Z\. Dong, Z\. Ding, F\. Visin, G\. Liu, J\. Zhang, K\. Kenealy, M\. Casbon, R\. Kumar, T\. Mesnard, Z\. Gleicher, C\. Brick, O\. Lacombe, A\. Roberts, Q\. Yin, Y\. Sung, R\. Hoffmann, T\. Warkentin, A\. Joulin, T\. Duerig, and M\. Seyedhosseini\(2025\)EmbeddingGemma: Powerful and Lightweight Text Representations\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2509.20354)Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.15.5.5.1.1)\.
- \[52\]L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei\(2024\)Multilingual E5 Text Embeddings: A Technical Report\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2402.05672)Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.17.7.5.1.1)\.
- \[53\]X\. Wang, X\. Li, Z\. Yin, Y\. Wu, and J\. Liu\(2023\-01\)Emotional intelligence of Large Language Models\.Journal of Pacific Rim Psychology17\.External Links:[Document](https://dx.doi.org/10.1177/18344909231213958),ISSN 1834\-4909Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p3.1)\.
- \[54\]Z\. Wu and M\. Palmer\(1994\-06\)Verb Semantics and Lexical Selection\.In32nd Annual Meeting of the Association for Computational Linguistics,Las Cruces, NM, USA,pp\. 133–138\.External Links:[Document](https://dx.doi.org/10.3115/981732.981751)Cited by:[§3\.1](https://arxiv.org/html/2606.29068#S3.SS1.SSS0.Px2.p2.1)\.
- \[55\]L\. Yu, J\. Wang, K\. R\. Lai, and X\. Zhang\(2017\-09\)Refining Word Embeddings for Sentiment Analysis\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,Copenhagen, Denmark,pp\. 534–539\.External Links:[Document](https://dx.doi.org/10.18653/v1/D17-1056),ISBN 978\-1\-94\-562683\-8Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p2.1)\.
- \[56\]Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou\(2025\)Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.05176)Cited by:[Table 1](https://arxiv.org/html/2606.29068#S3.T1.23.13.5.1.1)\.
- \[57\]B\. Zhao, M\. Okawa, E\. J\. Bigelow, R\. Yu, T\. D\. Ullman, and H\. Tanaka\(2024\-12\)Emergence of Hierarchical Emotion Representations in Large Language Models\.InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning,Vancouver, Canada\.External Links:[Link](https://neurips.cc/virtual/2024/99244)Cited by:[§2\.2](https://arxiv.org/html/2606.29068#S2.SS2.p3.1)\.
- \[58\]X\. Zhao, X\. Hu, Z\. Shan, S\. Huang, Y\. Zhou, X\. Zhang, Z\. Sun, Z\. Liu, D\. Li, X\. Wei, Y\. Pan, Y\. Xiang, M\. Zhang, H\. Wang, J\. Yu, B\. Hu, and M\. Zhang\(2025\)KaLM\-Embedding\-v2: Superior Training Techniques and Data Inspire A Versatile Embedding Model\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.20923)Cited by:[§1](https://arxiv.org/html/2606.29068#S1.p1.1),[Table 1](https://arxiv.org/html/2606.29068#S3.T1.19.9.5.1.1)\.
## Appendix 0\.AMorphological split
For NRC\-VAD and NRC\-EIL, along with the semantics\-aware splitting strategy, we also applied a simpler approach to prevent morphological leakage between the 5\-folds for cross\-validation and the holdout test set\. By comparing the results of both strategies, the impact of allowing semantic information exchange across splits can be shown\.
The morphological split was created by using stop word removal and Snowball stemming, assigning words with the same stem to the same group\. All words in a group need to be in the same split\. This is accomplished by the greedy algorithm in[Section˜3\.1](https://arxiv.org/html/2606.29068#S3.SS1.SSS0.Px2)\.
## Appendix 0\.BPrompts
In[Appendix˜0\.B](https://arxiv.org/html/2606.29068#Pt0.A2), we detail a subset of the prompts used to instantiate the analyzed task\-tuned and instruction\-aware text encoders\. As already mentioned in[Section˜3\.2](https://arxiv.org/html/2606.29068#S3.SS2), these prompts specify the downstream problem under examination, i\.e\., regression or classification, taking inspiration from the recommended prompts of each model card\.
To be precise, for regression, we slightly modified default prompts tailored to either semantic text similarity \(STS\) or retrieval if no STS option was available\.
\{listing\*\}
\[\!tb\]Configured prompts forEmbeddingGemma,KaLM v2, andLinq Mistral\. The other text encoders are set up in an analogous way, with similar prompts with respect to this sample list\.prompt\_nameandpromptare arguments passed to the model instantiation via the Hugging Face API in Python\.
\{
"google@embeddinggemma\-300m":\{
"regression\_configs":\{
"prompt\_name":"STS"
\},
"classification\_configs":\{
"prompt\_name":"Classification"
\}
\},
"tencent@KaLM\-Embedding\-Gemma3\-12B\-2511":\{
"regression\_configs":\{
"prompt":"Instruct:Retrievingemotion\.\\nQuery:"
\},
"classification\_configs":\{
"prompt":"Instruct:Classifyingemotion\.\\nQuery:"
\}
\},
"Linq\-AI\-Research@Linq\-Embed\-Mistral":\{
"regression\_configs":\{
"prompt":"Instruct:Retrievetheemotionexpressedinthetext\\nQuery:"
\},
"classification\_configs":\{
"prompt":"Instruct:Classifytheemotionexpressedinthetext\\nQuery:"
\}
\}
\}
## Appendix 0\.CHyperparameter tuning
Hyperparameters were tuned for each predictive model \(44\), dataset and splitting strategy \(55\), and encoder \(1212\), leading to240240experiments in total\. We chose Optuna\[[1](https://arxiv.org/html/2606.29068#bib.bib1)\]as Bayesian optimization procedure\.
The set of tuneable hyperparameters excluded those that we expected to uniformly influence the performance of a given predictor\. For instance, the batch size of the MLP was kept fixed, as it mainly refers to training efficiency and is unlikely to influence how well affective cues are extracted from text embeddings\. The number of runs was heuristically determined with respect to the complexity of the search space under examination \(e\.g\., number of hyperparameters, continuous or categorical variables\), while ensuring that all predictive models had sufficient chances to find their optimal configuration\.
The final hyperparameters are listed from[Table˜5](https://arxiv.org/html/2606.29068#Pt0.A4.T5)to[Table˜24](https://arxiv.org/html/2606.29068#Pt0.A4.T24)\. Most of them are self\-explanatory, except for thecomplexityparameter of the MLP, which controls the network architecture in terms of number and size of its hidden layers\. Each complexity level is a categorical variable, with twelve possible values, that corresponds to a predefined configuration, from shallow architectures with a single hidden layer \(e\.g\.,6464neurons for complexity \#0 and10241024for \#4\) to deeper networks \(e\.g\., one layer for levels from \#0 to \#4, two from \#5 to \#8, and three from \#9 to \#11\)\.
## Appendix 0\.DFull reports
As supplementary documentation, we attach more detailed quantitative results\. Considering NRC\-VAD \(cf\.[Table˜27](https://arxiv.org/html/2606.29068#Pt0.A4.T27)\) and NRC\-EIL \(cf\.[Table˜30](https://arxiv.org/html/2606.29068#Pt0.A4.T30)\), the outcomes with the morphology\-aware splitting strategy are added\. For GoEmotions \(cf\.[Table˜33](https://arxiv.org/html/2606.29068#Pt0.A4.T33)\), we include weighted\- and micro\-averaged metrics as auxiliary measures for a more complete view\.
Furthermore, we report mean and standard deviation of the cross\-validation scores of the training procedure to give insights into the variability across folds \(cf\.[Tables˜25](https://arxiv.org/html/2606.29068#Pt0.A4.T25),[26](https://arxiv.org/html/2606.29068#Pt0.A4.T26),[28](https://arxiv.org/html/2606.29068#Pt0.A4.T28),[29](https://arxiv.org/html/2606.29068#Pt0.A4.T29),[31](https://arxiv.org/html/2606.29068#Pt0.A4.T31)and[32](https://arxiv.org/html/2606.29068#Pt0.A4.T32)\)\.
Table 5:Summary of the best hyperparameters for LR on NRC\-VAD, split with the semantics\-aware strategy\.Table 6:Summary of the best hyperparameters for LR on NRC\-VAD, split with the morphology\-aware strategy\.Table 7:Summary of the best hyperparameters for LR on NRC\-EIL, split with the semantics\-aware strategy\.Table 8:Summary of the best hyperparameters for LR on NRC\-EIL, split with the morphology\-aware strategy\.Table 9:Summary of the best hyperparameters for LR on GoEmotions\.Table 10:Summary of the best hyperparameters forkk\-NN on NRC\-VAD, split with the semantics\-aware strategy\.Table 11:Summary of the best hyperparameters forkk\-NN on NRC\-VAD, split with the morphology\-aware strategy\.Table 12:Summary of the best hyperparameters forkk\-NN on NRC\-EIL, split with the semantics\-aware strategy\.Table 13:Summary of the best hyperparameters forkk\-NN on NRC\-EIL, split with the morphology\-aware strategy\.Table 14:Summary of the best hyperparameters forkk\-NN on GoEmotions\.Table 15:Summary of the best hyperparameters for XGB on NRC\-VAD, split with the semantics\-aware strategy\.Table 16:Summary of the best hyperparameters for XGB on NRC\-VAD, split with the morphology\-aware strategy\.Table 17:Summary of the best hyperparameters for XGB on NRC\-EIL, split with the semantics\-aware strategy\.Table 18:Summary of the best hyperparameters for XGB on NRC\-EIL, split with the morphology\-aware strategy\.Table 19:Summary of the best hyperparameters for XGB on GoEmotions\.Table 20:Summary of the best hyperparameters for MLP on NRC\-VAD, split with the semantics\-aware strategy\.Table 21:Summary of the best hyperparameters for MLP on NRC\-VAD, split with the morphology\-aware strategy\.Table 22:Summary of the best hyperparameters for MLP on NRC\-EIL, split with the semantics\-aware strategy\.Table 23:Summary of the best hyperparameters for MLP on NRC\-EIL, split with the morphology\-aware strategy\.Table 24:Summary of the best hyperparameters for MLP on GoEmotions\.Table 25:Full summary of the 5\-fold cross\-validation scores for LR andkk\-NN on NRC\-VAD\. For each encoder, top and bottom rows refer to data splits based on semantic and morphological leakage prevention, respectively\.Table 26:Full summary of the 5\-fold cross\-validation scores for XGB and MLP on NRC\-VAD\. For each encoder, top and bottom rows refer to data splits based on semantic and morphological leakage prevention, respectively\.Table 27:Full summary of the regression metrics at test time for NRC\-VAD, sorted byR2R^\{2\}score of the MLP model in descending order\. For each encoder, top and bottom rows refer to data splits based on semantic and morphological leakage prevention, respectively\.Table 28:Full summary of the 5\-fold cross\-validation scores for LR andkk\-NN on NRC\-EIL\. For each encoder, top and bottom rows refer to data splits based on semantic and morphological leakage prevention, respectively\.Table 29:Full summary of the 5\-fold cross\-validation scores for XGB and MLP on NRC\-EIL\. For each encoder, top and bottom rows refer to data splits based on semantic and morphological leakage prevention, respectively\.Table 30:Full summary of the regression metrics at test time for NRC\-EIL, sorted byR2R^\{2\}score of the MLP model in descending order\. For each encoder, top and bottom rows refer to data splits based on semantic and morphological leakage prevention, respectively\.Table 31:Full summary of the 5\-fold cross\-validation scores for LR andkk\-NN on GoEmotions\. For each encoder, its rows refer to macro, weighted, and micro averages, respectively\.Table 32:Full summary of the 5\-fold cross\-validation scores for XGB and MLP on GoEmotions\. For each encoder, its rows refer to macro, weighted, and micro averages, respectively\.Table 33:Full summary of the classification metrics at test time for GoEmotions, sorted byF1F\_\{1\}score of the MLP model in descending order\. For each encoder, its rows refer to macro, weighted, and micro averages, respectively\.Similar Articles
Introducing text and code embeddings
OpenAI introduces a new embeddings API endpoint that converts text and code into numerical vector representations for semantic search, clustering, and classification tasks. The models achieve state-of-the-art results on standard benchmarks including a 20% relative improvement in code search performance.
Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages
This paper investigates whether compact, task-specific bi-encoders fine-tuned on synthetic data from large language models can outperform general-purpose embeddings for clinical code retrieval in non-English languages, achieving state-of-the-art results on Spanish benchmarks CodiESP and DISTEMIST.
Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text
This paper proposes a generative framework for emotion intensity evaluation, shifting from discrete classification to continuous 0-100 scoring. It demonstrates superior performance and generalization in domains like finance.
Text and code embeddings by contrastive pre-training
OpenAI presents a contrastive pre-training approach for generating high-quality text and code embeddings at scale without supervision, achieving state-of-the-art results on linear-probe classification, semantic search, and code search benchmarks.
Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs
This paper replicates the finding of 'emotion vectors' in open-weight LLMs Apertus-8B and Gemma-4-E4B, showing that valence geometry is recoverable across models with differences in layer emergence. The study also finds that arousal encoding is sensitive to the story corpus used for extraction.