Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning
Summary
The paper proposes a hybrid pre-training objective combining JEPA latent-space prediction with MLM reconstruction for language models, showing improved embedding uniformity and semantic-lexical balance.
View Cached Full Text
Cached at: 06/05/26, 08:04 AM
# Joint Objectives for Self-Supervised Language Representation Learning Source: [https://arxiv.org/html/2606.05173](https://arxiv.org/html/2606.05173) Aimen Boukhari École Nationale Supérieure d’Informatique \(ESI\), Algiers , Algeria [mn\_boukhari@esi\.dz](https://arxiv.org/html/2606.05173v1/mailto:[email protected]) ###### Abstract Masked language modelling \(MLM\) has been the dominant pre\-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface\-form token identity rather than deeper semantic structure\. Inspired by the success of Joint Embedding Predictive Architectures \(JEPA\)\(LeCun,[2022](https://arxiv.org/html/2606.05173#bib.bib21)\)in vision and audio, we propose a hybrid pre\-training objective that combines a JEPA\-style latent\-space prediction loss with an MLM reconstruction loss over a single shared encoder\. A learnable scalarλ\\lambdacontinuously balances the two objectives during training\. We pre\-train both a hybrid model and a pure\-MLM baseline on English Wikipedia using identical architectures and compute budgets \(NVIDIA H100\)\. Extensive representation analysis across five GLUE benchmarks \(SST\-2, MRPC, MNLI, CoLA, STS\-B\) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings \(uniformity≤−0\.16\\leq\-0\.16vs\.−0\.05\-0\.05for MLM\), exhibits richer spectral geometry under max pooling, encodes less surface\-level lexical information, and achieves a better semantic\-to\-lexical balance\. Despite similar linear\-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture\. Code and checkpoints:[https://github\.com/aymen\-000/predict\-reconstruct\-language\-models](https://github.com/aymen-000/predict-reconstruct-language-models) ## 1Introduction Self\-supervised learning \(SSL\) has transformed representation learning across modalities\. In computer vision, contrastive methods such as SimCLR\(Chenet al\.,[2020](https://arxiv.org/html/2606.05173#bib.bib8)\)and MoCo\(Heet al\.,[2020](https://arxiv.org/html/2606.05173#bib.bib18)\), followed by non\-contrastive approaches like BYOL\(Grillet al\.,[2020](https://arxiv.org/html/2606.05173#bib.bib16)\)and VICReg\(Bardeset al\.,[2022](https://arxiv.org/html/2606.05173#bib.bib4)\), demonstrated that powerful visual features can be learned without labels\. In speech and audio, wav2vec\(Baevskiet al\.,[2020](https://arxiv.org/html/2606.05173#bib.bib2)\)and data2vec\(Baevskiet al\.,[2022](https://arxiv.org/html/2606.05173#bib.bib3)\)showed that masked prediction in latent space generalises across modalities\. These successes share a common principle: rather than reconstructing pixels or waveforms, the model learns to*predict abstract representations*of masked or future content\. Yann LeCun formalised this intuition in\(LeCun,[2022](https://arxiv.org/html/2606.05173#bib.bib21)\), arguing that reconstruction\-based objectives are fundamentally limited because they force the model to allocate capacity to irrelevant low\-level details\. The Joint Embedding Predictive Architecture \(JEPA\) avoids this by having a predictor network match the*target encoder’s latent representation*of the masked region, never reconstructing tokens in pixel or token space\. The image instantiation I\-JEPA\(Assranet al\.,[2023](https://arxiv.org/html/2606.05173#bib.bib1)\)demonstrated on ImageNet that JEPA pre\-training produces representations that generalise better with fewer labelled examples than masked autoencoders, and that learned features are more semantically structured as evidenced by linear probing\. In natural language processing, BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.05173#bib.bib11)\)established MLM as the standard SSL objective\. While highly effective, MLM operates in token space: the model must predict the exact identity of masked tokens, incentivising the encoder to retain fine\-grained lexical information at the expense of broader semantic structure\. Several analyses have confirmed that BERT representations are highly contextualised yet remain sensitive to surface\-form variation\(Ethayarajh,[2019](https://arxiv.org/html/2606.05173#bib.bib12); Rogerset al\.,[2020](https://arxiv.org/html/2606.05173#bib.bib22)\)\. SimCSE\(Gaoet al\.,[2021](https://arxiv.org/html/2606.05173#bib.bib13)\)and DeCLUTR\(Giorgiet al\.,[2021](https://arxiv.org/html/2606.05173#bib.bib15)\)improved the*uniformity*of BERT\-derived sentence embeddings through contrastive fine\-tuning, confirming that the MLM objective alone does not fully exploit the embedding hypersphere\. A natural question therefore arises: can a JEPA\-style latent prediction objective be combined with MLM to produce text encoders that encode semantics more robustly? A recent independent line of work, LLM\-JEPA\(Huanget al\.,[2025](https://arxiv.org/html/2606.05173#bib.bib17)\), explores applying JEPA principles to autoregressive language models\. Our work is complementary: we study the*representation geometry*of small\-to\-medium encoders trained under hybrid versus pure\-MLM objectives, providing the first systematic analysis using alignment/uniformity metrics, eigenspectrum analysis, effective rank, and probing classifiers across multiple GLUE tasks\. ##### Contributions\. 1. 1\.We propose a hybrid pre\-training architecture that jointly optimises a JEPA cosine prediction loss and an MLM cross\-entropy loss through a single shared encoder with a learnable balance weightλ\\lambda\. 2. 2\.We conduct the first systematic geometric analysis of hybrid vs\. MLM\-only text encoders across five GLUE datasets and four pooling strategies using six complementary representation metrics\. 3. 3\.We show that hybrid training consistently improves embedding uniformity and spectral richness while reducing surface\-form bias, even under a small pre\-training budget\. 4. 4\.We release code and model checkpoints to facilitate reproduction and extension of this analysis\. ## 2Background and Related Work ### 2\.1Reconstruction\-Based SSL and Its Limitations Masked language modelling, introduced in BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.05173#bib.bib11)\), predicts masked tokens from context\. This objective is effective but carries an implicit bias: the model must memorise token\-level statistics to recover the correct token identity, encouraging representations to retain lexical surface form rather than semantic content\.LeCun \([2022](https://arxiv.org/html/2606.05173#bib.bib21)\)identifies reconstruction objectives as fundamentally misaligned with the goal of learning abstract world models: predicting every detail of the input wastes model capacity on unpredictable or irrelevant information\. In vision, masked autoencoders \(MAE\)\(Heet al\.,[2022](https://arxiv.org/html/2606.05173#bib.bib19)\)achieve strong results but require fine\-tuning to match JEPA\-style methods on linear evaluation\(Assranet al\.,[2023](https://arxiv.org/html/2606.05173#bib.bib1)\), consistent with the hypothesis that pixel reconstruction does not optimally produce semantic features\. ### 2\.2JEPA and Latent Prediction I\-JEPA\(Assranet al\.,[2023](https://arxiv.org/html/2606.05173#bib.bib1)\)instantiates LeCun’s JEPA framework for images: a context encoder processes visible patches, a predictor maps context representations to target representations, and the target encoder \(updated via exponential moving average, EMA\) produces representations of masked regions that the predictor must match\. The loss is computed in representation space rather than pixel space, avoiding the reconstruction trap\. V\-JEPA\(Bardeset al\.,[2024](https://arxiv.org/html/2606.05173#bib.bib7)\)extends this to video and MC\-JEPA\(Bardeset al\.,[2023](https://arxiv.org/html/2606.05173#bib.bib5)\)adds motion consistency\. LLM\-JEPA\(Huanget al\.,[2025](https://arxiv.org/html/2606.05173#bib.bib17)\)recently proposed adapting JEPA to autoregressive language models\. Our work differs in focus: we study encoder representation geometry rather than downstream generation quality, and provide direct comparison with a controlled MLM baseline\. ### 2\.3Representation Quality Metrics Wang and Isola \([2020](https://arxiv.org/html/2606.05173#bib.bib25)\)introduced*alignment*and*uniformity*for evaluating sentence embeddings on theℓ2\\ell\_\{2\}\-normalised hypersphere\.Roy and Vetterli \([2007](https://arxiv.org/html/2606.05173#bib.bib23)\)defined the*effective rank*as the exponential of the Shannon entropy of the normalised singular\-value distribution\.Vershynin \([2018](https://arxiv.org/html/2606.05173#bib.bib24)\)introduces the*stable rank*as a robust alternative\.Garridoet al\.\([2023](https://arxiv.org/html/2606.05173#bib.bib14)\)used eigenspectrum analysis to compare contrastive and non\-contrastive SSL objectives\.Conneau and Kiela \([2018](https://arxiv.org/html/2606.05173#bib.bib9)\)established the probing task framework for analysing what linguistic information is encoded in sentence representations, andEthayarajh \([2019](https://arxiv.org/html/2606.05173#bib.bib12)\)applied contextuality analysis to BERT representations\. ## 3Method ### 3\.1Architecture Overview Our architecture consists of three components: a shared encoderfθf\_\{\\theta\}, a predictorgϕg\_\{\\phi\}, and a target encoderf¯θ\\bar\{f\}\_\{\\theta\}updated via EMA\. The shared encoder processes input tokens and produces contextualised representations used for both the JEPA and MLM objectives\. A lightweight token regressorhψh\_\{\\psi\}maps encoder outputs to vocabulary logits for the MLM branch\. The overall architecture is illustrated in Figure[1](https://arxiv.org/html/2606.05173#S3.F1)\. Figure 1:Hybrid Predict\-and\-Reconstruct architecture\. The shared encoder produces contextual representations used by two branches: a predictor for the JEPA objective and a token regression head for the MLM objective\. The target encoder is updated using exponential moving average \(EMA\) and provides stable targets for representation prediction\. ### 3\.2Hybrid Pre\-training Objective Given a token sequence𝐱\\mathbf\{x\}of lengthLL, we apply two distinct masking operations\. ##### Block masking \(JEPA branch\)\. Following I\-JEPA\(Assranet al\.,[2023](https://arxiv.org/html/2606.05173#bib.bib1)\), we sample contiguous block masks\. An encoder maskℳenc\\mathcal\{M\}\_\{\\text\{enc\}\}\(scale\[0\.65,0\.85\]\[0\.65,0\.85\]\) defines the visible context; prediction masks\{ℳpredk\}k=12\\\{\\mathcal\{M\}\_\{\\text\{pred\}\}^\{k\}\\\}\_\{k=1\}^\{2\}\(scale\[0\.10,0\.25\]\[0\.10,0\.25\]\) define the target regions\. ##### BERT masking \(MLM branch\)\. We apply standard BERT\-style masking with probabilityp=0\.15p=0\.15: 80% of selected positions receive the\[MASK\]token, 10% are replaced by a random token, and 10% are left unchanged\. ##### Forward pass\. 𝐳\\displaystyle\\mathbf\{z\}=fθ\(𝐱~,ℳenc\)\\displaystyle=f\_\{\\theta\}\(\\tilde\{\\mathbf\{x\}\},\\,\\mathcal\{M\}\_\{\\text\{enc\}\}\)\(1\)𝐡^\\displaystyle\\hat\{\\mathbf\{h\}\}=gϕ\(𝐳,ℳenc,ℳpred\)\\displaystyle=g\_\{\\phi\}\(\\mathbf\{z\},\\,\\mathcal\{M\}\_\{\\text\{enc\}\},\\,\\mathcal\{M\}\_\{\\text\{pred\}\}\)\(2\)𝐡\\displaystyle\\mathbf\{h\}=f¯θ\(𝐱\)\(no grad\)\\displaystyle=\\bar\{f\}\_\{\\theta\}\(\\mathbf\{x\}\)\\quad\\text\{\(no grad\)\}\(3\)𝐳full\\displaystyle\\mathbf\{z\}\_\{\\text\{full\}\}=fθ\(𝐱~\)\\displaystyle=f\_\{\\theta\}\(\\tilde\{\\mathbf\{x\}\}\)\(4\)where𝐱~\\tilde\{\\mathbf\{x\}\}denotes the BERT\-masked token sequence,𝐡\\mathbf\{h\}is the target representation, and𝐳full\\mathbf\{z\}\_\{\\text\{full\}\}is the full\-sequence latent used for token regression\. ##### Loss functions\. ℒJEPA\\displaystyle\\mathcal\{L\}\_\{\\text\{JEPA\}\}=1−1\|ℬ\|∑\(i,j\)∈ℬcos\(𝐡^ij,𝐡ij\)\\displaystyle=1\-\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{\(i,j\)\\in\\mathcal\{B\}\}\\cos\\\!\\bigl\(\\hat\{\\mathbf\{h\}\}\_\{ij\},\\,\\mathbf\{h\}\_\{ij\}\\bigr\)\(5\)ℒMLM\\displaystyle\\mathcal\{L\}\_\{\\text\{MLM\}\}=CE\(hψ\(𝐳full\),𝐲mask\)\\displaystyle=\\mathrm\{CE\}\\\!\\bigl\(h\_\{\\psi\}\(\\mathbf\{z\}\_\{\\text\{full\}\}\),\\,\\mathbf\{y\}\_\{\\text\{mask\}\}\\bigr\)\(6\)ℒ\\displaystyle\\mathcal\{L\}=λℒJEPA\+\(1−λ\)ℒMLM,λ=σ\(w\)\\displaystyle=\\lambda\\,\\mathcal\{L\}\_\{\\text\{JEPA\}\}\+\(1\-\\lambda\)\\,\\mathcal\{L\}\_\{\\text\{MLM\}\},\\quad\\lambda=\\sigma\(w\)\(7\)wherew∈ℝw\\in\\mathbb\{R\}is a learnable scalar optimised end\-to\-end andσ\\sigmais the sigmoid function\. ##### EMA update\. 𝜽¯←m𝜽¯\+\(1−m\)𝜽,m∈\[0\.996,1\.0\]\\bar\{\\boldsymbol\{\\theta\}\}\\leftarrow m\\,\\bar\{\\boldsymbol\{\\theta\}\}\+\(1\-m\)\\,\\boldsymbol\{\\theta\},\\quad m\\in\[0\.996,\\,1\.0\]\(8\) ### 3\.3Token Regressor The token regressorhψh\_\{\\psi\}operates directly on the shared encoder’s latent representation: hψ\(𝐳\)=W2LN\(GELU\(W1𝐳\)\)h\_\{\\psi\}\(\\mathbf\{z\}\)=W\_\{2\}\\,\\mathrm\{LN\}\\\!\\bigl\(\\mathrm\{GELU\}\(W\_\{1\}\\mathbf\{z\}\)\\bigr\)\(9\)withW1∈ℝD×DW\_\{1\}\\in\\mathbb\{R\}^\{D\\times D\},W2∈ℝD×VW\_\{2\}\\in\\mathbb\{R\}^\{D\\times V\}\(VV= vocabulary size\) andLN\(⋅\)\\mathrm\{LN\}\(\\cdot\)denotes layer normalisation\(Baet al\.,[2016](https://arxiv.org/html/2606.05173#bib.bib6)\)\.Unlike a standard MLM head,hψh\_\{\\psi\}receives gradients from both objectives through the shared encoder weights\. ### 3\.4Sentence Pooling For downstream evaluation, token representations are aggregated via mean pooling over non\-padding positions: 𝐬=∑t=1L𝟏\[xt≠pad\]𝐳t∑t=1L𝟏\[xt≠pad\]\\mathbf\{s\}=\\frac\{\\sum\_\{t=1\}^\{L\}\\mathbf\{1\}\[x\_\{t\}\\neq\\texttt\{pad\}\]\\,\\mathbf\{z\}\_\{t\}\}\{\\sum\_\{t=1\}^\{L\}\\mathbf\{1\}\[x\_\{t\}\\neq\\texttt\{pad\}\]\}\(10\)This matches the pooling used in all linear\-probe fine\-tuning experiments\. ## 4Experimental Setup ### 4\.1Pre\-training ##### Data\. Both models are pre\-trained on English Wikipedia \(WikiText\)\. All text is tokenised with thebert\-base\-uncasedtokeniser \(vocabulary size 30,522\) and sequences are truncated or padded to 512 tokens\. ##### Model architecture\. The model uses a transformer\-based encoder with token embedding dimensiondemb=512d\_\{\\text\{emb\}\}=512\. The predictor consists of 6 layers with embedding dimensiondpred=512d\_\{\\text\{pred\}\}=512\. A target encoder, with the same architecture as the main encoder, is maintained with frozen weights and updated via exponential moving average\. The token regressor head maps the encoder outputs back to the vocabulary \(vocab\_size=30,522\\text\{vocab\\\_size\}=30\{,\}522\) for masked language modeling\. Optimization is performed with AdamW, using a learnable scalar to balance the JEPA \(cosine\) and MLM \(cross\-entropy\) losses\. ##### Training\. Both models are trained for 3 epochs with batch size 64 on a single NVIDIA H100 GPU using bfloat16 mixed precision and AdamW with cosine learning rate schedule \(peak LR5×10−55\\times 10^\{\-5\}, weight decay0\.050\.05\)\. Full hyperparameters are given in Table[1](https://arxiv.org/html/2606.05173#S4.T1)\. Table 1:Pre\-training hyperparameters\.HyperparameterValueEpochs3Batch size64Peak LR5×10−55\\times 10^\{\-5\}LR schedulecosineWarmup steps10Weight decay0\.05Precisionbfloat16EMA range\[0\.996, 1\.0\]Encoder mask scale\[0\.65, 0\.85\]Pred\. mask scale\[0\.10, 0\.25\]Num\. pred\. masks2Min\. keep tokens32MLM probability0\.15HardwareNVIDIA H100 ### 4\.2Downstream Evaluation: Linear Probing We evaluate frozen encoder representations via linear probing on five GLUE tasks\(Wanget al\.,[2018](https://arxiv.org/html/2606.05173#bib.bib26)\)\. The encoder is kept frozen throughout; only a lightweight head is trained\. Since GLUE test labels are not publicly available, all results are reported on the official validation splits\. No hyperparameter search was performed on the validation set; the same configuration was applied to both models\. ##### Single\-sentence tasks \(SST\-2, CoLA\)\. Mean\-pooled representations feed a LayerNorm→\\toDropout\(0\.2\)→\\toLinear head\. ##### Sentence\-pair tasks \(MRPC, MNLI\)\. Both sentences are encoded independently; the classifier receives\[𝐬1;𝐬2;\|𝐬1−𝐬2\|;𝐬1⊙𝐬2\]\[\\mathbf\{s\}\_\{1\};\\;\\mathbf\{s\}\_\{2\};\\;\|\\mathbf\{s\}\_\{1\}\-\\mathbf\{s\}\_\{2\}\|;\\;\\mathbf\{s\}\_\{1\}\\odot\\mathbf\{s\}\_\{2\}\]\(Conneauet al\.,[2017](https://arxiv.org/html/2606.05173#bib.bib10)\)\. ##### Regression task \(STS\-B\)\. Cosine similarity is calibrated to the\[0,5\]\[0,5\]score range via a learnable scale and bias:y^=σw⋅cos\(𝐬1,𝐬2\)\+bw\\hat\{y\}=\\sigma\_\{w\}\\cdot\\cos\(\\mathbf\{s\}\_\{1\},\\mathbf\{s\}\_\{2\}\)\+b\_\{w\}\. Per\-task fine\-tuning details are given in Table[2](https://arxiv.org/html/2606.05173#S4.T2)\. Table 2:Fine\-tuning hyperparameters per task\.TaskLREpochsMetricSST\-21×10−31\\times 10^\{\-3\}90AccuracyMRPC1×10−31\\times 10^\{\-3\}10F1MNLI1×10−31\\times 10^\{\-3\}15Acc\. \(matched\)STS\-B1×10−31\\times 10^\{\-3\}15Spearmanρ\\rho ### 4\.3Representation Analysis We extract frozen sentence embeddings for up to 2,000 samples from each task’s validation split and compute six metrics: 1. 1\.Spectral entropy\(Garridoet al\.,[2023](https://arxiv.org/html/2606.05173#bib.bib14)\):Hspec=H\(𝝈/‖𝝈‖1\)/logDH\_\{\\text\{spec\}\}=H\(\\boldsymbol\{\\sigma\}/\\\|\\boldsymbol\{\\sigma\}\\\|\_\{1\}\)/\\log D\. 2. 2\.Effective rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2606.05173#bib.bib23)\):erank=expH\(𝝈/‖𝝈‖1\)\\text\{erank\}=\\exp H\(\\boldsymbol\{\\sigma\}/\\\|\\boldsymbol\{\\sigma\}\\\|\_\{1\}\)\. 3. 3\.Stable rank\(Vershynin,[2018](https://arxiv.org/html/2606.05173#bib.bib24)\):srank=‖𝐙‖F2/‖𝐙‖22\\text\{srank\}=\\\|\\mathbf\{Z\}\\\|\_\{F\}^\{2\}/\\\|\\mathbf\{Z\}\\\|\_\{2\}^\{2\}\. 4. 4\.Alignment\(Wang and Isola,[2020](https://arxiv.org/html/2606.05173#bib.bib25)\): mean squaredℓ2\\ell\_\{2\}distance between same\-class pairs\. 5. 5\.Uniformity\(Wang and Isola,[2020](https://arxiv.org/html/2606.05173#bib.bib25)\):log𝔼\[e−2‖𝐮−𝐯‖2\]\\log\\mathbb\{E\}\[e^\{\-2\\\|\\mathbf\{u\}\-\\mathbf\{v\}\\\|^\{2\}\}\]onℓ2\\ell\_\{2\}\-normalised embeddings\. 6. 6\.Probe gap\(Conneau and Kiela,[2018](https://arxiv.org/html/2606.05173#bib.bib9)\): semantic probe accuracy minus token probe accuracy\. All metrics are computed under four pooling strategies: mean, max, weighted mean, and attention pooling\. Full formal definitions are given in Appendix[D](https://arxiv.org/html/2606.05173#A4)\. ## 5Results ### 5\.1Downstream Task Accuracy Table[3](https://arxiv.org/html/2606.05173#S5.T3)reports linear\-probe performance on five GLUE tasks\. Both models achieve comparable accuracy, consistent with the known finding that MLM baselines are strong linear classifiers under mean\-pooled representations\(Gaoet al\.,[2021](https://arxiv.org/html/2606.05173#bib.bib13)\)\. Table 3:Linear\-probe downstream performance on GLUE validation splits\. Frozen encoder; best per\-task result inbold\. Spearmanρ\\rhofor STS\-B\.TaskHybridMLM\-onlySST\-2 \(Acc\.\)67\.5568\.69MRPC \(F1\)63\.0959\.84MNLI \(Acc\.\)50\.8251\.36STS\-B†\(Spearman\)0\.2810\.283 ### 5\.2Representation Geometry #### 5\.2\.1Uniformity The hybrid encoder consistently achieves significantly more negative uniformity scores across all five datasets and all four pooling strategies \(Table[4](https://arxiv.org/html/2606.05173#S5.T4)\)\. The mean uniformity under attention pooling is−0\.54\-0\.54for hybrid vs\.−0\.07\-0\.07for MLM\-only — a sevenfold difference\. This confirms that the JEPA predictive objective prevents representational collapse and promotes a more isotropic use of the embedding hypersphere\(Wang and Isola,[2020](https://arxiv.org/html/2606.05173#bib.bib25)\), analogous to findings in vision\(Assranet al\.,[2023](https://arxiv.org/html/2606.05173#bib.bib1)\)\. Table 4:Uniformity \(↓\\downarrowbetter\) by dataset and pooling\.DatasetPoolingHybridMLMSST\-2mean−0\.160\\mathbf\{\-0\.160\}−0\.052\-0\.052max−0\.294\\mathbf\{\-0\.294\}−0\.090\-0\.090weighted−0\.160\\mathbf\{\-0\.160\}−0\.052\-0\.052attention−0\.448\\mathbf\{\-0\.448\}−0\.055\-0\.055MRPCmean−0\.134\\mathbf\{\-0\.134\}−0\.053\-0\.053max−0\.262\\mathbf\{\-0\.262\}−0\.088\-0\.088weighted−0\.134\\mathbf\{\-0\.134\}−0\.053\-0\.053attention−0\.269\\mathbf\{\-0\.269\}−0\.055\-0\.055MNLImean−0\.169\\mathbf\{\-0\.169\}−0\.063\-0\.063max−0\.290\\mathbf\{\-0\.290\}−0\.096\-0\.096weighted−0\.163\\mathbf\{\-0\.163\}−0\.063\-0\.063attention−0\.365\\mathbf\{\-0\.365\}−0\.067\-0\.067CoLAmean−0\.314\\mathbf\{\-0\.314\}−0\.079\-0\.079max−0\.365\\mathbf\{\-0\.365\}−0\.098\-0\.098weighted−0\.314\\mathbf\{\-0\.314\}−0\.079\-0\.079attention−0\.955\\mathbf\{\-0\.955\}−0\.083\-0\.083STS\-Bmean−0\.202\\mathbf\{\-0\.202\}−0\.067\-0\.067max−0\.321\\mathbf\{\-0\.321\}−0\.102\-0\.102weighted−0\.201\\mathbf\{\-0\.201\}−0\.068\-0\.068attention−0\.577\\mathbf\{\-0\.577\}−0\.069\-0\.069 #### 5\.2\.2Alignment–Uniformity Trade\-off The improved uniformity comes at the cost of higher within\-class alignment values for hybrid representations\. MLM\-only alignment is consistently near zero \(≤0\.002\\leq 0\.002\), indicating extremely tight class clusters, while hybrid alignment is larger \(e\.g\. SST\-2/attention: 0\.20 vs\. 0\.001\), reflecting a more relaxed intra\-class structure\. This alignment\-uniformity trade\-off\(Wang and Isola,[2020](https://arxiv.org/html/2606.05173#bib.bib25)\)directly explains why linear\-probe accuracy is similar for both models: linear classifiers benefit primarily from tight clusters, which favours MLM\-only representations\. Scatter plots of alignment vs\. uniformity and intra/inter\-class distance ratios for all datasets are given in Appendix[B](https://arxiv.org/html/2606.05173#A2)\. #### 5\.2\.3Spectral Analysis Under max pooling, the hybrid encoder achieves higher spectral entropy and effective rank across all datasets \(Table[5](https://arxiv.org/html/2606.05173#S5.T5)\), indicating that more embedding dimensions carry meaningful variance\. The stable rank is also consistently higher for hybrid under max pooling\. Table 5:Spectral metrics under max pooling\. H = Hybrid; M = MLM\-only\. Eff\. Rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2606.05173#bib.bib23)\); Srank\(Vershynin,[2018](https://arxiv.org/html/2606.05173#bib.bib24)\)\.Spec\. Ent\.\(↑\\uparrow\)Eff\. Rank\(↑\\uparrow\)Srank\(↑\\uparrow\)TaskHMHMHMSST\-20\.7650\.7353843646\.984\.11MRPC0\.7940\.8242993018\.1811\.33MNLI0\.7790\.7674063918\.354\.99CoLA0\.7520\.71536533010\.824\.75STS\-B0\.7430\.7213903726\.944\.09 #### 5\.2\.4Probing Results ##### Token probe\. The hybrid encoder consistently exhibits lower token probe accuracy \(Table[6](https://arxiv.org/html/2606.05173#S5.T6)\), indicating that its representations retain less surface\-level lexical information\. Across all tasks under attention pooling, hybrid token accuracy is 3–9 points lower than MLM\-only\. ##### Semantic probe\. Despite encoding less token\-level information, hybrid representations achieve comparable or slightly higher semantic probe performance\. Under max pooling on MRPC, hybrid achieves the only positive probe gap \(\+0\.031\+0\.031\), confirming that the JEPA objective promotes semantic encoding over lexical memorisation\. ##### CoLA MCC\. The hybrid encoder achieves consistently positive MCC on CoLA \(0\.0520\.052–0\.0640\.064vs\.−0\.023\-0\.023–0\.0100\.010for MLM\), suggesting that block\-span prediction induces a weak but reproducible syntactic sensitivity absent from token\-level MLM\. Table 6:Probing results under attention pooling\. H = Hybrid; M = MLM\-only\.Token\(↓\\downarrow\)Semantic\(↑\\uparrow\)Gap\(↑\\uparrow\)TaskHMHMHMSST\-20\.6630\.7540\.6430\.635−\-0\.019−\-0\.119MRPC0\.6160\.6780\.6080\.610−\-0\.008−\-0\.068MNLI0\.6850\.7400\.3260\.333−\-0\.359−\-0\.407CoLA0\.8130\.8700\.6020\.594−\-0\.211−\-0\.276STS\-B0\.5920\.6540\.1870\.193−\-0\.405−\-0\.460 ## 6Ablation Study: Pooling Strategies We evaluate four pooling strategies to disentangle the effect of aggregation from the effect of the pre\-training objective\. ##### Mean pooling \(Eq\.[10](https://arxiv.org/html/2606.05173#S3.E10)\) is the default and matches the pooling used during fine\-tuning\. It provides the fairest comparison for downstream accuracy but yields the most compressed spectral geometry\. ##### Max pooling consistently yields the highest spectral entropy and effective rank for both models\. It is the most discriminative strategy for the hybrid encoder on classification tasks, and the only configuration under which the probe gap turns positive for MRPC\. ##### Weighted mean pooling produces results nearly identical to mean pooling across all metrics\. Since weights are initialised uniformly and evaluated insideno\_grad, this serves as a useful null result confirming that observed geometric differences are intrinsic to the encoder rather than to the pooling arithmetic\. ##### Attention pooling amplifies differences between the two models more than any other strategy\. The uniformity gap is widest under attention pooling \(SST\-2:−0\.448\-0\.448vs\.−0\.055\-0\.055; STS\-B:−0\.577\-0\.577vs\.−0\.069\-0\.069\), suggesting that the JEPA objective most strongly affects high\-attention positions — the tokens the model considers most semantically salient\. Figure[2](https://arxiv.org/html/2606.05173#S6.F2)illustrates the pooling ablation for the hybrid encoder on SST\-2\. Figure 2:Spectral entropy, semantic probe accuracy, and uniformity for the hybrid encoder under four pooling strategies on SST\-2\. Max pooling maximises spectral richness; attention pooling maximises the uniformity advantage\. ## 7Discussion ##### Why similar accuracy despite different geometry? The alignment\-uniformity trade\-off\(Wang and Isola,[2020](https://arxiv.org/html/2606.05173#bib.bib25)\)provides the explanation\. Hybrid representations cover the embedding hypersphere more uniformly but do not cluster same\-class points as tightly\. Since linear classifiers benefit primarily from tight clusters, MLM’s alignment advantage offsets hybrid’s uniformity advantage at the linear\-probe evaluation level\. We expect the uniformity advantage to translate to downstream gains under non\-linear probing, retrieval tasks, or longer pre\-training\. ##### Attention pooling as a diagnostic tool\. The amplified differences under attention pooling suggest that JEPA’s effect is concentrated at salient positions rather than distributed uniformly across the sequence, consistent with the block\-masking design that forces the predictor to recover representations of contiguous spans\. ##### Resource constraints\. Pre\-training was intentionally limited to 3 epochs on a modest corpus to study the effect of the objective under a controlled compute budget\. The geometric differences observed are therefore likely a lower bound on what is achievable at scale\. ## 8Future Work - •Alternative prediction losses\.Cosine similarity enforces directional alignment but ignores magnitude\. Future work should evaluate smoothℓ1\\ell\_\{1\}, VICReg\(Bardeset al\.,[2022](https://arxiv.org/html/2606.05173#bib.bib4)\), or Barlow Twins\(Zbontaret al\.,[2021](https://arxiv.org/html/2606.05173#bib.bib27)\)losses in the JEPA branch\. - •Curriculum\-basedλ\\lambda\.The currentλ\\lambdaconverges slowly\. A schedule starting MLM\-heavy and shifting to JEPA\-heavy as the target encoder matures may accelerate learning\. - •Alternative masking strategies\.Span masking\(Joshiet al\.,[2020](https://arxiv.org/html/2606.05173#bib.bib20)\), whole\-word masking, and syntactically informed masking are natural alternatives\. - •Scale\.Extending to BookCorpus \+ Wikipedia, more epochs, and larger model sizes is the most direct path to assessing whether geometric advantages translate to accuracy improvements\. - •Non\-linear probing and retrieval\.Evaluating with MLP probes and semantic similarity benchmarks \(STS12–STS16, SICK\-R\) would provide a more complete picture\. - •Collapse monitoring\.Online tracking of uniformity and effective rank during training could serve as an adaptive objective\-switching signal\. ## 9Conclusion We proposed a hybrid pre\-training objective that combines a JEPA\-style latent prediction loss with masked language modelling over a single shared encoder\. Through systematic representation analysis across five GLUE benchmarks and four pooling strategies, we showed that the hybrid objective consistently produces more uniform embedding distributions, richer spectral geometry, and better semantic\-to\-lexical balance compared to a controlled MLM\-only baseline trained under identical conditions\. These advantages are not captured by linear\-probe accuracy alone, highlighting the value of geometric representation analysis as a complementary evaluation protocol\. Our findings provide empirical support for LeCun’s hypothesis that latent\-space prediction objectives lead to more abstract representations than token reconstruction\(LeCun,[2022](https://arxiv.org/html/2606.05173#bib.bib21)\), and constitute a step toward understanding how JEPA principles can be applied to language encoders under realistic resource constraints\. ## References - Self\-supervised learning from images with a joint\-embedding predictive architecture\.InCVPR 2023,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.05173#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.05173#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.05173#S3.SS2.SSS0.Px1.p1.4),[§5\.2\.1](https://arxiv.org/html/2606.05173#S5.SS2.SSS1.p1.2)\. - J\.L\. Ba, J\.R\. Kiros, and G\.E\. Hinton \(2016\)Layer normalization\.arXiv:1607\.06450\.Cited by:[§3\.3](https://arxiv.org/html/2606.05173#S3.SS3.p1.6)\. - A\. Baevski, W\.\-N\. Hsu, Q\. Xu, A\. Babu, J\. Gu, and M\. Auli \(2022\)Data2vec: a general framework for self\-supervised learning in speech, vision and language\.InICML 2022,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p1.1)\. - A\. Baevski, H\. Zhou, A\. Mohamed, and M\. Auli \(2020\)Wav2vec 2\.0: a framework for self\-supervised learning of speech representations\.InNeurIPS 2020,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p1.1)\. - A\. Bardes, Q\. Garrido, J\. Ponce, M\. Rabbat, Y\. LeCun, M\. Assran, and N\. Ballas \(2024\)Revisiting feature prediction for learning visual representations from video\.arXiv:2404\.08471\.Cited by:[§2\.2](https://arxiv.org/html/2606.05173#S2.SS2.p1.1)\. - A\. Bardes, J\. Ponce, and Y\. LeCun \(2022\)VICReg: variance\-invariance\-covariance regularization for self\-supervised learning\.InICLR 2022,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p1.1),[1st item](https://arxiv.org/html/2606.05173#S8.I1.i1.p1.1)\. - A\. Bardes, J\. Ponce, and Y\. LeCun \(2023\)MC\-jepa: a joint\-embedding predictive architecture for self\-supervised learning of motion and content features\.arXiv preprint arXiv:2307\.12698\.Cited by:[§2\.2](https://arxiv.org/html/2606.05173#S2.SS2.p1.1)\. - T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton \(2020\)A simple framework for contrastive learning of visual representations\.InICML 2020,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p1.1)\. - A\. Conneau, D\. Kiela, H\. Schwentz, L\. Barrault, and A\. Bordes \(2017\)Supervised learning of universal sentence representations from natural language inference data\.InEMNLP 2017,Cited by:[§4\.2](https://arxiv.org/html/2606.05173#S4.SS2.SSS0.Px2.p1.1)\. - A\. Conneau and D\. Kiela \(2018\)SentEval: an evaluation toolkit for universal sentence representations\.InLREC 2018,Cited by:[Appendix D](https://arxiv.org/html/2606.05173#A4.SS0.SSS0.Px6),[§2\.3](https://arxiv.org/html/2606.05173#S2.SS3.p1.1),[item 6](https://arxiv.org/html/2606.05173#S4.I1.i6.p1.1)\. - J\. Devlin, M\.\-W\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InNAACL 2019,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.05173#S2.SS1.p1.1)\. - K\. Ethayarajh \(2019\)How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt\-2 embeddings\.InEMNLP 2019,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.05173#S2.SS3.p1.1)\. - T\. Gao, X\. Yao, and D\. Chen \(2021\)SimCSE: simple contrastive learning of sentence embeddings\.InEMNLP 2021,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p3.1),[§5\.1](https://arxiv.org/html/2606.05173#S5.SS1.p1.1)\. - Q\. Garrido, Y\. Chen, A\. Bardes, L\. Najman, and Y\. LeCun \(2023\)On the duality between contrastive and non\-contrastive self\-supervised learning\.InICLR 2023,Cited by:[Appendix D](https://arxiv.org/html/2606.05173#A4.SS0.SSS0.Px1),[§2\.3](https://arxiv.org/html/2606.05173#S2.SS3.p1.1),[item 1](https://arxiv.org/html/2606.05173#S4.I1.i1.p1.1)\. - J\. Giorgi, O\. Nitski, B\. Wang, and G\. Bader \(2021\)DeCLUTR: deep contrastive learning for unsupervised textual representations\.InACL 2021,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p3.1)\. - J\.\-B\. Grill, F\. Strub, F\. Altché,et al\.\(2020\)Bootstrap your own latent: a new approach to self\-supervised learning\.InNeurIPS 2020,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p1.1)\. - K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick \(2022\)Masked autoencoders are scalable vision learners\.InCVPR 2022,Cited by:[§2\.1](https://arxiv.org/html/2606.05173#S2.SS1.p1.1)\. - K\. He, H\. Fan, Y\. Wu, S\. Xie, and R\. Girshick \(2020\)Momentum contrast for unsupervised visual representation learning\.InCVPR 2020,Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p1.1)\. - H\. Huang, Y\. LeCun, and R\. Balestriero \(2025\)LLM\-jepa: large language models meet joint embedding predictive architectures\.arXiv preprint arXiv:2509\.14252\.Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.05173#S2.SS2.p1.1)\. - M\. Joshi, D\. Chen, Y\. Liu, D\.S\. Weld, L\. Zettlemoyer, and O\. Levy \(2020\)SpanBERT: improving pre\-training by representing and predicting spans\.Cited by:[3rd item](https://arxiv.org/html/2606.05173#S8.I1.i3.p1.1)\. - Y\. LeCun \(2022\)A path towards autonomous machine intelligence\.Note:OpenReview preprintCited by:[§1](https://arxiv.org/html/2606.05173#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.05173#S2.SS1.p1.1),[§9](https://arxiv.org/html/2606.05173#S9.p1.1)\. - A\. Rogers, O\. Kovaleva, and A\. Rumshisky \(2020\)A primer in bertology: what we know about how bert works\.TACL 2020\.Cited by:[§1](https://arxiv.org/html/2606.05173#S1.p3.1)\. - O\. Roy and M\. Vetterli \(2007\)The effective rank: a measure of effective dimensionality\.InEUSIPCO 2007,Cited by:[Appendix D](https://arxiv.org/html/2606.05173#A4.SS0.SSS0.Px2),[§2\.3](https://arxiv.org/html/2606.05173#S2.SS3.p1.1),[item 2](https://arxiv.org/html/2606.05173#S4.I1.i2.p1.1),[Table 5](https://arxiv.org/html/2606.05173#S5.T5)\. - R\. Vershynin \(2018\)High\-dimensional probability\.Cambridge University Press\.Cited by:[Appendix D](https://arxiv.org/html/2606.05173#A4.SS0.SSS0.Px3),[§2\.3](https://arxiv.org/html/2606.05173#S2.SS3.p1.1),[item 3](https://arxiv.org/html/2606.05173#S4.I1.i3.p1.1),[Table 5](https://arxiv.org/html/2606.05173#S5.T5)\. - A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InICLR 2019,Cited by:[§4\.2](https://arxiv.org/html/2606.05173#S4.SS2.p1.1)\. - T\. Wang and P\. Isola \(2020\)Understanding contrastive representation learning through alignment and uniformity on the hypersphere\.InICML 2020,Cited by:[Appendix D](https://arxiv.org/html/2606.05173#A4.SS0.SSS0.Px4),[Appendix D](https://arxiv.org/html/2606.05173#A4.SS0.SSS0.Px5),[§2\.3](https://arxiv.org/html/2606.05173#S2.SS3.p1.1),[item 4](https://arxiv.org/html/2606.05173#S4.I1.i4.p1.1),[item 5](https://arxiv.org/html/2606.05173#S4.I1.i5.p1.2),[§5\.2\.1](https://arxiv.org/html/2606.05173#S5.SS2.SSS1.p1.2),[§5\.2\.2](https://arxiv.org/html/2606.05173#S5.SS2.SSS2.p1.1),[§7](https://arxiv.org/html/2606.05173#S7.SS0.SSS0.Px1.p1.1)\. - J\. Zbontar, L\. Jing, I\. Misra, Y\. LeCun, and S\. Deny \(2021\)Barlow twins: self\-supervised learning via redundancy reduction\.InICML 2021,Cited by:[1st item](https://arxiv.org/html/2606.05173#S8.I1.i1.p1.1)\. ## Appendix AFull Representation Analysis Tables Tables[7](https://arxiv.org/html/2606.05173#A1.T7)–[11](https://arxiv.org/html/2606.05173#A1.T11)report all six representation metrics for all pooling strategies across all datasets\. H = Hybrid; M = MLM\-only\. Table 7:Full representation metrics — SST\-2\.Spec\. Ent\.Eff\. RankSrankUniformitySem\. ProbeProbe GapPoolHMHMHMHMHMHMmean0\.6090\.6052422494\.494\.50−0\.160\\mathbf\{\-0\.160\}−\-0\.0520\.6170\.618−0\.138\\mathbf\{\-0\.138\}−\-0\.152max0\.7650\.7353843646\.984\.11−0\.294\\mathbf\{\-0\.294\}−\-0\.0900\.6000\.595−0\.008\\mathbf\{\-0\.008\}−\-0\.045weighted0\.6090\.6052422494\.494\.50−0\.160\\mathbf\{\-0\.160\}−\-0\.0520\.6170\.619−0\.139\\mathbf\{\-0\.139\}−\-0\.149attention0\.4250\.6022312551\.814\.15−0\.448\\mathbf\{\-0\.448\}−\-0\.0550\.6430\.635−0\.019\\mathbf\{\-0\.019\}−\-0\.119Table 8:Full representation metrics — MRPC\.Spec\. Ent\.Eff\. RankSrankUniformitySem\. ProbeProbe GapPoolHMHMHMHMHMHMmean0\.6310\.6291841867\.397\.94−0\.134\\mathbf\{\-0\.134\}−\-0\.0530\.5830\.635−\-0\.052−\-0\.038max0\.7940\.8242993018\.1811\.33−0\.262\\mathbf\{\-0\.262\}−\-0\.0880\.6180\.576\+0\.031\\mathbf\{\+0\.031\}−\-0\.002weighted0\.6310\.6291841867\.397\.94−0\.134\\mathbf\{\-0\.134\}−\-0\.0530\.5830\.635−\-0\.057−\-0\.038attention0\.6110\.6421981944\.138\.73−0\.269\\mathbf\{\-0\.269\}−\-0\.0550\.6080\.610−0\.008\\mathbf\{\-0\.008\}−\-0\.068Table 9:Full representation metrics — MNLI\.Spec\. Ent\.Eff\. RankSrankUniformitySem\. ProbeProbe GapPoolHMHMHMHMHMHMmean0\.6070\.5972422456\.065\.83−0\.169\\mathbf\{\-0\.169\}−\-0\.0630\.3620\.354−0\.392\\mathbf\{\-0\.392\}−\-0\.410max0\.7790\.7674063918\.354\.99−0\.290\\mathbf\{\-0\.290\}−\-0\.0960\.3150\.349−\-0\.310−\-0\.288weighted0\.6070\.5972422456\.065\.83−0\.163\\mathbf\{\-0\.163\}−\-0\.0630\.3630\.355−0\.392\\mathbf\{\-0\.392\}−\-0\.409attention0\.5380\.6032532532\.765\.91−0\.365\\mathbf\{\-0\.365\}−\-0\.0670\.3260\.333−\-0\.359−\-0\.407Table 10:Full representation metrics — CoLA\. Semantic probe reports MCC\.Spec\. Ent\.Eff\. RankSrankUniformityMCC\(↑\\uparrow\)Probe GapPoolHMHMHMHMHMHMmean0\.6060\.5862352395\.164\.58−0\.314\\mathbf\{\-0\.314\}−\-0\.0790\.0640\.000−0\.246\\mathbf\{\-0\.246\}−\-0\.284max0\.7520\.71536533010\.824\.75−0\.365\\mathbf\{\-0\.365\}−\-0\.0980\.024−\-0\.023−0\.224\\mathbf\{\-0\.224\}−\-0\.306weighted0\.6060\.5862352395\.164\.58−0\.314\\mathbf\{\-0\.314\}−\-0\.0790\.0640\.000−0\.246\\mathbf\{\-0\.246\}−\-0\.285attention0\.4790\.5722212412\.203\.91−0\.955\\mathbf\{\-0\.955\}−\-0\.0830\.0520\.010−0\.211\\mathbf\{\-0\.211\}−\-0\.276Table 11:Full representation metrics — STS\-B\. Semantic probe reports Spearmanρ\\rho\.Spec\. Ent\.Eff\. RankSrankUniformitySpearman\(↑\\uparrow\)Probe GapPoolHMHMHMHMHMHMmean0\.5810\.5882282354\.835\.53−0\.202\\mathbf\{\-0\.202\}−\-0\.0670\.2840\.224−0\.369\\mathbf\{\-0\.369\}−\-0\.449max0\.7430\.7213903726\.944\.09−0\.321\\mathbf\{\-0\.321\}−\-0\.1020\.1400\.167−0\.376\\mathbf\{\-0\.376\}−\-0\.397weighted0\.5810\.5882272354\.835\.53−0\.201\\mathbf\{\-0\.201\}−\-0\.0680\.2840\.224−0\.369\\mathbf\{\-0\.369\}−\-0\.446attention0\.4840\.5972332432\.355\.70−0\.577\\mathbf\{\-0\.577\}−\-0\.0690\.1870\.193−0\.405\\mathbf\{\-0\.405\}−\-0\.460 ## Appendix BAlignment–Uniformity Plots Figure 3:Alignment vs\. uniformity and class separation — mean pooling, all datasets\.Figure 4:Alignment vs\. uniformity and class separation — max pooling, all datasets\.Figure 5:Alignment vs\. uniformity and class separation — weighted mean pooling, all datasets\.Figure 6:Alignment vs\. uniformity and class separation — attention pooling, all datasets\. The uniformity gap between hybrid and MLM\-only is most pronounced here, consistent with the quantitative results in Table[4](https://arxiv.org/html/2606.05173#S5.T4)\. ## Appendix CEigenspectrum Plots Figure 7:Eigenvalue spectrum — SST\-2, mean pooling\.Figure 8:Eigenvalue spectrum — MRPC, mean pooling\.Figure 9:Eigenvalue spectrum — MNLI, mean pooling\.Figure 10:Eigenvalue spectrum — CoLA, mean pooling\. ## Appendix DRepresentation Metric Definitions Let𝐙∈ℝN×D\\mathbf\{Z\}\\in\\mathbb\{R\}^\{N\\times D\}denote the centred embedding matrix, and let𝝈=\(σ1,…,σr\)\\boldsymbol\{\\sigma\}=\(\\sigma\_\{1\},\\ldots,\\sigma\_\{r\}\)be its singular values in descending order\. All metrics are computed on frozen sentence embeddings extracted from the validation split\. ##### 1\. Spectral Entropy\[Garridoet al\.,[2023](https://arxiv.org/html/2606.05173#bib.bib14)\]\. Hspec=H\(𝝈/‖𝝈‖1\)logD,H\(p\)=−∑ipilogpiH\_\{\\text\{spec\}\}=\\frac\{H\\\!\\left\(\\boldsymbol\{\\sigma\}/\\\|\\boldsymbol\{\\sigma\}\\\|\_\{1\}\\right\)\}\{\\log D\},\\qquad H\(p\)=\-\\textstyle\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}\(11\)Hspec∈\[0,1\]H\_\{\\text\{spec\}\}\\in\[0,1\]; a value of 1 means all singular values are equal, indicating that every embedding dimension carries the same variance \(maximally rich representation\)\. ##### 2\. Effective Rank\[Roy and Vetterli,[2007](https://arxiv.org/html/2606.05173#bib.bib23)\]\. erank\(𝐙\)=expH\(𝝈/‖𝝈‖1\)\\text\{erank\}\(\\mathbf\{Z\}\)=\\exp H\\\!\\left\(\\boldsymbol\{\\sigma\}/\\\|\\boldsymbol\{\\sigma\}\\\|\_\{1\}\\right\)\(12\)Measures the number of dimensions that effectively contribute to the variance; higher values indicate richer representations\. ##### 3\. Stable Rank\[Vershynin,[2018](https://arxiv.org/html/2606.05173#bib.bib24)\]\. srank\(𝐙\)=‖𝐙‖F2‖𝐙‖22=∑iσi2σ12\\text\{srank\}\(\\mathbf\{Z\}\)=\\frac\{\\\|\\mathbf\{Z\}\\\|\_\{F\}^\{2\}\}\{\\\|\\mathbf\{Z\}\\\|\_\{2\}^\{2\}\}=\\frac\{\\sum\_\{i\}\\sigma\_\{i\}^\{2\}\}\{\\sigma\_\{1\}^\{2\}\}\(13\)A robust alternative to effective rank that is less sensitive to outlier singular values\. ##### 4\. Alignment\[Wang and Isola,[2020](https://arxiv.org/html/2606.05173#bib.bib25)\]\. 𝒜\(f;α\)=𝔼\(𝐮,𝐯\)∼ppos\[‖f^\(𝐮\)−f^\(𝐯\)‖2\]α,α=2\\mathcal\{A\}\(f;\\,\\alpha\)=\\underset\{\(\\mathbf\{u\},\\mathbf\{v\}\)\\sim p\_\{\\text\{pos\}\}\}\{\\mathbb\{E\}\}\\bigl\[\\\|\\hat\{f\}\(\\mathbf\{u\}\)\-\\hat\{f\}\(\\mathbf\{v\}\)\\\|^\{2\}\\bigr\]^\{\\alpha\},\\quad\\alpha=2\(14\)wheref^\(𝐱\)=f\(𝐱\)/‖f\(𝐱\)‖2\\hat\{f\}\(\\mathbf\{x\}\)=f\(\\mathbf\{x\}\)/\\\|f\(\\mathbf\{x\}\)\\\|\_\{2\}andpposp\_\{\\text\{pos\}\}is the distribution of same\-class pairs\. Lower alignment means same\-class representations cluster more tightly\. ##### 5\. Uniformity\[Wang and Isola,[2020](https://arxiv.org/html/2606.05173#bib.bib25)\]\. 𝒰\(f;t\)=log𝔼𝐮,𝐯∼i\.i\.d\.pdata\[e−t‖f^\(𝐮\)−f^\(𝐯\)‖2\],t=2\\mathcal\{U\}\(f;\\,t\)=\\log\\,\\underset\{\\mathbf\{u\},\\mathbf\{v\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}p\_\{\\text\{data\}\}\}\{\\mathbb\{E\}\}\\bigl\[e^\{\-t\\\|\\hat\{f\}\(\\mathbf\{u\}\)\-\\hat\{f\}\(\\mathbf\{v\}\)\\\|^\{2\}\}\\bigr\],\\quad t=2\(15\)More negative values indicate that embeddings are spread more uniformly over the unit hypersphere, which prevents representational collapse\. ##### 6\. Probe Gap\[Conneau and Kiela,[2018](https://arxiv.org/html/2606.05173#bib.bib9)\]\. Δprobe=accsemantic−acctoken\\Delta\_\{\\text\{probe\}\}=\\text\{acc\}\_\{\\text\{semantic\}\}\-\\text\{acc\}\_\{\\text\{token\}\}\(16\)The semantic probe trains a linear classifier to predict the downstream task label from the frozen embedding; the token probe predicts the most frequent non\-special token in the input sentence\. A positive gap indicates that the representation encodes task\-relevant semantics more strongly than surface\-form lexical identity\.
Similar Articles
Representation Without Reward: A JEPA Audit for LLM Fine-Tuning
This paper audits Joint-embedding predictive architectures (JEPA) for LLM fine-tuning on a natural-language-to-regex task, testing twenty-two auxiliary objectives. The results show that hidden-state representation improvements are only weakly coupled to decoded-task accuracy, with no auxiliary surviving family-wise correction.
CGM-JEPA: Learning Consistent Continuous Glucose Monitor Representations via Predictive Self-Supervised Pretraining
Introduces CGM-JEPA, a self-supervised pretraining framework for continuous glucose monitor data that improves cross-modal and cross-cohort performance through masked latent prediction and distributional objectives.
DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Introduces DLLM-JEPA, a JEPA formulation for masked diffusion language models that constructs two views from a single input via the diffusion noise schedule, reducing training FLOPs by 33% relative to LLM-JEPA and improving fine-tuning performance on tasks like GSM8K.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
This paper investigates how using diverse self-generated data during mid-training improves the effectiveness of Reinforcement Learning in Large Language Models, particularly for reasoning tasks.
Model Unlearning Objectives Vary for Distinct Language Functions
The paper argues that unlearning in LLMs should be goal-dependent, proposing a cosine-based meta-learned variant of RMU for dangerous knowledge and a multi-layer objective with probe directions for toxicity, achieving strong results across four 7-8B models.