@MatthieuWyart: LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs …

X AI KOLs Timeline Papers

Summary

This paper proves that learning by predicting latent representations (as in world models like JEPA and data2vec) requires exponentially less data than predicting tokens (as in LLMs) for hierarchical data with hidden structure.

LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs more data? For data with hidden hierarchy, we prove the gap is exponential. https://t.co/r2uuX0lBCu https://t.co/51canl7smG
Original Article
View Cached Full Text

Cached at: 06/01/26, 01:41 PM

LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs more data? For data with hidden hierarchy, we prove the gap is exponential. https://t.co/r2uuX0lBCu https://t.co/51canl7smG


Learn from your own latents and not from tokens: A sample-complexity theory

Source: https://arxiv.org/html/2605.27734 {forest}{forest}

Supervised classification Token-level SSL (MLM, diffusion) Learn-from-your-latent SSL

Figure 1:The sample complexityPPto learn RHM data depends on training objective. The sample complexity to learn the synonymic invariance of correlations (and construct latent⋆\star) between prediction target (ZZ, boxed) and its context tuple (TT, circled) scales asmdtreem^{d_{\rm tree}}, wheredtreed_{\rm tree}denotes the tree-distance between the two. Overall sample complexity is bottlenecked by the weakest pertinent correlations, which we highlight for supervised classification and token-level SSL. Supervising on latents, as our ILC algorithm (Section˜3), SLC network (Section˜4), and data2vec (Section˜5) do, achieves the better sample complexity∼m3\sim m^{3}.#### The Random Hierarchy Model (RHM).

The RHM[32]is a probabilistic context-free grammar on a fixed regular tree. The tree has depthLL, branching factorss, and vocabularies𝒱0,𝒱1,…,𝒱L\mathcal{V}_{0},\mathcal{V}_{1},\ldots,\mathcal{V}_{L}, all of sizevv. Level0is visible and levels1,…,L1,\ldots,Lare latent. At levelℓ\ell, there aresL−ℓs^{L-\ell}variablesh1(ℓ),…,hsL−ℓ(ℓ)h^{(\ell)}_{1},\ldots,h^{(\ell)}_{s^{L-\ell}}, withhu(ℓ)∈𝒱ℓh^{(\ell)}_{u}\in\mathcal{V}_{\ell}. The visible sequence isxi=hi(0)x_{i}=h^{(0)}_{i},i=1,…,sLi=1,\ldots,s^{L}.

For each levelℓ=0,…,L−1\ell=0,\ldots,L-1, a grammar instance is sampled by choosingv​mvmdistinct tuples from𝒱ℓs\mathcal{V}_{\ell}^{s}uniformly without replacement and partitioning them intovvlabeled groups of sizemm, one groupℛℓ,a\mathcal{R}_{\ell,a}for each parent symbola∈𝒱ℓ+1a\in\mathcal{V}_{\ell+1}. A ruleν=(ν1,…,νs)∈ℛℓ,a\nu=(\nu_{1},\ldots,\nu_{s})\in\mathcal{R}_{\ell,a}means that parentaamay generate the tuple of childrenν\nu. The set of grammatical tuples at levelℓ\ellis𝒮ℓ:=⨆a∈𝒱ℓ+1ℛℓ,a,\mathcal{S}_{\ell}:=\bigsqcup_{a\in\mathcal{V}_{\ell+1}}\mathcal{R}_{\ell,a},so|𝒮ℓ|=v​m|\mathcal{S}_{\ell}|=vm. We writef:=m/vs−1f:=m/v^{s-1}for the fraction of validss-tuples, sincev​mvmout of thevsv^{s}possible tuples are grammatical. Because the rule map is injective, the grammar is unambiguous: each grammatical tuple has a unique parent. We denote it byparℓ​(ν)=a⟺ν∈ℛℓ,a.\mathrm{par}_{\ell}(\nu)=a\;\Longleftrightarrow\;\nu\in\mathcal{R}_{\ell,a}.Two grammatical tuplesν,ν′∈𝒮ℓ\nu,\nu^{\prime}\in\mathcal{S}_{\ell}are calledsynonymsif they have the same parent.

Generation proceeds top-down. Firsth1(L)∼Unif​(𝒱L)h^{(L)}_{1}\sim\mathrm{Unif}(\mathcal{V}_{L}). Then, recursively, ifhu(ℓ+1)=ah^{(\ell+1)}_{u}=a, the child tupleTu(ℓ):=(h(u−1)​s+1(ℓ),…,hu​s(ℓ))T^{(\ell)}_{u}:=\big(h^{(\ell)}_{(u-1)s+1},\ldots,h^{(\ell)}_{us}\big)is sampled uniformly fromℛℓ,a\mathcal{R}_{\ell,a}.

Correlation-based learning in the RHM.

A central observation fromCagnetta et al. [32]is that learning the RHM amounts to learning invariances under the exchange of synonyms. The statistical signal that identifies these synonym classes comes from correlations between a tuple and an external observable. Concretely, letT(ℓ)T^{(\ell)}be a level-ℓ\elltuple and letZZbe an observable elsewhere in the tree. This arrangement is depicted for different learning objectives inSection˜2. One can consider the connected correlation:

CZ​(ν,z):=ℙ​[T(ℓ)=ν,Z=z]−ℙ​[T(ℓ)=ν]​ℙ​[Z=z].C_{Z}(\nu,z):=\mathbb{P}[T^{(\ell)}=\nu,Z=z]-\mathbb{P}[T^{(\ell)}=\nu]\mathbb{P}[Z=z].If two tuplesν,ν′\nu,\nu^{\prime}are synonyms, thenCZ​(ν,⋅)=CZ​(ν′,⋅),C_{Z}(\nu,\cdot)=C_{Z}(\nu^{\prime},\cdot),wheneverZZdepends on the tuple only through its parent.111Indeed, conditioned on the parent of the tuple, the rest of the tree is independent of which production rule was used to generate the tuple, and the uniform choice of production rules givesℙ​[T(ℓ)=ν]=ℙ​[T(ℓ)=ν′]\mathbb{P}[T^{(\ell)}=\nu]=\mathbb{P}[T^{(\ell)}=\nu^{\prime}]within each synonym class.The strength of this correlation signal depends on the choice ofZZand on its position relative to the tuple. The signal must propagate through the hidden tree from the parent ofT(ℓ)T^{(\ell)}to the observableZZ. Along this path, each unresolved production rule averages overmmsynonymous choices, attenuating the signal. In sample-complexity terms, each additional unresolved rule costs a factor of ordermm, as illustrated inSection˜2.

Supervised classification[32]:the observableZZis the class label at the root. To recover the rules mapping level-ℓ\elltuples to level-(ℓ+1)(\ell+1)latents, the root-to-tuple correlation must traverse theL−ℓ−1L-\ell-1levels separating the tuple parent from the root. Including the fact that each grammatical tuple is observed with probability of order1/(v​m)1/(vm), this gives sample complexities of the formPℓsup≍v​mL−ℓ.P_{\ell}^{\mathrm{sup}}\asymp v\,m^{L-\ell}.The hardest step is therefore the first one,ℓ=0\ell=0, where one must learn the level-1 rules. Deep networks trained on the supervised RHM classification task were found to saturate this scaling and reconstruct the latent hierarchy in their representations.

Token-level self-supervision[33,38,39]:ZZis a masked or noised token. In the first step, local correlations between visible tokens identify the level-1 synonym classes, with sample complexityP1tok≍v​m3.P_{1}^{\mathrm{tok}}\asymp vm^{3}.Once these low-level latents have been internally reconstructed, the model can use them as coarse context variables. To learn higher levels, the relevant statistics are token–latent correlations: a visible token is correlated with a latent representation of its context. However, because the prediction target is still a surface token, the signal is averaged through the descendant channel from the latent scale down to the leaves. Each additional latent level therefore costs one extra factor ofmmin sample complexity. Thus, ifℓ≥1\ell\geq 1denotes the latent level being learned above the leaves,

Pℓtok≍v​mℓ+2.P_{\ell}^{\mathrm{tok}}\asymp v\,m^{\ell+2}.(1)Neural networks trained with token-level SSL objectives were empirically shown to saturate this staged scaling: as the dataset size increases, lower-level latents appear first, and higher-level latents become learnable thereafter. The overall sample-complexity is gated behind reconstructing latentsℓ=L−1\ell=L-1leading toPtok≍v​mL+1P_{\rm tok}\asymp vm^{L+1}.222Notice that since the root is sampled uniformly and each root symbol hasmmequiprobable rules, the distribution does not reveal how thev​mvmvalid top-level tuples are partitioned intovvgroups of sizemm, one group per root symbol. Thus unsupervised learning can recoverh(1),h(2),…,h(L−1)h^{(1)},h^{(2)},\ldots,h^{(L-1)}and the support of valid top-level tuples, but not the root labels themselves.

3Recovering the RHM hierarchy by clustering

Input:

PPsamples

x(1),…,x(P)x^{(1)},\ldots,x^{(P)}, RHM parameters

L,s,vL,s,v, and a clustering module

𝖢𝗅𝗎𝗌𝗍𝖾𝗋v\mathsf{Cluster}_{v}.

Output:estimated non-root hierarchy

h^(1),…,h^(L−1)\widehat{h}^{(1)},\ldots,\widehat{h}^{(L-1)}.

1

21exInitialize:

h^i(0)=xi\widehat{h}^{(0)}_{i}=x_{i}.

3

41exforℓ=0,1,…,L−2\ell=0,1,\ldots,L-2do

5Form all level-

ℓ\elltuples

T^u(ℓ)\widehat{T}^{(\ell)}_{u}.

6Estimate the support of grammatical tuples

𝒮^ℓ\widehat{\mathcal{S}}_{\ell}by the observed tuple set.

7For every

ν∈𝒮^ℓ\nu\in\widehat{\mathcal{S}}_{\ell}, count its incidence as

N​(ν)=∑p=1P𝟏​{T^ℓ(p)=ν}N(\nu)=\sum_{p=1}^{P}\mathbf{1}\{\widehat{T}^{(p)}_{\ell}=\nu\}and compute its empirical cousin context vector

ϕ^ℓ​(ν):=1N​(ν)​∑p=1PeZ^ℓ(p)​𝟏​{T^ℓ(p)=ν},\widehat{\phi}_{\ell}(\nu):=\frac{1}{N(\nu)}\sum_{p=1}^{P}e_{\widehat{Z}^{(p)}_{\ell}}\mathbf{1}\{\widehat{T}^{(p)}_{\ell}=\nu\},where

T^ℓ(p)\widehat{T}^{(p)}_{\ell}is a fixed tuple at level

ℓ\ellin sample

pp, and

Z^ℓ(p)\widehat{Z}^{(p)}_{\ell}is a fixed level-

ℓ\ellelement in a cousin tuple (i.e. sharing the

ℓ+2\ell+2grandparent with

T^ℓ(p)\widehat{T}^{(p)}_{\ell}).

8Cluster the context vectors:

𝒮^ℓ,1,…,𝒮^ℓ,v=𝖢𝗅𝗎𝗌𝗍𝖾𝗋v​({ϕ^ℓ​(ν):ν∈𝒮^ℓ}).\widehat{\mathcal{S}}_{\ell,1},\ldots,\widehat{\mathcal{S}}_{\ell,v}=\mathsf{Cluster}_{v}\!\left(\{\widehat{\phi}_{\ell}(\nu):\nu\in\widehat{\mathcal{S}}_{\ell}\}\right). 9Define the next-level latent label by the cluster identity:

h^u(ℓ+1)=aifT^u(ℓ)∈𝒮^ℓ,a.\widehat{h}^{(\ell+1)}_{u}=a\qquad\text{if}\qquad\widehat{T}^{(\ell)}_{u}\in\widehat{\mathcal{S}}_{\ell,a}.{\hbox to0pt{\vbox to0pt{\pgfpicture\makeatletter\hbox{\thinspace\lower 0.0pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{\the\pgflinewidth}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{0.0pt}{0.0pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{}} }}\pgfsys@invoke{ }\pgfsys@endscope}}} \pgfsys@invoke{ }\pgfsys@endscope}}} \pgfsys@invoke{ }\pgfsys@endscope\hbox to0.0pt{}{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{ }\pgfsys@endscope\hss}}\endpgfpicture}}}Predictor,ppClusterer,CC

10-0.75em

Algorithm 1Iterative Latent Clustering (ILC) — seeFigure˜2for a graphical representation.As discussed in the previous section, the limitation of token-level objectives is not that they fail to learn latent variables. Rather, they learn them while still receiving supervision through visible tokens. The learn-from-your-own-latent setting studied below removes this residual token bottleneck: once levelℓ\ellhas been recovered, both the conditioning object and the prediction target are lifted to levelℓ\ell. The next stage is then again a local synonym-clustering problem, but now with the same statistical strength at every scale.

In other words, every level becomes as easy as the first token-level step. GivenEquation˜1, this suggests that the whole non-root hierarchy should be recoverable fromP≍v​m3P\asymp vm^{3}samples. This is illustrated inSection˜2–right.

Iterative latent-clustering algorithm (ILC).

The preceding section framed learning the RHM in terms of learning tuple-target correlations, and identifyingsynonymsby tuples that share identical correlations. Here, we cast this as a vector-clustering problem. LetTTdenote a level-ℓ\elltuple, letZZbe a level-ℓ\elltarget in a cousin tuple (cf. the arrangement inSection˜2– cousins are simply nodes sharing the same grandparent at levelℓ+2\ell+2), and leteZ∈ℝve_{Z}\in\mathbb{R}^{v}be its one-hot encoding. For each grammatical tupleν∈𝒮ℓ\nu\in\mathcal{S}_{\ell}, define theconditional context vector

ϕℓ​(ν):=𝔼​[eZ∣T=ν]∈Δv−1.\phi_{\ell}(\nu):=\mathbb{E}[e_{Z}\mid T=\nu]\in\Delta^{v-1}. x1x_{1}x2x_{2}x3x_{3}x4x_{4}x5x_{5}x6x_{6}x7x_{7}x8x_{8}ppppppppCCCCCCCCh^1(1)\widehat{h}_{1}^{(1)}h^2(1)\widehat{h}_{2}^{(1)}h^3(1)\widehat{h}_{3}^{(1)}h^4(1)\widehat{h}_{4}^{(1)}ppppCCCCh^1(2)\widehat{h}_{1}^{(2)}h^2(2)\widehat{h}_{2}^{(2)}

Figure 2:Graphical representation ofAlgorithm˜1forL=3L=3. Predictorppimplements steps 3-5, and clustererCCconstructs next-level latents. Highlighted prediction targets are as inSection˜2–right.As before,the essential observation is that synonyms have the same context vector. Ifν∈𝒮ℓ,a\nu\in\mathcal{S}_{\ell,a}, we denote the common context vector of parentaabyΦℓ,a:=ϕℓ​(ν)\Phi_{\ell,a}:=\phi_{\ell}(\nu). The goal is to cluster thev​mvmtuple context vectors into thesevvparent centers. This assignment of tuples into clusters explicitly constructs the next-level latents.

Algorithm˜1(Iterative Latent Clustering, or ILC) operationalizes this procedure. At each step, the algorithm assumes that levelℓ\ellhas been decoded, clusters level-ℓ\elltuples by their empirical context vectors, and thereby constructs levelℓ+1\ell+1. This clustering procedure is illustrated graphically inFigure˜2: the predictorppestimates context vectors (steps 3-5), while the clustererCCmaps these vectors to next-level latent labels (steps 6-7). We prove its sample complexity inTheorem˜1and validate it numerically inFigure˜3.

Theoretical sample complexity.

An RHM grammar isbalancedif, for every levelℓ\ell, every grammatical tuple occurs with probability of order1/(v​m)1/(vm). It isseparatedif every pair of parent context vectorΦℓ,a,Φℓ,b\Phi_{\ell,a},\Phi_{\ell,b}are separated by distance≳1/m\gtrsim 1/m. We refer the reader to AppendixBfor the formal statements of these assumptions and their justification in the limit ofv→∞v\to\inftyat fixedf∈(0,1)f\in(0,1). For a balanced and separated RHM grammar withf<1f<1, and assuming the existence of astableclustering module (see AppendixB), we have:

Theorem 1(Recovery of the non-root hierarchy; informal).

Fix an RHM grammar that is balanced and separated, and a clustering algorithm stable under perturbations. Then there exists a constantC>0C>0, depending only on the constants in those assumptions and onss, such that the following holds. If

P≥C​[v​m​log⁡L​v​mδ+v​m31−f​log⁡L​v​mδ],P\geq C\left[vm\log\frac{Lvm}{\delta}+\frac{vm^{3}}{1-f}\log\frac{Lvm}{\delta}\right],(2)then the iterative latent-clustering algorithm recoversh(1),h(2),…,h(L−1)h^{(1)},h^{(2)},\ldots,h^{(L-1)}and all production-rule classes below the root with probability at least1−δ1-\delta, up to independent permutations of the latent labels at each level.

In particular, for fixedffbounded away from11and up to logarithmic factors,PILC≍v​m3.P_{\mathrm{ILC}}\asymp vm^{3}.

In AppendixB, we give the formal theorem and its proof, which is a level-by-level induction based on concentration of the empirical context vectors. Notice thatv​m3vm^{3}is the scale needed to learn the first-level rules from visible tokens (seeSection˜2). However, once one level has been recovered, the decoded latent sequence is again an RHM with the same local parameters. The same sample scale therefore suffices at every subsequent level, allowing the algorithm to recover the non-root hierarchy without any exponential dependence onLL, as illustrated inSection˜2–right.

Numerical validation.

We generate samples from the fixed-tree RHM and apply the level-by-level procedure described above. At each level, we estimate the empirical context vectorsϕ^ℓ​(ν)\widehat{\phi}_{\ell}(\nu)for observed grammatical tuples and cluster them usingkk-means. The recovered labels are matched to the ground-truth latent labels by the optimal permutation, and we report the resulting reconstruction accuracy. Figure3–leftshows the reconstruction accuracy forL=5L=5,s=3s=3,v=8v=8, and varyingmm. In the unrescaled plot, the transition to perfect reconstruction shifts to larger sample sizes asmmincreases. After rescaling the number of samples byv​m3,vm^{3},the curves collapse and reconstruction becomes accurate onceP/v​m3P/vm^{3}is of order one. This confirms the predictedm3m^{3}scaling of the local clustering threshold. The algorithm recovers all non-root levels at this same scale, consistent with the theory that the first-level recovery threshold is sufficient to climb the hierarchy recursively.

Refer to captionFigure 3:Recursive clustering recovers the RHM hierarchy at the predicted scale.Left: Clustering reconstruction accuracy of the iterative clustering algorithm usingkk-means (L=5,s=3,v=8L=5,s=3,v=8).Right: Class reconstruction accuracy of stacked SLC architecture (L=4,s=3,v=10L=4,s=3,v=10) trained end-to-end, evaluated by linear probe on the highest level latentsh^(L−1)\widehat{h}^{(L-1)}. In each panel, the main plot is ascaling collapse, effected by rescaling number of samples by predicted sample complexityP∼v​m3P\sim vm^{3}. Insets show raw data (top left) and inferred sample complexity (bottom right)P∗P^{\ast}for which accuracy reaches 50%.

4Gradient-based Iterative Latent Clustering

In this section, we ask whether the same scaling of the ILC algorithm can be achieved by a differentiable architecture trained end-to-end by gradient descent.

The Stacked Latent-Clustering (SLC) network.

We mirror the recursion ofAlgorithm˜1with a stack ofL−1L-1identical modules (Figure˜2). Each module is composed of two subnetworks that play the roles of the two operations in the algorithm.

  • •pp: ApredictorPred(ℓ)\mathrm{Pred}^{(\ell)}takes anss-tuple of level-ℓ\elltokens and outputs, via cross-entropy training, a categorical distribution over its cousin tokens; this is the neural analog of the empirical cousin context vector.
  • •CC: AclustererClust(ℓ)\mathrm{Clust}^{(\ell)}then maps each prediction vector to a soft assignment in a discrete codebook through a contrastive objective: prediction vectors with high similarity are pulled to the same code while dissimilar ones are pushed apart – implementing𝖢𝗅𝗎𝗌𝗍𝖾𝗋v\mathsf{Cluster}_{v}in the neural setting.

The cluster assignmentsh^(ℓ+1)\widehat{h}^{(\ell+1)}produced by levelℓ\ellbecome the input tokens of the level-(ℓ+1)(\ell+1)module. To keep the difficulty of the prediction task constant across depth, cluster outputs are tokenized with a softmax. Prediction targets are computed by a teacher network whose weights track an exponential moving average of the student,Wteacher←(1−αema)​Wteacher+αema​Wstudent,W_{\mathrm{teacher}}\leftarrow(1-\alpha_{\mathrm{ema}})\,W_{\mathrm{teacher}}+\alpha_{\mathrm{ema}}\,W_{\mathrm{student}},a common trick to prevent the representation collapse common to self-distilled SSL[49]. Full architectural and training details are given in AppendixC.1.

Results.

Figure˜3–rightshows the accuracy of a linear probe trained to recover the top-level latent from the frozen final-layer representations of an SLC network, forL=4L=4,s=3s=3,v=10v=10, and varyingmm. Rescaling the sample axis byv​m3vm^{3}collapses the curves, confirming that the network saturates the clustering threshold ofTheorem˜1.Figure˜12shows that the sample complexity forL∈{3,…,7}L\in\{3,\ldots,7\}does not vary. Interestingly, inserting a stop-gradient at every module boundary (or even betweenppandCC), so that each level is trainedonlyby its own prediction and clustering losses, leaves thev​m3vm^{3}scaling unchanged (Figure˜9). SLC therefore admits a strictly local learning rule, consistent with biologically motivated learning schemes.

SLC shows that anexplicithierarchical architecture, with one predictor-clusterer module per latent level, can saturate thev​m3vm^{3}statistical bound ofSection˜3. In the next section, we demonstrate that the SSL representation learning objective of data2vec leads toimplicithierarchical supervision, and correspondingly also saturates this learning bound.

5Analysis of data2vec

data2vecstudentrepresentationsteacherrepresentationsaax1x_{1}bbx2x_{2}?x3x_{3}ddx4x_{4}?x5x_{5}aax1x_{1}bbx2x_{2}ccx3x_{3}ddx4x_{4}eex5x_{5}NNNNN−1N-1N−1N-1N−2N-2N−2N-21111⋮\vdots⋮\vdotsPredℓd2v\ell_{\mathrm{d2v}}avg.lastKKRefer to caption

Figure 4:Data2vec on RHM.Left:data2vec learns in a teacher-student setting; the student is supervised on reconstruction of pooled teacher representations at masked token positions.Right:Root-classification accuracy of probes on pooled final-layer features, as a function of the number of online pretraining samplesPP. RHM parameters:v=16v=16,s=2s=2,L=4L=4. Inset: raw curves vs.PPand inferred sample complexityP∗P^{\ast}for which accuracy reaches 50%. Main: rescaling byv​m3vm^{3}, curves collapse. Lines are averages over three independent realizations, dots are individual runs.In this section, we argue that the iterative clustering mechanism isimplicitin data2vec[24], a popular SSL method that predicts its own latent representations. We choose data2vec over the closely related JEPA family because it comes with a published recipe for discrete-token inputs. In particular, data2vec uses a student-teacher setup. The student network is fed a partially masked inputSS; the teacher sees the unmasked inputxx. At every masked positionii, the student is trained, with a squared loss, to predict the targetYi​(x)Y_{i}(x)given by the average of the lastKKblock activations of a teacher network applied to the unmasked inputxx, as illustrated inFigure˜4–left. As with SLC, the teacher has the same architecture as the student, and its weights track an exponential moving average of the student’s weights, updated after each gradient step.

Empirical sample complexity.

We pretrain data2vec on the RHM. We follow the data2vec text recipe: a fractionpmask=0.15p_{\rm mask}=0.15of positions are masked. Full details are presented in AppendixD. We compare two data-presentation regimes:online, where each step draws a fresh batch from the grammar, andoffline, where a fixed training set ofPPunique strings is reused across epochs. After pretraining, we freeze the encoder and train a one-hidden-layer MLP probe to predict the root label from mean-pooled final-layer features.

Figure˜4–rightshows the downstream root-classification accuracy of the probes forL=4L=4,s=2s=2,v=16v=16in the online setting.333The same plots are reported in AppendixDfor the offline setting.The learning curves collapse onto a single curve after rescaling the sample axis byv​m3vm^{3}. Root classification on the RHM requires resolving allL−1L-1non-root latent levels and then identifying the partition of top-level tuples into root classes. As discussed inSection˜2, token-level SSL is bottlenecked by learning the level-(L−1)(L-1)rules from tokens, which costs∼v​mL+1=v​m5\sim vm^{L+1}=vm^{5}samples; an additional𝒪​(v​m)\mathcal{O}(vm)labeled examples then suffice to learn the root partition.444For reference, fully supervised learning of the root from leaves would itself require∼v​mL=v​m4\sim vm^{L}=vm^{4}samples, so thev​m3vm^{3}scaling we observe is in fact better than what supervised learning achieves – and data2vec uses no labels during pretraining.The observed scaling is hence incompatible with this token-level prediction baseline.

Theoretical analysis.

We now explain thev​m3vm^{3}scaling by tying data2vec back to the iterative latent clustering mechanism ofSection˜3. Although the targetYi​(x)Y_{i}(x)is a continuous vector, we posit it acts as a soft encoding of the RHM latents that the encoder has already learned: at any point in training,Yi​(x)Y_{i}(x)contains a linear component of every latent learned in the encoder (as we verify inFigure˜13). We idealize the slow EMA as a discretized sequence of phases: within each phase, the teacher is held fixed, and between phases it is refreshed to absorb the latents most recently acquired by the student. The argument then proceeds by induction over the levels. Phase0reduces data2vec training to the token-level masked-prediction problem ofSection˜2and recovers the level-11latents atP≳v​m3P\gtrsim vm^{3}. Each subsequent phaseliftsthe prediction problem onto level-ℓ\elltuples, where it coincides with the cousin-tuple clustering problem ofSection˜3– with the same sample complexity. The full non-root hierarchy is therefore recovered at this scale. We now formalize this argument through two assumptions.

*(A1) Targets carry the learned latents.*Letzi(ℓ):=h⌈i/sℓ⌉(ℓ)z_{i}^{(\ell)}:=h^{(\ell)}_{\lceil i/s^{\ell}\rceil}be the level-ℓ\ellancestor of positionii, and letez∈ℝve_{z}\in\mathbb{R}^{v}denote the one-hot encoding of a latent symbolzz. If the encoder linearly represents the latentsh(1),…,h(ℓ)h^{(1)},\ldots,h^{(\ell)}, then the teacher target admits a decomposition

Yi​(x)=Fi​(S)+∑a=0ℓBa​ezi(a)+residual,Y_{i}(x)\;=\;F_{i}(S)\;+\;\sum_{a=0}^{\ell}B_{a}\,e_{z_{i}^{(a)}}\;+\;{\rm residual},whereFi​(S)F_{i}(S)collects components determined by the visible input and the matricesBaB_{a}are non-zero linear projections. The residual term collects everything else, which is typically nonlinear and not used in our analysis. The residual architecture of the transformer makes this natural: a feature decodable from one block propagates through the identity path to every later block, and hence appears with a non-zero linear coefficient in the layers entering the top-KKteacher average. At initialization,ℓ=0\ell=0and the only ancestor isxix_{i}itself, soYi​(x)Y_{i}(x)contains a linear component of the masked one-hot inputs.

*(A2) Correlation learning.*Whenever a correlation between the prediction target and a feature of the visible input becomes detectable above sampling noise, gradient descent extracts that feature. This is the same correlation-learning hypothesis used in previous RHM analyses[33,38], where it is supported empirically and by one-step gradient calculations.

Phase 0.

The population minimizer of the data2vec loss is the conditional expectation𝔼​[Yi​(x)∣S]\mathbb{E}[Y_{i}(x)\mid S]. By linearity of conditional expectation, in phase0, predictingYi​(x)Y_{i}(x)fromSSreduces, up to the visible component and the non-linear residual, to predicting the masked one-hotexie_{x_{i}}from its visible context. This is exactly the token-level masked-prediction problem analyzed inSection˜2: local token–token correlations become detectable aboveP≳v​m3P\gtrsim vm^{3}, and by (A2) the network extracts them. The resulting representation collapses synonymous level-0tuples onto a common image, and the level-11ancestor becomes linearly decodable from the encoder.

Phaseℓ≥1\ell\geq 1.

Suppose phases0,…,ℓ−10,\ldots,\ell-1have produced linearly decodable representations ofh(1),…,h(ℓ)h^{(1)},\ldots,h^{(\ell)}inside the encoder. After the next teacher update, these latents enter the target, and (A1) gives the new component𝔼​[ezi(ℓ)∣S]\mathbb{E}[e_{z_{i}^{(\ell)}}\mid S]in the optimal predictor. Because levels1,…,ℓ1,\ldots,\ellare decodable, the visible contextSScan be parsed into level-ℓ\ellsymbols away from the masked position. In particular, it contains the level-ℓ\ellcousin tupleTℓT_{\ell}, the closest level-ℓ\elltuple sitting outside the production rule of positionii. Conditioning onTℓT_{\ell}produces the context vectorϕℓ​(ν)=𝔼​[ezi(ℓ)∣Tℓ=ν]\phi_{\ell}(\nu)=\mathbb{E}[e_{z_{i}^{(\ell)}}\mid T_{\ell}=\nu]studied inSection˜3. The prediction problem at scaleℓ\ellhasliftedfrom leaves to level-ℓ\elltuples, and is now identical to the clustering problem solved by the ILC algorithm of that section, requiringP≳v​m3P\gtrsim vm^{3}. When the network extracts this signal, the level-(ℓ+1)(\ell+1)ancestorzi(ℓ+1)z_{i}^{(\ell+1)}becomes linearly decodable, and the next teacher update carries it into the target.

The induction continues up toℓ=L−2\ell=L-2. Every phase shares the same sample complexity. Since the same training set is reused across phases, the offline sample complexity equals the per-phase threshold,Pd2v≍v​m3P_{\rm d2v}\;\asymp\;vm^{3}, and depth enters only through the number of phases, not through the per-phase requirement.

Refer to captionFigure 5:Data2vec clusters synonyms.Synonym clustering score𝒞ℓ\mathcal{C}_{\ell}of the encoder atℓ∈{1,2,3}\ell\in\{1,2,3\}, forL=4L=4,s=2s=2,v=16v=16. A positive score means synonyms are mapped closer together than non-synonyms.*Insets:*raw curves vs.PP.*Main:*same curves rescaled byv​m3vm^{3}collapse.

The encoder internally clusters synonyms.

Our theoretical argument predicts that the data2vec encoder collapses synonymous level-ℓ\elltuples onto a common representation – the very geometric signature of clustering. We test this prediction directly by measuring thesynonym clustering score𝒞ℓ\mathcal{C}_{\ell}of the encoder at levelℓ\ellby comparing the change in its hidden representations under two interventions on the input: asynonym swap, which resamples every level-ℓ\ellproduction rule and thus replaces each level-(ℓ−1)(\ell-1)tuple by a synonym while leaving higher-level latents unchanged, and ageneric swapto an unrelated input. The score is normalized so that𝒞ℓ=0\mathcal{C}_{\ell}=0when synonym swaps move the representation as much as generic swaps (no clustering), and𝒞ℓ=1\mathcal{C}_{\ell}=1when synonyms are mapped to a common image (perfect clustering).Figure˜5reports𝒞ℓ\mathcal{C}_{\ell}at each non-root level after the third block of the encoder as a function of pretraining samplesPP. The scores at all three levels rise from zero and collapse onto a common curve once the sample axis is rescaled byv​m3vm^{3}. This is direct evidence that data2vec’s encoder implements the same recursive clustering mechanism as the ILC algorithm ofSection˜3, with the same per-levelv​m3vm^{3}sample complexity predicted by the theory.

6Conclusion

We have proved that learning from one’s own latents recovers the full non-root latent tree of the Random Hierarchy Model from a number of samples scaling asm3m^{3}, exponentially fewer than themL+1m^{L+1}required by token-level self-supervised objectives. We confirmed this prediction with a hierarchical clustering algorithm, an end-to-end neural network of predictor-clusterer modules, and the first sample-complexity analysis of data2vec. The main lesson is that latents at the same level of the hierarchy are far more correlated with each other than they are with raw tokens, so predicting from one’s own latents amplifies a signal that token-level prediction dilutes. This places on a firm quantitative footing the common intuition that token-level prediction is suboptimal.

Practical implications.

Recent work supports that empirical neural scaling laws in language are governed by power-law decays of token correlations with context length[33,34,38]. Can generative models trained with own-latent self-supervision systematically beat existing scaling laws, alone or in combination with token-level losses? A useful first step would be a controlled comparison between data2vec and a same-architecture next-token-prediction baseline as the training-set sizePPis varied: at smallPPthe two should diverge sharply, while at very largePPthey should converge to the same latent representations. Such a comparison would empirically test our central prediction and, if confirmed, motivate latent-supervised generative models as a route to break current scaling regimes.

Biological plausibility.

As we show in AppendixC.2, the predictor-clusterer modules ofSection˜4can also reach them3m^{3}scaling of sample complexity using only quasilocal learning rules – stop-gradients between modules in place of end-to-end backpropagation. In this local learning setting our method is conceptually similar to contrastive predictive coding[50]and CLAPP[20], which both propose biologically plausible self-supervised objectives that predict in latent rather than raw-sensory space. This raises the question of whether the remarkable sample efficiency of the human brain stems from a similar mechanism. Very recently, CLAPP was trained on the RHM in a setting where information on the top root label was given, recovering the supervised learning sample complexitymLm^{L}– seeappendix˜Afor details. It would be interesting to test whether such bio-plausible architectures can reach am3m^{3}sample complexity on the RHM simply by learning from their own latent.

Limitations.

The RHM is a deliberately simple synthetic language: its context-free grammar has a fixed tree topology and unambiguous, non-recursive production rules. Extending the analysis to (i) variable tree topologies, (ii) recursive production rules that can call themselves, and (iii) explicitly context-dependent rules will bring the theory closer to natural language and images. Such extensions are likely important to characterize precisely the relative strengths and weaknesses of various architectures. Although our analysis suggests that stacking JEPAs as in H-JEPA is largely redundant, very recent self-supervised architectures claim explicit hierarchical supervision but are neither stacked JEPAs nor stacked data2vecs, and instead resemble data2vec-style training procedures with multi-scale teachers[51,52]. These architectures would be interesting to study theoretically. More broadly, developing a suite of synthetic data models will provide a laboratory in which new architectures can be designed, tested, and theoretically understood before being scaled up.

Acknowledgments and Disclosure of Funding

We thank Francesco Cagnetta, Francesco D’Amico, Ariane Delrocq, Antonio Sclocchi, and Wu Zihan for discussions and feedback on the manuscript, and the members of the Simons collaboration for discussions. D. J. Korchinski acknowledges financial support from the Natural Sciences and Engineering Research Council of Canada (NSERC PDF - 587940 - 2024). A. Favero acknowledges support from the Infosys-Cambridge AI Centre. This work was supported by the Simons Foundation through the Simons Collaboration on the Physics of Learning and Neural Computation (Award ID: SFI-MPS-POL00012574-05), PI Wyart.

References

  • Ho et al. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.InAdvances in Neural Information Processing Systems, 2020.
  • Rombach et al. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Brooks et al. [2024]Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Wing Yin Ng, Ricky Wang, and Aditya Ramesh.Video generation models as world simulators.OpenAI Technical Report, 2024.URLhttps://openai.com/research/video-generation-models-as-world-simulators.
  • Brown et al. [2020]Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al.Language models are few-shot learners.InAdvances in Neural Information Processing Systems, 2020.
  • OpenAI [2023]OpenAI.GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • DeepSeek-AI [2025]DeepSeek-AI.DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.
  • Hoffmann et al. [2022]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, et al.Training compute-optimal large language models.InAdvances in Neural Information Processing Systems, 2022.
  • Touvron et al. [2023]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample.LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
  • Gilkerson et al. [2017]Jill Gilkerson, Jeffrey A. Richards, Steven F. Warren, Judith K. Montgomery, Charles R. Greenwood, D. Kimbrough Oller, John H. L. Hansen, and Terrance D. Paul.Mapping the early language environment using all-day recordings and automated analysis.American Journal of Speech-Language Pathology, 26(2):248–265, 2017.
  • Linzen [2020]Tal Linzen.How can we accelerate progress towards human-like linguistic generalization?InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
  • Frank [2023]Michael C. Frank.Bridging the data gap between children and large language models.Trends in Cognitive Sciences, 27(11):990–992, 2023.
  • Schuhmann et al. [2022]Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev.LAION-5B: An open large-scale dataset for training next generation image-text models.InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  • Smith and Gasser [2005]Linda B. Smith and Michael Gasser.The development of embodied cognition: Six lessons from babies.Artificial Life, 11(1–2):13–29, 2005.
  • Alayrac et al. [2022]Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al.Flamingo: a visual language model for few-shot learning.InAdvances in Neural Information Processing Systems, 2022.
  • Rao and Ballard [1999]Rajesh P. N. Rao and Dana H. Ballard.Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87, 1999.
  • Friston [2010]Karl Friston.The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010.
  • LeCun et al. [2022]Yann LeCun et al.A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022.
  • Tack et al. [2025]Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason Weston, and Xian Li.LLM pretraining with continuous concepts.arXiv preprint arXiv:2502.08524, February 2025.doi:10.48550/arXiv.2502.08524.
  • Liu et al. [2026]Yuliang Liu, Yunchong Song, Yixuan Wang, Kewen Ge, Alex Lamb, Qipeng Guo, Kai Chen, Bowen Zhou, and Zhouhan Lin.Next concept prediction in discrete latent space leads to stronger language models.arXiv preprint arXiv:2602.08984, February 2026.doi:10.48550/arXiv.2602.08984.
  • Illing et al. [2021]Bernd Illing, Jean Ventura, Guillaume Bellec, and Wulfram Gerstner.Local plasticity rules can learn deep representations using self-supervised contrastive predictions.InAdvances in Neural Information Processing Systems, volume 34, pages 30365–30379. Curran Associates, Inc., 2021.
  • Millidge et al. [2022]Beren Millidge, Anil Seth, and Christopher L. Buckley.Predictive coding: a theoretical and experimental review.arXiv preprint arXiv:2107.12979, 2022.
  • Grill et al. [2020]Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko.Bootstrap your own latent a new approach to self-supervised learning.InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, pages 21271–21284, Red Hook, NY, USA, December 2020. Curran Associates Inc.ISBN 978-1-7138-2954-6.
  • Caron et al. [2021]Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.Emerging properties in self-supervised vision transformers.InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
  • Baevski et al. [2022]Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli.Data2vec: A general framework for self-supervised learning in speech, vision and language.InProceedings of the 39th International Conference on Machine Learning, pages 1298–1312. PMLR, June 2022.
  • Assel et al. [2025]Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, and Randall Balestriero.Joint embedding vs reconstruction: Provable benefits of latent space prediction for self-supervised learning.arXiv preprint arXiv:2505.12477, 2025.
  • Li et al. [2025]Lihuan Li, Hao Xue, Shuang Ao, Yang Song, and Flora Salim.HiT-JEPA: A Hierarchical Self-supervised Trajectory Embedding Framework for Similarity Computation.arXiv preprint arXiv:2507.00028, June 2025.doi:10.48550/arXiv.2507.00028.
  • Girgis et al. [2026]Abanoub M. Girgis, Ibtissam Labriji, and Mehdi Bennis.Hierarchical JEPA meets predictive remote control in beyond 5G networks.arXiv preprint arXiv:2602.07000, January 2026.doi:10.48550/arXiv.2602.07000.
  • Booth and Thompson [1973]Taylor L. Booth and Richard A. Thompson.Applying probability measures to abstract languages.IEEE Transactions on Computers, C-22(5):442–450, 1973.
  • Manning and Schütze [1999]Christopher D. Manning and Hinrich Schütze.Foundations of Statistical Natural Language Processing.MIT Press, Cambridge, MA, 1999.
  • Grenander [1996]Ulf Grenander.Elements of Pattern Theory.Johns Hopkins University Press, 1996.
  • Mumford [1994]David Mumford.Pattern theory: a unifying perspective.InFirst European Congress of Mathematics, Vol. I (Paris, 1992), volume 119 ofProgress in Mathematics, pages 187–224. Birkhäuser, 1994.
  • Cagnetta et al. [2024]Francesco Cagnetta, Leonardo Petrini, Umberto M Tomasini, Alessandro Favero, and Matthieu Wyart.How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024.
  • Cagnetta and Wyart [2024]Francesco Cagnetta and Matthieu Wyart.Towards a theory of how the structure of language is acquired by deep neural networks.Advances in Neural Information Processing Systems, 37:83119–83163, 2024.
  • Cagnetta et al. [2026]Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart.Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026.
  • Sclocchi et al. [2025a]Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart.A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, January 2025a.doi:10.1073/pnas.2408799121.
  • Sclocchi et al. [2025b]Antonio Sclocchi, Alessandro Favero, Noam Itzhak Levi, and Matthieu Wyart.Probing the latent hierarchical structure of data via diffusion models.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):84005, August 2025b.ISSN 1742-5468.doi:10.1088/1742-5468/aded6c.
  • Favero et al. [2025a]Alessandro Favero, Antonio Sclocchi, and Matthieu Wyart.Bigger isn’t always memorizing: Early stopping overparameterized diffusion models.arXiv preprint arXiv:2505.16959, 2025a.
  • Favero et al. [2025b]Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, and Matthieu Wyart.How compositional generalization and creativity improve as diffusion models are trained.InInternational Conference on Machine Learning, pages 16286–16306. PMLR, 2025b.
  • Cagnetta et al. [2025]Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, and Matthieu Wyart.Scaling laws and representation learning in simple hierarchical languages: Transformers versus convolutional architectures.Physical Review E, 112(6):065312, 2025.
  • Poggio et al. [2017]Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao.Why and when can deep-but-not-shallow networks avoid the curse of dimensionality: A review.International Journal of Automation and Computing, 14(5):503–519, 2017.
  • Schmidt-Hieber [2020]Johannes Schmidt-Hieber.Nonparametric regression using deep neural networks with ReLU activation function.Annals of Statistics, 48(4):1875–1897, 2020.
  • Mei [2025]Song Mei.U-Nets as belief propagation: Efficient classification, denoising, and diffusion in generative hierarchical models.InInternational Conference on Learning Representations (ICLR), 2025.arXiv:2404.18444.
  • Allen-Zhu and Li [2025]Zeyuan Allen-Zhu and Yuanzhi Li.Physics of language models: Part 1, learning hierarchical language structures.Transactions on Machine Learning Research, 2025.URLhttps://openreview.net/forum?id=mPQKyzkA1K.
  • Zhao et al. [2023]Haoyu Zhao, Abhishek Panigrahi, Rong Ge, and Sanjeev Arora.Do transformers parse while predicting the masked word?InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 16513–16542, Singapore, 2023. Association for Computational Linguistics.doi:10.18653/v1/2023.emnlp-main.1029.URLhttps://aclanthology.org/2023.emnlp-main.1029/.
  • Garnier-Brun et al. [2025]Jerome Garnier-Brun, Marc Mézard, Emanuele Moscato, and Luca Saglietti.How transformers learn structured data: Insights from hierarchical filtering.InInternational Conference on Machine Learning (ICML), 2025.arXiv:2408.15138.
  • Mossel [2016]Elchanan Mossel.Deep learning and hierarchical generative models.arXiv preprint arXiv:1612.09057, 2016.
  • Malach and Shalev-Shwartz [2018]Eran Malach and Shai Shalev-Shwartz.A provably correct algorithm for deep learning that actually works.arXiv preprint arXiv:1803.09522, 2018.
  • DeGiuli [2019]Eric DeGiuli.Random language model.Physical Review Letters, 122(12):128301, 2019.doi:10.1103/PhysRevLett.122.128301.
  • Balestriero et al. [2023]Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann LeCun, and Micah Goldblum.A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023.
  • van den Oord et al. [2019]Aaron van den Oord, Yazhe Li, and Oriol Vinyals.Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, January 2019.doi:10.48550/arXiv.1807.03748.
  • Lowe et al. [2026]Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, and Graham W. Taylor.Self-distillation of hidden layers for self-supervised representation learning.arXiv preprint arXiv:2603.15553, March 2026.doi:10.48550/arXiv.2603.15553.
  • Maes et al. [2026]Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero.LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, March 2026.doi:10.48550/arXiv.2603.19312.
  • Ren et al. [2026]Yunwei Ren, Yatin Dandi, Florent Krzakala, and Jason D. Lee.Provable learning of random hierarchy models and hierarchical shallow-to-deep chaining.arXiv preprint arXiv:2601.19756, 2026.
  • Parley et al. [2026]Jack T Parley, Francesco Cagnetta, and Matthieu Wyart.Deep networks learn to parse uniform-depth context-free languages from local statistics.arXiv preprint arXiv:2602.06065, 2026.
  • Defilippis et al. [2026]Lorenzo Defilippis, Florent Krzakala, Bruno Loureiro, and Antoine Maillard.Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026.
  • Caron et al. [2018]Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze.Deep clustering for unsupervised learning of visual features.InProceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018.
  • Caron et al. [2020]Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin.Unsupervised learning of visual features by contrasting cluster assignments.InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, pages 9912–9924, Red Hook, NY, USA, December 2020. Curran Associates Inc.ISBN 978-1-7138-2954-6.
  • Oquab et al. [2024]Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski.DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, January 2024.ISSN 2835-8856.
  • Siméoni et al. [2026]Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seung Eun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herve Jegou, Patrick Labatut, and Piotr Bojanowski.DINOv3.Transactions on Machine Learning Research, May 2026.ISSN 2835-8856.
  • Zhou et al. [2022]Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong.iBOT: Image BERT pre-training with online tokenizer.InInternational Conference on Learning Representations, January 2022.
  • Darcet et al. [2025]Timothée Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, and Piotr Bojanowski.Cluster and predict latent patches for improved masked image modeling.Transactions on Machine Learning Research, February 2025.ISSN 2835-8856.
  • Mur-Labadia et al. [2026]Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes.V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, March 2026.doi:10.48550/arXiv.2603.14482.
  • Delrocq et al. [2026]Ariane Delrocq, Wu S. Zihan, Guillaume Bellec, and Wulfram Gerstner.Self-supervised local learning rules learn the hidden hierarchical structure of high-dimensional data.arXiv preprint arXiv:2605.18557, May 2026.doi:10.48550/arXiv.2605.18557.
  • Quinton and Rey [2025]Pierre Quinton and Valérian Rey.Jacobian descent for multi-objective optimization.arXiv preprint arXiv:2406.16232, February 2025.doi:10.48550/arXiv.2406.16232.
  • Akiba et al. [2019]Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama.Optuna: A next-generation hyperparameter optimization framework.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pages 2623–2631, New York, NY, USA, July 2019. Association for Computing Machinery.ISBN 978-1-4503-6201-6.doi:10.1145/3292500.3330701.

Appendix AFurther related work

Approximation- and information-theoretic bounds.

Deep networks can represent hierarchically compositional functions with exponentially fewer parameters than shallow ones[40], which translates into nonparametric sample-complexity bounds polynomial in the input dimension for both supervised[41]and self-supervised[42]tasks. These results characterize expressivity and information-theoretic statistics, but do not describe what gradient descent can actually learn from a finite training set.

Supervised learning.

Mossel [46]introduced phylogeny-inspired generative models on regular trees, andMalach and Shalev-Shwartz [47]showed that they can be learned by an iterative clustering algorithm.DeGiuli [48]introduced languages with random production rules. The RHM[32]combines both assumptions, leading to models with controllable correlations where sample complexity can be predicted and compared to observations in deep nets.Ren et al. [53]proved rigorously that result for multi-step gradient descent.Parley et al. [54]obtains sample complexity for non-regular trees.Defilippis et al. [55]derive optimal scaling laws for hierarchical multi-index models, a related class of generative models with continuous latents.

Self-supervised learning.

A complementary line of work has shown that transformers can approximately implement parsing-style (inside-algorithm) inference on context-free grammars[43,44,45]. These results interpret the algorithm that trained LLMs implement, but do not characterize the learning mechanism nor its sample complexity. Such analyses do exist for the RHM, but only for next-token prediction[32]and for diffusion[38,37], not for learning from one’s own latents as considered here.

Clustering-based approaches to self-distillation.

Our SLC method is similar in spirit to clustering or self-learned pseudo-labeling-based methods for image-representation learning, such as DeepCluster[56]which in turn evolved into SwAV[57], the DINO family of models[23,58,59], iBOT[60], and CAPI[61]. Our method addsexplicithierarchical clustering – it would be interesting to test to what degreeimplicithierarchical supervision is present for these clustering-based approaches.

Explicitly hierarchical self-distillation SSL.

Schematic realizations of the H-JEPA[17], V-JEPA 2.1[62], Bootleg[51], and our SLC architectures are depicted infig.˜6. Bootleg and V-JEPA 2.1 are extremely recent, report state of the art performance, and employ multi-loss optimization with one prediction per sampled representation level. Bootleg is schematically very similar to data2vec (cf.fig.˜4–left), except that in bootleg gradients are computed after averaging losses over representations, whereas in data2vec the averaging is done first over representations. V-JEPA 2.1 meanwhile collates representations from throughout the student network, similar to H-JEPA, but also allows low-level target predictions to be influenced by higher level latents. This high→\rightarrowlow prediction path is missing in H-JEPA, but present in SLC, V-JEPA 2.1, and Bootleg. It is important to clarify whether this is related to the fact that, with few exceptions[26,27], “standard” H-JEPA appears to have found limited application in the literature.

Predictive coding and the RHM.

Delrocq et al. [63]recently showed that a backprop-free predictive coding model, CLAPP, can learn the RHM’s hierarchical structure with sample complexity comparable to that of backprop. In that work, the model was trained on the RHM in the setting where all production rules exist, such that all sentences belong to the language (corresponding tof=1f=1in our notations). In that case, self-supervised methods cannot learn the RHM rules, as all inputs are equiprobable and correlations within the input are absent. However, the variation of CLAPP considered in that work uses a contrastive update rule that is gated by an extra learning signal: updates are given a±1\pm 1sign indicating whether pairs belong to the same top-level class. The sample complexity obtained was that of supervised learning∼mL\sim m^{L}. Baseline CLAPP[20]does not rely on this non-local context supervision – it would be interesting to understand whether it learns the RHM rules in sample complexitym3m^{3}whenf<1f<1.

H-JEPAstudentrepresentationsteacherrepresentationsMSEMSEMSEMSEaax1x_{1}bbx2x_{2}?x3x_{3}ddx4x_{4}?x5x_{5}aax1x_{1}bbx2x_{2}ccx3x_{3}ddx4x_{4}eex5x_{5} V-JEPA 2.1studentrepresentationsteacherrepresentationspredMSEMSEMSE⋮\vdots⋮\vdots⋮\vdots⋮\vdots⋮\vdots⋮\vdotsaax1x_{1}bbx2x_{2}?x3x_{3}ddx4x_{4}?x5x_{5}aax1x_{1}bbx2x_{2}ccx3x_{3}ddx4x_{4}eex5x_{5} BootlegstudentrepresentationsteacherrepresentationsMSEMSEMSE⋮\vdots⋮\vdots⋮\vdots⋮\vdots⋮\vdots⋮\vdotsaax1x_{1}bbx2x_{2}?x3x_{3}ddx4x_{4}?x5x_{5}aax1x_{1}bbx2x_{2}ccx3x_{3}ddx4x_{4}eex5x_{5} Stacked Latent Clusteringstudentrepresentationsteacherrepresentationsaax1x_{1}bbx2x_{2}ccx3x_{3}ddx4x_{4}eex5x_{5}aax1x_{1}bbx2x_{2}ccx3x_{3}ddx4x_{4}eex5x_{5}⋮\vdots⋮\vdotsCELCELCEL

Figure 6:Top left:Structure of losses and targets for naïve H-JEPAs.Top right:Structure of losses and targets in V-JEPA 2.1.Bottom left:Structure of losses and targets for Bootleg.Bottom right:A schematic implementation of Stacked Latent Clustering, in which high-level representations are grounded in predicted lower-level representations.

Appendix BProof of Iterative Latent Clustering recovery

This appendix proves the recovery guarantee for Iterative Latent Clustering. Throughout the proof, probabilities are taken over fresh samples from a fixed grammar instance, unless an expectation over grammars is explicitly denoted by𝔼𝒢\mathbb{E}_{\mathcal{G}}.

B.1Formal assumptions

We use the following two assumptions. The first is a typical-event assumption on the random grammar. The second is an algorithmic stability assumption on the clustering module.

Assumption 1(Balanced and separated grammar).

There exist constantscbal,Cbal,csep>0c_{\rm bal},C_{\rm bal},c_{\rm sep}>0, independent ofv,m,Lv,m,L, such that for every levelℓ=0,…,L−2\ell=0,\ldots,L-2, every grammatical tupleν∈𝒮ℓ\nu\in\mathcal{S}_{\ell}satisfies

cbalv​m≤ℙ​[T=ν]≤Cbalv​m,\frac{c_{\rm bal}}{vm}\leq\mathbb{P}[T=\nu]\leq\frac{C_{\rm bal}}{vm},and the parent context vectors satisfy

mina≠b⁡‖Φℓ,a−Φℓ,b‖2≥csep​1−fm.\min_{a\neq b}\|\Phi_{\ell,a}-\Phi_{\ell,b}\|_{2}\geq c_{\rm sep}\frac{\sqrt{1-f}}{m}.

Assumption 2(Stable clustering module).

Let{zν:ν∈𝒮ℓ}⊂ℝv\{z_{\nu}:\nu\in\mathcal{S}_{\ell}\}\subset\mathbb{R}^{v}be points with true centers{Φℓ,a:a∈𝒱ℓ+1}\{\Phi_{\ell,a}:a\in\mathcal{V}_{\ell+1}\}. Suppose that

‖zν−Φℓ,parℓ​(ν)‖2≤ε\|z_{\nu}-\Phi_{\ell,\mathrm{par}_{\ell}(\nu)}\|_{2}\leq\varepsilonfor allν∈𝒮ℓ\nu\in\mathcal{S}_{\ell}, and that

‖Φℓ,a−Φℓ,b‖2≥Δ\|\Phi_{\ell,a}-\Phi_{\ell,b}\|_{2}\geq\Deltafor alla≠ba\neq b. Ifε≤Δ/8\varepsilon\leq\Delta/8, then𝖢𝗅𝗎𝗌𝗍𝖾𝗋v​({zν}ν∈𝒮ℓ)\mathsf{Cluster}_{v}(\{z_{\nu}\}_{\nu\in\mathcal{S}_{\ell}})returns the true partition𝒮ℓ=⨆a∈𝒱ℓ+1𝒮ℓ,a\mathcal{S}_{\ell}=\bigsqcup_{a\in\mathcal{V}_{\ell+1}}\mathcal{S}_{\ell,a}, up to a permutation of labels.

Assumption1is a high-probability random-grammar event. Balancedness follows from asymptotic uniformity of single-symbol marginals in the limitv→∞v\to\inftyat fixedf∈(0,1)f\in(0,1), see Lemma C.1 of[54]. Indeed, ifν∈𝒮ℓ,a\nu\in\mathcal{S}_{\ell,a}, thenν\nuis generated with probability1/m1/mconditional on its parentaa. Therefore

ℙ​[T(ℓ)=ν]=1m​ℙ​[h(ℓ+1)=a]∼1v​m.\mathbb{P}[T^{(\ell)}=\nu]=\frac{1}{m}\mathbb{P}[h^{(\ell+1)}=a]\sim\frac{1}{vm}.The separation scale is the latent-latent correlation scale. Indeed, writing

ψℓ​(ν):=ϕℓ​(ν)−𝔼​[eY],\psi_{\ell}(\nu):=\phi_{\ell}(\nu)-\mathbb{E}[e_{Y}],the connected covariance column is

Cℓ​(:,ν)=ℙ​[T=ν]​ψℓ​(ν).C_{\ell}(:,\nu)=\mathbb{P}[T=\nu]\psi_{\ell}(\nu).Forℓ=0\ell=0,TTis a grammatical visibless-tuple andYYis a visible token in a sibling branch, with lowest common ancestor two levels above the leaves. This is precisely the token–tuple covariance computed inCagnetta and Wyart [33]. In our notation, their calculation gives, for a fixed visible valueyyand a grammatical tupleν\nu,

𝔼𝒢[C0(y,ν)2|ν∈𝒮0]≍1−fv3​m4.\mathbb{E}_{\mathcal{G}}\!\left[C_{0}(y,\nu)^{2}\,\middle|\,\nu\in\mathcal{S}_{0}\right]\asymp\frac{1-f}{v^{3}m^{4}}.Summing over thevvpossible values ofYYyields the squared column norm

𝔼𝒢[∥C0(:,ν)∥22|ν∈𝒮0]≍v⋅1−fv3​m4=1−fv2​m4.\mathbb{E}_{\mathcal{G}}\!\left[\|C_{0}(:,\nu)\|_{2}^{2}\,\middle|\,\nu\in\mathcal{S}_{0}\right]\asymp v\cdot\frac{1-f}{v^{3}m^{4}}=\frac{1-f}{v^{2}m^{4}}. The same estimate holds at every latent level by self-similarity of the RHM. Indeed, fixℓ>0\ell>0and consider the sequence of level-ℓ\ellvariables

H(ℓ):=(h1(ℓ),…,hsL−ℓ(ℓ)).H^{(\ell)}:=\big(h^{(\ell)}_{1},\ldots,h^{(\ell)}_{s^{L-\ell}}\big).Marginally,H(ℓ)H^{(\ell)}is generated by an RHM of depthL−ℓL-\ellwith the same local parameters(s,v,m)(s,v,m), using the production rules at levelsℓ,ℓ+1,…,L−1\ell,\ell+1,\ldots,L-1. In this truncated RHM, the level-ℓ\ellvariables play the role of visible tokens. Our statisticCℓ​(:,ν)C_{\ell}(:,\nu)is therefore exactly the level-0token–tuple covariance of this truncated model:TTis a terminalss-tuple,YYis a terminal child in a sibling branch, and their lowest common ancestor is again two levels above the terminal scale. Since the random rule ensemble has the same law at every level, the same covariance calculation gives

𝔼𝒢[∥Cℓ(:,ν)∥22|ν∈𝒮ℓ]≍1−fv2​m4,\mathbb{E}_{\mathcal{G}}\!\left[\|C_{\ell}(:,\nu)\|_{2}^{2}\,\middle|\,\nu\in\mathcal{S}_{\ell}\right]\asymp\frac{1-f}{v^{2}m^{4}},uniformly forℓ=0,…,L−2\ell=0,\ldots,L-2. Since

Cℓ​(:,ν)=ℙ​[T=ν]​ψℓ​(ν)C_{\ell}(:,\nu)=\mathbb{P}[T=\nu]\psi_{\ell}(\nu)and, on the balancedness event,ℙ​[T=ν]≍1/(v​m)\mathbb{P}[T=\nu]\asymp 1/(vm), this implies

𝔼𝒢[∥ψℓ(ν)∥22|ν∈𝒮ℓ]≍(1−f)/(v2​m4)1/(v2​m2)=1−fm2.\mathbb{E}_{\mathcal{G}}\!\left[\|\psi_{\ell}(\nu)\|_{2}^{2}\,\middle|\,\nu\in\mathcal{S}_{\ell}\right]\asymp\frac{(1-f)/(v^{2}m^{4})}{1/(v^{2}m^{2})}=\frac{1-f}{m^{2}}. Thus1−f/m\sqrt{1-f}/mis the natural scale of the centered context-vector signal.

To turn this into a separation statement, we need to assume that different parent context vectors have generic relative positions, as expected for independent random production rules. In particular, under a random-direction heuristic, overlaps are smaller than squared norms by a factorv−1/2v^{-1/2}, so pairwise distances have the same scale as the individual norms.

B.2Theorem and proof

Theorem 2(Recovery of the non-root hierarchy).

Assume the balancedness and separation conditions above, and assume the clustering is stable under perturbations as described. Then there exists a constantC>0C>0, depending only on the constants in those assumptions and onss, such that the following holds. If

P≥C​[v​m​log⁡L​v​mδ+v​m31−f​log⁡L​v​mδ],P\geq C\left[vm\log\frac{Lvm}{\delta}+\frac{vm^{3}}{1-f}\log\frac{Lvm}{\delta}\right],(3)then the iterative latent-clustering algorithm recoversh(1),h(2),…,h(L−1)h^{(1)},h^{(2)},\ldots,h^{(L-1)}and all production-rule classes below the root with probability at least1−δ1-\delta, up to independent permutations of the latent labels at each level.

In particular, up to logarithmic factors,PILC≍(1−f)−1​v​m3.P_{\mathrm{ILC}}\asymp(1-f)^{-1}\,vm^{3}.

Synonyms have identical context vectors.

We first record the property that makes clustering possible.

Lemma 1(Synonyms have identical context vectors).

Ifν,ν′∈𝒮ℓ\nu,\nu^{\prime}\in\mathcal{S}_{\ell}satisfyparℓ​(ν)=parℓ​(ν′)\mathrm{par}_{\ell}(\nu)=\mathrm{par}_{\ell}(\nu^{\prime}), thenϕℓ​(ν)=ϕℓ​(ν′)\phi_{\ell}(\nu)=\phi_{\ell}(\nu^{\prime}).

Proof.

LetAAbe the level-(ℓ+1)(\ell+1)parent of the branch containingTT, letBBbe the level-(ℓ+1)(\ell+1)parent of the sibling branch containingYY, and letGGbe their shared level-(ℓ+2)(\ell+2)parent. The local graphical structure isG→(A,B)G\to(A,B),A→TA\to T, andB→YB\to Y. By unambiguity, observingT=νT=\nudeterminesA=parℓ​(ν)A=\mathrm{par}_{\ell}(\nu). By the context-free Markov structure, conditioned onAA, the rest of the tree is independent of which production rule was used to generateTT. Hence, ifν∈𝒮ℓ,a\nu\in\mathcal{S}_{\ell,a},

ℙ​[Y=y∣T=ν]=ℙ​[Y=y∣A=a].\mathbb{P}[Y=y\mid T=\nu]=\mathbb{P}[Y=y\mid A=a].The right-hand side depends onν\nuonly through its parentaa. Therefore all tuples in the same synonym class have the same context vector. ∎

Concentration of empirical context vectors.

Fix a levelℓ\elland a grammatical tupleν∈𝒮ℓ\nu\in\mathcal{S}_{\ell}. LetNνN_{\nu}be the number of samples in which the chosen cousin tuple equalsν\nu, and letY1,…,YNνY_{1},\ldots,Y_{N_{\nu}}be the corresponding target values. Conditional onT=νT=\nu, these are independent draws from the conditional law ofYY, and the empirical context vector

ϕ^ℓ​(ν)=1Nν​∑q=1NνeYq\widehat{\phi}_{\ell}(\nu)=\frac{1}{N_{\nu}}\sum_{q=1}^{N_{\nu}}e_{Y_{q}}is the empirical mean ofNνN_{\nu}independent one-hot vectors with meanϕℓ​(ν)=𝔼​[eY∣T=ν]\phi_{\ell}(\nu)=\mathbb{E}[e_{Y}\mid T=\nu].

Lemma 2(Context-vector concentration).

There is a constantC0>0C_{0}>0such that, for anyη∈(0,e−1]\eta\in(0,e^{-1}],

ℙ[∥ϕ^ℓ(ν)−ϕℓ(ν)∥2>C0log⁡(1/η)Nν|Nν]≤η.\mathbb{P}\!\left[\|\widehat{\phi}_{\ell}(\nu)-\phi_{\ell}(\nu)\|_{2}>C_{0}\sqrt{\frac{\log(1/\eta)}{N_{\nu}}}\,\middle|\,N_{\nu}\right]\leq\eta.

Proof.

Condition onNν=n≥1N_{\nu}=n\geq 1. Define

Uq:=eYq−ϕℓ​(ν),U¯:=1n​∑q=1nUq.U_{q}:=e_{Y_{q}}-\phi_{\ell}(\nu),\qquad\bar{U}:=\frac{1}{n}\sum_{q=1}^{n}U_{q}.Then𝔼​[Uq]=0\mathbb{E}[U_{q}]=0and

ϕ^ℓ​(ν)−ϕℓ​(ν)=U¯.\widehat{\phi}_{\ell}(\nu)-\phi_{\ell}(\nu)=\bar{U}.Since theUqU_{q}’s are independent and mean zero, the cross terms vanish, and

𝔼​‖U¯‖22\displaystyle\mathbb{E}\|\bar{U}\|_{2}^{2}=𝔼​‖1n​∑q=1nUq‖22\displaystyle=\mathbb{E}\left\|\frac{1}{n}\sum_{q=1}^{n}U_{q}\right\|_{2}^{2}=1n2​∑q=1n𝔼​‖Uq‖22\displaystyle=\frac{1}{n^{2}}\sum_{q=1}^{n}\mathbb{E}\|U_{q}\|_{2}^{2}≤1n,\displaystyle\leq\frac{1}{n},where we used

𝔼​‖eY−ϕℓ​(ν)‖22=1−‖ϕℓ​(ν)‖22≤1.\mathbb{E}\|e_{Y}-\phi_{\ell}(\nu)\|_{2}^{2}=1-\|\phi_{\ell}(\nu)\|_{2}^{2}\leq 1.Thus,

𝔼​‖U¯‖2≤n−1/2.\mathbb{E}\|\bar{U}\|_{2}\leq n^{-1/2}. Now consider the function

F​(Y1,…,Yn):=‖U¯‖2.F(Y_{1},\ldots,Y_{n}):=\|\bar{U}\|_{2}.Changing one observation changesU¯\bar{U}by at most2/n\sqrt{2}/n, and hence changesFFby at most2/n\sqrt{2}/n. By McDiarmid’s bounded-differences inequality,

ℙ​[F−𝔼​F>t]≤exp⁡(−n​t2).\mathbb{P}\!\left[F-\mathbb{E}F>t\right]\leq\exp(-nt^{2}).Taking

t=log⁡(1/η)n,t=\sqrt{\frac{\log(1/\eta)}{n}},we get, with probability at least1−η1-\eta,

F≤1n+log⁡(1/η)n.F\leq\frac{1}{\sqrt{n}}+\sqrt{\frac{\log(1/\eta)}{n}}.Sinceη≤e−1\eta\leq e^{-1}, we havelog⁡(1/η)≥1\log(1/\eta)\geq 1, so the first term is absorbed into the second. Thus, for a constantC0C_{0},

‖ϕ^ℓ​(ν)−ϕℓ​(ν)‖2=‖U¯‖2≤C0​log⁡(1/η)n\|\widehat{\phi}_{\ell}(\nu)-\phi_{\ell}(\nu)\|_{2}=\|\bar{U}\|_{2}\leq C_{0}\sqrt{\frac{\log(1/\eta)}{n}}with probability at least1−η1-\eta. ∎

Proof of the theorem.

At each levelℓ\ell, letTℓT_{\ell}be the level-ℓ\elltuple to be clustered andZℓZ_{\ell}the level-ℓ\elltarget in the sibling branch. For samplepp, denote the corresponding true variables byTℓ(p)T_{\ell}^{(p)}andZℓ(p)Z_{\ell}^{(p)}

We now proveTheorem˜2.

Proof ofTheorem˜2.

We first define a good event using the true latent variables. For every levelℓ=0,…,L−2\ell=0,\ldots,L-2and every grammatical tupleν∈𝒮ℓ\nu\in\mathcal{S}_{\ell}, let

Nℓ,ν:=∑p=1P𝟏​{Tℓ(p)=ν}N_{\ell,\nu}:=\sum_{p=1}^{P}\mathbf{1}\{T_{\ell}^{(p)}=\nu\}be the number of samples in which the true level-ℓ\ellcousin tuple equalsν\nu, and let

ϕ~ℓ​(ν):=1Nℓ,ν​∑p=1PeZℓ(p)​𝟏​{Tℓ(p)=ν}\widetilde{\phi}_{\ell}(\nu):=\frac{1}{N_{\ell,\nu}}\sum_{p=1}^{P}e_{Z_{\ell}^{(p)}}\mathbf{1}\{T_{\ell}^{(p)}=\nu\}be the corresponding oracle empirical context vector.

By balancedness,

ℙ​[Tℓ=ν]≥cbalv​m.\mathbb{P}[T_{\ell}=\nu]\geq\frac{c_{\rm bal}}{vm}.ThusNℓ,νN_{\ell,\nu}is binomial with mean at leastcbal​P/(v​m)c_{\rm bal}P/(vm). A Chernoff bound and a union bound over all(L−1)​v​m(L-1)vmpairs(ℓ,ν)(\ell,\nu)imply that, if

P≳v​m​log⁡L​v​mδ,P\gtrsim vm\log\frac{Lvm}{\delta},then, with probability at least1−δ/21-\delta/2,

Nℓ,ν≳Pv​mN_{\ell,\nu}\gtrsim\frac{P}{vm}simultaneously for allℓ\ellandν\nu.

On this count event, Lemma2, applied withη=δ/(2​L​v​m)\eta=\delta/(2Lvm), and another union bound give

maxℓ,ν⁡‖ϕ~ℓ​(ν)−ϕℓ​(ν)‖2≲v​mP​log⁡L​v​mδ\max_{\ell,\nu}\|\widetilde{\phi}_{\ell}(\nu)-\phi_{\ell}(\nu)\|_{2}\lesssim\sqrt{\frac{vm}{P}\log\frac{Lvm}{\delta}}with probability at least1−δ/21-\delta/2. Therefore, if

P≳v​m31−f​log⁡L​v​mδ,P\gtrsim\frac{vm^{3}}{1-f}\log\frac{Lvm}{\delta},then, with probability at least1−δ1-\delta,

‖ϕ~ℓ​(ν)−ϕℓ​(ν)‖2≤18​csep​1−fm\|\widetilde{\phi}_{\ell}(\nu)-\phi_{\ell}(\nu)\|_{2}\leq\frac{1}{8}c_{\rm sep}\frac{\sqrt{1-f}}{m}for allℓ\elland allν∈𝒮ℓ\nu\in\mathcal{S}_{\ell}. Call this simultaneous eventℰ\mathcal{E}.

We now show that onℰ\mathcal{E}, the algorithm succeeds deterministically. At level0, the variables are visible tokens, soh^(0)=h(0)\widehat{h}^{(0)}=h^{(0)}. Suppose inductively that levelℓ\ellhas been recovered exactly, up to a permutation of labels. Then the recovered level-ℓ\elltuples are the true level-ℓ\elltuples, up to relabeling. Consequently, the empirical context vectors computed by the algorithm are exactly the oracle vectorsϕ~ℓ​(ν)\widetilde{\phi}_{\ell}(\nu), up to permutations of tuple labels and output coordinates. These permutations preserve distances.

Thus, onℰ\mathcal{E}, every empirical context vector lies within

18​csep​1−fm\frac{1}{8}c_{\rm sep}\frac{\sqrt{1-f}}{m}of its true parent center. By separation,

mina≠b⁡‖Φℓ,a−Φℓ,b‖2≥csep​1−fm.\min_{a\neq b}\|\Phi_{\ell,a}-\Phi_{\ell,b}\|_{2}\geq c_{\rm sep}\frac{\sqrt{1-f}}{m}.The stable clustering assumption therefore implies that the clustering step recovers the true synonym partition

𝒮ℓ=⨆a∈𝒱ℓ+1𝒮ℓ,a\mathcal{S}_{\ell}=\bigsqcup_{a\in\mathcal{V}_{\ell+1}}\mathcal{S}_{\ell,a}up to a permutation of labels. Assigning each tuple to its cluster recovers levelℓ+1\ell+1, again up to a permutation of labels.

Iterating this deterministic induction fromℓ=0\ell=0toL−2L-2recoversh(1),…,h(L−1)h^{(1)},\ldots,h^{(L-1)}and all production-rule classes below the root. Sinceℰ\mathcal{E}holds with probability at least1−δ1-\delta, the theorem follows.

The required sample size is

P≳v​m​log⁡L​v​mδ+v​m31−f​log⁡L​v​mδ.P\gtrsim vm\log\frac{Lvm}{\delta}+\frac{vm^{3}}{1-f}\log\frac{Lvm}{\delta}.Thus, up to logarithmic factors,P≳v​m3/(1−f)P\gtrsim vm^{3}/(1-f)suffices for recovery, and we write

PILC≍v​m31−fP_{\rm ILC}\asymp\frac{vm^{3}}{1-f}for this threshold. ∎

Appendix CStacked latent-clustering details and additional results

This appendix gives implementation details and additional diagnostics for the SLC network introduced inSection˜4. We use the RHM notation ofSection˜2: true latent variables are denoted byhi(ℓ)h^{(\ell)}_{i}, while learned SLC tokens are denoted byh^i(ℓ)\widehat{h}^{(\ell)}_{i}.

C.1SLC architecture details

We encode RHM input strings of sizesLs^{L}into one-hot feature vectors of dimensionℝsL×dh\mathbb{R}^{s^{L}\times d_{h}}, wheredh≫vd_{h}\gg vis the dimension into which we tokenize learned latents. This initializes the level-0tokens ash^i(0)=xi\widehat{h}^{(0)}_{i}=x_{i}. We use a single dimensiondhd_{h}for the cluster codebook and for the token vocabulary passed between SLC modules: each clusterer assigns labels in{1,…,dh}\{1,\ldots,d_{h}\}, these labels become the tokens for the next module, and predictors therefore output categorical distributions overdhd_{h}token identities.

A SLC model for an RHM of depthLLis comprised ofL−1L-1SLC modules. Moduleℓ\ellreduces the learned level-ℓ\ellsequence to an encoding of the level-(ℓ+1)(\ell+1)latents. For a module at levelℓ\ell, the tensor shapes are

h^(ℓ)∈ℝsL−ℓ×dh,xu(ℓ):=(h^(u−1)​s+1(ℓ),…,h^u​s(ℓ))∈ℝs×dh,ϕ^u(ℓ)=Pred(ℓ)​(xu(ℓ))∈ℝs×(s−1)×s×dh,vec​ϕ^u(ℓ)∈ℝs2​(s−1)​dh,qu(ℓ+1)=Clust(ℓ)​(vec​ϕ^u(ℓ))∈Δdh−1,h^(ℓ+1)∈ℝsL−ℓ−1×dh.\begin{array}[]{rcl}\widehat{h}^{(\ell)}&\in&\mathbb{R}^{s^{L-\ell}\times d_{h}},\\[2.84526pt] x_{u}^{(\ell)}:=\big(\widehat{h}^{(\ell)}_{(u-1)s+1},\ldots,\widehat{h}^{(\ell)}_{us}\big)&\in&\mathbb{R}^{s\times d_{h}},\\[2.84526pt] \widehat{\phi}^{(\ell)}_{u}=\mathrm{Pred}^{(\ell)}(x_{u}^{(\ell)})&\in&\mathbb{R}^{s\times(s-1)\times s\times d_{h}},\\[2.84526pt] \mathrm{vec}\,\widehat{\phi}^{(\ell)}_{u}&\in&\mathbb{R}^{s^{2}(s-1)d_{h}},\\[2.84526pt] q_{u}^{(\ell+1)}=\mathrm{Clust}^{(\ell)}(\mathrm{vec}\,\widehat{\phi}^{(\ell)}_{u})&\in&\Delta^{d_{h}-1},\\[2.84526pt] \widehat{h}^{(\ell+1)}&\in&\mathbb{R}^{s^{L-\ell-1}\times d_{h}}.\end{array}Hereu=1,…,sL−ℓ−1u=1,\ldots,s^{L-\ell-1}indexes level-ℓ\ellpatches, anddh2d_{h_{2}}is the hidden width of the CNN layers below. The dimensions ofϕ^u(ℓ)\widehat{\phi}^{(\ell)}_{u}enumerate, respectively, the possible position of the input patch within its grandparent, the target cousin patch, the target token within that patch, and thedhd_{h}token identity. This task is easiest fordh=vd_{h}=v, as clustering is architecturally enforced, and is hardest fordh≥m​vd_{h}\geq mv(when synonyms can be uniquely encoded without clustering). Unless otherwise noted, we will always usedh=m​vd_{h}=mv.

A SLC module is composed of a predictor and a clusterer. This arrangement is depicted inFigure˜7.

h^1(ℓ)\widehat{h}^{(\ell)}_{1}h^2(ℓ)\widehat{h}^{(\ell)}_{2}aabbh^3(ℓ)\widehat{h}^{(\ell)}_{3}h^4(ℓ)\widehat{h}^{(\ell)}_{4}ccdd⋮\vdotsPred(ℓ)\mathrm{Pred}^{(\ell)}p​(h^3(ℓ)|a,b)p(\widehat{h}^{(\ell)}_{3}\,|\,a,b)p​(h^4(ℓ)|a,b)p(\widehat{h}^{(\ell)}_{4}\,|\,a,b)ϕ^1(ℓ)\widehat{\phi}^{(\ell)}_{1}cross entropy lossClust(ℓ)\mathrm{Clust}^{(\ell)}SM\operatorname{\mathrm{SM}}h^1(ℓ+1)\widehat{h}^{(\ell+1)}_{1}contrastiveclustering lossϕ^i(ℓ)\widehat{\phi}^{(\ell)}_{i}⋯\cdotsh^i(ℓ+1)\widehat{h}^{(\ell+1)}_{i}Figure 7:Schematic of a single SLC module, with the layer-wise prediction and clustering losses.x1x_{1}x2x_{2}x3x_{3}x4x_{4}x5x_{5}x6x_{6}x7x_{7}x8x_{8}ppppppppCCCCCCCCh^1(1)\widehat{h}_{1}^{(1)}h^2(1)\widehat{h}_{2}^{(1)}h^3(1)\widehat{h}_{3}^{(1)}h^4(1)\widehat{h}_{4}^{(1)}ppppCCCCh^1(2)\widehat{h}_{1}^{(2)}h^2(2)\widehat{h}_{2}^{(2)}

Figure 8:Multiple predictions improve signal. We train the predictor and clusterer modules with weight sharing, and train cross-entropy prediction at all possible positions and cousin targets.#### Predictor.

The predictor operates on anss-tuple patch and predicts alls​(s−1)s(s-1)tokens with the same grandparent (cf.Figure˜8for a visualization). Using all cousin targets increases the available signal and improves prefactors, although it does not change the asymptotic sample-complexity scaling. Because the same patch can occupy any of thesspositions under its grandparent and we use weight sharing across positions, the predictor outputss2​(s−1)s^{2}(s-1)distributions; for each observed patch, only thes​(s−1)s(s-1)distributions matching its actual position enter the loss. The predictor is implemented in terms of convolutional layers, as

ϕ^(ℓ)​(x)=Pred(ℓ)​(x)=(SM∘CNN3∘A∘CNN2∘A∘CNN1)​(x)\widehat{\phi}^{(\ell)}(x)=\mathrm{Pred}^{(\ell)}(x)=(\operatorname{\mathrm{SM}}\circ\operatorname{\mathrm{CNN}}_{3}\circ A\circ\operatorname{\mathrm{CNN}}_{2}\circ A\circ\operatorname{\mathrm{CNN}}_{1})(x)(4)whereA=ReLU∘BNA=\operatorname{\mathrm{ReLU}}\circ\mathrm{BN}denotes the activation function,CNN1:ℝs×dh→ℝdh2\operatorname{\mathrm{CNN}}_{1}:\mathbb{R}^{s\times d_{h}}\rightarrow\mathbb{R}^{d_{h_{2}}}is a stridess1d convolutional layer, whileCNN2:ℝdh2→ℝdh2\operatorname{\mathrm{CNN}}_{2}:\mathbb{R}^{d_{h_{2}}}\rightarrow\mathbb{R}^{d_{h_{2}}}andCNN3:ℝdh2→ℝs×(s−1)×s×dh\operatorname{\mathrm{CNN}}_{3}:\mathbb{R}^{d_{h_{2}}}\rightarrow\mathbb{R}^{s\times(s-1)\times s\times d_{h}}are both stride 1. The softmax operationSM\operatorname{\mathrm{SM}}acts on the last dimension so that∑a=1dhϕ^(ℓ)​(x)r​ρ​t​a=1\sum_{a=1}^{d_{h}}\widehat{\phi}^{(\ell)}(x)_{r\rho ta}=1. This last dimension is therefore the marginal probability distribution over neighbouring tokens, conditioned on the input patch positionrr, target cousin-patch indexρ\rho, and target token positiontt.

We train the predictor with a cross-entropy prediction loss. Fix a grandparent indexg∈{1,…,sL−ℓ−2}g\in\{1,\ldots,s^{L-\ell-2}\}and an input patch positionr∈{1,…,s}r\in\{1,\ldots,s\}within that grandparent, so thatu=(g−1)​s+ru=(g-1)s+rand the input patch isxu(ℓ)x_{u}^{(\ell)}. For a target cousin patch positionr′≠rr^{\prime}\neq r, letρr​(r′)∈{1,…,s−1}\rho_{r}(r^{\prime})\in\{1,\ldots,s-1\}denote the index ofr′r^{\prime}among the patch positions excludingrr. The level-ℓ\elltoken at target positionttinside target patchr′r^{\prime}has absolute index

i​(g,r′,t):=((g−1)​s+r′−1)​s+t.i(g,r^{\prime},t):=((g-1)s+r^{\prime}-1)s+t.For this input patch, the prediction loss is

ℒpred​(g,r)=−∑r′≠r∑t=1s∑a=1dhh^i​(g,r′,t),a(ℓ)​log⁡(ϕ^(ℓ)​(xu(ℓ))r,ρr​(r′),t,a).\mathcal{L}_{\mathrm{pred}}(g,r)=-\sum_{r^{\prime}\neq r}\sum_{t=1}^{s}\sum_{a=1}^{d_{h}}\widehat{h}^{(\ell)}_{i(g,r^{\prime},t),a}\log\!\left(\widehat{\phi}^{(\ell)}(x_{u}^{(\ell)})_{r,\rho_{r}(r^{\prime}),t,a}\right).(5)The predictor loss for moduleℓ\ellis the mean of this quantity over all grandparent indicesggand input patch positionsrr.

Clusterer.

The clusterer in turn assigns soft cluster labels to each patch’s prediction vector – this amounts to assigning a learned level-(ℓ+1)(\ell+1)latent label.

q(ℓ+1)=Clust(ℓ)​(ϕ^(ℓ))=(SM∘CNN5∘A∘CNN4)​(ϕ^(ℓ))q^{(\ell+1)}=\mathrm{Clust}^{(\ell)}(\widehat{\phi}^{(\ell)})=(\operatorname{\mathrm{SM}}\circ\operatorname{\mathrm{CNN}}_{5}\circ A\circ\operatorname{\mathrm{CNN}}_{4})(\widehat{\phi}^{(\ell)})(6)whereϕ^(ℓ)\widehat{\phi}^{(\ell)}denotes the flattened output ofPred(ℓ)\mathrm{Pred}^{(\ell)},CNN4:ℝs2​(s−1)​dh→ℝdh2\operatorname{\mathrm{CNN}}_{4}:\mathbb{R}^{s^{2}(s-1)d_{h}}\rightarrow\mathbb{R}^{d_{h_{2}}}andCNN5:ℝdh2→ℝdh\operatorname{\mathrm{CNN}}_{5}:\mathbb{R}^{d_{h_{2}}}\rightarrow\mathbb{R}^{d_{h}}are stride-1 CNNs. The soft codeq(ℓ+1)q^{(\ell+1)}is the learned tokenh^(ℓ+1)\widehat{h}^{(\ell+1)}passed to the next module.A prioriwe do not specify thatdh=vd_{h}=v, so the clusterer could choose to instead assign each synonym of a patch into its own cluster. Indeed, for finite data, differing but synonymous patches will imply different contexts due to finite sampling effects. We address this by using a contrastive clustering loss that penalizes assignments of sufficiently similar predictions into different clusters.

Letqi=Clust(ℓ)​(ϕ^i(ℓ))q_{i}=\mathrm{Clust}^{(\ell)}(\widehat{\phi}^{(\ell)}_{i})denote the soft cluster assignment for prediction vectorϕ^i(ℓ)\widehat{\phi}^{(\ell)}_{i}, where the indexiiis taken to be over patch positions and batches, and we drop the(ℓ)superscript for brevity. The predictions are then centred, so thatϕ^i(ℓ)→ϕ^i(ℓ)−⟨ϕ^j(ℓ)⟩j\widehat{\phi}^{(\ell)}_{i}\rightarrow\widehat{\phi}^{(\ell)}_{i}-\langle\widehat{\phi}^{(\ell)}_{j}\rangle_{j}. The similaritySi​jS_{ij}between two predictions is the scaled cosine similarity

Si​j=12​(1+ϕ^i(ℓ)⋅ϕ^j(ℓ)‖ϕ^i(ℓ)‖2​‖ϕ^j(ℓ)‖2)S_{ij}=\frac{1}{2}\left(1+\frac{\widehat{\phi}^{(\ell)}_{i}\cdot\widehat{\phi}^{(\ell)}_{j}}{||\widehat{\phi}^{(\ell)}_{i}||_{2}||\widehat{\phi}^{(\ell)}_{j}||_{2}}\right)(7)of these centred predictions. The clustering loss is then

ℒclust=⟨ℒsim​(qi,qj,Si​j)+λsep​ℒsep​(qi,qj,Si​j)⟩i​j+λsparsity​⟨ℒsparsity​(qi)⟩i\mathcal{L}_{\mathrm{clust}}=\langle\mathcal{L}_{\mathrm{sim}}(q_{i},q_{j},S_{ij})+\lambda_{\mathrm{sep}}\mathcal{L}_{\mathrm{sep}}(q_{i},q_{j},S_{ij})\rangle_{ij}+\lambda_{\mathrm{sparsity}}\langle\mathcal{L}_{\mathrm{sparsity}}(q_{i})\rangle_{i}(8)where the⟨⋅⟩i​j\langle\cdot\rangle_{ij}indicates an average over patch positions and batches. The similarity loss component

ℒsim​(qi,qj,Si​j)=(qi−qj)⋅(qi−qj)​ReLU⁡(Si​j−Sm)\mathcal{L}_{\mathrm{sim}}(q_{i},q_{j},S_{ij})=(q_{i}-q_{j})\cdot(q_{i}-q_{j})\operatorname{\mathrm{ReLU}}(S_{ij}-S_{m})(9)serves to penalize predictions that are sufficiently similar (more than the marginSmS_{m}) but assigned to different clusters (i.e. with non-zeroΔ​q\Delta q) – this loss ensures that clusters consist of similar predictions. Alone, of course, the model could simply cluster ALL predictions into a single cluster. Therefore, we include a separation loss

ℒsep​(qi,qj,Si​j)=qi⋅qj​(1−Si​j)\mathcal{L}_{\mathrm{sep}}(q_{i},q_{j},S_{ij})=q_{i}\cdot q_{j}(1-S_{ij})(10)which penalizes dissimilar (i.e. large1−Si​j1-S_{ij}) predictions that are assigned into overlapping clusters (i.e.qi⋅qj>0q_{i}\cdot q_{j}>0). Finally, we introduce

ℒsparsity​(qi)=−qi⋅qi\mathcal{L}_{\mathrm{sparsity}}(q_{i})=-q_{i}\cdot q_{i}(11)which tends to maximize theL2L_{2}norm of the cluster assignment and therefore the sparsity (since‖qi‖1=1||q_{i}||_{1}=1due to the softmax). Preliminary experiments demonstrated that this sparsity reward improves training stability and downstream interpretability.

A general caveat of employing contrastive losses is that memory costs can explode. Employed naïvely, the first SLC module would computes2​(L−1)s^{2(L-1)}comparisons, each of which involves a potentially costly dot-product of as2​(s−1)​dhs^{2}(s-1)d_{h}vector. In practice for each batch, we select a random subset ofNcompareN_{\mathrm{compare}}patch positions and batches on which to compute the clustering loss for memory constraints.

Training protocol and stabilizers.

To keep the prediction problem at depthℓ\ellcomparable to the prediction problem at depth0, we tokenize clusterer outputs by means of a softmax operation,SM⁡(X,T)=exp⁡(X/T)/∑iexp⁡(Xi/T)\operatorname{\mathrm{SM}}(X,T)=\exp(X/T)/\sum_{i}\exp(X_{i}/T). ForT=0T=0, this is hard label assignment. Preliminary experiments found thatT<1T<1can improve convergence, but here we report results forT=1T=1.

The clustering codebook dimensiondhd_{h}is another important stabilizer. If the space of possible labels is large, each input can be assigned a unique label; this is a failure to cluster, because synonyms of a latent are not assigned the same symbol in the next layer. The input vocabulary for layerℓ+1\ell+1then increases by a factormm, making the prediction task for layerℓ+1\ell+1intractable. This problem is exacerbated by noisy or finite sampling, where synonymous inputs do not produce identical predictions. This can be resolved by an architectural bottleneck, i.e. settingdh=vd_{h}=v, as is done for the ILC. For most natural data, this dimension is unknowna priori; to simulate these conditions we set the label dimension todh>m​vd_{h}>mv(the hardest case) and instead use the contrastive loss to set the scale of attraction between similar predictions.

To mitigate representation collapse, we use a teacher-student framework, in which latent targets are obtained from a teacher network whose weights are simply the exponentially lagged weights of the student. After each step of gradient descent, the teacher weightsW(T)W^{(T)}are updated towards the student weightsW(S)W^{(S)}withW(T)←(1−αema)​W(T)+αema​W(S)W^{(T)}\leftarrow(1-\alpha_{\mathrm{ema}})W^{(T)}+\alpha_{\mathrm{ema}}W^{(S)}, where1/αema1/\alpha_{\mathrm{ema}}sets the exponential moving average timescale. The prediction loss therefore uses teacher representations to obtain targetsx:x_{:}, which are evaluated against student predictions. The clustering loss meanwhile, computes the similaritySi​jS_{ij}using the teacher’s predictions and subsequently uses the student’s clustering assignmentsqiq_{i},qjq_{j}. As an alternative to the teacher-student framework, we highlight inSection˜C.2that collapse can also be overcome by introducing stop-gradients between SLC modules, effectively making learning entirely local, reminiscent to predictive coding. Finally, we perform Jacobian descent, computing the gradient for each loss independently, and aggregate the gradients by means of the UPGrad algorithm. This selects an update in the dual-cone of the gradients to avoid conflicting updates[64], which we found in preliminary experiments reduced the need for careful hyperparameter selection.

Unless otherwise specified, SLC models are trained using AdamW and Jacobian descent (via TorchJD and UPGrad[64]) and following hyperparameters:αema=0.015\alpha_{\mathrm{ema}}=0.015,lr=3×10−3\mathrm{lr}=3\times 10^{-3},wd=10−3\mathrm{wd}=10^{-3},dh2=150d_{h_{2}}=150, batch size 32, clustering dimensiondh=128d_{h}=128,Sm=0.8S_{m}=0.8,λsep=0.5\lambda_{\mathrm{sep}}=0.5,Ncompare=300N_{\mathrm{compare}}=300, andλspars=10−2\lambda_{\mathrm{spars}}=10^{-2}. These were optimized by hyperparameter sweeps conducted atL=5L=5forv=16v=16,m=8m=8,s=3s=3using Optuna’s Tree-Parzen-Estimator sampler and Hyperband pruner[65].

We use a validation set of 2000 samples to evaluate the model throughout training. When training classifiers on the final SLC representation, we use early stopping to select the best pretraining checkpoint and classification checkpoint based on this validation set performance, but report test set performance on a held-out test set of 2000 samples. InFigure˜12the test sets are smaller, with 1000 samples.

C.2Training with local rules

InFigure˜9, we ablate the global training machinery while also testing increasingly biologically plausible training conditions. We first prevent gradient flow between SLC layers (denoted “Layer SG” for layerwise stop-gradient), then disable the exponential moving average (EMA) teacher network, and finally disable gradient flow between the prediction and clustering submodules. These changes remove the long-range backpropagating error signal and the biologically implausible time-delayed duplicate teacher network. When disabling EMA but allowing gradients to flow from the clustering loss through to the predictor, we observe representation collapse. We credit this to the clustering loss overpowering the prediction loss: trivial predictions are easy to cluster perfectly at the expense of prediction accuracy. Preventing gradient flow between prediction and clustering networks prevents this failure mode, showing that SLC can still learn when updates are constrained to local signals.

Refer to captionFigure 9:Local learning suffices to learn the RHM.We compare baseline training (blue) to increasingly local learning rules. “Layer SG” in legend indicates stopgradients between SLC layers, “P→CP\rightarrow CSG” indicates stopgradient between the predictor and clusterer submodules. EMA in legend indicates exponential-moving-average teacher is used as a target. RHM parameters are:L=4,v=10,m=10,s=3L=4,v=10,m=10,s=3. Solid lines are average over three independent realizations.

C.3Training dynamics

InFigure˜10we report the layer-wise accuracy and prediction loss of the different SLC modules. We evaluate the accuracy of moduleℓ\ellby taking the learned hard labelsh^i(ℓ+1)\widehat{h}^{(\ell+1)}_{i}(with theT=0T=0hard label assignment) and comparing them with the ground-truth RHM latentshi(ℓ+1)h^{(\ell+1)}_{i}. We identify an optimal mapping between the assigned labellings and the true latents with the Kuhn-Munkres “Hungarian” algorithm, and report the accuracy attained with such a mapping.

Refer to captionFigure 10:Training proceeds layer-wise.As each layer’s clustering improves, the subsequent layer’s prediction loss decreases, enabling that subsequent layer to then cluster.

C.4Independence of sample complexity with L

InFigure˜11we check that the sample complexity scaling for ILC and SLC better supportsm3m^{3}rather thanmL+1m^{L+1}.

Refer to captionFigure 11:Attempted curve collapse with token based SSL scaling.Data are as inFigure˜3, but with themL+1m^{L+1}scaling that holds for token level SSL. The collapse of accuracy vs. training sample is significantly degraded.Furthermore, inFigure˜12we validate the sample complexityP∼m3P\sim m^{3}for systems with varying depth, up toL=7L=7.

Refer to captionFigure 12:The SLC model exhibits a depth-independent sample complexity.Accuracy of linear probes trained to classify the root latenth1(L)h^{(L)}_{1}from the final SLC representation. RHM parameters ares=3s=3,m=8m=8, andv=16v=16, with depthLLindicated by the legend. The pre-trained models were trained for 30 epochs, and the classifiers for the same duration and on the same dataset. Solid lines are an average over three independent RHM instantiations.

Appendix DFurther results on data2vec

Figure˜13–leftshows the synonym clustering score for encoder layers to latents at different depths. All layers remain invariant toℓ=L=4\ell=L=4because there is not signal for which the network can learn the root latent. For latents at levelℓ<L\ell<L, we find clustering throughout the middle layers of the network. This indicates that clustering occurs throughout the early layers of data2vec. Furthermore, the fact that for latentℓ\ellencoder layerℓ−1\ell-1is the first to exhibit clustering of the synonyms indicates that latents are constructed layer-by-layer. Deeper layers exhibit stronger clustering of low-level latents, up to layer 2. This inversion is reminiscent of an encoder-decoder architecture, and mimics what is observed in U-Net architectures[38].

Figure˜13–righttests the presence of linear traces of latents in the teacher target for differentmmandℓ\ell. This data collapses withv​m3vm^{3}, confirming that the transition to all latents being linearly encoded in the teacher occurs at the predictedv​m3vm^{3}scaling.

InFigure˜14we test the scaling collapse for data2vec instances trained in the offline setting, thereby decoupling the number of gradient descent steps from the number of original observed samples. We findv​m3vm^{3}provides a good collapse.

Refer to captionFigure 13:Left:post-training synonym clustering score per encoder layer and RHM levelℓ\ell(m=6m=6). Values above0indicate that synonymous level-(ℓ−1)(\ell-1)tuples are clustered closer than non-synonymous ones.Right:linear-probe accuracy for the latenth(ℓ)h^{(\ell)}from the teacher target, vs. rescaled sample sizeP/(v​m3)P/(vm^{3}), forℓ=2,3,4\ell=2,3,4. Curves collapse withv​m3vm^{3}for everyℓ\ell.Refer to captionFigure 14:Offline data2vec pretraining.Linear and MLP probe test accuracy on the classification task as a function of the pretraining training-set sizePP. Each point is averaged over three seeds. Main axes: curves rescaled byv​m3vm^{3}. Insets: same data, rawPP.Table 1:Data2vec hyperparameters.GroupHyperparameterValueEncoder architecturehidden dimdmodeld_{\text{model}}20482048attention headsnhn_{h}3232head dim6464encoder layersnln_{l}88FFN dimdffd_{\text{ff}}8192=4​dmodel8192=4\,d_{\text{model}}non-linearityGELUdropout0.10.1positional embeddingslearned,T×dmodelT\times d_{\text{model}}token embeddingslearned,Vtok=v+2V_{\text{tok}}\!=\!v\!+\!2idsLayerNorm placementpre-LNdata2vec objectivemask probability0.150.15mask span length11top-KKteacher layers averagedK=4K=4per-layer LayerNorm of targetsyesglobal LayerNorm of targetsnoregression head depth11linear layerlosssmooth L1,β=4\beta=4teacher EMA decayμ\mu0.990.99(constant; no annealing)OptimizationoptimizerAdamW(β1,β2)(\beta_{1},\beta_{2})(0.9,0.98)(0.9,0.98)weight decay0.010.01(excl. bias / LayerNorm)learning rateη=10−4\eta=10^{-4}(constant)LR schedulenone (no warmup, no decay)gradient clippingglobal L2 norm≤1.0\leq 1.0batch sizeBB512512training stepsTstepsT_{\text{steps}}262 144262\,144Probespoolingmean over positionslinear probeLinear(dmodel,v)(d_{\text{model}},v)MLP probedmodel→2​dmodel→vd_{\text{model}}\!\to\!2d_{\text{model}}\!\to\!v, GELU, dropout0.10.1probe optimizerAdam, lr10−310^{-3}probe steps per eval20002000(each: linear, MLP)

Appendix ECompute budget

SLC experiments were conducted on individual H100 nodes. The SLC experiments inFigure˜3took approximately 10 H100 hours. TheLLscaling experiment reported inFigure˜12cost approximately 100 H100 hours, as the amount of computations scales linearly with input data size which consists ofsLs^{L}tokens (and therefore exponentially in depth).Figure˜9cost approximately 3 H100 hours. We do not have an estimate for the Optuna hyperparameter sweeps or preliminary experiments.

The data2vec experiments cost approximately 1,000 H100 hours.

Similar Articles

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

arXiv cs.LG

This paper audits Joint-embedding predictive architectures (JEPA) for LLM fine-tuning on a natural-language-to-regex task, testing twenty-two auxiliary objectives. The results show that hidden-state representation improvements are only weakly coupled to decoded-task accuracy, with no auxiliary surviving family-wise correction.