@xbresson: How do we design materials with AI? Excited to introduce Crys-JEPA, a new generative technique in collaboration w/ @liu…

X AI KOLs Following 05/19/26, 12:34 AM Papers

Summary

Crys-JEPA introduces a joint embedding predictive architecture for crystals that learns an energy-aware latent space, achieving significant improvements in stability and novelty for de novo crystal discovery.

How do we design materials with AI? Excited to introduce Crys-JEPA, a new generative technique in collaboration w/ @liun_online, Kostya Novoselov, @ylecun & team. We achieved 47.9% VSUN on MP20 by building a high-quality energy-aware latent space with JEPA https://t.co/mWrjazmiGo https://t.co/UdltCOr9ix

Original Article

View Cached Full Text

Cached at: 05/22/26, 09:45 AM

Crys-JEPA: Accelerating Crystal Discovery via Embedding Screening and Generative Refinement

Source: https://arxiv.org/html/2605.14759 Nian Liu1Nikita Kazeev1Stephen Gregory Dale1Artem Maevskiy1 Yuwei Zeng1Ryoji Kubo1Pengru Huang1Thomas Laurent2 Yann LeCun3 4Kostya S. Novoselov1Xavier Bresson1 {nianliu, yuweizeng, ryojikubo}@u.nus.edu, [email protected], [email protected] {kna, sdale, maevskiy, pengru, kostya, xaviercs}@nus.edu.sg 1National University of Singapore2Loyola Marymount University3New York University4AMI

Abstract

De novo crystal generation seeks to discover materials that are not merely realistic, but also stable and novel. However, most existing generative models are trained to maximize the likelihood of observed crystals, which encourages samples to stay close to known materials yet not necessarily align with the criteria that matter in discovery. Through an empirical investigation, we show that current crystal generative models are caught in a pronounced stability–novelty trade-off: moving toward the observed distribution preserves stability but limits novelty, whereas moving away from it quickly destroys stability. This suggests that the useful region for discovering crystals that are both stable and novel is extremely narrow. To escape the trade-off, we introduce Crys-JEPA, a joint embedding predictive architecture for crystals that learns an energy-aware latent space preserving formation-energy differences. In this space, stability assessment can be reformulated as an embedding-based comparison against accessible training crystals, reducing the reliance on expensive energy evaluation and task-specific external references. Building on Crys-JEPA, we further develop a screening-and-refinement pipeline that identifies promising generated crystals and reintroduces them to refine the generative model. On MP-20 and Alex-MP-20 datasets, we achieve improvements over baselines up to 81.4% and 82.6% onV.S.U.Nmetric, respectively.

1Introduction

Discovering new materials is a key driver of progress across a wide range of applications, including solar cells, batteries, and catalysis[7]. Among material classes, crystals are of particular importance because their periodic atomic arrangements give rise to diverse and tunable physical properties, making them a central target in computational materials design. This motivates the task ofde novo generation(DNG)[48], which aims to discover entirely new crystal structures without relying on predefined templates. In recent years, DNG has been substantially advanced by deep generative models, particularly diffusion[18]and flow matching[29].

Most existing DNG models[48,53,23]are trained under the conventional objective of maximizing the log-likelihood of the observed data. However, it remains unclear to what extent this objective improves the criteria that matter most in practice, namelyValidity(V),Stability(S),Uniqueness(U), andNovelty(N), which we collectively denote as V.S.U.N. In Section3, we begin with an empirical study showing that current crystal generative models generally exhibit a pronounced trade-off between stability and novelty. To better understand this phenomenon, we analyze how these metrics vary with respect to the density landscape of the observed distribution. Our results suggest that crystal generation imposes an exceptionally strict precision requirement: moving closer to high-density regions is often insufficient to improve novelty, whereas drifting toward low-density regions can already destroy stability. In other words, there is little effective intermediate region in which a crystal can be both stable and novel. Consequently, although log-likelihood maximization encourages generated samples to stay near the observed distribution, these nearby regions do not necessarily satisfy both stability and novelty.

Notably, standard training datasets consist primarily of already known stable crystals. Therefore, without changing the underlying generative backbone, one natural way to alleviate the stability–novelty trade-off is to reintroduce crystals that satisfy V.S.U.N into training, especially those that are both stable and novel, so that the model can fit a more desirable distribution. When external data cannot be used due to fairness considerations, an alternative is to identify such promising crystals directly from model generations. This, however, requires reliable V.S.U.N assessment for generated samples. While validity, uniqueness, and novelty can be evaluated relatively cheaply, stability is substantially more difficult to assess. Stability is defined through a formation-energy comparison against a reference set, which introduces two fundamental challenges. First,reference ambiguity: the choice of reference depends on the task, and even public references such as the Materials Project[20]evolve over time, making it impossible to know the final reference set in advance during training. Second,computational cost: standard stability evaluation typically relies on density functional theory (DFT), which is prohibitively expensive at scale. As a result, performing DFT for all generated crystals is impractical, especially because some generated samples are of low quality.

In this work, we address these two challenges jointly. To mitigate reference ambiguity, weassess stability relative to the training set. The key intuition is that training crystals are already stable under commonly used reference sets. Therefore, if a generated crystal has formation energy comparable to that of training crystals under the same chemical system, it is also likely to be stable. This substantially reduces the dependence on an explicitly specified external reference. To avoid costly energy calculations, we further introduce an energy-aware surrogate model,Crys-JEPA. Specifically, we pre-train a joint-embedding predictive architecture (JEPA)[27]with an InfoNCE objective[36]to build a crystal latent space structured by formation energy, where crystals with similar formation energies are mapped nearby and energetically dissimilar crystals are well separated. We then use Crys-JEPA embedding-based comparison as a proxy of stability assessment, enabling efficient screening of generated crystals. Building on this surrogate, we construct a simple refinement loop: we pre-train a base generative model, generate candidate crystals, select promising ones using Crys-JEPA, and fine-tune the base model on the selected samples. Experiments show that this screening-and-refinement pipeline substantially improves generation quality.

Our contributions are summarized as follows:

•We identify a stability–novelty trade-off in crystal de novo generation, and show that it stems from the extreme precision required for crystals to remain both stable and novel.
•We developCrys-JEPA, an energy-aware latent surrogate for DFT-based stability evaluation, and show that it can drive a simple refinement loop that improves the generation quality.
•Our approach consistently outperforms strong baselines on both MP-20 and Alex-MP-20, improving up to 81.4% and 82.6% onV.S.U.Nmetric via DFT, respectively.

2Preliminaries

Crystal representation.

A crystal𝐂\mathbf{C}is defined by the periodic arrangement of its fundamental repeating unit, the unit cell, across the three-dimensional space. We usually describe a unit cell using three components, i.e., the atomic fractional coordinates𝑿∈[0,1)N×3\bm{X}\in[0,1)^{N\times 3}, the atom types𝑨∈ℝN\bm{A}\in\mathbb{R}^{N}, and the lattice matrix𝑳∈ℝ3×3\bm{L}\in\mathbb{R}^{3\times 3}, whereNNis the number of atoms in the cell. For𝑨\bm{A}, we consider the first 100 chemical elements and encode each atomic type with a one-hot vector, yieldingone-hot(𝑨)∈ℝN×100\text{one-hot}(\bm{A})\in\mathbb{R}^{N\times 100}. To represent the lattice, we adopt the reparameterization used in prior work[53]. Specifically, we factorize𝑳\bm{L}through singular value decomposition and rewrite it in terms of a rotation matrix and a symmetric matrix:

𝑳=𝑼𝑳~,𝑼=𝑾𝑽⊤,𝑳~=𝑽𝚺𝑽⊤,\bm{L}=\bm{U}\tilde{\bm{L}},\quad\bm{U}=\bm{W}\bm{V}^{\top},\quad\tilde{\bm{L}}=\bm{V}\bm{\Sigma}\bm{V}^{\top},(1)where𝑾\bm{W}and𝑽\bm{V}denote the left and right singular matrices of𝑳\bm{L}, and𝚺\bm{\Sigma}contains the singular values on its diagonal. Under this formulation,𝑼\bm{U}corresponds to a rotation matrix, while𝑳~\tilde{\bm{L}}is symmetric positive definite. We then take the upper-triangular part of𝑳~\tilde{\bm{L}}and flatten it into a 6-dimensional vector, written as𝑳^=vec(triu(𝑳~))∈ℝ6\widehat{\bm{L}}=\mathrm{vec}(\mathrm{triu}(\tilde{\bm{L}}))\in\mathbb{R}^{6}. Based on these components, theii-th atom in crystal𝐂\mathbf{C}is represented by anatom vector:

𝒗i=[𝑿i‖one-hot(Ai)‖𝑳^]∈ℝ3+100+6,\bm{v}_{i}=[\bm{X}_{i}\,\|\,\text{one-hot}(A_{i})\,\|\,\widehat{\bm{L}}]\in\mathbb{R}^{3+100+6},(2)which concatenates its coordinate, element type, and lattice representation. The full crystal representation is then given by stacking all atom vectors:

𝑽=[𝒗1,…,𝒗N]⊤∈ℝN×(3+100+6).\bm{V}=[\bm{v}_{1},\dots,\bm{v}_{N}]^{\top}\in\mathbb{R}^{N\times(3+100+6)}.(3)

Thermodynamic stability.

For a crystal𝐂\mathbf{C}withNNatoms within unit cell, the total energy is denoted asEtE_{t}, and the total energy per atom is defined asEt/atom=Et/NE_{t/\mathrm{atom}}=E_{t}/N. Suppose𝐂\mathbf{C}containskkdistinct atomic species{T1,…,Tk}\{T_{1},\dots,T_{k}\}. Its chemical system is denoted byT1T_{1}–T2T_{2}–⋯\cdots–TkT_{k}, and its composition is represented by𝒇=(f1,…,fk)\bm{f}=(f_{1},\dots,f_{k}), wherefi≥0f_{i}\geq 0and∑ifi=1\sum_{i}f_{i}=1. We associate𝐂\mathbf{C}with an entry𝒫=(𝒇,Et/atom)\mathcal{P}=(\bm{f},E_{t/\mathrm{atom}})in composition–energy space.

The primary requirement of DNG is to generate crystals that are stable from a thermodynamic perspective, which is originally defined based onformation energy. Letμiref\mu_{i}^{\mathrm{ref}}denote the elemental reference energy per atom of speciesTiT_{i}, typically taken from its stable elemental phase under the same computational setting. The formation energy per atom of𝐂\mathbf{C}is defined as

Ef/atom(𝐂)=Et/atom−∑i=1kfiμiref.E_{f/atom}(\mathbf{C})=E_{t/\mathrm{atom}}-\sum_{i=1}^{k}f_{i}\mu_{i}^{\mathrm{ref}}.(4) Given a reference datasetℛ\mathcal{R}, we construct a phase diagram using all entries inℛ\mathcal{R}that belong to the same chemical system as𝐂\mathbf{C}or any of its subsystems. For each reference entry𝒫j=(𝒇j,Et/atomj)\mathcal{P}^{j}=(\bm{f}^{j},E_{t/\mathrm{atom}}^{j}), its formation energy isEf/atomj=Et/atomj−∑i=1kfijμirefE_{f/atom}^{j}=E_{t/\mathrm{atom}}^{j}-\sum_{i=1}^{k}f_{i}^{j}\mu_{i}^{\mathrm{ref}}. The convex hull is then defined as the lower convex envelope in composition–formation-energy space. Accordingly, the hull formation energy at composition𝒇\bm{f}is

Ef/atomhull(𝒇)=min{λj}∑jλjEf/atomjs.t.∑jλj𝒇j=𝒇,∑jλj=1,λj≥0.E_{f/atom}^{\mathrm{hull}}(\bm{f})=\min_{\{\lambda_{j}\}}\sum_{j}\lambda_{j}E_{f/atom}^{j}\quad\text{s.t.}\quad\sum_{j}\lambda_{j}\bm{f}^{j}=\bm{f},\;\;\sum_{j}\lambda_{j}=1,\;\;\lambda_{j}\geq 0.(5)The energy above hull is defined as

ΔE=Ef/atom(𝐂)−Ef/atomhull(𝒇).\Delta E=E_{f/atom}(\mathbf{C})-E_{f/atom}^{\mathrm{hull}}(\bm{f}).(6)In AppendixA, we derive thatΔE\Delta Ecan also be represented via total energy per atom,

ΔE=Et/atom−Et/atomhull(𝒇),\Delta E=E_{t/\mathrm{atom}}-E_{t/\mathrm{atom}}^{\mathrm{hull}}(\bm{f}),(7)whereEt/atomhull(𝒇)=∑jλjEt/atomjE_{t/\mathrm{atom}}^{\mathrm{hull}}(\bm{f})=\sum_{j}\lambda_{j}E_{t/\mathrm{atom}}^{j}. In this work, we regard a crystal as thermodynamically stable ifΔE<ϵ\Delta E<\epsilon, whereϵ=0.1\epsilon=0.1eV/atom following prior studies[53].

3Stability and Novelty Trade-off: An Experimental Investigation

In this section, we investigate how log-likelihood maximization relates to the evaluation criteria V.S.U.N, with a particular focus on stability and novelty.

Discovering trade-off in crystal generation.

We reproduce multiple baselines trained on the MP-20 dataset[48], including CDVAE[48], DiffCSP[21], DiffCSP++[22], FlowMM[33], FlowLLM[42], SymmCD[28], ADiT[23], CrysLLMGen (7B)[24], SGEquiDiff[8], and MatterGen[53]. Detailed descriptions of these models can be found in AppendixE.

For evaluation, we repeat sampling 10 times and collect 1,000 crystals in each run. We then report the mean and standard deviation ofstability(S) andnovelty(N), and two combined metrics, i.e.S.U.N.andV.S.U.N.. In this case study, however, our main focus is onstabilityandnovelty. During evaluation, structure relaxation and energy prediction are performed using MatterSim-v1-1M[52]. Following prior work[33], we use MP-2023 as the reference dataset and regard crystals with energy above hull below 0.1 eV/atom as stable[53].

Refer to caption Figure 1:(a) The trade-off between stability and novelty exhibited by current crystal generative models. (b) The trends of stability, novelty, and S.U.N. as generated crystals move from regions closer to the observed distribution toward regions farther away, measured using the proxy distance defined in this section.The results are summarized in Table1. We further visualize the relationship between stability and novelty in Fig.1(a). CDVAE[48], for example, produces highly novel crystals, but their stability is relatively limited. In contrast, ADiT[23]appears to stay closer to the training distribution, producing more stable but less exploratory outputs. The remaining models lie between these two extremes, showing an overall negative correlation between stability and novelty. This trend suggests that current crystal generative models struggle to balance these two objectives.

Why does the trade-off arise?

Training crystals represent high-density regions within the empirical data distributionp(x)p(x). Since maximizing log-likelihood,log⁡p(x)\log p(x), compels a model to prioritize these high-density regions[5], we hypothesize that the stability–novelty trade-off can be understood by examining how metrics evolve as generated samples deviate from the training distribution.

To test this, we collect 100,000 crystals generated by the aforementioned models, denoted as{𝐂gen}\{\mathbf{C}_{gen}\}. Because these models were optimized to maximizelog⁡p(x)\log p(x), the samples in{𝐂gen}\{\mathbf{C}_{gen}\}are approximately drawn from the same underlying distribution as the ground-truth training set{𝐂gt}\{\mathbf{C}_{gt}\}. We then rank{𝐂gen}\{\mathbf{C}_{gen}\}by their proximity to the high-density regions defined by{𝐂gt}\{\mathbf{C}_{gt}\}.

Following thePrecisionmetric[48], we defined a fingerprint-based distance𝒟\mathcal{D}between{𝐂gen}\{\mathbf{C}_{gen}\}and{𝐂gt}\{\mathbf{C}_{gt}\}111See AppendixD.4for details on the fingerprint descriptors and the calculation of𝒟\mathcal{D}.. As a non-learned metric,𝒟\mathcal{D}provides a consistent measurement across different models and is less susceptible to the irregular behavior often seen in models processing out-of-distribution inputs. We order{𝐂gen}\{\mathbf{C}_{\mathrm{gen}}\}according to𝒟\mathcal{D}to observe how stability, novelty, and S.U.N. evolve across the distribution.

The resulting trends, shown in Fig.1(b), illustrate the cumulative values of each metric across percentiles of𝒟\mathcal{D}. As we move from regions near the observed distribution toward more distant regions, stability decreases consistently while novelty increases. Notably, we find no effective range where novelty improves substantially without a corresponding sacrifice in stability. Consequently, their intersection (S.U.N.) remains relatively stagnant. This pattern underscores a fundamental challenge in crystal generation: even minor deviations from the observed training distribution significantly degrade thermodynamic stability.

Refer to caption Figure 2:Overview of (a) JEPA architecture and (b) the proposed Crys-JEPA.

4Mitigating the Stability–Novelty Trade-off

As shown in Fig.1(b), novelty starts from a low initial value because crystals in MP-20 are already known materials, and their local neighborhoods are therefore not novel. Without changing the likelihood-based nature of the underlying generator or the intrinsic fragility of crystals, our goal is to shift the novelty curve upward so that there exists a wider regime in which generated crystals can be both stable and novel. To this end, we screen for promising V.S.U.N. crystals among model generations and reintroduce them to refine the generator. In this section, we present Crys-JEPA, which serves as a practical stability surrogate during screening, together with the resulting screening-and-refinement pipeline.

4.1Crys-JEPA: Energy-aware Crystal Latent Space via JEPA

As discussed in Section2, stability evaluation fundamentally relies on comparing the formation energies of generated crystals against those of reference crystals. In this work, we construct a unified latent space for crystals guided by formation energy per atom,Ef/atomE_{f/\mathrm{atom}}, using the JEPA framework[27]. The goal is to learn an embedding space in which crystals with similar formation energies are close, while crystals with larger energy differences are farther apart. The overall frameworks of JEPA and Crys-JEPA are illustrated in Fig.2.

Context construction.

Within JEPA, acontextis a compatible transformation of the target input that preserves its underlying semantics. While some domains provide natural context–target pairs, such as question–answer[9]or text–code[19], here we construct such pairs through data augmentation[2]. Because the latent space is intended to be energy-aware, we restrict the augmentations to transformations that preserve formation energy. Specifically, we apply translation and rotation to the target crystal𝐂t\mathbf{C}_{t}, rather than masking-based augmentations commonly used in vision JEPA models[2,3,4].

Translation acts on fractional coordinates as𝒯(𝑿+𝒕)=(𝑿+𝒕)−⌊𝑿+𝒕⌋,𝒕∈[0,1)1×3,\mathcal{T}(\bm{X}+\bm{t})=(\bm{X}+\bm{t})-\lfloor\bm{X}+\bm{t}\rfloor,\bm{t}\in[0,1)^{1\times 3},while rotation acts on the lattice matrix asℛ(𝑳~)=𝑳~𝑼,\mathcal{R}(\tilde{\bm{L}})=\tilde{\bm{L}}\bm{U},where𝑼∈SO(3)\bm{U}\in\mathrm{SO}(3)(special orthogonal group) is sampled from the Haar-uniform distribution, parameterized by𝒓=(r1,r2,r3)∼𝒰([0,1]3)\bm{r}=(r_{1},r_{2},r_{3})\sim\mathcal{U}([0,1]^{3})(uniform distribution)[40]. The resulting context crystal𝐂c\mathbf{C}_{c}is obtained via𝐂c=ℛ∘𝒯(𝐂t)\mathbf{C}_{c}=\mathcal{R}\circ\mathcal{T}(\mathbf{C}_{t}).

Encoder and predictor.

Given the augmented representations𝑽c\bm{V}_{c}and𝑽t\bm{V}_{t}as Eq. (3), we encode both using the same Transformer[45](refer to AppendixB). The resulting embeddings are denoted as𝑯c,𝑯t∈ℝN×d\bm{H}_{c},\bm{H}_{t}\in\mathbb{R}^{N\times d}, whereddrepresents the JEPA embedding dimension. A predictor network is trained to infer𝑯t\bm{H}_{t}from𝑯c\bm{H}_{c}conditioned on the augmentation parameters(𝒕,𝒓)(\bm{t},\bm{r}), which explicitly encode the relation between context and target. We implement the predictor as a multilayer perceptron (MLP):

𝑷(𝑯c,𝒕,𝒓)=𝑾2σ(𝑯c+𝑾1[𝒕∥𝒓]+𝒃1)+𝒃2,\bm{P}(\bm{H}_{c},\bm{t},\bm{r})=\bm{W}_{2}\ \sigma\!\left(\bm{H}_{c}+\bm{W}_{1}[\bm{t}\,\|\,\bm{r}]+\bm{b}_{1}\right)+\bm{b}_{2},(8)where{𝑾1,𝑾2,𝒃1,𝒃2}\{\bm{W}_{1},\bm{W}_{2},\bm{b}_{1},\bm{b}_{2}\}are learnable parameters andσ(⋅)\sigma(\cdot)denotesSiLU(⋅\cdot) activation[17].

Optimization.

JEPA training aims to (i) align predicted and target embeddings,min⁡D(𝑷(𝑯𝒄,𝒕,𝒓),𝑯𝒕)\min D(\bm{P(H_{c},t,r)},\bm{H_{t}}), and (ii) prevent representations collapse. We jointly achieve both objectives using an energy-weighted InfoNCE loss[36,10]:

ℒ=−1B∑ilog⁡exp⁡(sim(𝑷i,𝑯ti)/τ)∑k=1Bexp⁡(ωik⋅sim(𝑷i,𝑯tk)/τ),\mathcal{L}=-\frac{1}{B}\sum_{i}\log\frac{\exp\!\left(\mathrm{sim}(\bm{P}_{i},\bm{H}_{t}^{i})/\tau\right)}{\sum_{k=1}^{B}\exp\!\left(\omega_{ik}\cdot\mathrm{sim}(\bm{P}_{i},\bm{H}_{t}^{k})/\tau\right)},(9)where𝑷i=𝑷(𝑯ci,𝒕,𝒓)\bm{P}_{i}=\bm{P}(\bm{H}_{c}^{i},\bm{t},\bm{r}),sim(⋅,⋅)\mathrm{sim}(\cdot,\cdot)denotes cosine similarity,τ>0\tau>0is a temperature parameter, andBBis the batch size. The energy-aware weightωik\omega_{ik}is defined as

ωik=1−exp⁡(−|Ef/atomi−Ef/atomk|)fork≠i,andwii=1.\omega_{ik}=1-\exp\!\left(-\left|E_{f/\mathrm{atom}}^{i}-E_{f/\mathrm{atom}}^{k}\right|\right)\textrm{ for }k\not=i,\textrm{ and }\ w_{ii}=1.(10)This weighting scheme enforces stronger repulsion between embeddings of crystals with larger formation energy differences, while allowing energetically similar crystals to remain close. As a result, the learned latent space both avoids collapse and encodes a meaningful energy-aware structure, enabling embedding-based energy comparison.

Space visualization

Refer to caption Figure 3:Visualization of the latent space learned by Crys-JEPAfor 100,000 crystal structures.We pre-train Crys-JEPA on Material Project v.2022.10.28[20]and MPtrj[11]. The former provides high-quality stable structures, while the latter contributes trajectory-level structurally unstable variations. Because of the resource limitation, we retain only crystals with at most 20 atoms in the unit cell. After training, we visualize the latent space of Crys-JEPA in Fig.3. Specifically, we randomly sample 100,000 crystals from the two datasets, obtain their Crys-JEPA embeddings together with theirEf/atomE_{f/\mathrm{atom}}, and then reduce the embedding dimension using UMAP[31]. The samples are partitioned into five equal groups of 20,000 crystals based onEf/atomE_{f/\mathrm{atom}}with group specific color.

As shown in the figure, the visualized crystals form a structured manifold spanning from lowEf/atomE_{f/\mathrm{atom}}(red) to highEf/atomE_{f/\mathrm{atom}}(yellow). Crystals with similar formation energies tend to be located closer together, whereas crystals with larger energy gaps tend to be more separated. This visualization provides qualitative evidence that Crys-JEPA embedding comparison can act as the surrogate ofEf/atomE_{f/atom}difference.

4.2Screening-and-Refinement Pipeline

Taking the training on MP-20, we propose the screening-and-refinement pipeline as following:

1.Pre-train a target generative model𝒢\mathcal{G}on MP-20.
2.Use the model𝒢\mathcal{G}to generate𝒩\mathcal{N}crystals, relax them using machine learning force field, and retain those that are V.U.N relative to MP-20.
3.For each retained crystal𝐂\mathbf{C}, identify itsreference set, consisting of training crystals that belong to the same chemical system as𝐂\mathbf{C}or to any of its subsystems. For example, if the formula of𝐂\mathbf{C}isA2BA_{2}B, a possible reference set from the training data isRef={A,B,AB2,A2B2}Ref=\{A,B,AB_{2},A_{2}B_{2}\}.
4.Obtain the Crys-JEPA embedding𝑯C\bm{H}_{C}of the target crystal𝐂\mathbf{C}, as well as the embeddings{𝑯iRef}\{\bm{H}_{i}^{Ref}\}of the reference crystals. Based on these embeddings, we heuristically define the average distance of𝐂\mathbf{C}to MP-20 as DC=1NRef∑i∈Ref‖𝑯C−𝑯iRef‖2,D_{C}=\frac{1}{N_{Ref}}\sum_{i\in Ref}\left\|\bm{H}_{C}-\bm{H}_{i}^{Ref}\right\|^{2},(11)whereNRefN_{Ref}denotes the number of candidate groups for𝐂\mathbf{C}.
5.Finally, rank the retained crystals from Step 2 according toDCD_{C}, select topk%\mathrm{k}\%crystals with the smallest distances, and fine-tune the base model𝒢\mathcal{G}on thek%\mathrm{k}\%crystals.

In Step 2, we again use MatterSim-v1-1M[52], as in Section3. As introduced in Section2, the phase diagram depends only on crystals within the same chemical system or its subsystems, which motivates the construction of the reference set in Step 3. In Step 4, instead of explicitly solving the convex-hull optimization in Eq. (5), we treat all candidates in the reference set uniformly and use the resulting average embedding distance as a simple ranking signal. Experimental results in Section5support the effectiveness of this simplification.

4.3Why Use Embedding-Based Comparison for Stability Screening?

Refer to caption Figure 4:Distributions ofEf/atomE_{f/\mathrm{atom}}andEt/atomE_{t/\mathrm{atom}}for crystals in the available datasets.Besides Crys-JEPA, another option for stability screening is to use machine-learning force fields (MLFFs)[52,14]. We compare Crys-JEPA with MLFF-based screening from two perspectives:

(1)Effective comparison space.Fig.4shows the distributions ofEf/atomE_{f/\mathrm{atom}}andEt/atomE_{t/\mathrm{atom}}in the two datasets. In both cases, most crystals are concentrated in a relatively narrow scalar range, which makes fine-grained comparison more error-prone. In contrast, Crys-JEPA performs comparison in a much broader latent space. Since our pipeline only requires relative ranking rather than absolute energy prediction, such a space can provide a more expressive signal for distinguishing candidates.

(2)Information content.MLFFs output scalar energy estimates (and optionally force and stress). By contrast, Crys-JEPA produces crystal embeddings that are trained to reflect formation-energy differences while still encoding structural information from the input crystal. Consequently, this approach provides a much richer informational basis for estimating thermodynamic stability than a simple scalar prediction.

In Section5, we compare Crys-JEPA against two MLFFs, namely eSEN[14]and CHGNet[11], and show that Crys-JEPA provides a stronger screening signal in our refinement pipeline.

5Numerical Experiments

In this section, we evaluate the effectiveness of Crys-JEPA and the proposed screening-and-refinement pipeline. As the target generator𝒢\mathcal{G}in the pipeline, we use a basic generative model consisting of a denoising diffusion probabilistic model (DDPM)[18]with a vanilla Transformer[45]as the denoising network. Full model details are provided in AppendixB.

5.1Crystal Generation on MP-20

Comparing with Generative Models.

We first evaluate how much improvement can be brought by the proposed pipeline. We train the base model and evaluate it under the same experimental protocol as in Section3. The results are reported in Table1. We compare with baselines in Section3, and further conduct density functional theory (DFT)[38]for more accurate stability evaluation. Details are given in AppendixD.3. Due to high cost, we only perform DFT on the 1,000 generated crystals from both the strongest baseline (MatterGen) and our model.

As shown in Table1, the base model (DDPM+Transformer) achieves higher stability but lower novelty than MatterGen, resulting in inferior overallS.U.N.andV.S.U.Nperformance. After applying the screening-and-refinement pipeline, most metrics improve substantially, including the most important one,V.S.U.N, which surpasses MatterGen.

Table 1:Generation performance on MP-20 across 10,000 generated crystals for MLFF metrics, and with 1,000 samples used for DFT evaluation. Results are reported as the mean percentage±\pmstandard deviation. (S: stability; N: novelty; V: validity; U: uniqueness.)ModelSNS.U.N(MLFF)V.S.U.N(MLFF)S.U.N(DFT)V.S.U.N(DFT)CDVAE†[48]29.9±\pm1.296.5±\pm0.627.0±\pm1.322.8±\pm1.1--DiffCSP[21]45.9±\pm1.883.6±\pm0.630.9±\pm1.825.6±\pm1.3--DiffCSP++[22]39.7±\pm2.082.8±\pm1.023.8±\pm1.320.3±\pm1.2--FlowMM†[33]40.8±\pm2.083.1±\pm1.125.3±\pm1.621.0±\pm1.3--FlowLLM†[42]36.5±\pm1.586.4±\pm1.525.1±\pm0.621.3±\pm0.8--SymmCD†[28]34.7±\pm1.485.1±\pm1.519.0±\pm0.915.7±\pm0.8--ADiT[23]69.5±\pm1.158.9±\pm0.930.3±\pm0.827.2±\pm1.3--CrysLLMGen (7B)†[24]35.1±\pm1.987.4±\pm1.122.9±\pm0.920.4±\pm0.9--SGEquiDiff[8]46.5±\pm1.974.8±\pm1.223.5±\pm0.920.0±\pm1.0--MatterGen[53]47.0±\pm1.186.6±\pm1.034.6±\pm1.229.2±\pm0.930.826.4Base model48.4±\pm1.478.0±\pm1.627.8±\pm1.123.3±\pm1.3--w/ Crys-JEPA77.7±\pm0.479.1±\pm1.444.2±\pm1.544.1±\pm1.548.147.9We then collect all 10,000 crystals generated by the base model and repeat the analysis in Fig.1(b). The resulting trends are shown in Fig.5(a). Compared with Fig.1(b), the novelty curve starts from a noticeably higher value after refinement. In addition, there exists a range, approximately(0,0.5)(0,0.5), in which stability remains high while novelty continues to increase. This regime allows the model to maintain both stability and novelty simultaneously, thereby sustaining a higherV.S.U.N. As in Fig.5(b), the proposed pipeline actually pushes the base model beyond the stability-novelty trade-off.

Refer to caption Figure 5:(a) Trends of the evaluation metrics along the proxy distance after refinement. (b) Comparison stability–novelty trade-off before and after refinement.

Comparing Crys-JEPA with fingerprints and MLFFs

In this experiment, we compare Crys-JEPA with two MLFFs, eSEN[14]and CHGNet[11]. Specifically, we use each MLFF to predict crystal energies, computeΔE\Delta Efor generated crystals relative to the training set, and rank the generated candidates byΔE\Delta Ebefore selection. For completeness, we also consider the fingerprint-based distance introduced in Section3as an alternative ranking signal at Step 4 of the pipeline. The results are summarized in Table2. Besides generation metrics, we also report the training-set size used by each model and the corresponding inference cost. To ensure a fair runtime comparison and eliminate the effect of batch size, we disable batched inference by setting the batch size to 1. Each model sequentially processes 10,000 crystals, and the total inference time is reported.

Table 2:Comparison of different screening ways on MP-20, including fingerprint-based ranking, MLFF-based ranking, and the proposed Crys-JEPA. Results are averaged over 10,000 sampled crystals and reported as mean percentage±\pmstandard deviation.%VSNS.U.NV.S.U.NInference Time (s)10,000 Crystals# TrainingSamplesBase model82.1±\pm1.548.4±\pm1.478.0±\pm1.627.8±\pm1.123.3±\pm1.3-27,136w/ Fingerprints94.1±\pm1.053.1±\pm1.472.9±\pm1.222.9±\pm0.721.9±\pm0.71651.69-w/ CHGNet[11]96.8±\pm0.374.7±\pm1.377.4±\pm0.639.8±\pm0.938.2±\pm0.7339.491,580,395w/ eSEN-30M-MP[14]95.8±\pm0.377.2±\pm1.080.2±\pm0.944.0±\pm0.841.0±\pm1.1843.531,580,395w/ Crys-JEPA98.9±\pm0.277.7±\pm0.479.1±\pm1.444.2±\pm1.544.1±\pm1.548.87839,568

1.Fingerprints capture structural and compositional similarity, but they are not directly energy-aware. Consequently, they are less reliable for screening stable crystals.
2.Crys-JEPA substantially improves the validity of the base model. This is because the Crys-JEPA embedding also encodes structural information. When Crys-JEPA is used to screen generated crystals, the selected candidates are more likely to be structurally close to the valid training crystals.
3.eSEN is a strong baseline for filtering high-quality crystals. However, it explicitly models all interatomic interactions within a cutoff radius, which leads to a high inference cost. In contrast, Crys-JEPA adopts a vanilla Transformer encoder and focuses only on atoms within the unit cell, resulting in a much more efficient screening procedure.
4.Both MLFFs are trained on the full MPTrj dataset with force, stress, and energy labels. By contrast, as described in Section4.1, Crys-JEPA uses roughly half of the crystals and only energy labels. This suggests that scaling Crys-JEPA with richer supervision and more training data could further improve its performance.
5.A key requirement for computingΔE\Delta Evia MLFFs is the reference crystals (in this case, the training set) must have DFT-calculated energies for an accurate convex hull. In contrast, our Crys-JEPA solely relies on crystal structures and a simple Euclidean metricDCD_{C}per Eq.11.

5.2Crystal Generation base on Alex-MP-20

Table 3:Evaluation of 10,000 crystals generated by models trained on Alex-MP-20.ModelSNS.U.N(MLFF)V.S.U.N(MLFF)S.U.N(DFT)V.S.U.N(DFT)MatterGen73.1±\pm1.067.9±\pm1.443.5±\pm1.237.4±\pm1.440.135.1Base model69.0±\pm0.961.8±\pm1.233.2±\pm1.528.8±\pm1.0--w/ Crys-JEPA78.6±\pm1.284.2±\pm1.563.4±\pm2.163.2±\pm2.064.164.1We further verify the effectiveness of Crys-JEPA and the screening-and-refinement pipeline on the Alex-MP-20 dataset[53], using theMP2020correctionreference following[53]. The results are presented in Table3. Once again, the proposed pipeline yields a substantial improvement in metrics such asV.S.U.N.. In particular, compared to MatterGen,V.S.U.N.increases by 82.6% via DFT.

6Conclusion

In this paper, we investigated the stability–novelty trade-off in de novo crystal generation and proposed Crys-JEPA, an energy-aware embedding model for efficient stability screening. By mapping generated crystals alongside known training references within a learned latent space, Crys-JEPA enables a screening-and-refinement pipeline that overcomes this inherent trade-off.

Limitations and broader impact.

While Crys-JEPA was trained on a relatively limited dataset, its performance could be further enhanced through larger-scale data and advanced architectural refinements. Although this work significantly reduces the computational cost of materials screening, candidates selected by the surrogate model should be rigorously validated via first-principles methods before being considered definitive material discoveries.

Acknowledgments and Disclosure of Funding

XB is supported by MOE AcRF T1 Grant ID 251RES2423 and NRF AI4SCT Grant ID 20250024. KSN acknowledges support by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG3-RP-2022-028), by the Ministry of Education, Singapore under Research Centre of Excellence award to the Institute for Functional Intelligent Materials, I-FIM (project No. EDUNC-33-18-279-V12) and by the Tier 3 program (MOE-MOET32024-0001). The DFT calculation was performed on resources of the National Supercomputing Centre (NSCC), Singapore.

References

[1]J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick,et al.(2024)Accurate structure prediction of biomolecular interactions with alphafold 3.Nature630(8016),pp. 493–500.Cited by:Appendix E.
[2](2023)Self-supervised learning from images with a joint-embedding predictive architecture.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp. 15619–15629.Cited by:Appendix E,§4.1.
[3]R. Balestriero and Y. LeCun(2025)Lejepa: provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544.Cited by:§C.2,Appendix E,§4.1.
[4]A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun, M. Assran, and N. Ballas(2024)Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471.Cited by:Appendix E,§4.1.
[5]C. M. Bishop and N. M. Nasrabadi(2006)Pattern recognition and machine learning.Vol.4,Springer.Cited by:§3.
[6]D. Bo, C. Shi, L. Wang, and R. Liao(2023)Specformer: spectral graph neural networks meet transformers.arXiv preprint arXiv:2303.01028.Cited by:Appendix E.
[7]K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh(2018)Machine learning for molecular and materials science.Nature559(7715),pp. 547–555.Cited by:§1.
[8]R. Chang, A. Pak, A. Guerra, N. Zhan, N. Richardson, E. Ertekin, and R. P. Adams(2025)Space group equivariant crystal diffusion.arXiv preprint arXiv:2505.10994.Cited by:Appendix E,§3,Table 1.
[9]D. Chen, M. Shukor, T. Moutakanni, W. Chung, J. Yu, T. Kasarla, A. Bolourchi, Y. LeCun, and P. Fung(2025)Vl-jepa: joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942.Cited by:Appendix E,§4.1.
[10]T. Chen, S. Kornblith, M. Norouzi, and G. Hinton(2020)A simple framework for contrastive learning of visual representations.InInternational conference on machine learning,pp. 1597–1607.Cited by:§4.1.
[11]B. Deng, P. Zhong, K. Jun, J. Riebesell, K. Han, C. J. Bartel, and G. Ceder(2023)CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling.Nature Machine Intelligence5(9),pp. 1031–1041.Cited by:§4.1,§4.3,§5.1,Table 2.
[12]A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,et al.(2020)An image is worth 16x16 words: transformers for image recognition at scale.arXiv preprint arXiv:2010.11929.Cited by:§B.2,Appendix E.
[13]V. P. Dwivedi and X. Bresson(2020)A generalization of transformer networks to graphs.arXiv preprint arXiv:2012.09699.Cited by:Appendix E.
[14]X. Fu, B. M. Wood, L. Barroso-Luque, D. S. Levine, M. Gao, M. Dzamba, and C. L. Zitnick(2025)Learning smooth and expressive interatomic potentials for physical property prediction.arXiv preprint arXiv:2502.12147.Cited by:§4.3,§4.3,§5.1,Table 2.
[15]A. M. Ganose, H. Sahasrabuddhe, M. Asta, K. Beck, T. Biswas, A. Bonkowski, J. Bustamante, X. Chen, Y. Chiang, D. C. Chrzan,et al.(2025)Atomate2: modular workflows for materials science.Digital Discovery4(7),pp. 1944–1973.Cited by:§D.3.
[16]N. Gruver, A. Sriram, A. Madotto, A. Wilson, C. Zitnick, and Z. UlissiFine-tuned language models generate stable inorganic materials as text, arxiv, 2024.arXiv preprint arXiv:2402.0437910.Cited by:Appendix E.
[17]D. Hendrycks and K. Gimpel(2016)Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415.Cited by:§4.1.
[18]J. Ho, A. Jain, and P. Abbeel(2020)Denoising diffusion probabilistic models.Advances in neural information processing systems33,pp. 6840–6851.Cited by:§B.1,§1,§5.
[19]H. Huang, Y. LeCun, and R. Balestriero(2025)Llm-jepa: large language models meet joint embedding predictive architectures.arXiv preprint arXiv:2509.14252.Cited by:Appendix E,§4.1.
[20]A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder,et al.(2013)Commentary: the materials project: a materials genome approach to accelerating materials innovation.APL materials1(1).Cited by:1st item,2nd item,§1,§4.1.
[21]R. Jiao, W. Huang, P. Lin, J. Han, P. Chen, Y. Lu, and Y. Liu(2023)Crystal structure prediction by joint equivariant diffusion.Advances in Neural Information Processing Systems36,pp. 17464–17497.Cited by:Appendix E,§3,Table 1.
[22]R. Jiao, W. Huang, Y. Liu, D. Zhao, and Y. Liu(2024)Space group constrained crystal generation.arXiv preprint arXiv:2402.03992.Cited by:Appendix E,§3,Table 1.
[23]C. K. Joshi, X. Fu, Y. Liao, V. Gharakhanyan, B. K. Miller, A. Sriram, and Z. W. Ulissi(2025)All-atom diffusion transformers: unified generative modelling of molecules and materials.arXiv preprint arXiv:2503.03965.Cited by:§D.3,Appendix E,§1,§3,§3,Table 1.
[24]S. Khastagir, K. Das, P. Goyal, S. Lee, S. Bhattacharjee, and N. Ganguly(2025)LLM meets diffusion: a hybrid framework for crystal material generation.arXiv preprint arXiv:2510.23040.Cited by:Appendix E,§3,Table 1.
[25]G. Kresse and J. Furthmüller(1996-10)Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set.Phys. Rev. B54,pp. 11169–11186.External Links:Document,LinkCited by:§D.3.
[26]G. Kresse and D. Joubert(1999-01)From ultrasoft pseudopotentials to the projector augmented-wave method.Phys. Rev. B59,pp. 1758–1775.External Links:Document,LinkCited by:§D.3.
[27]Y. LeCun(2022)A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review62(1),pp. 1–62.Cited by:Appendix E,§1,§4.1.
[28]D. Levy, S. S. Panigrahi, S. Kaba, Q. Zhu, K. L. K. Lee, M. Galkin, S. Miret, and S. Ravanbakhsh(2025)SymmCD: symmetry-preserving crystal generation with diffusion models.arXiv preprint arXiv:2502.03638.Cited by:Appendix E,§3,Table 1.
[29]Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le(2022)Flow matching for generative modeling.arXiv preprint arXiv:2210.02747.Cited by:§1.
[30]L. Maes, Q. L. Lidec, D. Scieur, Y. LeCun, and R. Balestriero(2026)Leworldmodel: stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312.Cited by:Appendix E.
[31]L. McInnes, J. Healy, N. Saul, and L. Grossberger(2018)UMAP: uniform manifold approximation and projection.The Journal of Open Source Software3(29),pp. 861.Cited by:§4.1.
[32]A. Merchant, S. Batzner, S. S. Schoenholz, M. Aykol, G. Cheon, and E. D. Cubuk(2023)Scaling deep learning for materials discovery.Nature624(7990),pp. 80–85.Cited by:Remark.
[33]B. K. Miller, R. T. Chen, A. Sriram, and B. M. Wood(2024)Flowmm: generating materials with riemannian flow matching.InForty-first International Conference on Machine Learning,Cited by:§D.3,Appendix E,§3,§3,Table 1.
[34]A. Q. Nichol and P. Dhariwal(2021)Improved denoising diffusion probabilistic models.InInternational conference on machine learning,pp. 8162–8171.Cited by:§B.3.
[35]S. P. Ong, W. D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V. L. Chevrier, K. A. Persson, and G. Ceder(2013)Python materials genomics (pymatgen): a robust, open-source python library for materials analysis.Computational Materials Science68,pp. 314–319.Cited by:Appendix A,§D.3,§D.4.1.
[36]A. v. d. Oord, Y. Li, and O. Vinyals(2018)Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748.Cited by:§1,§4.1.
[37]W. Peebles and S. Xie(2023)Scalable diffusion models with transformers.InProceedings of the IEEE/CVF international conference on computer vision,pp. 4195–4205.Cited by:Appendix E.
[38]J. P. Perdew, K. Burke, and M. Ernzerhof(1996-10)Generalized gradient approximation made simple.Phys. Rev. Lett.77,pp. 3865–3868.External Links:Document,LinkCited by:§D.3,§5.1.
[39]J. Schmidt, T. F. Cerqueira, A. H. Romero, A. Loew, F. Jäger, H. Wang, S. Botti, and M. A. Marques(2024)Improving machine-learning models in materials science through large datasets.Materials Today Physics48,pp. 101560.Cited by:2nd item.
[40]K. Shoemake(1992)Uniform random rotations.InGraphics Gems III (IBM Version),pp. 124–132.Cited by:§4.1.
[41]J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli(2015)Deep unsupervised learning using nonequilibrium thermodynamics.InInternational conference on machine learning,pp. 2256–2265.Cited by:§B.1.
[42]A. Sriram, B. K. Miller, R. T. Chen, and B. M. Wood(2024)Flowllm: flow matching for material generation with large language models as base distributions.Advances in Neural Information Processing Systems37,pp. 46025–46046.Cited by:Appendix E,§3,Table 1.
[43]Y. Sui and B. Hooi(2026)Conversation for non-verifiable learning: self-evolving llms through meta-evaluation.arXiv preprint arXiv:2601.21464.Cited by:Appendix E.
[44]H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar,et al.(2023)Llama: open and efficient foundation language models.arXiv preprint arXiv:2302.13971.Cited by:Appendix E.
[45]A. Vaswani(2017)Attention is all you need.Advances in Neural Information Processing Systems.Cited by:§B.2,Appendix E,§4.1,§5.
[46]L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton(2016)A general-purpose machine learning framework for predicting properties of inorganic materials.npj Computational Materials2(1),pp. 16028.Cited by:1st item,§D.4.1.
[47]L. Ward, A. Dunn, A. Faghaninia, N. E. Zimmermann, S. Bajaj, Q. Wang, J. Montoya, J. Chen, K. Bystrom, M. Dylla,et al.(2018)Matminer: an open source toolkit for materials data mining.Computational Materials Science152,pp. 60–69.Cited by:§D.4.1.
[48]T. Xie, X. Fu, O. Ganea, R. Barzilay, and T. Jaakkola(2021)Crystal diffusion variational autoencoder for periodic material generation.arXiv preprint arXiv:2110.06197.Cited by:1st item,1st item,§D.4.2,Appendix E,§1,§1,§3,§3,§3,Table 1.
[49]Y. Xing, X. Wang, Y. Li, H. Huang, and C. Shi(2024)Less is more: on the over-globalizing problem in graph transformers.arXiv preprint arXiv:2405.01102.Cited by:Appendix E.
[50]R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu(2020)On layer normalization in the transformer architecture.InInternational conference on machine learning,pp. 10524–10533.Cited by:§B.2.
[51]M. Xu, A. S. Powers, R. O. Dror, S. Ermon, and J. Leskovec(2023)Geometric latent diffusion models for 3d molecule generation.InInternational Conference on Machine Learning,pp. 38592–38610.Cited by:Appendix E.
[52]H. Yang, C. Hu, Y. Zhou, X. Liu, Y. Shi, J. Li, G. Li, Z. Chen, S. Chen, C. Zeni,et al.(2024)Mattersim: a deep learning atomistic model across elements, temperatures and pressures.arXiv preprint arXiv:2405.04967.Cited by:§D.3,§3,§4.2,§4.3,Remark.
[53]C. Zeni, R. Pinsler, D. Zügner, A. Fowler, M. Horton, X. Fu, S. Shysheya, J. Crabbé, L. Sun, J. Smith,et al.(2023)Mattergen: a generative model for inorganic materials design.arXiv preprint arXiv:2312.03687.Cited by:2nd item,Appendix E,§1,§2,§2,§3,§3,§5.2,Table 1.
[54]Z. Zhang, X. Wang, M. Zhang, J. Tan, and C. Shi(2026)Toward graph-tokenizing large language models with reconstructive graph instruction tuning.InProceedings of the ACM Web Conference 2026,pp. 430–441.Cited by:Appendix E.
[55]B. Zhu, D. Bo, D. C. Zhang, and X. Wang(2026)Graph-grpo: training graph flow models with reinforcement learning.arXiv preprint arXiv:2603.10395.Cited by:Appendix E.
[56]N. E. Zimmermann and A. Jain(2020)Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity.RSC advances10(10),pp. 6063–6081.Cited by:1st item,§D.4.1.

Appendix AThermodynamic Stability Calculation

In this section, we give a brief derivation that energy above hull can also be measured using total energy per atom. Firstly, we rewrite Eq. (5) and Eq.6here,

Ef/atomhull(𝒇)\displaystyle E_{f/atom}^{\mathrm{hull}}(\bm{f})=min{λj}∑jλjEf/atomjs.t.∑jλj𝒇j=𝒇,∑jλj=1,λj≥0,\displaystyle=\min_{\{\lambda_{j}\}}\sum_{j}\lambda_{j}E_{f/atom}^{j}\quad\text{s.t.}\quad\sum_{j}\lambda_{j}\bm{f}^{j}=\bm{f},\;\;\sum_{j}\lambda_{j}=1,\;\;\lambda_{j}\geq 0,(12a)ΔE\displaystyle\Delta E=Ef/atom(𝐂)−Ef/atomhull(𝒇).\displaystyle=E_{f/atom}(\mathbf{C})-E_{f/atom}^{\mathrm{hull}}(\bm{f}).(12b)Then, substituting theEf/atomj=Et/atomj−∑i=1kfijμirefE_{f/atom}^{j}=E_{t/\mathrm{atom}}^{j}-\sum_{i=1}^{k}f_{i}^{j}\mu_{i}^{\mathrm{ref}}into Eq. (12a) gives

Ef/atomhull(𝒇)\displaystyle E_{f/atom}^{\mathrm{hull}}(\bm{f})=min{λj}∑jλj(Et/atomj−∑i=1kfijμiref)\displaystyle=\min_{\{\lambda_{j}\}}\sum_{j}\lambda_{j}\left(E_{t/\mathrm{atom}}^{j}-\sum_{i=1}^{k}f_{i}^{j}\mu_{i}^{\mathrm{ref}}\right)(13)=min{λj}⁡[∑jλjEt/atomj−∑i=1k(∑jλjfij)μiref].\displaystyle=\min_{\{\lambda_{j}\}}\left[\sum_{j}\lambda_{j}E_{t/\mathrm{atom}}^{j}-\sum_{i=1}^{k}\left(\sum_{j}\lambda_{j}f_{i}^{j}\right)\mu_{i}^{\mathrm{ref}}\right].Using the composition-matching constraint∑jλj𝒇j=𝒇\sum_{j}\lambda_{j}\bm{f}^{j}=\bm{f}in Eq. (5), we have∑jλj𝒇ij=𝒇i\sum_{j}\lambda_{j}\bm{f}_{i}^{j}=\bm{f}_{i}for allii, and thus

Ef/atomhull(𝒇)=Et/atomhull(𝒇)−∑i=1kfiμiref,E_{f/atom}^{\mathrm{hull}}(\bm{f})=E_{t/atom}^{\mathrm{hull}}(\bm{f})-\sum_{i=1}^{k}f_{i}\mu_{i}^{\mathrm{ref}},(14)where

Et/atomhull(𝒇)=min{λj}∑jλjEt/atomjs.t.∑jλj𝒇j=𝒇,∑jλj=1,λj≥0E_{t/atom}^{\mathrm{hull}}(\bm{f})=\min_{\{\lambda_{j}\}}\sum_{j}\lambda_{j}E_{t/\mathrm{atom}}^{j}\quad\text{s.t.}\quad\sum_{j}\lambda_{j}\bm{f}^{j}=\bm{f},\;\;\sum_{j}\lambda_{j}=1,\;\;\lambda_{j}\geq 0(15)is the convex hull energy in composition–total-energy space. Combining this with Eq. (12b), the elemental reference term cancels:

ΔE\displaystyle\Delta E=(Et/atom−∑i=1kfiμiref)−(Et/atomhull(𝒇)−∑i=1kfiμiref)\displaystyle=\left(E_{t/\mathrm{atom}}-\sum_{i=1}^{k}f_{i}\mu_{i}^{\mathrm{ref}}\right)-\left(E_{t/atom}^{\mathrm{hull}}(\bm{f})-\sum_{i=1}^{k}f_{i}\mu_{i}^{\mathrm{ref}}\right)(16)=Et/atom−Et/atomhull(𝒇).\displaystyle=E_{t/\mathrm{atom}}-E_{t/atom}^{\mathrm{hull}}(\bm{f}).Therefore, although the phase diagram is originally defined using formation energies, the energy above hull can be computed equivalently as the difference between the crystal total energy per atom and the convex hull energy at the same composition. ThisΔE\Delta Ecomputing has been adopted bypymatgen[35]internally.

Appendix BModel Details

B.1Diffusion-Based Generation

Diffusion models[41]have recently become a leading framework for generative modeling. In this work, we adopt the denoising diffusion probabilistic model (DDPM)[18]as our generative backbone.

DDPM consists of a forward noising process and a learned reverse denoising process. In the forward process, Gaussian noise is gradually added to a clean sample𝒙0\bm{x}_{0}overTTtimesteps:

𝒙t=1−βt𝒙t−1+βtϵt=α¯t𝒙0+1−α¯tϵ,ϵ∼𝒩(0,𝑰),\bm{x}_{t}=\sqrt{1-\beta_{t}}\,\bm{x}_{t-1}+\sqrt{\beta_{t}}\,\bm{\epsilon}_{t}=\sqrt{\bar{\alpha}_{t}}\,\bm{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\,\bm{\epsilon},\quad\bm{\epsilon}\sim\mathcal{N}(0,\bm{I}),(17)where{βt}t=1T\{\beta_{t}\}_{t=1}^{T}denotes a predefined variance schedule withβ1<⋯<βT\beta_{1}<\cdots<\beta_{T},αt:=1−βt\alpha_{t}:=1-\beta_{t}, andα¯t:=∏s=1tαs\bar{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}. This formulation yields a closed-form expression for the conditional distributionq(𝒙t∣𝒙0)q(\bm{x}_{t}\mid\bm{x}_{0}).

The reverse process reconstructs𝒙0\bm{x}_{0}from noise using a parameterized model. Specifically, a neural networkϵθ(𝒙t,t)\bm{\epsilon}_{\theta}(\bm{x}_{t},t)is trained to estimate the injected noise at each timestep. The corresponding training objective is:

ℒDDPM(θ)=𝔼t,𝒙0,ϵ[‖ϵ−ϵθ(𝒙t,t)‖22],\mathcal{L}_{\text{DDPM}}(\theta)=\mathbb{E}_{t,\bm{x}_{0},\bm{\epsilon}}\left[\left\|\bm{\epsilon}-\bm{\epsilon}_{\theta}(\bm{x}_{t},t)\right\|_{2}^{2}\right],(18)wherettis sampled uniformly from{1,…,T}\{1,\ldots,T\}.

During sampling, generation starts from Gaussian noise𝒙T∼𝒩(0,𝑰)\bm{x}_{T}\sim\mathcal{N}(0,\bm{I})and iteratively applies the learned reverse transitions to recover𝒙0\bm{x}_{0}. The reverse transition is modeled as:

q(𝒙t−1∣𝒙t,𝒙0)=𝒩(𝒙t−1;𝝁~t(𝒙t,𝒙0),β~t𝑰),q(\bm{x}_{t-1}\mid\bm{x}_{t},\bm{x}_{0})=\mathcal{N}\bigl(\bm{x}_{t-1};\tilde{\bm{\mu}}_{t}(\bm{x}_{t},\bm{x}_{0}),\tilde{\beta}_{t}\bm{I}\bigr),(19)with

𝝁~t=1αt(𝒙t−βt1−α¯tϵθ(𝒙t,t)),β~t=1−α¯t−11−α¯tβt.\tilde{\bm{\mu}}_{t}=\frac{1}{\sqrt{\alpha_{t}}}\left(\bm{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\bm{\epsilon}_{\theta}(\bm{x}_{t},t)\right),\quad\tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}.(20)

B.2Transformer as Backbones

Refer to caption Figure 6:Two backbone Transformers.Our framework employs two Transformer-based architectures: a crystal encoder in Crys-JEPA and a denoising network in DDPM. The detailed designs are shown in Fig.6. Both are built upon the standard Transformer[45], with several practical modifications.

Pre-normalization.

We adopt pre-normalization before both the self-attention and feed-forward layers, following[50], to improve optimization stability.

Global Representation.

To obtain a global crystal representation in Crys-JEPA, we introduce a learnable[CLS]token[12]. This token is concatenated with atom embeddings and processed through the Transformer. Its final output serves as the global embedding.

Timestep Conditioning.

In the DDPM denoising network, the timestepttis encoded using sinusoidal embeddings followed by a linear projection. The resulting representation is added to the input features at each Transformer layer.

B.3Hyperparameter Settings

For the Crys-JEPA encoder, we use an 8-layer Transformer with hidden dimension 512 and 16 attention heads, without dropout.

For the diffusion model, we set the number of timesteps toT=256T=256and adopt a cosine noise schedule[34]. The denoising Transformer consists of 12 layers with hidden dimension 1024 and 8 attention heads, and a dropout rate of 0.01.

Appendix CMore Experiments

C.1Hyper-Parameter Sensitivity Analysis

Refer to caption (a)kkon MP-20 (b)kkon Alex-MP-20 (c)𝒩\mathcal{N}on MP-20

Figure 7:Test the changing trend ofS.U.NandV.S.U.Nwith the increasing ofkkor𝒩\mathcal{N}.In the screening-and-refinement pipeline, two key hyper-parameters, namelyk\mathrm{k}and𝒩\mathcal{N}, play important roles. The parameterk\mathrm{k}governs the number of the selected crystals at step 5 for fine-tuning, while𝒩\mathcal{N}determines the number of initial crystals at step 1. The results are visually presented in Fig.7. With increasingk\mathrm{k}, more and more crystals distant from the training set will be included into the fine-tuning. This is a double-edged sword, bringing more various crystals while decreasing the stability. Consequently, theS.U.NandV.S.U.Ngenerally rises and then falls. For IncreasingNNgenerally improves bothS.U.NandV.S.U.N, indicating that a larger candidate pool provides more high-quality samples for subsequent screening. This observation is consistent with the intuition that the screening-and-refinement pipeline benefits from a richer pool, as it increases the probability of discovering crystals that simultaneously satisfy stability, uniqueness, and novelty.

C.2Further Analysis of Crys-JEPA Representation

Energy-aware JEPA representation.

The main Crys-JEPA model is trained with the energy-aware InfoNCE objective in Eq. (9), where the weighting term in Eq. (10) modulates the repulsion between crystal embeddings according to their formation-energy differences. This design aims to organize the latent space such that crystals with similar formation energies are mapped close to each other, while crystals with larger formation-energy gaps are pushed farther apart. As shown in the main text, this representation can serve as an effective surrogate for stability screening in the proposed screening-and-refinement pipeline.

Refer to caption Figure 8:(a) Visualization of the latent space learned by Crys-JEPA withoutωik\omega_{ik}. (b) Distribution of top four most frequent atom types.To further understand the learned representation, we analyze the latent space under a related setting where the energy-aware weighting term is removed and all negative samples are treated equally. The resulting latent space is visualized in Fig.8(a). Compared with the energy-aware representation, the embeddings no longer exhibit a clear organization with respect to formation energy. Instead, the model appears to capture more generic structural and compositional patterns induced by the augmentation-based JEPA objective.

We further inspect atom-level embeddings by visualizing several frequent elements, including O, Mg, F, and S, as shown in Fig.8(b). The embeddings show partial separation among different element types, suggesting that the model still retains chemically meaningful information even without explicit energy-aware weighting. This observation is consistent with the intuition that translation and rotation augmentations encourage the encoder to preserve intrinsic crystal information.

Limitation.

However, these results should be interpreted with caution. Although the visualizations suggest that the learned embeddings contain certain structural, compositional, or chemical signals, they do not by themselves provide conclusive evidence that the representation is globally well-structured. In particular, JEPA-style objectives may suffer from representation collapse or partial collapse, where embeddings occupy a low-dimensional subspace or concentrate around a limited set of directions. In such cases, low-dimensional visualization methods such as UMAP may still produce visually separable clusters, but the apparent structure may not faithfully reflect a robust or uniformly informative latent space.

This issue is especially important for our setting because Crys-JEPA is used as a ranking signal for stability screening. The screening pipeline requires not only that the representation separates some obvious chemical patterns, but also that distances in the embedding space provide a reliable and fine-grained comparison between generated crystals and reference crystals. Therefore, while the above analysis provides useful qualitative evidence, it remains insufficient for establishing that the learned representation is free from collapse or that all embedding dimensions contribute meaningfully to the energy-aware comparison.

Refer to caption Figure 9:Visualization of the latent space learned by Crys-JEPA with SIGReg and MSE loss. Although formation-energy labels are not used during training, the learned latent space still exhibits a partial organization with respect to formation energy.

Collapse-resistant JEPA with SIGReg and MSE loss.

Motivated by this limitation, we further explore an alternative JEPA training objective based on SIGReg[3]. SIGReg is a recently proposed regularization method for preventing representation collapse in joint-embedding predictive architectures. Instead of relying on negative samples, SIGReg projects a batch of embeddings onto multiple random one-dimensional directions and encourages each projected distribution to match a standard Gaussian. By enforcing Gaussianity across random projections, the batch-level embedding distribution is encouraged to approximate an isotropic Gaussian. This regularization helps prevent trivial collapse and encourages information to be spread across multiple embedding dimensions.

In this variant, we replace the energy-aware InfoNCE objective in Eq. (9) with a combination of mean squared error (MSE) alignment and SIGReg regularization. The MSE term aligns the predicted target embedding with the encoded target embedding:

ℒMSE=‖P(Hc,t,r)−Ht‖22,\mathcal{L}_{\mathrm{MSE}}=\left\|P(H_{c},t,r)-H_{t}\right\|_{2}^{2},whereHcH_{c}andHtH_{t}denote the context and target embeddings, respectively, andP(⋅)P(\cdot)is the predictor conditioned on the augmentation parameters. The full objective is then defined as

ℒ=ℒMSE+λSIGℒSIGReg,\mathcal{L}=\mathcal{L}_{\mathrm{MSE}}+\lambda_{\mathrm{SIG}}\mathcal{L}_{\mathrm{SIGReg}},whereλSIG\lambda_{\mathrm{SIG}}controls the strength of the SIGReg regularization.

Unlike the energy-aware InfoNCE objective used in the main model, this variant does not use formation-energy labels. Therefore, it provides a fully self-supervised JEPA training paradigm based only on augmentation consistency and collapse prevention. The resulting latent space is visualized in Fig.9. Interestingly, although formation energy is not explicitly used during training, the learned embeddings still exhibit a partial organization with respect to formation energy. This suggests that energy-related information may be partially recoverable from the structural and compositional patterns preserved by the self-supervised objective.

Nevertheless, we view this SIGReg-based variant as complementary to, rather than a replacement for, the energy-aware Crys-JEPA objective used in the main experiments. The energy-aware objective directly injects formation-energy differences into the latent geometry and is therefore better aligned with our stability-screening goal. By contrast, SIGReg and MSE provide a promising collapse-resistant representation learning framework, but currently lack an explicit mechanism for ordering crystals by formation-energy differences. Future work will investigate how SIGReg-style regularization can be combined with energy supervision to obtain a latent space that is both collapse-resistant and thermodynamically informative.

Appendix DExperimental Details

D.1Datasets Descriptions

We consider the following two datasets in the numerical experiments:

•MP-20is a realistic benchmark curated from the Materials Project[20]and introduced by CDVAE[48]. MP-20 contains 45,231 inorganic crystal structures with up to 20 atoms per unit cell, spanning 89 elements. Following CDVAE, we use the standard 60/20/20 split for training, validation, and testing. Since MP-20 consists mostly of experimentally known and globally stable materials, it provides a challenging benchmark for de novo crystal generation.
•Alex-MP-20is a large-scale dataset of inorganic crystal structures curated in MatterGen[53]by combining data from the Alexandria database[39]and the Materials Project[20]. The dataset includes 607,684 structures with at most 20 atoms per unit cell, ensuring compatibility with standard crystal generation benchmarks. Additional filtering is applied to retain thermodynamically stable or near-stable materials, typically defined as having energy above hull below 0.1 eV/atom, based on density functional theory (DFT) calculations. Materials containing radioactive atoms are removed.

D.2Evaluation Metric

Our goal in de novo generation task it to generate valid, stable, unique and novel materials. These four basic metrics are defined as follows:

•Validity(V). We consider the validity of a crystal from both structure and composition[48]. For structure, a valid crystal should have volume larger than 0.1, and the minimal distance among all atom pairs should larger than 0.5. For element composition, we check charge neutrality and electronegativity difference. If one crystal satisfies these two validity simultaneously, it is overall valid.
•Stability(S). Stability measures whether a generated crystal is thermodynamically feasible. For each generated structure, we compute its energy above the convex hull, denoted asΔE\Delta E, using a surrogate model (e.g., MLFF) or DFT when available. A structure is considered stable ifΔE<ϵ\Delta E<\epsilon, whereϵ\epsilonis a small threshold.
•Uniqueness(U). Uniqueness measures the diversity of generated structures by removing duplicates within the generated set. Two generated structures are considered identical if they match under the same structural equivalence criterion.
•Novelty(N). Novelty evaluates whether generated crystals are distinct from the reference dataset. A generated structure is considered non-novel if it matches any structure in the reference set under a structural equivalence criterion (e.g., usingStructureMatcher). Otherwise, it is regarded as novel.

We discuss more on stability and novelty in this paper. To further evaluate generation quality, we further report compound metrics that measure the joint satisfaction of multiple criteria. Let{𝐂gen}i=1N\{\mathbf{C}_{gen}\}_{i=1}^{N}denoteNNgenerated crystals. For each crystal, we define indicator functions:

Si=𝕀(Ehull(i)≤ϵ),Ni=𝕀(𝐂i∉𝒟ref),Ui=𝕀(𝐂iis unique in𝒞),Vi=𝕀(𝐂iis valid).\displaystyle S_{i}=\mathbb{I}(E_{\text{hull}}^{(i)}\leq\epsilon),\quad N_{i}=\mathbb{I}(\mathbf{C}_{i}\notin\mathcal{D}_{\text{ref}}),\quad U_{i}=\mathbb{I}(\mathbf{C}_{i}\text{ is unique in }\mathcal{C}),\quad V_{i}=\mathbb{I}(\mathbf{C}_{i}\text{ is valid}).(21)Then, we define

•Stable & Unique & Novel(S.U.N) measures the fraction of generated crystals that are simultaneously stable, unique, and novel: S.U.N=1N∑i=1NSi⋅Ui⋅Ni.\texttt{S.U.N}=\frac{1}{N}\sum_{i=1}^{N}S_{i}\cdot U_{i}\cdot N_{i}.(22)
•Valid & Stable & Unique & Novel(V.S.U.N) further incorporates validity, measuring the fraction of samples that satisfy all four criteria: V.S.U.N=1N∑i=1NVi⋅Si⋅Ui⋅Ni.\texttt{V.S.U.N}=\frac{1}{N}\sum_{i=1}^{N}V_{i}\cdot S_{i}\cdot U_{i}\cdot N_{i}.(23)

These compound metrics explicitly characterize the trade-off between stability and novelty, which is often overlooked when reporting marginal metrics independently.

D.3Density Functional Theory Settings

We use DFT settings from Materials Projecthttps://docs.materialsproject.org/methodology/materials-methodology/calculation-details/gga+u-calculations/parameters-and-convergencefor structure relaxation and energy computation. In particular, we do GGA and GGA+U calculations withatomate2.vasp.flows.mp. MPGGADoubleRelaxStaticMaker[15], which in turn relies onpymatgen.io.vasp.sets.MPRelaxSetandpymatgen.io.vasp.sets.MPStaticSet[35]. Computations themselves were done with VASP[25]version 5.4.4. with the plane-wave basis set[25]. The electron-ion interaction is described by the projector augmented wave (PAW) pseudo-potentials[26]. The exchange-correlation of valence electrons is treated with the Perdew-Burke-Ernzerhof (PBE) functional within the generalized gradient approximation (GGA)[38]. The raw total energies computed by DFT were corrected withMaterialsProject2020Compatibilitybefore putting into thePhaseDiagramto obtain the DFTEhullE_{\text{hull}}.

We do DFT relaxation firstly using the generated crystal structures. Then, for the crystals failed to DFT, we follow previous studies[33,23]and use MatterSim-v1-1M[52]to do pre-relaxation, and then redo DFT.

D.4Distance based on Fingerprints

D.4.1CrystalNN and Magpie Fingerprints

CrystalNN fingerprints[56]are local structural descriptors implemented in matminer[47], based on the neighbor-finding algorithm CrystalNN proposed in pymatgen[35]. For each atomic site, CrystalNN identifies neighboring atoms using a combination of distance-based criteria and Voronoi-like weighting, yielding a robust coordination environment even in distorted structures. Based on the identified neighbors, the fingerprint encodes features such as coordination number distributions, local bonding environments and weighted neighbor statistics. These site-level features are then aggregated (e.g., mean, variance) across all atoms in the unit cell to obtain a fixed-length vector representation for the entire crystal. CrystalNN fingerprints primarily capture local geometric and coordination information, making them effective for representing structural environments.

Magpie fingerprints[46]are composition-based descriptors introduced in matminer and originally proposed in the Magpie framework. Given only the chemical composition of a material, Magpie computes statistical summaries of elemental properties, including atomic number electronegativity, atomic radius, melting temperature, valence electron counts. For each property, Magpie derives aggregate statistics such as mean, variance, minimum / maximum and range. These statistics form a fixed-length feature vector describing the composition. Magpie fingerprints capture global compositional characteristics but do not encode explicit structural information.

D.4.2Definition of Distance

To quantify the distances between{𝐂gen}\{\mathbf{C}_{gen}\}and{𝐂gt}\{\mathbf{C}_{gt}\}as in Section3, inspired by thePrecisionmetric[48], we consider fingerprint based distance, and measure as follows:

•For crystal𝐂i∈{𝐂gen}\mathbf{C}_{i}\in\{\mathbf{C}_{gen}\}, we get its two vectors, i.e.,FPistr∈ℝ132FP_{i}^{str}\in\mathbb{R}^{132}andFPicom∈ℝ61FP_{i}^{com}\in\mathbb{R}^{61}. The former is CrystalNN fingerprint[56]for structural information, and the latter is normalized Magpie fingerprint[46]describing elements composition.
•Compute the structure distance𝒟istr=min𝐂j∈{𝐂gt}‖FPistr−FPjstr‖2\mathcal{D}_{i}^{str}=\min_{\mathbf{C}_{j}\in\{\mathbf{C}_{gt}\}}||FP_{i}^{str}-FP_{j}^{str}||^{2}and composition distance𝒟icom=min𝐂k∈{𝐂gt}‖FPicom−FPkcom‖2\mathcal{D}_{i}^{com}=\min_{\mathbf{C}_{k}\in\{\mathbf{C}_{gt}\}}||FP_{i}^{com}-FP_{k}^{com}||^{2}.
•Get the overall distance𝒟i=𝒟istr/α+𝒟icom/β\mathcal{D}_{i}=\mathcal{D}_{i}^{str}/\alpha+\mathcal{D}_{i}^{com}/\beta, whereα=max𝐂j∈{𝐂gen}⁡𝒟jstr\alpha=\max_{\mathbf{C}_{j}\in\{\mathbf{C}_{gen}\}}\mathcal{D}_{j}^{str}andβ=max𝐂k∈{𝐂gen}⁡𝒟kcom\beta=\max_{\mathbf{C}_{k}\in\{\mathbf{C}_{gen}\}}\mathcal{D}_{k}^{com}for normalization.

The distance𝒟i\mathcal{D}_{i}jointly captures structural and compositional deviation from the training set.

D.5Operating Environment

The environment where our code runs is shown as follows:

•Operating system: Linux version 6.8.0-63-generic
•CPU information: AMD EPYC 9554 64-Core Processor
•GPU information: NVIDIA Corporation AD102GL [L40S]

Appendix ERelated Work

Generation on Scientific Discovery

Recent advances in deep generative models, particularly diffusion-based approaches, have significantly accelerated progress in scientific discovery, including molecule and material design. AlphaFold3[1]leverages diffusion to enable accurate all-atom biomolecular complex generation. GeoLDM[51]demonstrates the effectiveness of diffusion models for scientific data by capturing 3D geometric structures and producing physically consistent molecular samples. Graph-GRPO[55]further advances molecule generation by aligning graph flow models with task-specific objectives through reinforcement learning[43].

In the domain of crystal generation, CDVAE[48], FlowMM[33], and ADiT[23]introduce variational, flow matching, and latent diffusion paradigms, respectively. DiffCSP[21]proposes CSPNet, which has become a widely adopted equivariant denoising backbone. Subsequent works, such as DiffCSP++[22], SGEquiDiff[8], and SymmCD[28], incorporate physical priors, including space group constraints and crystallographic symmetry, into the generation process. MatterGen[53]explores conditional diffusion for inverse material design based on target properties and symmetry. More recently, several studies integrate large language models with crystal generation[16,42,24].

Overall, prior work primarily focuses on designing more advanced generative architectures and objectives. In contrast, our work shifts the focus toward improving the quality of generated samples via a screening-and-refinement pipeline. The substantial improvements achieved even with a simple base model highlight the general applicability of our approach across different generative frameworks.

Joint-Embedding Predictive Architecture

Joint-Embedding Predictive Architecture (JEPA)[27]is a self-supervised learning paradigm that learns representations by predicting target embeddings from context embeddings in latent space, instead of reconstructing raw inputs. This design encourages the model to capture high-level semantic information while discarding irrelevant details.

JEPA has been successfully extended to multiple domains. I-JEPA[2]learns semantic image representations by predicting masked regions from partial context. V-JEPA[4]scales this paradigm to video data through temporal prediction. VL-JEPA[9]aligns visual and textual modalities in a shared embedding space, while LLM-JEPA[19]extends the framework to language and code modeling. LeJEPA[3]further introduces regularization techniques to stabilize representation learning and prevent collapse. Additionally, LeWorldModel[30]connects JEPA with world modeling by predicting future latent states.

In contrast to these works, which primarily focus on representation learning, we explore the application of JEPA in scientific data, specifically for crystal modeling. Our work demonstrates that JEPA can serve as an effective surrogate for energy-aware comparison and play a crucial role in generative screening.

Transformer

Transformers[45]have become a general-purpose architecture for modeling structured data due to their ability to capture global dependencies via self-attention. Representative variants include LLaMA[44], which introduces architectural improvements for large-scale autoregressive generation, and Vision Transformer (ViT)[12], which extends Transformers to image modeling and enables advances in multimodal learning. Transformers have also been adapted for generative modeling, such as Diffusion Transformers (DiT)[37], which provide a strong backbone for diffusion models. In graph domains[54], Graph Transformer (GT)[13]first introduces self-attention to graph data, while subsequent works address its limitations. For example, CoBFormer[49]firstly reveals the over-globalization issue and enhances local inductive biases with theoretical guarantees, and Specformer[6]makes the first attempt to learn eigenvalues interaction via Transformer, triggering the fusion of attention and spectral domain.

In this work, we adopt a standard Transformer architecture as the backbone. Exploring more advanced Transformer variants to further improve Crys-JEPA remains an important direction for future work.