CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

arXiv cs.AI Papers

Summary

CrystalReasoner is an LLM framework that generates crystal structures from natural language by using physical priors as thinking tokens and reinforcement learning to ensure validity, stability, and property-conditioned generation.

arXiv:2605.14344v1 Announce Type: new Abstract: Generative modeling has emerged as a promising approach for crystal structure discovery. However, existing LLM-based generative models struggle with low-level atomic precision, while diffusion-based methods fall short in integrating high-level scientific knowledge. As a result, generated structures are often invalid, unstable, or do not possess desirable properties. To address this gap, we propose CrystalReasoner (\method), an end-to-end LLM framework that generates crystal structures from natural language instructions through reasoning and alignment. \method introduces physical priors as thinking tokens, which include crystallographic symmetry, local coordination environments and predicted physical properties before generating atomic coordinates. This bridges the gap between natural language and 3D structures. \method then employs reinforcement learning (RL) with a multi-objective, dense reward function to align generation with physical validity, chemical consistency, and thermodynamic stability. For property-conditioned tasks, we design task-specific reward functions and train specialized models for discrete constraints (e.g., space group) and continuous properties (e.g., elasticity, thermal expansion). Empirical results demonstrate that compared to prior works and baselines without thinking traces or RL, \method obtains better performance on diverse metrics, triples S.U.N. ratio, and achieves better performance for property conditioned generation. \method also exhibits adaptive reasoning, increasing reasoning lengths as the number of atoms increases. Our work demonstrates the potential of leveraging thinking traces and RL for generating valid, stable, and property-conditioned crystal structures. Please see our work at https://crystalreasoner.github.io/ .
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:23 AM

# CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation
Source: [https://arxiv.org/html/2605.14344](https://arxiv.org/html/2605.14344)
Yuyang Wu1 Tsinghua University Beijing, China yy\-wu23@mails\.tsinghua\.edu\.cn &Stefano Falletta2 Radical AI sfalletta@radical\-ai\.com &Delia McGrath2 Radical AI dmcgrath@radical\-ai\.com Sherry Yang3 New York University New York, NY, USA sherryyang@nyu\.edu

###### Abstract

Generative modeling has emerged as a promising approach for crystal structure discovery\. However, existing LLM\-based generative models struggle with low\-level atomic precision, while diffusion\-based methods fall short in integrating high\-level scientific knowledge\. As a result, generated structures are often invalid, unstable, or do not possess desirable properties\. To address this gap, we propose CrystalReasoner \(CrysReas\), an end\-to\-end LLM framework that generates crystal structures from natural language instructions through reasoning and alignment\. CrysReas introduces physical priors as thinking tokens, which include crystallographic symmetry, local coordination environments and predicted physical properties before generating atomic coordinates\. This bridges the gap between natural language and 3D structures\. CrysReas then employs reinforcement learning \(RL\) with a multi\-objective, dense reward function to align generation with physical validity, chemical consistency, and thermodynamic stability\. For property\-conditioned tasks, we design task\-specific reward functions and train specialized models for discrete constraints \(e\.g\., space group\) and continuous properties \(e\.g\., elasticity, thermal expansion\)\. Empirical results demonstrate that compared to prior works and baselines without thinking traces or RL, CrysReas obtains better performance on diverse metrics, triples S\.U\.N\. ratio, and achieves better performance for property conditioned generation\. CrysReas also exhibits adaptive reasoning, increasing reasoning lengths as the number of atoms increases\. Our work demonstrates the potential of leveraging thinking traces and RL for generating valid, stable, and property\-conditioned crystal structures\. Please see our work at https://crystalreasoner\.github\.io/ \.

## 1Introduction

Modern technologies increasingly rely on the development of new materials, such as solid\-state electrolytes for batteries \(Zhaoet al\.\([2020](https://arxiv.org/html/2605.14344#bib.bib9)\)\), high\-performance catalysts \(Goldsmithet al\.\([2018](https://arxiv.org/html/2605.14344#bib.bib10)\)\), and functional semiconductors \(Davieset al\.\([2018](https://arxiv.org/html/2605.14344#bib.bib11)\)\)\. Traditional computational methods for crystal structure discovery such as random search \(Pickard and Needs \([2011](https://arxiv.org/html/2605.14344#bib.bib44)\)\) and particle swarm optimization \(Wanget al\.\([2010](https://arxiv.org/html/2605.14344#bib.bib46)\)\) are computationally intensive due to explicit energy evaluation in each search iteration\. In contrast, generative models offer a scalable alternative by bypassing the costly search and energy evaluation steps \(De Breucket al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib45)\)\)\.

Despite the progress in generative modeling, existing generative models for crystal structures are limited\. For example, diffusion\-based models \(Yanget al\.\([2023](https://arxiv.org/html/2605.14344#bib.bib1)\); Xieet al\.\([2021](https://arxiv.org/html/2605.14344#bib.bib20)\); Chenet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib26)\); Jiaoet al\.\([2023](https://arxiv.org/html/2605.14344#bib.bib32),[2024](https://arxiv.org/html/2605.14344#bib.bib33)\); Kelviniuset al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib34)\); Joshiet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib47)\)\) operate in the 3D structure or latent space could not easily integrate rich textual knowledge \(e\.g\., compositions, properties from textbooks\)\. To incorporate scientific knowledge, some works \(Yanget al\.\([2024c](https://arxiv.org/html/2605.14344#bib.bib6)\); Inizanet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib4)\); Khastagiret al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib48)\)\) use LLMs to generate formulas followed by diffusion for structures conditioned on chemical formulas\. However, these decoupled architectures separate semantic reasoning and structural generation into distinct modules, preventing end\-to\-end training and joint optimization\.

On the other hand, finetuning LLMs to directly generate crystal information files \(CIFs\) holds great promise integrating scientific knowledge, as most LLMs are pretrained on science text books\. However, recent attempts \(Antuneset al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib3)\); Gruveret al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib5)\); Mohantyet al\.\([2026](https://arxiv.org/html/2605.14344#bib.bib30)\); Ganet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib31)\); Xuet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib25)\)\) face a critical challenge: the LLM tokenizer flattens 3D coordinates into strings, losing symmetry and spatial constraints, which results in low space\-group accuracy \(e\.g\., 24% in CrystalTextLLM\)\. Furthermore, LLM based approaches generally suffer from a lack of precision in generated atom locations, and they lack mechanisms to enforce physical validity, stability, and property conditioning in the generated structures\.

To address this gap, we draw insights from the development of LLMs around reasoning and RL alignment with verifiable feedback\. We propose CrystalReasoner \(CrysReas\), an end\-to\-end framework that converts high\-level textual instructions into high\-fidelity low\-level crystal structures through reasoning and alignment, as shown in Figure[1](https://arxiv.org/html/2605.14344#S1.F1)\. First, CrysReas is finetuned to generate physical priors as thinking traces before outputting atomic coordinates, following an abstract\-to\-concrete progression through reasoning about crystallographic symmetry, local coordination environments, and predicted properties \(e\.g\., structure volume, formation energy\)\. By introducing symbolic representations of the 3D structure through text, LLMs can first reason about 3D structure before generating the structure itself, making structure generation more tractable\.

Second, to improve precision of the generated atom locations, we apply RL with a carefully designed multi\-objective dense reward function covering physical validity, chemical validity, and thermodynamic stability, guiding generation toward valid, low\-energy configurations\. To enable property conditioned generation, CrysReas employs RL with property\-specific reward, supporting optimization with respect to both discrete constraints \(e\.g\., space group\) and continuous properties \(e\.g\., elasticity, thermal expansion\) calculated using surrogate MLIPs \(Yanget al\.\([2024b](https://arxiv.org/html/2605.14344#bib.bib15)\)\)\. By combining stability rewards with property\-specific objectives, CrysReas can be specialized for diverse material design scenarios without architectural modifications\.

![Refer to caption](https://arxiv.org/html/2605.14344v1/x1.png)Figure 1:Overview of the CrystalReasoner pipeline\. An LLM is finetuned to first generate thinking traces in an abstract\-to\-concrete manner before outputting atomic coordinates\. A multi\-objective dense reward is used for RL \(GRPO\) alignment\. The model can be used for formula conditioned generation generation, and can be further specialized with property\-specific reward for property conditioned generation\.Our evaluation shows that CrysReas consistently achieves the best performance among model variants without thinking or RL in generating valid and low\-energy structures, as verified by Density Functional Theory \(DFT\) calculations \(Hohenberg and Kohn \([1964](https://arxiv.org/html/2605.14344#bib.bib35)\); Kresse and Furthmüller \([1996](https://arxiv.org/html/2605.14344#bib.bib36)\)\)\. Furthermore, CrysReas triples stable, unique, and novel \(S\.U\.N\.\) discovery ratio, compared to previous LLM\-based crystal generation approaches\. Notably, CrysReas also exhibit adaptive reasoning, increasing reasoning lengths as the number of atoms increase\. For property conditioned generation, we found that RL against elasticity and thermal expansion consistently improves the chance that the generated structures fall into the specified range of these properties\.

In summary, our contributions are fourfold:

1. 1\.Physical Priors as Thinking Tokens:A novel strategy that instructs the LLM to generate explicit physical priors before atomic coordinates, improving 3D reasoning\.
2. 2\.RL Global Alignment:A RL framework with a multi\-objective dense reward, improving numerical precision and guiding generated structures toward thermodynamic equilibrium\.
3. 3\.Task\-Specialized Property Conditioning:Individual reward designs for property conditioned generation without requiring architectural modifications\.
4. 4\.Overall Better Performance:Compared to prior works and baselines, CrysReas achieves superior performance across diverse metrics, triples the S\.U\.N\. discovery ratio, and improves property conditioned generation quality\.

## 2Preliminaries

In this section, we define notations and provide background on LLMs for crystal structure generation and RL for LLMs\.

### 2\.1LLMs for Crystal Structure Generation

Following prior work \(Gruveret al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib5)\); Antuneset al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib3)\)\), we formulate crystal structure generation as token\-sequence generation with an LLMπθ\\pi\_\{\\theta\}\. Given a natural language descriptioncc\(e\.g\., formula, space group\), the LLM autoregressively generates a token sequencea0:Na\_\{0:N\}representing lattice parameters and atomic coordinates:

πθ​\(a0:N\|c\)=∏t=0NP​\(at\|a<t,c\)\\pi\_\{\\theta\}\(a\_\{0:N\}\|c\)=\\prod\_\{t=0\}^\{N\}P\(a\_\{t\}\|a\_\{<t\},c\)
After training, the generated structures are evaluated by validity checkers or MLIPs on multiple metrics, including the structural validityRstructuralR\_\{\\text\{structural\}\}\(satisfying geometric constraints\), chemical validityRchemicalR\_\{\\text\{chemical\}\}\(oxidation states consistent with electroneutrality\), composition consistencyRconsistencyR\_\{\\text\{consistency\}\}\(following user constraints\), and thermodynamic stabilityRstabilityR\_\{\\text\{stability\}\}of the generated structures\.

### 2\.2RL for Language Models

RL has been an effective technique for refining LLMs, ensuring the models are specifically optimized for targeted objectives or human preferences \(Ouyanget al\.\([2022](https://arxiv.org/html/2605.14344#bib.bib2)\)\) or verifiable reward \(Shaoet al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib19)\)\)\. Among different RL algorithms, Group Relative Policy Optimization \(GRPO\) \(Shaoet al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib19)\)\) demonstrates significant utility in addressing domains that necessitate long thinking traces, most notably in the context of mathematical reasoning\. GRPO is a policy gradient RL algorithm for LLMs that eliminates the need for a critic network by comparing multiple outputs sampled from the same input\. For each inputcc, it samplesGGcandidate outputs\{ai\}\\\{a\_\{i\}\\\}, each receiving a rewardRiR\_\{i\}, and optimizes a clipped objective with KL regularization to a reference policy:

𝒥\(θ\)=𝔼c∼𝒟,\{ai\}∼πθ\[1G∑i=1G\(ℒclip,i\(θ\)−β𝔻KL\(πθ\(⋅\|c\)\|\|πref\(⋅\|c\)\)\)\]\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{c\\sim\\mathcal\{D\},\\\{a\_\{i\}\\\}\\sim\\pi\_\{\\theta\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\left\(\\mathcal\{L\}\_\{\\text\{clip\},i\}\(\\theta\)\-\\beta\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}\(\\cdot\|c\)\|\|\\pi\_\{\\text\{ref\}\}\(\\cdot\|c\)\)\\right\)\\right\]
whereℒclip,i​\(θ\)\\mathcal\{L\}\_\{\\text\{clip\},i\}\(\\theta\)is the standard PPO\-style clipped surrogate objective \(Schulmanet al\.\([2017](https://arxiv.org/html/2605.14344#bib.bib29)\)\) adapted with GRPO’s group\-relative advantage, using the normalized rewardsR~i=Ri−mean​\(R\)std​\(R\)\\tilde\{R\}\_\{i\}=\\frac\{R\_\{i\}\-\\text\{mean\}\(R\)\}\{\\text\{std\}\(R\)\}within the group ofGGsamples\.

## 3Method

In this section, we introduce core methods for addressing limitations of LLMs in generating physically plausible crystal structures, including embedding progressive thinking tokens to reason between high\-level physical properties and low\-level atomic coordinates \(Section[3\.1](https://arxiv.org/html/2605.14344#S3.SS1)\), and designing an RL framework for validity optimization \(Section[3\.2](https://arxiv.org/html/2605.14344#S3.SS2)\) and property\-conditioned generation \(Section[3\.3](https://arxiv.org/html/2605.14344#S3.SS3)\)\.

### 3\.1Enable High\-Level to Low\-Level Thinking

Treating 3D lattice coordinates as discrete 1D tokens obscures the implicit structural dependencies and periodic symmetries inherent in crystals\. Therefore, LLMs often violate physical constraints when generating crystal structures directly, resulting in poor physical validity\. We address this problem by embedding thinking traces as physical priors before the final crystal structure, enabling LLMs to reason about the connection between high\-level physical information and low\-level atomic coordinates\.

#### Progressive Reasoning\.

It is natural for humans to reason progressively through high\-level concepts \(e\.g\., space groups\) to low\-level properties \(e\.g\., structure volume\), while pre\-trained LLMs also learn this pattern from large\-scale texts\. Therefore, to more effectively leverage the LLM’s language capabilities for 3D structure generation, we embed progressive thinking tokens as physical priors before atomic coordinates in the training data, as illustrated in Figure[2](https://arxiv.org/html/2605.14344#S3.F2)\. This design also uses the resulting intermediate physical priors to constrain the token search space, significantly increasing the probability of producing structurally plausible lattices\. The thinking tokens contain three parts, evolving progressively from abstract to concrete: It determines the abstract symmetries first \(e\.g\., space group\), then describes local atomic environments \(e\.g\., connectivity, bond length distribution\), and finally reasons about the concrete expected physical properties \(e\.g\., structure volume, formation energy\)\. To synthesize such thinking tokens in the training data, we generate the first and third parts using fixed rules, and copy the second part directly from Robocrystallographer \(Ganose and Jain \([2019](https://arxiv.org/html/2605.14344#bib.bib8)\)\)\. Additional details and examples for thinking traces can be found in Appendix[B](https://arxiv.org/html/2605.14344#A2)\.

![Refer to caption](https://arxiv.org/html/2605.14344v1/x2.png)Figure 2:LLMs are required to generate thinking tokens before outputting atomic coordinates\. The first part encodes abstract physical knowledge \(e\.g\., formula, space group\)\. The second characterizes local coordination environments \(e\.g\., bond length distribution\)\. The last part reasons about physical properties \(e\.g\., structural volume, stability, and electronic properties\)\. Finally, the model outputs the crystal structure in a simplified Crystallographic Information File similar to that in CrystalTextLLM \(Gruveret al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib5)\)\)\.

### 3\.2RL for Validity and Stability Optimization

Although thinking traces help LLMs to provide high\-level physical priors and improve the validity, as stochastic models, LLMs still suffer from numerical imprecision, requiring final alignment for precise atom location generation\. Moreover, thinking traces only provide supplementary reasoning information, but do not guarantee that generated structures conform to the specifications in the reasoning trace\.

To bridge this gap, we propose joint optimization of the thinking trace and generated structures through RL with verifable feedback\. We design multi\-objective and dense reward signals that not only enforce atomic arrangements to comply with crystallographic symmetries and physical validity, but also stabilize crystal structures to lie near or below the convex hull\.

#### RL for Jointly Optimizing Thinking Trace and Crystal Structure\.

As discussed in Section[2\.2](https://arxiv.org/html/2605.14344#S2.SS2), GRPO is capable of working with only a scalar reward per output and handling long reasoning traces\. We therefore apply it to crystal structure generation, which allows us to directly optimize the thinking tokens introduced in Section[3\.1](https://arxiv.org/html/2605.14344#S3.SS1)based solely on the final reward of the generated structure\. This approach requires evaluating only the final generated structure rather than intermediate tokens, while still enabling the thinking trace to be refined through policy gradients\.

#### Multi\-Objective Reward for Validity and Stability\.

It is desirable to align structure generation so that generated structures are both valid and stable\. To achieve this, we design a multi\-objective reward function as follows:

Rtarget=αvalidity​Rvalidity\+αstability​𝟏validity​RstabilityR\_\{\\text\{target\}\}=\\alpha\_\{\\text\{validity\}\}R\_\{\\text\{validity\}\}\+\\alpha\_\{\\text\{stability\}\}\\mathbf\{1\}\_\{\\text\{validity\}\}R\_\{\\text\{stability\}\}\(1\)
HereRvalidity=Rinstruction\+Rstructural\+RchemicalR\_\{\\text\{validity\}\}=R\_\{\\text\{instruction\}\}\+R\_\{\\text\{structural\}\}\+R\_\{\\text\{chemical\}\}, whereRinstructionR\_\{\\text\{instruction\}\}is a binary reward for following the target composition, andRstructural/chemicalR\_\{\\text\{structural/chemical\}\}come from validity checkers\.RstabilityR\_\{\\text\{stability\}\}quantifies energetic favorability via energy above the hull \(EhullE\_\{\\text\{hull\}\}\), is calculated by MLIPs, and contributes only when𝟏validity=1\\mathbf\{1\}\_\{\\text\{validity\}\}=1, i\.e\., basic validity holds\. We setαvalidity≪αstability\\alpha\_\{\\text\{validity\}\}\\ll\\alpha\_\{\\text\{stability\}\}becauseRstabilityR\_\{\\text\{stability\}\}is continuous and offers more room for improvement than binary rewards that saturate quickly\. This makes stability the primary reward, and its dependence onEhullE\_\{\\text\{hull\}\}preserves sensitivity to atomic changes\. Different weight settings are explored in experiments, and additional details \(e\.g\., the exact formulation\) are in Appendix[C](https://arxiv.org/html/2605.14344#A3)\.

### 3\.3RL for Property\-Conditioned Crystal Structure Generation

#### Range Constrained Property Optimization\.

Beyond validity and stability, it is crucial to support property conditioned \(e\.g\., low\-temperature conductivity\) generation for target\-driven material design\. We categorize conditioning tasks into two families: discrete symmetry constraints \(e\.g\., space group\) and continuous property variables \(e\.g\., elasticity, thermal expansion\)\. For discrete symmetry constraints, we use standard binary indicator rewards\. For continuous properties, targeting an exact scalar property value is difficult due to generation and prediction noise\. Therefore, we reformulate continuous property conditioned generation as a range\-constraint problem\. Specifically, a user of CrysReas can specify a target property rangePspecified=\[L,R\]P\_\{\\text\{specified\}\}=\[L,R\]in the input, and the objective is to enforce that properties of the generated structuresPgeneratedP\_\{\\text\{generated\}\}fall into this range\.

We design a bounded dense rewardRrange​\(Pgenerated,Pspecified=\[L,R\]\)R\_\{\\text\{range\}\}\(P\_\{\\text\{generated\}\},P\_\{\\text\{specified\}\}=\[L,R\]\)in Appendix[C](https://arxiv.org/html/2605.14344#A3)that outputs values in\[−1,1\]\[\-1,1\]\.RrangeR\_\{\\text\{range\}\}is positive whenPgenerated∈PspecifiedP\_\{\\text\{generated\}\}\\in P\_\{\\text\{specified\}\}and negative otherwise, with maximum atL\+R2\\frac\{L\+R\}\{2\}\(chosen for convenience, without physical preference\)\. This choice provides a single, unambiguous target within the interval, avoiding a flat reward plateau that would weaken learning signals\.

Table 1:Comparison of our modelCrysReasto our implementations of prior works includingPLAID\+\+ Wyckoff BaseandCrystalTextLLM\. Our model achieves the best overall performance\.
#### Reward Combining Stability and Property Conditioning\.

We can further combine property reward with stability reward to ensure the generated structures not only follow specified properties but also are likely to be stable\. we formulate this target reward as

Rtarget=𝟏valid⋅Rstability\+β⋅RpropertyR\_\{\\text\{target\}\}=\\mathbf\{1\}\_\{\\text\{valid\}\}\\cdot R\_\{\\text\{stability\}\}\+\\beta\\cdot R\_\{\\text\{property\}\}\(2\)
The property reward componentRpropertyR\_\{\\text\{property\}\}is defined as range\-constraint rewards for different tasks\. For tasks requiring specific structural symmetries,RpropertyR\_\{\\text\{property\}\}is a binary indicator that yields11if the generated structure belongs to the target space group and0otherwise\. For conditioning on elastic properties, the model targets specific ranges for bulk modulusKKand shear modulusGG, whereRproperty=Rrange​\(K\)\+Rrange​\(G\)R\_\{\\text\{property\}\}=R\_\{\\text\{range\}\}\(K\)\+R\_\{\\text\{range\}\}\(G\)\. For tasks conditioning on thermal expansion, the model targets the volumetric thermal expansion coefficientα\\alpha, whereRproperty=Rrange​\(α\)R\_\{\\text\{property\}\}=R\_\{\\text\{range\}\}\(\\alpha\)\. We use MatterSim \(Yanget al\.\([2024b](https://arxiv.org/html/2605.14344#bib.bib15)\)\) for property calculations as described in Section[3\.2](https://arxiv.org/html/2605.14344#S3.SS2)\.

## 4Experiments

In this section, we systematically evaluate CrysReas on the task of generating valid, stable, and property compliant crystal structures from natural language instructions\. First, we evaluate the success of end\-to\-end generation in Section[4\.1](https://arxiv.org/html/2605.14344#S4.SS1)\. We then investigate the effect of individual components of CrysReas, including thinking traces \(Section[4\.2](https://arxiv.org/html/2605.14344#S4.SS2)\) and RL optimization \(Section[4\.3](https://arxiv.org/html/2605.14344#S4.SS3)\)\. We finally evaluate the success of property\-conditioned generation in Section[4\.4](https://arxiv.org/html/2605.14344#S4.SS4)\.

### 4\.1End\-to\-End Evaluation of Validity, Instruction Following, and Stability

#### Baselines and Setups\.

We aim to evaluate CrysReas’s ability to generate unique, valid, stable structures under textual specifications\. We implement CrystalTextLLM \(Gruveret al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib5)\)\) and the Wyckoff Base model of PLAID\+\+ \(Xuet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib25)\)\) as prior work baselines, preserving dataset, floating\-point precision in training data, models \(all initialized from Qwen2\.5\-3B \(Qwenet al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib39)\)\)\), and hyperparameters the same\. We also use our model variants,CrysReas\-Base\(SFT only\),CrysReas\-Thinking\(SFT with thinking traces\), andCrysReas\-RL\(RL on base model\) as ablation baselines\. See additional details of baselines in Appendix[A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px3)\. We compare the models on following metrics: \(i\) structural and chemical validity, following the definition of prior works \(Xieet al\.\([2021](https://arxiv.org/html/2605.14344#bib.bib20)\)\); \(ii\) instruction following for composition, space group, elasticity, and thermal expansion, which verifies whether the generated structure follows the given constraints; and \(iii\) uniqueness, which measures the percentage of unique structures in the generated set, \(iv\) formation energy, and \(v\) S\.U\.N\. ratio, all three calculated by MatterGen \(Zeniet al\.\([2023](https://arxiv.org/html/2605.14344#bib.bib7)\)\)\. A structure is considered stable when energy above the hull is less than 0\.016 eV/atom, following Materials Project’s convention \(Jainet al\.\([2013](https://arxiv.org/html/2605.14344#bib.bib27)\)\)\. More details of these metrics can be found in Appendix[C](https://arxiv.org/html/2605.14344#A3)\.

#### Comparison Against Prior Works\.

We compare our model against our implementations of CrystalTextLLM \(Gruveret al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib5)\)\) and the Wyckoff Base model of PLAID\+\+ \(Xuet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib25)\)\) in Table[1](https://arxiv.org/html/2605.14344#S3.T1)\. Our model CrysReas is better than the two prior works on multiple metrics\.

The choice of intermediate representation critically impacts performance\. Our thinking traces have the best space\-group consistency, while CrystalTextLLM has the worst space\-group consistency\. This is because CrystalTextLLM has no structural prior, while PLAID\+\+ adopts Wyckoff representations to encode symmetry, and our thinking traces better preserves the intrinsic structural characteristics of crystals compared to both baselines\.

We implement prior works under the original precision settings \(2 or 3 demical places\), but when we increase precision to 8 decimal places, their performance degrades noticeably, revealing that it is not always better for LLMs to generate more digits\.

Table 2:Performance comparison of model variants:CrysReas\-Base\(SFT baseline\),CrysReas\-Thinking\(SFT \+ thinking traces\),CrysReas\-RL\(SFT \+ RL\), and fullCrysReas\. Thinking traces improve instruction following and validity; RL boosts uniqueness and stability; the full model achieves the best overall performance\.
#### Comparison for Model Variants\.

We compare four variants: baselineCrysReas\-Base\(SFT only\),CrysReas\-Thinking\(SFT with thinking traces\),CrysReas\-RL\(RL on base model\), and fullCrysReas\(both\)\. Evaluation covers validity \(structural, chemical\), instruction following \(composition, space group\), and stability \(uniqueness, energy, S\.U\.N\. ratio\)\. As shown in Table[2](https://arxiv.org/html/2605.14344#S4.T2), the fullCrysReasmodel outperforms all variants across nearly all metrics, and both thinking traces and RL improves performance over all the metrics\. Notably, compared to the model variants, our model triples the S\.U\.N\. ratio, and doubles the uniqueness, although we only leverage RL to optimize stability, and we never explicitly optimze on uniqueness and novelty\. This highlights the ability of thinking traces to improve structural validity and the ability of RL to explore diverse crystal space\.

### 4\.2Evaluate the Effect of Thinking Traces

We now perform ablations and qualitative analysis to better understand the effect of the thinking traces\.

![Refer to caption](https://arxiv.org/html/2605.14344v1/x3.png)Figure 3:Performance comparison ofCrysReas\-Basevs\.CrysReas\-Thinkingacross varying complexity\. \(a\) Structural validity and \(b\) composition consistency vs\. number of atoms:CrysReas\-Thinkingconsistently outperforms the baseline, especially as complexity increases\. \(c\) space\-group consistency across symmetry groups:CrysReas\-Thinkingshows stronger symmetry understanding, particularly for challenging semi\-constrained groups \(e\.g\.,C​2/cC2/c,A​m​m​2Amm2,I​4​m​2I4m2\)\.#### Varying Atom Count and Space\-Group Complexity\.

To understand the importance of the thinking trace across different levels of complexity, we vary the number of atoms in the test set and the complexity of the space group and measure the performance of the generated structures\. Figure[3](https://arxiv.org/html/2605.14344#S4.F3)comparesCrysReas\-BaseandCrysReas\-Thinkingacross different atomic counts\.CrysReas\-Thinkingconsistently outperforms no thinking in structural validity and composition consistency\. When the number of atoms increases, performance of the model drops, but the effect of thinking becomes more obvious \(for systems with 10\-21 atoms, CrysReas\-Thinking significantly outperforms CrysReas\-Base, as shown in Figure[3](https://arxiv.org/html/2605.14344#S4.F3)\(b\)\)\.

For space\-group consistency, we observe that LLMs are generally better at generating structures that follow the specified space\-group for more common space\-groups \(e\.g\.,P​63/m​m​cP6\_\{3\}/mmc\) and struggles with less common space groups in materials project \(e\.g\.,P​3P3\)\. However, thinking improves space\-group consistency across all space\-groups, as shown in Figure[3](https://arxiv.org/html/2605.14344#S4.F3)\(c\), indicating that thinking traces help enforce symmetry constraints\. The difference between thinking and no\-thinking is more significant in semi\-constrained groups \(e\.g\.,C​2/cC2/c,A​m​m​2Amm2,I​4​m​2I4m2\), demonstrating that thinking traces are most beneficial when symmetry constraints are neither trivial nor overwhelmingly strict\.

![Refer to caption](https://arxiv.org/html/2605.14344v1/x4.png)Figure 4:\(a\) Thinking trace length scales with number of atoms, showing adaptive reasoning budget\. \(b\) Ablation on three segments by removing each of them: \(1\) crystallographic symmetry, \(2\) local coordination environments, and \(3\) predicted functional properties\. Earlier tokens affect space\-group consistency more, indicating that hierarchical reasoning from\-high\-to\-low levels is important for space\-group consistency\.
#### Length Scaling and Sub\-Components of the Thinking Traces\.

To understand the contribution of thinking traces to the final atomic coordinates, we measure the lengths of the thinking tokens when varying the number of atoms in Figure[4](https://arxiv.org/html/2605.14344#S4.F4)\(a\)\. We observe that more atoms require longer thinking traces, indicating that CrysReas can perform adaptive reasoning according to the complexity of the generation task\.

We then perform ablations on three components of the thinking trace, namely crystallo\-graphic symmetry, local coordination environments and predicted functional properties\. Specifically, we remove each component from the thinking trace during inference and assess its individual contribution\. As shown in Figure[4](https://arxiv.org/html/2605.14344#S4.F4)\(b\), earlier segments in the thinking trace are more critical to space\-group consistency\. This hierarchy suggests that the model first establishes high\-level structural framing before progressing to localized physical parameters\. This confirms that the thinking traces are not merely stochastic outputs but serve as a reliable hierarchical physical prior, ensuring the logical and structural validity of the generated crystals\.

#### Physical Properties Are Predicted Before Generation\.

In Appendix[D](https://arxiv.org/html/2605.14344#A4), we prove that thinking traces are able to predict physical values \(sites, volume, bounds\) with low error\.

### 4\.3Evaluate the Effect of RL Optimization for Validity and Stability

Table 3:Ablation of reward designs: Validity Only \(structural \+ chemical\), Energy Only \(energy minimization\), and Mixed Reward \(both\)\. Energy objectives drive exploration and more than double uniqueness, and mixing the validity and the energy term as a regularizer achieves the best overall stability and S\.U\.N\. ratio\.#### Ablation Study on Reward Components\.

To understand the functionality of each component \(validity and stability\) of the total reward and find the best reward configuration, we compare three reward configurations \(Table[3](https://arxiv.org/html/2605.14344#S4.T3)\): Validity Only model \(αvalidity=1,αstability=0\\alpha\_\{\\text\{validity\}\}=1,\\alpha\_\{\\text\{stability\}\}=0\), Energy Only model \(αvalidity=0,αstability=1\\alpha\_\{\\text\{validity\}\}=0,\\alpha\_\{\\text\{stability\}\}=1\), and Mixed Reward model \(αvalidity=1,αstability=10\\alpha\_\{\\text\{validity\}\}=1,\\alpha\_\{\\text\{stability\}\}=10\)\. Validity Only model achieves high structural validity but suffers from low uniqueness\-indicating mode collapse\. The table strongly supports that energy rewards increase uniqueness, but “actively exploring” may be stronger than the evidence shown\. Consider softening this wording or adding a small diversity/novelty analysis\. Among all, Mixed Reward model delivers the best overall performance, with the highest stability and S\.U\.N\. ratio, demonstrating that the validity term acts as an effective regularizer balancing physical realism and exploratory diversity\.

#### DFT Verification for Energy\.

We evaluate the energy above the hull via DFT calculations for the four model variants by sampling 128 queries and comparing their distributions of the energy above the hull\. As illustrated in theEh​u​l​lE\_\{hull\}distributions in Figure[5](https://arxiv.org/html/2605.14344#S4.F5)\(a\), both RL alignment and thinking traces shift the energy distribution toward lower values\. Specifically,CrysReas\-RLshifts the distribution toward a lower energy regime, while the inclusion of thinking traces further refines the generated candidates\.

To provide a more granular comparison, we present parity plots for Figure[5](https://arxiv.org/html/2605.14344#S4.F5)\(b\)CrysReas\-Basevs\.CrysReas\-Thinkingand Figure[5](https://arxiv.org/html/2605.14344#S4.F5)\(c\)CrysReas\-Basevs\.CrysReas\-RL\. The scatter distribution reveals that the majority of data points lie below the diagonal liney=xy=x, demonstrating that bothCrysReas\-ThinkingandCrysReas\-RLconsistently achieve lowerEhullE\_\{\\text\{hull\}\}values compared to theCrysReas\-Basebaseline\. These findings suggest that incorporating reasoning traces and policy optimization effectively guides the model toward more thermodynamically stable crystal structures\.

![Refer to caption](https://arxiv.org/html/2605.14344v1/x5.png)Figure 5:We evaluateCrysReas\-Base,CrysReas\-Thinking,CrysReas\-RL, andCrysReason 128 queries, reporting the distributions of energy above the hull \(Eh​u​l​lE\_\{hull\}\) for DFT\-validated structures \(countnn, meanμ\\mu, varianceσ\\sigma\)\. Both thinking traces and RL improve energy over the base model, with RL achieving the most significant gains\. Scatter plots \(b\) and \(c\) further confirm thatCrysReas\-ThinkingandCrysReas\-RLconsistently yield lowerEh​u​l​lE\_\{hull\}thanCrysReas\-Base\.

### 4\.4Evaluating Property Conditioned Generation

![Refer to caption](https://arxiv.org/html/2605.14344v1/x6.png)Figure 6:Performance of specialized models on three conditioning tasks: space group \(left\), elasticity \(middle\), and thermal expansion \(right\)\. Each specialist improves over the baselineCrysReason its target metric, confirming that reward\-shaped RL effectively enforces discrete or continuous property constraints\.For conditioned generation tasks, we useCrysReas\(with both thinking and RL\) as the baseline and investigate three specialized models through property\-conditioned RL:CrysReas\-space\-group,CrysReas\-ElasticProperties, andCrysReas\-ThermalExpansion\(Figure[6](https://arxiv.org/html/2605.14344#S4.F6)\)\. All three specialists achieve notable improvements on their respective conditioning targets, demonstrating that RL with specialized rewards enhances adherence to specific property constraints\.

However, specialization comes with trade\-offs\. For elasticity conditioning, whileCrysReas\-ElasticPropertiesoutperforms the baseline on follow\-elasticity rate, it achieves slightly lower structural validity \(Table[4](https://arxiv.org/html/2605.14344#S4.T4)\)\. This suggests that optimizing for specific property requirements may modestly impact general structural quality\.

Table 4:Elasticity\-conditioned generation: trade\-off between target adherence and structural validity\. The specialistCrysReas\-ElasticPropertiesimproves follow\-elasticity rate at a small cost to structural validity compared to the baselineCrysReas\.

## 5Related Work

#### Purely Diffusion\-Based Crystal Generation\.

Diffusion models have been successfully applied to crystal structure generation, learning to reverse a noise process on atomic coordinates and lattice parameters \(Yanget al\.\([2023](https://arxiv.org/html/2605.14344#bib.bib1)\); Xieet al\.\([2021](https://arxiv.org/html/2605.14344#bib.bib20)\); Chenet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib26)\); Jiaoet al\.\([2023](https://arxiv.org/html/2605.14344#bib.bib32),[2024](https://arxiv.org/html/2605.14344#bib.bib33)\); Kelviniuset al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib34)\); Joshiet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib47)\)\)\. These methods achieve strong performance on structural validity and serve as standard baselines\. However, diffusion models operate solely on structural representations and do not explicitly integrate text\-based knowledge that connects to higher\-level concepts such as chemical compositions or materials semantics\.

#### LLM\-Related Crystal Generation\.

To incorporate semantic information, some approaches adopt decoupled architectures where LLMs generate formulas and separate diffusion models predict structures from those formulas \(Yanget al\.\([2024c](https://arxiv.org/html/2605.14344#bib.bib6)\); Inizanet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib4)\); Khastagiret al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib48)\)\)\. This design enables textual priors but splits reasoning and generation into independent modules, making joint optimization infeasible\. More recent end\-to\-end approaches use a single LLM to directly output crystal structures as flattened coordinate strings \(Antuneset al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib3)\); Gruveret al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib5)\); Mohantyet al\.\([2026](https://arxiv.org/html/2605.14344#bib.bib30)\); Ganet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib31)\); Xuet al\.\([2025](https://arxiv.org/html/2605.14344#bib.bib25)\)\)\. However, flattening 3D coordinates disrupts crystallographic symmetries and spatial constraints, frequently leading to physically invalid configurations\.

#### Chain\-of\-Thought Reasoning for Complex Tasks\.

Chain\-of\-thought \(CoT\) reasoning \(Weiet al\.\([2022](https://arxiv.org/html/2605.14344#bib.bib28)\)\) improves LLM performance on complex multi\-step tasks by decomposing problems into intermediate reasoning steps\. This approach has been successfully applied to mathematical reasoning, logical deduction, and code generation, where explicit intermediate states bridge abstract inputs and concrete outputs \(Spragueet al\.\([2024](https://arxiv.org/html/2605.14344#bib.bib40)\); Linget al\.\([2023](https://arxiv.org/html/2605.14344#bib.bib41)\); Yanget al\.\([2024a](https://arxiv.org/html/2605.14344#bib.bib42)\)\)\. However, applying CoT to crystal structure generation remains underexplored, particularly for tasks requiring precise 3D spatial reasoning\.

## 6Conclusion

In this work, we proposed CrysReas, an end\-to\-end framework that enables LLMs to directly generate stable crystal structures from natural language instructions\. By introducing physical priors as thinking tokens, GRPO\-based alignment with MLIP rewards, and task\-specific training for property conditioning, we establish a new paradigm for integrating textual knowledge with crystallographic generation\.

#### Limitations and Future Works\.

Despite these advances, our framework has several limitations that point to promising future directions\. First, Due to computational constraints, all models including re\-implemented prior works are evaluated using the Qwen2\.5\-3B architecture, limiting direct comparison with original reported results from prior works; a more comprehensive comparison could be achieved with additional prior works, parameter tuning, and multiple experimental runs\. Second, our framework requires training specialized models for each property\-conditioning task rather than supporting all conditions within a single unified model; developing a multi\-task or adapter\-based framework may better reduce training overhead for multi\-task scenarios\. Third, all experiments are conducted solely on the CDVAE MP\-20 split, leaving generalization to other material families \(e\.g\., oxides, halides, 2D materials\) unvalidated; evaluating CrysReas on broader datasets with diverse chemical compositions may better assess its generalization capability\.

## References

- Crystal structure generation with autoregressive large language modeling\.Nature Communications15\(1\),pp\. 10570\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.14344#S2.SS1.p1.3),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Chen, J\. Guo, E\. Fako, and P\. Schwaller \(2025\)Accelerating inverse materials design using generative diffusion models with reinforcement learning\.arXiv preprint arXiv:2511\.03112\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px1.p1.1)\.
- D\. W\. Davies, K\. T\. Butler, A\. J\. Jackson, J\. M\. Skelton, K\. Morita, and A\. Walsh \(2019\)Smact: semiconducting materials by analogy and chemical theory\.Journal of Open Source Software4\(38\),pp\. 1361\.Cited by:[Appendix C](https://arxiv.org/html/2605.14344#A3.SS0.SSS0.Px1.p1.5)\.
- D\. W\. Davies, K\. T\. Butler, J\. M\. Skelton, C\. Xie, A\. R\. Oganov, and A\. Walsh \(2018\)Computer\-aided design of metal chalcohalide semiconductors: from chemical composition to crystal structure\.Chemical science9\(4\),pp\. 1022–1030\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p1.1)\.
- P\. De Breuck, H\. Wang, G\. Rignanese, S\. Botti, and M\. A\. Marques \(2025\)Generative ai for crystal structures: a review\.npj Computational Materials\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p1.1)\.
- R\. Fan, Z\. Wang, and P\. Liu \(2025\)MegaScience: pushing the frontiers of post\-training datasets for science reasoning\.arXiv preprint arXiv:2507\.16812\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px3.p1.1)\.
- J\. Gan, P\. Zhong, Y\. Du, Y\. Zhu, C\. Duan, H\. Wang, D\. Schwalbe\-Koda, C\. P\. Gomes, K\. A\. Persson, and W\. Wang \(2025\)MatLLMSearch: crystal structure discovery with evolution\-guided large language models\.arXiv preprint arXiv:2502\.20933\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p3.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px2.p1.1)\.
- A\. M\. Ganose and A\. Jain \(2019\)Robocrystallographer: automated crystal structure text descriptions and analysis\.MRS Communications9\(3\),pp\. 874–881\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px2.p1.1),[Appendix B](https://arxiv.org/html/2605.14344#A2.p3.1),[§3\.1](https://arxiv.org/html/2605.14344#S3.SS1.SSS0.Px1.p1.1)\.
- B\. R\. Goldsmith, J\. Esterhuizen, J\. Liu, C\. J\. Bartel, and C\. A\. Sutton \(2018\)Machine learning for heterogeneous catalyst design and discovery\.AIChE\-Journal64\(7\),pp\. 2311–2323\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p1.1)\.
- N\. Gruver, A\. Sriram, A\. Madotto, A\. G\. Wilson, C\. L\. Zitnick, and Z\. Ulissi \(2024\)Fine\-tuned language models generate stable inorganic materials as text\.arXiv preprint arXiv:2402\.04379\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.14344#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.14344#S2.SS1.p1.3),[Figure 2](https://arxiv.org/html/2605.14344#S3.F2),[Figure 2](https://arxiv.org/html/2605.14344#S3.F2.3.2),[§4\.1](https://arxiv.org/html/2605.14344#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.14344#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px2.p1.1)\.
- T\. Hahn, U\. Shmueli, and J\. W\. Arthur \(1983\)International tables for crystallography\.Vol\.1,Reidel Dordrecht\.Cited by:[Appendix B](https://arxiv.org/html/2605.14344#A2.p2.1)\.
- P\. Hohenberg and W\. Kohn \(1964\)Inhomogeneous electron gas\.Physical review136\(3B\),pp\. B864\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p6.1)\.
- T\. J\. Inizan, S\. Yang, A\. Kaplan, Y\. Lin, J\. Yin, S\. Mirzaei, M\. Abdelgaid, A\. H\. Alawadhi, K\. Cho, Z\. Zheng,et al\.\(2025\)System of agentic ai for the discovery of metal\-organic frameworks\.arXiv preprint arXiv:2504\.14110\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Jain, S\. P\. Ong, G\. Hautier, W\. Chen, W\. D\. Richards, S\. Dacek, S\. Cholia, D\. Gunter, D\. Skinner, G\. Ceder,et al\.\(2013\)Commentary: the materials project: a materials genome approach to accelerating materials innovation\.APL materials1\(1\)\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px1.p1.1),[Appendix C](https://arxiv.org/html/2605.14344#A3.SS0.SSS0.Px2.p1.2),[§4\.1](https://arxiv.org/html/2605.14344#S4.SS1.SSS0.Px1.p1.1)\.
- R\. Jiao, W\. Huang, P\. Lin, J\. Han, P\. Chen, Y\. Lu, and Y\. Liu \(2023\)Crystal structure prediction by joint equivariant diffusion\.Advances in Neural Information Processing Systems36,pp\. 17464–17497\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px1.p1.1)\.
- R\. Jiao, W\. Huang, Y\. Liu, D\. Zhao, and Y\. Liu \(2024\)Space group constrained crystal generation\.arXiv preprint arXiv:2402\.03992\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px1.p1.1)\.
- C\. K\. Joshi, X\. Fu, Y\. Liao, V\. Gharakhanyan, B\. K\. Miller, A\. Sriram, and Z\. W\. Ulissi \(2025\)All\-atom diffusion transformers: unified generative modelling of molecules and materials\.arXiv preprint arXiv:2503\.03965\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px1.p1.1)\.
- F\. E\. Kelvinius, O\. B\. Andersson, A\. S\. Parackal, D\. Qian, R\. Armiento, and F\. Lindsten \(2025\)WyckoffDiff–a generative diffusion model for crystal symmetry\.arXiv preprint arXiv:2502\.06485\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Khastagir, K\. Das, P\. Goyal, S\. Lee, S\. Bhattacharjee, and N\. Ganguly \(2025\)LLM meets diffusion: a hybrid framework for crystal material generation\.arXiv preprint arXiv:2510\.23040\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px2.p1.1)\.
- G\. Kresse and J\. Furthmüller \(1996\)Efficient iterative schemes for ab initio total\-energy calculations using a plane\-wave basis set\.Physical review B54\(16\),pp\. 11169\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p6.1)\.
- Z\. Ling, Y\. Fang, X\. Li, Z\. Huang, M\. Lee, R\. Memisevic, and H\. Su \(2023\)Deductive verification of chain\-of\-thought reasoning\.Advances in Neural Information Processing Systems36,pp\. 36407–36433\.Cited by:[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px3.p1.1)\.
- T\. Mohanty, M\. Mehta, H\. M\. Sayeed, B\. Oded, I\. Pitussi, A\. Borenstein, V\. Srikumar, and T\. D\. Sparks \(2026\)CrysText: a generative ai approach for text\-conditioned crystal structure generation using llm\.Integrating Materials and Manufacturing Innovation,pp\. 1–15\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p3.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px2.p1.1)\.
- P\. Moritz, R\. Nishihara, S\. Wang, A\. Tumanov, R\. Liaw, E\. Liang, M\. Elibol, Z\. Yang, W\. Paul, M\. I\. Jordan,et al\.\(2018\)Ray: a distributed framework for emerging\{\\\{ai\}\\\}applications\.In13th USENIX symposium on operating systems design and implementation \(OSDI 18\),pp\. 561–577\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px7.p1.1)\.
- S\. P\. Ong, W\. D\. Richards, A\. Jain, G\. Hautier, M\. Kocher, S\. Cholia, D\. Gunter, V\. L\. Chevrier, K\. A\. Persson, and G\. Ceder \(2013\)Python materials genomics \(pymatgen\): a robust, open\-source python library for materials analysis\.Computational Materials Science68,pp\. 314–319\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§2\.2](https://arxiv.org/html/2605.14344#S2.SS2.p1.4)\.
- C\. J\. Pickard and R\. Needs \(2011\)Ab initio random structure searching\.Journal of Physics: Condensed Matter23\(5\),pp\. 053201\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p1.1)\.
- Qwen, A\. Yang, B\. Yang, B\. Zhang, B\. Hui,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.14344#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2\.2](https://arxiv.org/html/2605.14344#S2.SS2.p3.3)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2\.2](https://arxiv.org/html/2605.14344#S2.SS2.p1.4)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px5.p1.4),[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px7.p1.1)\.
- Z\. Sprague, F\. Yin, J\. D\. Rodriguez, D\. Jiang, M\. Wadhwa, P\. Singhal, X\. Zhao, X\. Ye, K\. Mahowald, and G\. Durrett \(2024\)To cot or not to cot? chain\-of\-thought helps mainly on math and symbolic reasoning\.arXiv preprint arXiv:2409\.12183\.Cited by:[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px3.p1.1)\.
- A\. Togo, K\. Shinohara, and I\. Tanaka \(2024\)Spglib: a software library for crystal symmetry search\.Science and Technology of Advanced Materials: Methods4\(1\),pp\. 2384822\.Cited by:[Appendix C](https://arxiv.org/html/2605.14344#A3.SS0.SSS0.Px3.p1.1)\.
- Y\. Wang, J\. Lv, L\. Zhu, and Y\. Ma \(2010\)Crystal structure prediction via particle\-swarm optimization\.Physical Review B\-Condensed Matter and Materials Physics82\(9\),pp\. 094116\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px3.p1.1)\.
- T\. Xie, X\. Fu, O\. Ganea, R\. Barzilay, and T\. Jaakkola \(2021\)Crystal diffusion variational autoencoder for periodic material generation\.arXiv preprint arXiv:2110\.06197\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px1.p1.1),[Appendix C](https://arxiv.org/html/2605.14344#A3.SS0.SSS0.Px1.p1.5),[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.14344#S4.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Xu, R\. Desai, L\. Wang, G\. Hope, and E\. Ritz \(2025\)Plaid\+\+: a preference aligned language model for targeted inorganic materials design\.arXiv preprint arXiv:2509\.07150\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.14344#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.14344#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px2.p1.1)\.
- G\. Yang, Y\. Zhou, X\. Chen, X\. Zhang, T\. Y\. Zhuo, and T\. Chen \(2024a\)Chain\-of\-thought in neural code generation: from and for lightweight language models\.IEEE Transactions on Software Engineering50\(9\),pp\. 2437–2457\.Cited by:[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px3.p1.1)\.
- H\. Yang, C\. Hu, Y\. Zhou, X\. Liu, Y\. Shi, J\. Li, G\. Li, Z\. Chen, S\. Chen, C\. Zeni,et al\.\(2024b\)Mattersim: a deep learning atomistic model across elements, temperatures and pressures\.arXiv preprint arXiv:2405\.04967\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px8.p1.1),[§1](https://arxiv.org/html/2605.14344#S1.p5.1),[§3\.3](https://arxiv.org/html/2605.14344#S3.SS3.SSS0.Px2.p3.9)\.
- S\. Yang, S\. Batzner, R\. Gao, M\. Aykol, A\. Gaunt, B\. McMorrow, D\. Rezende, D\. Schuurmans, I\. Mordatch, and E\. D\. Cubuk \(2024c\)Generative hierarchical materials search\.Advances in Neural Information Processing Systems37,pp\. 38799–38819\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Yang, K\. Cho, A\. Merchant, P\. Abbeel, D\. Schuurmans, I\. Mordatch, and E\. D\. Cubuk \(2023\)Scalable diffusion for materials generation\.arXiv preprint arXiv:2311\.09235\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p2.1),[§5](https://arxiv.org/html/2605.14344#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Zeni, R\. Pinsler, D\. Zügner, A\. Fowler, M\. Horton, X\. Fu, S\. Shysheya, J\. Crabbé, L\. Sun, J\. Smith,et al\.\(2023\)Mattergen: a generative model for inorganic materials design\.arXiv preprint arXiv:2312\.03687\.Cited by:[Appendix A](https://arxiv.org/html/2605.14344#A1.SS0.SSS0.Px8.p1.1),[Appendix C](https://arxiv.org/html/2605.14344#A3.SS0.SSS0.Px5.p1.1),[§4\.1](https://arxiv.org/html/2605.14344#S4.SS1.SSS0.Px1.p1.1)\.
- Q\. Zhao, S\. Stalin, C\. Zhao, and L\. A\. Archer \(2020\)Designing solid\-state electrolytes for safe, energy\-dense batteries\.Nature Reviews Materials5\(3\),pp\. 229–252\.Cited by:[§1](https://arxiv.org/html/2605.14344#S1.p1.1)\.

## Appendix AExperimental Details

#### Data\.

All experiments are conducted on Materials Project \(Jainet al\.\[[2013](https://arxiv.org/html/2605.14344#bib.bib27)\]\) structures stored in our CrysReas database\. We use the CDVAE MP\-20 split \(Xieet al\.\[[2021](https://arxiv.org/html/2605.14344#bib.bib20)\]\) as the upstream data source\. For supervised fine\-tuning, we usesplit\_cdvae\.json, which contains 24,231 training structures and 8,141 test structures\. We further construct task\-specific subsets for stability optimization and property conditioning, deliberately limiting their size to avoid excessive training time\. For stability optimization, we select 8000 training and 512 test structures to establishsplit\_rl\.json, which contains structures whose phases are valid in theReferenceMP2020Correctionphase diagram, so that energy above the hull can be evaluated consistently\. For property\-conditioned generation, we usesplit\_elastic\.jsonwith 4,000 training and 256 test structures, andsplit\_cte\.jsonwith 4,000 training and 256 test structures, which contain the structures that pass the corresponding MLIP calculations\.

#### Instruction and Trace Construction\.

The user instructions randomly combine constraints over composition, space groups, stability\-related quantities, and other physical properties, following a format similar to CrystalTextLLM \(Gruveret al\.\[[2024](https://arxiv.org/html/2605.14344#bib.bib5)\]\)\. For property\-conditioned generation, we use MLIP\-predicted property ranges rather than arbitrary target values, ensuring that the instruction is physically reasonable and that at least one feasible reference structure exists\. For thinking\-trace supervision, we use Pymatgen \(Onget al\.\[[2013](https://arxiv.org/html/2605.14344#bib.bib22)\]\) and Robocrystallographer \(Ganose and Jain \[[2019](https://arxiv.org/html/2605.14344#bib.bib8)\]\) to build rule\-based traces that describe structural, electronic, stability, and mechanical information before emitting the final CIF\-like structure\. For the final CIF\-like structure, lattice lengths are rounded to 6 decimal places, and atomic coordinates are rounded to 8 decimal places\.

#### Models and Baselines\.

All language\-model variants are initialized from Qwen2\.5\-3B \(Qwenet al\.\[[2024](https://arxiv.org/html/2605.14344#bib.bib39)\]\) using the MegaScience\-fine\-tuned checkpoint \(Fanet al\.\[[2025](https://arxiv.org/html/2605.14344#bib.bib21)\]\)\. We first train two SFT baselines:CrysReas\-Basethat directly emits the final CIF\-like structure, andCrysReas\-Thinkingthat first generates thinking traces before producing the structure\. Our main stability\-optimized model,CrysReas, starts from the thinking SFT model and is optimized with GRPO\. We also trainCrysReas\-RLas an RL counterpart of the no\-thinking baseline\. For property\-conditioned generation, we trainCrysReas\-Space\-group,CrysReas\-ElasticProperties, andCrysReas\-ThermalExpansion\.

We compare against two prior\-work\-style representations implemented in the same CrysReas pipeline:CrystalTextLLMandPLAID\+\+ Wyckoff Base\. We use the same dataset, floating\-point precision \(8 decimal places for coordinates and 6 decimal places for lattice lengths\) in the final CIF\-like structure, models, and hyperparameters\.

#### Supervised Fine\-Tuning\.

For SFT, We apply full\-parameter fine\-tuning for 2 epochs with a global batch size of 32, per\-GPU micro\-batch size 1, maximum sequence length 4096, learning rate1×10−41\\times 10^\{\-4\}, Adam betas\(0\.9,0\.95\)\(0\.9,0\.95\), weight decay 0\.01, cosine learning\-rate decay, 10% warmup, and gradient clipping at 1\.0\. Training uses FSDP, bf16 precision, and gradient checkpointing\. The no\-thinking and thinking models are trained on conditional structure\-generation prompts\. The CrystalTextLLM and PLAID\+\+ Wyckoff Base comparison models are trained on the same structures but use their corresponding text representations; their SFT data mixes generation and infilling examples with a 66/34 ratio\.

#### Reinforcement Learning\.

We use the Verl \(Shenget al\.\[[2024](https://arxiv.org/html/2605.14344#bib.bib23)\]\) PPO trainer stack with the GRPO advantage estimator \(Shenget al\.\[[2024](https://arxiv.org/html/2605.14344#bib.bib23)\]\)\. RL training runs for 1 epoch with batch size 64, group size 8, maximum prompt length 256, maximum response length 4096, actor learning rate1×10−51\\times 10^\{\-5\}, PPO mini\-batch size 32, per\-GPU micro\-batch size 1, clip ratio 0\.2, entropy coefficient 0, and weight decay 0\.01\. The GRPO configuration usesγ=0\.98\\gamma=0\.98,λ=0\.9\\lambda=0\.9, normalized group advantages, and an adaptive KL controller with initial coefficient 0\.001, target KL 0\.05, and horizon 10,000\. Rollouts are sampled with temperature 1\.0 and top\-p=1\.0p=1\.0\.

#### Generation and Evaluation\.

At evaluation time, each model generates 16 samples per prompt\. Stability and general structure\-generation models are evaluated onsplit\_generation\.json, which contains 1,024 test prompts\. Elasticity conditioned generation is evaluated onsplit\_generation\_elastic\.jsonwith 512 test prompts, and thermal expansion conditioned generation is evaluated onsplit\_generation\_cte\.jsonwith 256 test prompts\. We sample with temperature 1\.0 and top\-p=0\.7p=0\.7\. Generated structures are parsed into the common CrysReas structure format and evaluated using the same downstream metric pipeline for all models\. All training is performed on two A100 GPUs and requires less than 40 GPU hours in total\.

#### Reward Calculations\.

To make online reward computation efficient, we use a CPU/GPU workload sharding strategy\. Lightweight symbolic and structural checks, such as parsing, composition matching, space\-group matching, and SMACT validity, are parallelized on CPU workers with Ray by splitting the rollout batch into DataFrame chunks\. Expensive MLIP\-based metrics are handled separately as heavy metrics\. For these metrics, structures are dispatched to Ray \(Moritzet al\.\[[2018](https://arxiv.org/html/2605.14344#bib.bib49)\]\) GPU workers, where MatterSim\-based calculations are performed in batches\. We also enable the Verl \(Shenget al\.\[[2024](https://arxiv.org/html/2605.14344#bib.bib23)\]\) framework to launch reward calculation asynchronously during the computation of log probabilities under the current policy\.

#### MLIP Settings\.

Direct first\-principles evaluation of crystal stability and functional properties is too expensive to use during training\. We therefore employ MatterSim \(Yanget al\.\[[2024b](https://arxiv.org/html/2605.14344#bib.bib15)\]\) as the MLIP backend for structure relaxation and property evaluation\. In our pipeline, candidate structures are first relaxed with MatterSim, after which energy above the hull is computed from the relaxed structures and their predicted energies\. For this step, we use theReferenceMP2020Correctionreference set from MatterGen \(Zeniet al\.\[[2023](https://arxiv.org/html/2605.14344#bib.bib7)\]\) to construct the phase diagram and evaluate hull distance\.

To estimate the elastic properties at 0 K, we first perform an additional MLIP\-based structural relaxation tailored for elastic analysis, allowing both atomic positions and lattice parameters to adjust until a force threshold of10−410^\{\-4\}eV/Å is reached\. We then compute the stress response of the relaxed structure and estimate the full6×66\\times 6elastic tensor in Voigt notation using symmetry\-aware elastic analysis\. From this tensor, we derive the bulk modulus and shear modulus\.

For coefficient of thermal expansion at 300 K, we use a quasi\-harmonic approximation \(QHA\) workflow driven by MLIP\-predicted energies and forces, and report the volumetric thermal expansion coefficient\. To ensure the reliability of our framework, we conduct post\-hoc DFT validation on a representative subset of generated structures, verifying both the model’s effectiveness and the predictive accuracy of the Machine Learning Interatomic Potential \(MLIP\)\.

## Appendix BThinking Traces Design

Before generating explicit coordinates, the LLM is required to generate thinking traces first\. We explicitly guide the LLM to produce a structured material report as a chain\-of\-thought\. To facilitate LLMs to gradually understand the relationship between expert knowledge, such as space groups and bond lengths, and atomic coordinates, the thinking tokens evolve progressively from abstract to concrete\.

First, the LLM determines the space group and appropriate Wyckoff sites \(Hahnet al\.\[[1983](https://arxiv.org/html/2605.14344#bib.bib37)\]\)\. This step establishes the fundamental symmetry constraints and the general symbolic arrangement of atoms, preventing the model from generating physically inconsistent configurations in subsequent stages\.

Second, to map these abstract symmetry constraints to a precise geometric realization, we incorporate the descriptive logic derived from Robocrystallographer \(Ganose and Jain \[[2019](https://arxiv.org/html/2605.14344#bib.bib8)\]\)\. By training the LLM to mimic these structural descriptors, including coordination motifs, bond length distributions, and connectivity, the model learns to transform abstract symmetry into a coherent topological representation\. These generated descriptors serve as intermediate physical constraints that assist the LLM in resolving the precise spatial coordinates\.

Finally, the model reasons about the expected physical properties, such as electronic band structures and thermodynamic stability\. This stage represents the most concrete reasoning phase, where the abstract symmetry and geometric coordinates are translated into observable physical quantities\. By hallucinating these functional outcomes as intermediate physical constraints, the model verifies that the realized crystal structure satisfies the initial input constraints before committing to specific atomic coordinates\.

Figure[5](https://arxiv.org/html/2605.14344#A2.T5)shows an example prompt\.

Table 5:Example of crystal structure generation prompt and response for CaCO3\.User PromptBelow is a description of a bulk material\. The chemical formula is CaCO3\. The formation energy per atom is \-2\.6875\. The space\-group number is 167\. The energy above the convex hull is 0\.0\. The band gap is 4\.9995\. Generate a description of the lengths and angles of the lattice vectors and then the element type and coordinates for each atom within the lattice:Assistant ResponseLet’s generate a material report first, according to the given information\.Material Report:Crystal StructureFirst, consider space groups and atom numbers\. This material CaCO3 should have the space group R\-3c \(id 167\)\. Since for Ca, 2=2, for C, 2=2, for O, 6=6, 2\*\(\+2\)\+2\*\(\+4\)\+6\*\(\-2\)=0, the structure is like this: Ca has 1 sites: one site has 2 atoms, oxidation state \+2\. C has 1 sites: one site has 2 atoms, oxidation state \+4\. O has 1 sites: one site has 6 atoms, oxidation state \-2\.Second, consider band gaps\.CaCO3\\mathrm\{CaCO\_\{3\}\}is Calcite structured and crystallizes in the trigonal R\-3c space group\.Ca2\+\\mathrm\{Ca^\{2\+\}\}is bonded to six equivalentO2−\\mathrm\{O^\{2\-\}\}atoms to form corner\-sharingCaO6\\mathrm\{CaO\_\{6\}\}octahedra\. The corner\-sharing octahedral tilt angles are62∘62^\{\\circ\}\. All Ca\-O bond lengths are 2\.36 Å\.C4\+\\mathrm\{C^\{4\+\}\}is bonded in a trigonal planar geometry to three equivalentO2−\\mathrm\{O^\{2\-\}\}atoms\. All C\-O bond lengths are 1\.29 Å\.O2−\\mathrm\{O^\{2\-\}\}is bonded in a trigonal planar geometry to two equivalentCa2\+\\mathrm\{Ca^\{2\+\}\}and oneC4\+\\mathrm\{C^\{4\+\}\}atom\.Third, consider structure validity\. The structure is reasonable, because the band lengths are all greater than 0\.5, and the structure’s volume 122\.95 is larger than 0\.1\.Electronic PropertiesClassification:It is an insulator \(wide band gap\) with a calculated band gap \(EgE\_\{g\}\) of 4\.999 eV\.StabilityThermodynamic Status:It is predicted to be thermodynamically stable \(on the hull\)\. The formation energy per atom is \-2\.688 eV/atom\.CIF File<CIF\>P16\.35844783 6\.35844725 6\.3584458946\.3714 46\.3714 46\.3714Ca 1 0\.50000000 0\.50000000 0\.50000000Ca 1 \-0\.00000000 0\.00000000 \-0\.00000000C 1 0\.75000000 0\.75000000 0\.75000000C 1 0\.25000000 0\.25000000 0\.25000000O 1 0\.75000000 0\.49216771 0\.00783229O 1 0\.00783229 0\.75000000 0\.49216771O 1 0\.50783229 0\.99216771 0\.25000000O 1 0\.25000000 0\.50783229 0\.99216771O 1 0\.99216771 0\.25000000 0\.50783229O 1 0\.49216771 0\.00783229 0\.75000000</CIF\>
## Appendix CMetrics and Rewards

We detail the metrics used for evaluation and the reward components that constitute the RL objective\.

#### Structural and Chemical Consistency\.

These metrics assess the physical plausibility of generated crystals\. A structure is considered structurally valid if it satisfies basic geometric constraints: all interatomic distances\>2\>2Å, cell volumeV\>4\.0V\>4\.0Å3, lattice lengthsa,b,c\>1\.1a,b,c\>1\.1Å, and lattice anglesα,β,γ∈\[20∘,160∘\]\\alpha,\\beta,\\gamma\\in\[20^\{\\circ\},160^\{\\circ\}\], following checkers of CDVAE \(Xieet al\.\[[2021](https://arxiv.org/html/2605.14344#bib.bib20)\]\)\. Using SMACT \(Davieset al\.\[[2019](https://arxiv.org/html/2605.14344#bib.bib38)\]\), a structure is chemically valid if its constituent elements can be assigned oxidation states that satisfy electroneutrality and yield stable charge configurations\.

The corresponding reward components are defined as:

Rstructural\\displaystyle R\_\{\\text\{structural\}\}=𝟏\{all geometric constraints met\}\\displaystyle=\\mathbf\{1\}\_\{\\\{\\text\{all geometric constraints met\}\\\}\}\(3\)Rchemical\\displaystyle R\_\{\\text\{chemical\}\}=𝟏\{charge neutrality and oxidation state plausible\}\\displaystyle=\\mathbf\{1\}\_\{\\\{\\text\{charge neutrality and oxidation state plausible\}\\\}\}\(4\)Both are binary indicators, yielding11when the condition holds and0otherwise\. They provide immediate, interpretable feedback on basic crystal quality\.

#### Energy and Thermodynamic Stability\.

The primary stability metric is the energy above the convex hullEhullE\_\{\\text\{hull\}\}\(eV/atom\), computed via a surrogate MLIP \(MatterSim\) during training and verified by DFT post\-hoc\. A structure is considered stable ifEhull<0\.016E\_\{\\text\{hull\}\}<0\.016eV/atom, following the Materials Project \(Jainet al\.\[[2013](https://arxiv.org/html/2605.14344#bib.bib27)\]\) convention\.

Instead of using a raw negative energy reward \(−Ehull\-E\_\{\\text\{hull\}\}\), which suffers from three drawbacks: it cannot provide a signal when the MLIP fails to produce a validEhullE\_\{\\text\{hull\}\}\(e\.g\., for highly distorted structures\); its unbounded range leads to unstable training; its gradient is small, offering insufficient sensitivity near the optimum, we design a bounded, smooth, and sensitive reward function:

Rstability=\{1−12​E0​Ehull,Ehull≤E0E02​Ehull,Ehull≥E0R\_\{\\text\{stability\}\}=\\begin\{cases\}1\-\\dfrac\{1\}\{2E\_\{0\}\}E\_\{\\text\{hull\}\},&E\_\{\\text\{hull\}\}\\leq E\_\{0\}\\\\\[6\.0pt\] \\dfrac\{E\_\{0\}\}\{2E\_\{\\text\{hull\}\}\},&E\_\{\\text\{hull\}\}\\geq E\_\{0\}\\end\{cases\}where we setE0=1E\_\{0\}=1eV/atom, matching the typical scale of pre\-trained model outputs\. This design has three advantages: it is bounded in\[0,1\]\[0,1\], stabilizing training; it is highly sensitive whenEhullE\_\{\\text\{hull\}\}is small \(linear slope−12​E0\-\\frac\{1\}\{2E\_\{0\}\}\); it provides a smooth but decaying gradient for largeEhullE\_\{\\text\{hull\}\}, preventing outlier domination while still penalizing instability\.

#### Instruction Following\.

The model must adhere to user\-specified constraints, including target composition and space group\. The metricComposition Consistencyrequires the generated chemical formula to exactly match the target\. The metricspace\-group Consistencyrequires the generated structure to belong to the target space group \(determined byspglib\) \(Togoet al\.\[[2024](https://arxiv.org/html/2605.14344#bib.bib24)\]\)\.

The total reward for instruction following for validity optimization only contains composition matching, as a subtle change for coordinates can change the space group consistency, making it difficult to train the model\.

Rinstruction=𝟏\{composition matches\}R\_\{\\text\{instruction\}\}=\\mathbf\{1\}\_\{\\\{\\text\{composition matches\}\\\}\}

#### Range Constraint Reward\.

We define a bounded dense rewardRrange​\(Pgenerated,Pspecified=\[L,R\]\)∈\[−1,1\]R\_\{\\text\{range\}\}\(P\_\{\\text\{generated\}\},P\_\{\\text\{specified\}\}=\[L,R\]\)\\in\[\-1,1\]as follows\. Letz=Pgenerated−L\+R2R−Lz=\\frac\{P\_\{\\text\{generated\}\}\-\\frac\{L\+R\}\{2\}\}\{R\-L\}\. Then:

Rrange=\{1−2​z2,if​\|z\|≤12e1−2​z2−1,otherwiseR\_\{\\text\{range\}\}=\\begin\{cases\}1\-2z^\{2\},&\\text\{if \}\|z\|\\leq\\frac\{1\}\{\\sqrt\{2\}\}\\\\ e^\{1\-2z^\{2\}\}\-1,&\\text\{otherwise\}\\end\{cases\}
The reward attains its maximum value of 1 atz=0z=0, i\.e\.,Pgenerated=L\+R2P\_\{\\text\{generated\}\}=\\frac\{L\+R\}\{2\}\(the center of the specified range\)\. This midpoint is chosen as the unique optimum to provide a single, unambiguous target within the interval, avoiding a flat reward plateau that would dilute learning signals\.

The reward is positive whenPgenerated∈\[L,R\]P\_\{\\text\{generated\}\}\\in\[L,R\]\(i\.e\.,\|z\|≤0\.5\|z\|\\leq 0\.5\) and negative otherwise\. The exponential tail ensures smooth gradient information for values far outside the range\.

#### Uniqueness, Novelty, and S\.U\.N\.

To evaluate diversity and discovery capability, we adopt three metrics\.Uniquenessis the proportion of generated structures that are distinct according to the disordered structure matcher of MatterGen \(Zeniet al\.\[[2023](https://arxiv.org/html/2605.14344#bib.bib7)\]\)\.Noveltyis the proportion of generated structures not present in the training set, matched via fingerprint similarity\.S\.U\.N\.refers to structures that are simultaneously stable \(Ehull<0\.016E\_\{\\text\{hull\}\}<0\.016eV/atom\), unique, and novel\. This ratio directly measures the model’s ability to discover new viable materials\. These metrics are computed after DFT verification; they are not used as rewards during RL\.

#### Combined Reward for RL\.

The final reward combines validity and stability with a gated mechanism:

Rtarget=αvalidity​Rvalidity\+αstability​𝟏validity​RstabilityR\_\{\\text\{target\}\}=\\alpha\_\{\\text\{validity\}\}R\_\{\\text\{validity\}\}\+\\alpha\_\{\\text\{stability\}\}\\mathbf\{1\}\_\{\\text\{validity\}\}R\_\{\\text\{stability\}\}whereRvalidity=Rinstruction\+Rstructural\+RchemicalR\_\{\\text\{validity\}\}=R\_\{\\text\{instruction\}\}\+R\_\{\\text\{structural\}\}\+R\_\{\\text\{chemical\}\}, and𝟏validity\\mathbf\{1\}\_\{\\text\{validity\}\}is the indicator that all validity components are satisfied \(i\.e\.,Rstructural=Rchemical=1R\_\{\\text\{structural\}\}=R\_\{\\text\{chemical\}\}=1and the composition part ofRinstructionR\_\{\\text\{instruction\}\}is non\-zero\)\. Empirically, we setαvalidity≪αstability\\alpha\_\{\\text\{validity\}\}\\ll\\alpha\_\{\\text\{stability\}\}so that the stability reward dominates while validity terms act as a gate\. This encourages the model to first generate plausible structures and then optimize their thermodynamic stability\.

## Appendix DEvaluate the Effect of Thinking Traces

#### Physical Properties Are Predicted Before Generation\.

To understand the relation between thinking traces and the final atomic coordinates, we compare the difference between the predicted physical properties in thinking tokens \(bond length and volume\) and the actual physical properties of the generated structure in Table[6](https://arxiv.org/html/2605.14344#A4.T6)\. The consistently low error on sites, structure volume and bond length confirms that thinking traces accurately pre\-determine physical properties, demonstrating their role as effective physical priors\. We also show qualitative examples across different space groups in Figure[7](https://arxiv.org/html/2605.14344#A4.F7.fig1)\.

Table 6:Comparison between predicted properties \(site, structure volume, bond length\) in thinking traces and actual properties of generated structures across different space groups\.![Refer to caption](https://arxiv.org/html/2605.14344v1/assets/1.png)\(a\)Fm\-3m
![Refer to caption](https://arxiv.org/html/2605.14344v1/assets/3.png)\(b\)Fd\-3m
![Refer to caption](https://arxiv.org/html/2605.14344v1/assets/4.png)\(c\)P3m1

Figure 7:Selected generated structures\.

Similar Articles

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

arXiv cs.CL

This paper proposes SGR, a framework that enhances LLM stepwise reasoning by integrating external knowledge graphs through query-relevant subgraph generation, combining Cypher-based reasoning with collaborative reasoning integration. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show improved reasoning accuracy over standard prompting and knowledge-enhanced baselines.