Unlocking Latent Dimensions: Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders

arXiv cs.LG Papers

Summary

This paper explores the use of variational autoencoders to learn latent representations of large-scale X-ray scattering data, enabling efficient data compression and analysis.

arXiv:2606.14999v1 Announce Type: new Abstract: Scientific user facilities generate X-ray scattering data faster than traditional workflows can process them. We address this challenge across two settings, offline dataset exploration and live on-the-fly analysis. We train a domain-specific attention-based Convolutional Variational Autoencoder (C-VAE) on 1.5 million X-ray scattering images to learn low-dimensional representations capturing structural variation across diverse experimental conditions. The learned latent space reveals well-organized clusters and smooth trajectories reflecting experimental progression. It further supports controlled synthetic scattering image generation across diverse structural states. When deployed without retraining, the model organizes time-resolved film formation experiments at two synchrotron facilities into interpretable latent structures. Benchmarking against DINOv3 (ViT-7B), a general-purpose vision foundation model, demonstrates that domain-specific training yields more interpretable latent organization for scattering data. Both workflows are integrated within Latent Space Explorer, a component of the MLExchange platform, supporting interactive structural exploration across archived datasets and live experiments.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:36 AM

# Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders
Source: [https://arxiv.org/html/2606.14999](https://arxiv.org/html/2606.14999)
Xiaoya ChongAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USARunbo JiangAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAWiebke KoeppAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAPetrus H\. ZwartCenter for Advanced Mathematics for Energy Research Applications, Lawrence Berkeley National Laboratory, Berkeley, CA, USAMolecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USABerkeley Synchrotron Infrared Structural Biology program, Lawrence Berkeley National Laboratory, Berkeley, CA, USADamon EnglishAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAGregory M\. SuAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAMaterials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USAEric SchaibleAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAChenhui ZhuAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAMostafa NassrMcKetta Department of Chemical Engineering, University of Texas, Austin, TX, USANoah P\. WambleMcKetta Department of Chemical Engineering, University of Texas, Austin, TX, USAKelvin Kam\-Yun LiAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAMaterials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USADepartment of Chemistry, University of California, Berkeley, CA, USAJonathan M\. ChanAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAJose Carlos DiazAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAMcKetta Department of Chemical Engineering, University of Texas, Austin, TX, USACameron McKayMaseeh Department of Civil, Architectural, and Environmental Engineering, University of Texas, Austin, TX, USALynn KatzMaseeh Department of Civil, Architectural, and Environmental Engineering, University of Texas, Austin, TX, USABenny FreemanMcKetta Department of Chemical Engineering, University of Texas, Austin, TX, USAGuillaume FreychetNational Synchrotron Light Source II, Brookhaven National Laboratory, Upton, NY, USAUniversity Grenoble Alpes, CEA, Leti, F\-38000 Grenoble, FranceYevgen MatviychukNational Synchrotron Light Source II, Brookhaven National Laboratory, Upton, NY, USAEliot GannNational Synchrotron Light Source II, Brookhaven National Laboratory, Upton, NY, USADaniel B\. AllanNational Synchrotron Light Source II, Brookhaven National Laboratory, Upton, NY, USABenedikt SochorAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USADeutsches Elektronen\-Synchrotron DESY, Notkestr\. 85, 22607 Hamburg, GermanyFrank SchluenzenDeutsches Elektronen\-Synchrotron DESY, Notkestr\. 85, 22607 Hamburg, GermanyStephan V\. RothDeutsches Elektronen\-Synchrotron DESY, Notkestr\. 85, 22607 Hamburg, GermanyDepartment of Fibre and Polymer Technology, Royal Institute of Technology KTH, Teknikringen 34–35, Stockholm, SwedenEthan CrumlinAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAChemical Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USADylan McReynoldsAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USATanny ChavezCorresponding author:tanchavez@lbl\.govAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USAAlexander HexemerCorresponding author:ahexemer@lbl\.govAdvanced Light Source, Lawrence Berkeley National Laboratory, Berkeley, CA, USACenter for Advanced Mathematics for Energy Research Applications, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

###### Abstract

Scientific user facilities generate X\-ray scattering data faster than traditional workflows can process them\. We address this challenge across two settings, offline dataset exploration and live on\-the\-fly analysis\. We train a domain\-specific attention\-based Convolutional Variational Autoencoder \(C\-VAE\) on 1\.5 million X\-ray scattering images to learn low\-dimensional representations capturing structural variation across diverse experimental conditions\. The learned latent space reveals well\-organized clusters and smooth trajectories reflecting experimental progression\. It further supports controlled synthetic scattering image generation across diverse structural states\. When deployed without retraining, the model organizes time\-resolved film formation experiments at two synchrotron facilities into interpretable latent structures\. Benchmarking against DINOv3 \(ViT\-7B\), a general\-purpose vision foundation model, demonstrates that domain\-specific training yields more interpretable latent organization for scattering data\. Both workflows are integrated withinLatent Space Explorer, a component of theMLExchangeplatform, supporting interactive structural exploration across archived datasets and live experiments\.

Keywords:variational autoencoder, X\-ray scattering, latent space, representation learning, synchrotron, on\-the\-fly analysis, dimensionality reduction

## 1Introduction

The rapid growth of data volumes at modern experimental facilities has fundamentally changed how experimental data is collected and analyzed at synchrotron facilities\. Detectors at Scientific User Facilities \(SUFs\) generate millions of high\-resolution images under diverse experimental conditions, resulting in heterogeneous, high\-dimensional datasets\. Traditional workflows based on manual inspection or offline batch processing are increasingly insufficient to keep pace with modern data acquisition rates\. As experimental workflows move toward autonomous and adaptive data collection, there is a growing need for computational models and software infrastructure capable of extracting interpretable structure from large scientific datasets in real time\[[2](https://arxiv.org/html/2606.14999#bib.bib1),[32](https://arxiv.org/html/2606.14999#bib.bib2),[30](https://arxiv.org/html/2606.14999#bib.bib28)\]\. Recent work has shown that machine learning \(ML\) can help analyze complex X\-ray data and reveal underlying dynamics\[[15](https://arxiv.org/html/2606.14999#bib.bib5)\], including classification of large diffraction datasets\[[35](https://arxiv.org/html/2606.14999#bib.bib41)\], physics\-aware real\-time analysis of nanodiffraction patterns\[[26](https://arxiv.org/html/2606.14999#bib.bib35)\], and closed\-loop feedback control at synchrotron beamlines\[[33](https://arxiv.org/html/2606.14999#bib.bib32)\]\. In particular, representation learning techniques based on deep neural networks enable the discovery of meaningful structure in high\-dimensional scientific data without requiring domain experts to manually define or select the specific data features to be analyzed\. For imaging data, architectures such as Convolutional Neural Networks \(CNNs\)\[[20](https://arxiv.org/html/2606.14999#bib.bib7)\]effectively capture local spatial correlations, while attention mechanisms\[[44](https://arxiv.org/html/2606.14999#bib.bib8)\]extend this capability by modeling long\-range dependencies and hierarchical structure\[[10](https://arxiv.org/html/2606.14999#bib.bib9),[5](https://arxiv.org/html/2606.14999#bib.bib20)\]\. Building on these advances, modern approaches aim to encode complex data into compact, low\-dimensional latent representations that capture the dominant variations within a dataset\.

Such latent representations provide a natural framework for organizing, visualizing, and exploring large collections of scattering data\[[19](https://arxiv.org/html/2606.14999#bib.bib3),[28](https://arxiv.org/html/2606.14999#bib.bib11)\]\. Variational Autoencoders \(VAEs\)\[[19](https://arxiv.org/html/2606.14999#bib.bib3)\]offer a probabilistic approach to learning these structured latent spaces and have been successfully applied to a wide range of scientific imaging problems, including materials characterization and scattering experiments\[[9](https://arxiv.org/html/2606.14999#bib.bib25),[23](https://arxiv.org/html/2606.14999#bib.bib4),[18](https://arxiv.org/html/2606.14999#bib.bib6),[16](https://arxiv.org/html/2606.14999#bib.bib29),[42](https://arxiv.org/html/2606.14999#bib.bib37),[17](https://arxiv.org/html/2606.14999#bib.bib40),[3](https://arxiv.org/html/2606.14999#bib.bib46)\]and manifold\-aware synthetic data generation\[[6](https://arxiv.org/html/2606.14999#bib.bib47)\]\. In parallel, self\-supervised vision transformer \(ViT\) models such as Distillation with No Labels \(DINO\) have demonstrated strong performance in learning transferable image representations\[[10](https://arxiv.org/html/2606.14999#bib.bib9),[31](https://arxiv.org/html/2606.14999#bib.bib19)\]\. Together, these learned representations enable a broad range of downstream tasks, including structural phase tracking\[[40](https://arxiv.org/html/2606.14999#bib.bib39)\], unsupervised anomaly detection\[[39](https://arxiv.org/html/2606.14999#bib.bib42)\], segmentation\[[34](https://arxiv.org/html/2606.14999#bib.bib43),[47](https://arxiv.org/html/2606.14999#bib.bib44)\], and interactive region\-of\-interest selection\[[36](https://arxiv.org/html/2606.14999#bib.bib38)\]\. The generative capability of learned latent spaces further enables synthetic scattering image generation, offering a route to augmenting underrepresented structural states in experimental datasets\[[49](https://arxiv.org/html/2606.14999#bib.bib45),[6](https://arxiv.org/html/2606.14999#bib.bib47),[18](https://arxiv.org/html/2606.14999#bib.bib6)\]\. These representations are commonly visualized using dimensionality reduction techniques such as Principal Component Analysis \(PCA\)\[[27](https://arxiv.org/html/2606.14999#bib.bib10)\], Uniform Manifold Approximation and Projection \(UMAP\)\[[28](https://arxiv.org/html/2606.14999#bib.bib11)\], and t\-distributed stochastic neighbor embedding \(t\-SNE\)\[[43](https://arxiv.org/html/2606.14999#bib.bib12)\], enabling visualization of global relationships between observations\. Clustering methods such as HDBSCAN\[[4](https://arxiv.org/html/2606.14999#bib.bib13)\]further allow identification of structurally similar patterns within large datasets\. Interactive tools for navigating these learned representations have been developed across several domains, including biomedical imaging\[[22](https://arxiv.org/html/2606.14999#bib.bib30)\]and molecular discovery\[[46](https://arxiv.org/html/2606.14999#bib.bib31)\], demonstrating the value of visual analytics for high\-dimensional scientific data\.

Despite these advances, extracting interpretable structure from large\-scale scattering datasets remains challenging in two distinct settings\. Inpost\-experiment analysis, where all data are available offline, latent representations enable interactive exploration of large experimental archives\. TheMLExchangeplatform\[[48](https://arxiv.org/html/2606.14999#bib.bib14)\]is a web\-based environment for exchangeable machine learning workflows at scientific user facilities, and itsLatent Space Explorer111[https://github\.com/mlexchange/mlex\_latent\_explorer](https://github.com/mlexchange/mlex_latent_explorer)component provides interactive dimensionality reduction, clustering, and visualization of learned embeddings for offline dataset exploration\[[8](https://arxiv.org/html/2606.14999#bib.bib16)\]\. Inon\-the\-fly analysis, data arrive continuously during an active experiment, requiring a pre\-trained model to be deployed before data acquisition begins\. Recent efforts have demonstrated ML\-guided on\-the\-fly analysis in related scattering settings, including autonomous phase identification in X\-ray diffraction\[[41](https://arxiv.org/html/2606.14999#bib.bib34)\]and real\-time structural tracking during thin film crystallization\[[38](https://arxiv.org/html/2606.14999#bib.bib33)\]\. In this setting, a key question arises: is a general\-purpose vision model sufficient, or does the domain\-specific character of X\-ray scattering data require a model trained specifically on scattering images? Domain\-specific models have shown clear advantages for scattering data; for example, a dedicated denoising model for SAXS/WAXD images outperforms general\-purpose approaches by capturing scattering\-specific textural features\[[50](https://arxiv.org/html/2606.14999#bib.bib36)\]\. General\-purpose self\-supervised models such as DINOv3\[[37](https://arxiv.org/html/2606.14999#bib.bib18)\], a family of large vision foundation models of which we use the ViT\-7B variant, built on the DINO framework\[[31](https://arxiv.org/html/2606.14999#bib.bib19),[5](https://arxiv.org/html/2606.14999#bib.bib20)\], learn powerful representations from large natural image collections, but may not capture the structural variations most relevant for scattering analysis\. DINOv3 is based on the DINO framework, in which a student network is trained to match the outputs of a momentum\-based teacher network using multiple augmented views of each image\[[5](https://arxiv.org/html/2606.14999#bib.bib20)\]\. This strategy produces powerful general\-purpose visual features without requiring labeled data\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/figures/Introduction_fig_compressed.jpg)

Figure 1:C\-VAE scattering data analysis pipeline\.\(1\)A large\-scale archive of 1\.5 million X\-ray scattering images collected at the Advanced Light Source \(ALS\) serves as the training dataset\.\(2\)A domain\-specific attention\-based convolutional variational autoencoder \(C\-VAE\) with windowed self\-attention is trained on this archive using 80 NVIDIA A100 GPUs at NERSC Perlmutter, learning a 512\-dimensional latent representation of the scattering data\.\(3\)The learned latent space organizes scattering patterns into structured clusters with smooth trajectories reflecting experimental progression, visualized here via UMAP projection\.\(4\)The pre\-trained C\-VAE model \(123M parameters, 512\-dimensional latent space\) is deployed without retraining across downstream applications\.\(5a\)Offline post\-experiment analysis: archived datasets from multiple facilities are interactively explored through the Latent Space Explorer interface, supporting clustering, latent trajectory analysis, and cluster\-to\-pattern inspection\.\(5b\)On\-the\-fly live analysis: detector images streamed during active experiments at ALS beamline 7\.3\.3 and the NSLS\-II SMI beamline are embedded in real time, enabling immediate structural insights and trajectory monitoring\.\(5c\)Synthetic scattering image generation via UMAP\-guided PCA sampling and conditional flow matching\.\(5d\)Benchmarking against DINOv3, a general\-purpose vision foundation model, to evaluate the benefit of domain\-specific training for scattering data analysis\.In this work, we address both settings by training a domain\-specific attention\-based convolutional variational autoencoder \(C\-VAE\) on 1\.5 million historical X\-ray scattering images collected at the Advanced Light Source\. We first demonstrate that the C\-VAE learns a well\-organized latent space from this historical dataset, in which clusters correspond to distinct scattering regimes and trajectories reflect experimental progression\. This pre\-trained model is then deployed for on\-the\-fly analysis of previously unseen experiments at two synchrotron facilities\. To directly evaluate the benefit of domain\-specific training, we benchmark the C\-VAE against DINOv3 \(ViT\-7B\)\[[37](https://arxiv.org/html/2606.14999#bib.bib18)\]on the same on\-the\-fly data, testing whether a model trained on scattering images captures more interpretable latent structure than a large general\-purpose vision model\. Both workflows are integrated within theLatent Space Explorercomponent of theMLExchangeplatform, supporting interactive exploration during both offline post\-experiment analysis and live experimental sessions\.

This work makes the following contributions:

- •Development of an attention\-based convolutional variational autoencoder \(C\-VAE\) for learning compact, domain\-specific latent representations of large\-scale X\-ray scattering datasets\.
- •Large\-scale latent representation learning from 1\.5 million scattering images collected under diverse experimental conditions, demonstrating that the learned latent space captures structured scattering regimes, experimental trajectories, and continuous morphological transitions\.
- •Deployment of the pre\-trained C\-VAE for on\-the\-fly analysis at two synchrotron facilities, with a systematic benchmark against DINOv3 \(ViT\-7B\)\[[37](https://arxiv.org/html/2606.14999#bib.bib18)\]as a general\-purpose baseline, providing evidence for the advantage of domain\-specific training for scattering data\.
- •Development and comparison of two latent space synthetic scattering image generation strategies: UMAP\-guided PCA sampling and conditional flow matching\. Flow matching achieves superior conditioning fidelity while both strategies produce physically realistic on\-manifold outputs across diverse scattering regimes\.
- •Integration of representation learning and interactive visualization into theMLExchangeplatform throughLatent Space Explorer, supporting real\-time embedding visualization during on\-the\-fly analysis as well as offline analysis of scattering data collected across multiple experimental facilities\.

## 2Methods

### 2\.1C\-VAE Architecture

X\-ray scattering images contain rich structural information that varies across samples, measurement geometries, and experimental conditions\. Extracting compact and meaningful representations from such data requires models that capture both local spatial features and longer\-range structural patterns\. A Variational Autoencoder\[[19](https://arxiv.org/html/2606.14999#bib.bib3)\]is employed to learn compact latent representations of X\-ray scattering images\. VAEs provide a probabilistic framework that enforces a structured and continuous latent space, making them well\-suited for downstream tasks such as clustering, visualization, and exploration\. The resulting latent space captures dominant structural variations across experiments and supports analysis of structural relationships between scattering patterns\.

The C\-VAE is a convolutional architecture augmented with localized self\-attention mechanisms\. CNNs\[[20](https://arxiv.org/html/2606.14999#bib.bib7)\]are effective for modeling local spatial structure in image data, while attention mechanisms\[[44](https://arxiv.org/html/2606.14999#bib.bib8)\]enable modeling of longer\-range dependencies and hierarchical relationships\. In the encoder, convolutional down\-sampling extracts multiscale spatial features, while attention modules refine contextual relationships within local neighborhoods\. These window\-based attention blocks, inspired by Swin\-style windowed attention\[[25](https://arxiv.org/html/2606.14999#bib.bib23),[10](https://arxiv.org/html/2606.14999#bib.bib9)\], allow the model to capture spatial correlations efficiently while remaining computationally efficient for large images\. This hybrid design is particularly well\-suited for scattering data, where both localized features \(e\.g\., peaks\) and global structural patterns \(e\.g\., symmetry and orientation\) are important\.

Given an input imagex∈ℝ1×H×Wx\\in\\mathbb\{R\}^\{1\\times H\\times W\}, the encoder consists of a hierarchy of strided convolution layers interleaved with localized attention blocks\. Each attention block partitions the feature map into non\-overlapping windows of sizew×ww\\times wand applies self\-attention independently within each window\. For a windowXw∈ℝN×dX\_\{w\}\\in\\mathbb\{R\}^\{N\\times d\}, whereN=w2N=w^\{2\}is the number of tokens in the window andddis the feature dimension, self\-attention is computed as:

Attn​\(Xw\)=softmax​\(Qw​Kw⊤dk\)​Vw,\\mathrm\{Attn\}\(X\_\{w\}\)=\\mathrm\{softmax\}\\\!\\left\(\\frac\{Q\_\{w\}K\_\{w\}^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)V\_\{w\},\(1\)where

Qw=Xw​Wq,Kw=Xw​Wk,Vw=Xw​WvQ\_\{w\}=X\_\{w\}W\_\{q\},\\qquad K\_\{w\}=X\_\{w\}W\_\{k\},\\qquad V\_\{w\}=X\_\{w\}W\_\{v\}\(2\)andWq,Wk∈ℝd×dkW\_\{q\},W\_\{k\}\\in\\mathbb\{R\}^\{d\\times d\_\{k\}\},Wv∈ℝd×dvW\_\{v\}\\in\\mathbb\{R\}^\{d\\times d\_\{v\}\}are learnable projection matrices\. The encoded feature tensor is flattened and mapped to the mean and log\-variance of a diagonal Gaussian latent distribution\. Sampling in the latent space is performed using the reparameterization trick:

z=μ\+σ⊙ε,σ=exp⁡\(12​log⁡σ2\),ε∼𝒩​\(0,I\),z=\\mu\+\\sigma\\odot\\varepsilon,\\qquad\\sigma=\\exp\\\!\\left\(\\frac\{1\}\{2\}\\log\\sigma^\{2\}\\right\),\\qquad\\varepsilon\\sim\\mathcal\{N\}\(0,I\),\(3\)wherezzis the sampled latent vector,μ\\muandσ\\sigmaare the mean and standard deviation of the latent distribution,ε\\varepsilonis a random variable drawn from a standard normal distribution, and⊙\\odotdenotes element\-wise multiplication\. This operation enables backpropagation through stochastic sampling by expressing the latent variable as a deterministic function ofμ\\mu,σ\\sigma, and random noise\. The decoder mirrors the encoder through transposed convolution layers and symmetric windowed\-attention blocks, progressively reconstructing spatial structure from the latent vector\. The overall model architecture is shown in Figure[2](https://arxiv.org/html/2606.14999#S2.F2)\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/x1.png)Figure 2:Architecture of the convolutional VAE with windowed attention blocks\. Conv2D/ConvT2D denote strided and transposed convolution; FC denotes fully connected layers\.The objective is to analyze the learned latent representations rather than to maximize reconstruction fidelity\. Although the decoder reconstructs scattering images from the latent vector, the encoder embeddings provide the main representation used for downstream analysis\. These latent embeddings can be used for clustering, visualization, and exploration of relationships between scattering patterns\.

### 2\.2Model Training

This study uses a dataset of approximately 1\.5 million X\-ray scattering images collected over the past decade at the Advanced Light Source \(ALS\)\. Each image has a resolution of1475×16791475\\times 1679pixels and is drawn from a randomized subset of a larger archive containing more than two million images acquired at the SAXS/WAXS beamline 7\.3\.3\[[12](https://arxiv.org/html/2606.14999#bib.bib22)\]\. Prior to model input, images are resized to512×512512\\times 512pixels to match the fixed input dimensions of the C\-VAE architecture\. The dataset spans a wide range of material systems including crystalline powders, liquid crystals, amorphous materials, and thin films, measured in transmission and grazing\-incidence geometries\. All images were recorded using a PILATUS3 2M detector composed of 24 modules arranged in three columns of eight vertically stacked units\. Due to the diversity of samples and experimental conditions, the dataset contains a broad spectrum of scattering signatures, including isotropic rings, sharp diffraction peaks, streaked patterns, and other complex structures associated with different forms of material organization\.

Model training was performed on the National Energy Research Scientific Computing Center \(NERSC\) Perlmutter supercomputer\[[29](https://arxiv.org/html/2606.14999#bib.bib21)\]\. We used 20 compute nodes, each equipped with four NVIDIA A100 GPUs \(80 GB\), for a total of 80 GPUs\. Distributed training was implemented usingPyTorch’s Distributed Data Parallel \(DDP\)222[https://docs\.pytorch\.org/tutorials/intermediate/ddp\_tutorial\.html](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)framework to synchronize model updates across GPUs and enable efficient multi\-node scaling\. Job scheduling and resource allocation were managed using the Slurm workload manager\. Under this configuration, training on 1\.5 million images required approximately 48 hours\.

All models were trained using the Adam optimizer together with a cosine annealing learning rate scheduler\. Input images were normalized prior to training, and distributed data samplers were used to ensure balanced workloads across GPUs\. To improve computational efficiency and reduce GPU memory usage, automatic mixed precision \(AMP\) was enabled during training\. Additional training details, including hyperparameter settings and architecture specifications, are provided in Supplementary Note 3\.

### 2\.3Latent Manifold Sampling for Synthetic Data Generation

The trained C\-VAE decoder can generate scattering images from arbitrary points in the learned latent space, enabling controlled sampling of the scattering phase space\. Two complementary strategies are implemented for this purpose, both conditioned on a two\-dimensional UMAP coordinate and producing a 512\-dimensional latent vector that is subsequently decoded by the trained C\-VAE decoder to yield a synthetic scattering image\.

In the first strategy, UMAP\-guided PCA sampling, cluster\-aware PCA models are fitted to the subset of core latent vectors whose HDBSCAN cluster membership strength exceeds 0\.8\. For each cluster, the centroid is computed in the two\-dimensional UMAP embedding using these core samples\. To generate a new sample, Euclidean distances from a query UMAP coordinate to all cluster centroids are computed and converted into a probability distribution over clusters using a temperature\-controlled top\-kksoftmax weighting scheme, which assigns higher probability to nearby clusters while allowing limited contributions from neighbouring clusters\. In our implementation we usek=2k=2and a temperature ofT=0\.1T=0\.1\. Intra\-cluster variability is modelled using PCA fitted to the core latent vectors of each cluster\. New latent candidates are sampled by drawing Gaussian noise in the PCA coordinate space and inverse\-transforming to the full latent space\. To reduce out\-of\-distribution artefacts, per\-dimension clamping is applied, restricting each latent dimension to the intervalμ±2\.5​σ\\mu\\pm 2\.5\\sigma, whereμ\\muandσ\\sigmadenote the global mean and standard deviation of the core latent vectors\. Full pipeline details are provided in Supplementary Note 2\.1\.

In the second strategy, conditional flow matching, a dedicated velocity network is trained to learn a continuous normalizing flow\[[24](https://arxiv.org/html/2606.14999#bib.bib49)\]from a standard Gaussian prior𝒩​\(𝟎,𝐈\)\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)to the training latent distribution, conditioned on the two\-dimensional UMAP coordinate of the desired scattering state\. The velocity network consists of six residual blocks with adaptive layer normalization, conditioned on a fused embedding of a sinusoidal time encoding and a learned UMAP encoder, both projected to a 256\-dimensional conditioning space\. To support conditioning strength control at inference, classifier\-free guidance\[[14](https://arxiv.org/html/2606.14999#bib.bib50)\]is incorporated during training by randomly zeroing the UMAP conditioning with probabilityp=0\.15p=0\.15, training the network simultaneously as a conditional and unconditional model\. At inference, a guided velocity is constructed as a weighted combination of conditional and unconditional predictions, with guidance scales=5\.0s=5\.0, and the resulting ordinary differential equation is integrated over 20 Euler steps to produce the synthetic latent vector\. The model was trained on high\-confidence cluster members with HDBSCAN membership strength exceeding 0\.5, using the AdamW optimiser with a learning rate of3×10−43\\times 10^\{\-4\}and weight decay of10−410^\{\-4\}, combined with a cosine annealing learning rate scheduler\. Full architecture and training details are provided in Supplementary Note 2\.2\.

### 2\.4Software Infrastructure

TheLatent Space Explorerapplication is integrated within theMLExchangeplatform and shares infrastructure with several web\-based applications for workflow orchestration, data management, and visualization as shown in Figure[3](https://arxiv.org/html/2606.14999#S2.F3)\. It extends interactive latent space exploration to large\-scale X\-ray scattering datasets, integrating domain\-specific model training, distributed workflow execution, and real\-time experimental data streams within a unified environment\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/x2.png)Figure 3:Software architecture diagram ofLatent Space Explorerwithin theMLExchangeplatform\. The system integrates workflow orchestration \(Prefect\), data management \(Tiled\), and model tracking \(MLflow,Data Clinic\)\.#### Data access

Latent Space Explorersupports data access from both the local file system andTiled333[https://blueskyproject\.io/tiled/](https://blueskyproject.io/tiled/)\.Tiledis a data access service within theBlueskyecosystem\[[1](https://arxiv.org/html/2606.14999#bib.bib15)\]that provides HTTP\-based access to scientific datasets\. It is widely used at synchrotron facilities to manage and serve experimental data generated during beamline operations\. Data loading from these sources is enabled through an internally developed tool calledFile Manager444[https://github\.com/mlexchange/mlex\_file\_manager](https://github.com/mlexchange/mlex_file_manager)\[[7](https://arxiv.org/html/2606.14999#bib.bib48)\], which provides a unified interface for seamless data integration across these sources\. File system access is currently restricted to common image formats including PNG, JPEG, and TIF files\. In contrast,Tiledsupports a broader range of dataset types through native compatibility and customizable ingestion mechanisms, allowing it to accommodate structured experimental data produced by beamline instruments\.

For results data,Latent Space Explorerwrites and retrieves outputs exclusively throughTiled\. Generated results are cataloged using a human\-readable hashed identifier derived from the dataset name, enabling the system to handle multiple datasets or directories consistently\.

#### Job management

Formerly, the execution of ML jobs was managed byMLExCompute, an internally developed orchestration system withinMLExchange\[[48](https://arxiv.org/html/2606.14999#bib.bib14),[7](https://arxiv.org/html/2606.14999#bib.bib48)\]\. This system has since been superseded by aPrefect555[https://docs\.prefect\.io/v3/get\-started](https://docs.prefect.io/v3/get-started)based workflow orchestration\.Prefectis a workflow management system that enables distributed execution of computational tasks while maintaining clear dependencies between processing stages\. In the current setup, ML job requests are handled through a central \(parent\)Prefectflow that packages the job according to user\-specified parameters\. Depending on the task, this flow may trigger multiple subordinate workflows \(subflows\)\. For example, during model training two subflows are typically executed: one responsible for the training stage and another responsible for performing partial inference on a subset of the dataset\. Job execution is carried out byPrefectworkers, which support multiple execution backends including Docker, Podman, Slurm, and Conda environments\.

#### User interface

TheLatent Space Explorerapplication provides an interactive environment for exploring the latent space representations of scientific datasets, as shown in Figure[4](https://arxiv.org/html/2606.14999#S2.F4)\. Through the web interface, users can load datasets, configure analysis parameters, and execute dimensionality reduction workflows\. The interface supports PCA and UMAP, which can be applied to selected datasets with configurable parameters\. Once dimensionality reduction has been completed, the resulting embeddings can be visualized in two or three dimensional latent space\. Users may interactively select regions of interest within this space to investigate groups of similar scattering patterns\. Statistical summaries such as the mean and standard deviation of selected data points can then be computed and visualized\. In addition, clustering algorithms such as DBSCAN and HDBSCAN can be applied to identify groups of structurally related scattering patterns\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/figures/LSE_interface.png)\(a\)User interface ofLatent Space Explorer\. The left sidebar enables users to select datasets, configure parameters, and initiate dimensionality reduction and clustering algorithms\. The right panel presents an overview of the selected dataset and supports interactive exploration of the latent space, including the visualization of statistical summaries\.
![Refer to caption](https://arxiv.org/html/2606.14999v1/figures/LSE_live_mode.png)\(b\)Live mode operation inLatent Space Explorer\. Dataset information and corresponding feature vectors are streamed in real time through a websocket connection\. Users can visualize and interact with regions of interest directly within the plot\. Frame index corresponds to the time chronology of incoming data\. Live mode is activated by clicking the Live button located in the top\-right bar of the interface\.

Figure 4:TheLatent Space Explorerinterface in standard and live modes\.Optionally, users may incorporate latent embeddings generated by pretrained autoencoders developed within theData Clinic\[[8](https://arxiv.org/html/2606.14999#bib.bib16)\]application ofMLExchange\.Data Clinicoffers a web\-based interface for interactively training and evaluating tunable autoencoders on scientific datasets\. However, through integration withMLflow, users are not limited to models developed withinData Clinicand can register and utilize more complex or externally trained models\.

#### Model registration

To enable seamless exchange of trained autoencoders across different components of theMLExchangeplatform,MLflowserver666[https://mlflow\.org](https://mlflow.org/)is used to register trained models together with their associated training parameters\[[45](https://arxiv.org/html/2606.14999#bib.bib17)\]\. This approach ensures versioned storage of trained models, enabling reproducibility and consistent reuse across multiple applications\. Once a model has been trained and registered with the ALSMLflowserver777[https://mlflow\.computing\.als\.lbl\.gov](https://mlflow.computing.als.lbl.gov/), it becomes available within theLatent Space Explorerinterface\. Users can select a registered model through a drop\-down menu and apply the corresponding encoder to generate latent embeddings prior to dimensionality reduction or clustering\.

#### On\-the\-fly capabilities

Latent Space Explorersupports on\-the\-fly capabilities for real\-time data embedding, enabling visualization of experimental data as it is generated\. The inference time for the autoencoder and dimensionality reduction model is approximately 0\.04 sec and 0\.005 sec, respectively, resulting in a total processing time of about 0\.05 sec per sample\. Additional compute resources can be allocated to scale inference to faster acquisition rates\. To support streaming data processing, we useArroyoPy,888[https://pypi\.org/project/arroyopy/](https://pypi.org/project/arroyopy/)a framework that integrates with messaging systems and supports flexible composition of processing steps\. ML models are executed directly within the streaming pipeline, and the resulting latent embeddings are projected using a pre\-fitted dimensionality reduction model before being transmitted to the frontend\.

From the perspective of the web interface, a websocket connection continuously listens for incoming messages containing aTiledURI together with the corresponding two\-dimensional feature vector\. These vectors are dynamically aggregated and rendered in the frontend, enabling users to observe the evolving latent space in real time\. Users can interact directly with this visualization by selecting regions of interest and computing statistical summaries\. In this real\-time workflow, data retrieval is performed exclusively through theTileddata service, ensuring consistent access to experimental datasets generated during beamline operations\.

## 3Results

The results are organized to reflect the two deployment settings introduced in Section[1](https://arxiv.org/html/2606.14999#S1)\. We first characterize the latent space learned from the historical training dataset \(Section[3\.1](https://arxiv.org/html/2606.14999#S3.SS1)\), demonstrating that the C\-VAE produces a well\-organized representation across 1\.5 million scattering images\. We then deploy this pre\-trained model for on\-the\-fly analysis of previously unseen scattering data collected during live experiments at two synchrotron facilities \(Section[3\.2](https://arxiv.org/html/2606.14999#S3.SS2)\), and benchmark its performance against DINOv3 \(ViT\-7B\)\[[37](https://arxiv.org/html/2606.14999#bib.bib18)\]as a general\-purpose baseline to evaluate the benefit of domain\-specific training\. Finally, we demonstrate the generative capabilities of the learned representation through UMAP\-guided latent sampling \(Section[3\.3](https://arxiv.org/html/2606.14999#S3.SS3)\)\. All detector images are displayed using a plasma colormap for visualization, although the underlying data are grayscale intensity images\.

Additionally,Latent Space Explorerwas deployed for post\-experiment data analysis at the P03 micro\- and nanofocus small\- and wide\-angle X\-ray scattering \(MiNaXS\) beamline at ALS 9\.3\.1 beamline and PETRA III \(DESY, Germany\); while highly promising, these results are beyond the scope of this methods\-focused study and will be reported separately in a science\-driven publication\.

### 3\.1Exploring X\-ray Scattering Data

![Refer to caption](https://arxiv.org/html/2606.14999v1/x3.png)\(a\)1\.5M C\-VAE embeddings projected via UMAP, colored by HDBSCAN cluster assignment\.
![Refer to caption](https://arxiv.org/html/2606.14999v1/x4.png)\(b\)Zoomed view of two well\-separated experiment clusters in C\-VAE latent space\.
![Refer to caption](https://arxiv.org/html/2606.14999v1/x5.png)\(c\)Single\-cluster trajectory reflecting temporal progression within one experiment\.

Figure 5:Latent space structure of C\-VAE embeddings projected via UMAP\.\(a\)The full training dataset exhibits a continuous yet structured manifold, with HDBSCAN clusters corresponding to groups of structurally similar scattering patterns\.\(b\)Independent experiments form well\-separated clusters in the latent space\.\(c\)Within a single cluster, embeddings follow a smooth trajectory reflecting the temporal progression of the experiment\.We begin by analyzing the structure of the training dataset used to train the C\-VAE model\. The dataset contains approximately 1\.5 million X\-ray scattering images collected under diverse experimental conditions \(see Methods\)\. The images were stored in RGBA format with dimensions of1475×16791475\\times 1679pixels and were resized to512×512512\\times 512prior to encoding with the C\-VAE model\. The encoder generated a 512\-dimensional latent representation for each image, providing a compact description of the structural features present in the scattering patterns\. To visualize the global structure of this high\-dimensional latent space, the embeddings were projected into two dimensions using UMAP\[[28](https://arxiv.org/html/2606.14999#bib.bib11)\]\. The resulting visualization revealed a highly non\-uniform distribution of points, reflecting the diversity of scattering patterns present in the dataset\. To identify recurring structural motifs, we applied the HDBSCAN clustering algorithm\[[4](https://arxiv.org/html/2606.14999#bib.bib13)\]to the latent space\. Unlikekk\-means clustering, HDBSCAN automatically determines the number of clusters and identifies ambiguous or noisy samples as outliers\. Figure[5](https://arxiv.org/html/2606.14999#S3.F5)shows the UMAP projection of the full training dataset together with representative clusters\. The visualization revealed a continuous yet structured latent space in which clusters corresponded to groups of scattering patterns with similar structural characteristics, an organization that could not be recovered by projecting raw pixel intensities directly through UMAP without using a trained encoder \(Supplementary Note 1\.2\)\. As illustrated in Figure[5\(b\)](https://arxiv.org/html/2606.14999#S3.F5.sf2), independent experiments formed well\-separated clusters, indicating that the latent representation encoded experiment\-specific information\. Notably, neighboring clusters remained close in latent space, reflecting similarities between related experiments\. In addition, Figure[5\(c\)](https://arxiv.org/html/2606.14999#S3.F5.sf3)shows that each cluster followed a line\-like trajectory corresponding to the temporal evolution of individual experiments, demonstrating that the latent space preserved both structural similarity and experimental progression\. A PCA analysis of the 512\-dimensional latent vectors is provided in Supplementary Note 1\.1\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/x6.png)\(a\)UMAP trajectory of an experiment with representative scattering images overlaid\.
![Refer to caption](https://arxiv.org/html/2606.14999v1/x7.png)\(b\)3D cluster view revealing similarity and dissimilarity of clusters in latent space\.

Figure 6:Complementary views of C\-VAE latent structure\.\(a\)UMAP projection of a single experimental run, with representative scattering images overlaid at selected frames to illustrate how structural features evolve along the latent trajectory\.\(b\)Three\-dimensional visualization of latent clusters, showing separation between groups that appear overlapping in two\-dimensional projections\.Figure[6](https://arxiv.org/html/2606.14999#S3.F6)shows two complementary views of the latent structure\. Figure[6\(a\)](https://arxiv.org/html/2606.14999#S3.F6.sf1)illustrates the progression of a scattering experiment within the learned latent space\. The 512\-dimensional embeddings are projected into two dimensions using UMAP, where each point corresponds to a single scattering image\. Representative images along the trajectory reveal a smooth transition between scattering patterns\. For example, the initial image \(00023\) exhibits a semicircular scattering pattern, while subsequent images gradually introduce additional structural features such as localized intensity variations and concentric rings\. The smooth trajectory in latent space indicates that the C\-VAE captures gradual and meaningful variations in the scattering patterns\.

While two\-dimensional visualizations provide an intuitive overview of the latent structure, dimensionality reduction may introduce apparent overlap between clusters that are well separated in higher dimensions\. To illustrate this effect, Figure[6\(b\)](https://arxiv.org/html/2606.14999#S3.F6.sf2)presents a three\-dimensional visualization of the latent clusters\. In this representation, clusters that appear overlapping in two\-dimensional projections become clearly separated along additional axes\. For example, clusters C084, C345, and C401 that appear closely spaced in the two\-dimensional embedding are more clearly distinguished in three dimensions\. Similarly, clusters C335, C373, and C374 become separable when additional latent dimensions are considered\. This observation highlights the importance of analyzing the latent structure beyond low\-dimensional projections\.

The structured latent space learned by the C\-VAE provides a compact and interpretable organization of large scattering datasets, in which clusters correspond to distinct scattering regimes and trajectories reflect the temporal progression of individual experiments\. These results confirm that training on 1\.5 million historical scattering images produces a latent space that captures physically meaningful structure across diverse experimental conditions\. This pre\-trained model is therefore well\-suited for deployment in on\-the\-fly analysis of new experiments, without any retraining\.

### 3\.2On\-the\-Fly Analysis

Having established that the C\-VAE learns a well\-organized latent space from historical training data, we now evaluate its deployment for on\-the\-fly analysis of previously unseen experiments\. We consider time\-resolved grazing\-incidence X\-ray scattering measurements of perfluorosulfonic acid \(PFSA, Nafion\) ionomer films, performed to probe the morphological evolution of PFSA ionomer films during blade coating and drying\[[11](https://arxiv.org/html/2606.14999#bib.bib27),[21](https://arxiv.org/html/2606.14999#bib.bib26)\]\. Sequential scattering images were acquired throughout film formation, capturing transitions from solvent\-dominated states to semicrystalline aggregates and fully formed structures\. The trained C\-VAE model is applied without retraining to data acquired in on\-the\-fly mode at beamline 7\.3\.3 at the Advanced Light Source \(Case Study 1\) and at the Soft Matter Interfaces \(SMI\) beamline at NSLS\-II \(Case Study 2\)\. In this setting, detector images captured during the experiment are streamed directly into the analysis pipeline through a websocket listener\. Each image is embedded into the learned latent space in real time and visualized through theLatent Space Explorerinterface, enabling immediate inspection of latent trajectories and clustering behavior without requiring offline post\-processing\. Each captured high\-resolution image \(1475×16791475\\times 1679\) is first resized to512×512512\\times 512and then encoded into a512512\-dimensional latent mean vector, which serves as the representation used for all downstream analysis\.

##### Case Study 1: Real\-Time Analysis at the ALS 7\.3\.3 Beamline

The trained C\-VAE model is applied to two experimental runs of PFSA ionomer film formation at ALS beamline 7\.3\.3, producing smooth latent trajectories that track the structural evolution of the sample throughout the drying process\. Figure[7](https://arxiv.org/html/2606.14999#S3.F7)a shows the C\-VAE PCA trajectory with representative scattering images overlaid at selected frames\. At early time points, when the system is in its dispersion state, the scattering pattern is dominated by a diffuse halo at lowqq\(scattering vector or momentum transfer\), reflecting solvent scattering\. As the dispersion dries, the polymer begins to form semi\-crystalline aggregates, producing an arch\-shaped feature within the initial solvent scattering peak\. Once most of the solvent has evaporated, the system self\-assembles and hydrophilic domains form throughout the material, generating a new scattering feature at higherqq\. The smooth trajectory in latent space directly tracks this physical evolution, confirming that the C\-VAE captures meaningful structural transitions during film formation\.

The full comparison across PCA, UMAP, and t\-SNE projections is shown in Figure[8](https://arxiv.org/html/2606.14999#S3.F8)a \(top row\)\. Each point corresponds to a detector image acquired sequentially during the experiment, with color encoding the frame index within each run\. Across all projections, the scans form smooth trajectories in latent space, indicating that the model captures gradual structural evolution occurring during the film formation\. The two runs follow similar but distinct trajectories, suggesting that the representation captures consistent experimental progression while preserving run\-specific characteristics\. Cluster assignments obtained through unsupervised clustering are indicated by marker shape; the kinetically active stage of the sample in both runs is grouped together \(circles\) and the kinetically stable stage is grouped together \(stars\)\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/x8.png)\(a\)ALS beamline 7\.3\.3 \(C\-VAE, PCA\)![Refer to caption](https://arxiv.org/html/2606.14999v1/x9.png)\(b\)NSLS\-II SMI beamline \(C\-VAE, PCA\)
Figure 7:C\-VAE PCA trajectories with representative scattering images overlaid at selected frames for\(a\)ALS beamline 7\.3\.3 and\(b\)NSLS\-II SMI beamline\. Color encodes temporal progression \(0%=start, 100%=end\); marker shape indicates cluster identity \(circle: cluster 0, star: cluster 1\)\. Image insets link latent positions to the underlying detector patterns, directly connecting structural evolution in the scattering data to the trajectory path in latent space\.
##### Case Study 2: Generalization to NSLS\-II Experiments

Having established that the C\-VAE captures consistent latent structure at the ALS, we next ask whether these representations transfer to a different facility with distinct detector hardware and beamline geometry\. To evaluate this, the trained C\-VAE model is applied without retraining to PFSA ionomer film formation experiments at the NSLS\-II SMI beamline, a setting the model had not encountered during training\. Figure[7](https://arxiv.org/html/2606.14999#S3.F7)b shows the C\-VAE PCA trajectories with representative detector images overlaid at selected frames, linking latent positions to the underlying scattering patterns\.

The C\-VAE latent trajectories across all projections are shown in Figure[8](https://arxiv.org/html/2606.14999#S3.F8)b \(top row\)\. As in the ALS experiments, the scans form smooth trajectories in latent space, confirming that the model captures the continuous structural evolution of scattering patterns at a different facility and detector configuration\. The separation between runs indicates that the representation distinguishes between different experimental conditions or progression stages\. Cluster assignments remain well separated across PCA, UMAP, and t\-SNE projections, grouping scans with similar structural features into distinct regions along the trajectories\.

##### Comparison with a General\-Purpose Vision Model

![Refer to caption](https://arxiv.org/html/2606.14999v1/x10.png)\(a\)ALS beamline 7\.3\.3: C\-VAE \(top row\) vs\. DINOv3 \(bottom row\)![Refer to caption](https://arxiv.org/html/2606.14999v1/x11.png)\(b\)NSLS\-II SMI beamline: C\-VAE \(top row\) vs\. DINOv3 \(bottom row\)
Figure 8:Latent\-space trajectories of PFSA ionomer film formation at\(a\)ALS beamline 7\.3\.3 and\(b\)NSLS\-II SMI beamline, comparing C\-VAE \(top row\) and DINOv3 \(bottom row\) embeddings projected via PCA \(left\), UMAP \(centre\), and t\-SNE \(right\)\. Point colour encodes temporal frame index within each run \(colorbars\); marker shape indicates cluster identity \(circle: cluster 0, star: cluster 1\); hue family distinguishes experimental runs\.STARTandENDmark the global first and last frames of each acquisition trajectory\.To evaluate the benefit of domain\-specific training and directly test the hypothesis raised in Section[1](https://arxiv.org/html/2606.14999#S1), we benchmark the C\-VAE against DINOv3 \(ViT\-7B\)\[[37](https://arxiv.org/html/2606.14999#bib.bib18)\], a large self\-supervised vision transformer trained on natural images\. Embeddings from DINOv3 are computed for the same scattering images used in both case studies and projected via PCA, UMAP, and t\-SNE for direct comparison with the C\-VAE\. The results for both facilities are shown in Figure[8](https://arxiv.org/html/2606.14999#S3.F8)\(bottom rows of each panel\)\.

At the ALS beamline, the DINOv3 embeddings produce coherent trajectories that reflect the broad progression of the experiment\. However, the cluster assignments appear more fragmented compared to the C\-VAE\. DINOv3 captures general visual differences between frames but does not necessarily organize scattering\-specific structural transitions as precisely as the domain\-specific model\. At the NSLS\-II SMI beamline, a similar pattern is observed\. The DINOv3 embeddings form well\-separated groups corresponding to the two experimental runs, confirming that the model captures meaningful visual variation between scattering patterns\. Compared with the C\-VAE, however, the clusters are more distributed and the within\-run trajectories less smooth\. This reflects the general\-purpose nature of the model, which was not trained to distinguish the scattering\-specific structural features such as diffraction rings, peak symmetry, and orientation, that dominate variation in this data\.

While the analyses above were generated offline to allow direct comparison across multiple dimensionality reduction methods, the same representations can be explored interactively through theLatent Space Explorerinterface during live experiments, as shown in Figure[9](https://arxiv.org/html/2606.14999#S3.F9)\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/figures/lse_live.png)Figure 9:Latent Space Explorerinterface for interactive exploration of learned latent embeddings during NSLS\-II experiments\. A selected region of the latent space \(left\) yields mean or standard deviation of the corresponding scattering patterns \(right\)\.Together, these results demonstrate that the trained C\-VAE generalises beyond the training dataset and provides real\-time organization of previously unseen scattering data at two independent synchrotron facilities\. The benchmarking against DINOv3 supports the value of domain\-specific training for on\-the\-fly scattering analysis, and theLatent Space Explorerinterface makes these representations accessible during live experiments\.

### 3\.3Synthetic Scattering Image Generation via Latent Manifold Sampling

The structured latent space learned by the C\-VAE supports controlled generation of synthetic scattering images through conditioned latent sampling\. Two complementary strategies are implemented: UMAP\-guided PCA sampling, which constructs cluster\-aware PCA models from core latent vectors and draws new samples using a temperature\-controlled top\-k softmax weighting scheme over nearby clusters in UMAP space \(k=2k=2,T=0\.1T=0\.1\), and conditional flow matching \(CFM\), which trains a UMAP\-conditioned velocity network to learn a continuous normalizing flow\[[24](https://arxiv.org/html/2606.14999#bib.bib49)\]from a standard Gaussian prior to the training latent distribution, with classifier\-free guidance\[[14](https://arxiv.org/html/2606.14999#bib.bib50)\]incorporated during training\. At inference, the Ordinary Differential Equation \(ODE\) is integrated over 20 Euler steps with guidance scale 5\.0\. Figure[10](https://arxiv.org/html/2606.14999#S3.F10)shows synthetic scattering images generated by both strategies across diverse structural clusters\. The generated images display physically plausible features including concentric diffraction rings, diffuse halos, and anisotropic intensity distributions consistent with the training distribution, and occupy the same regions of the UMAP embedding as training data, confirming that the sampling strategies preserve manifold structure without out\-of\-distribution drift\. A quantitative comparison of both strategies against retrieval and unconditional baselines using a cluster\-stratified evaluation protocol is provided in Supplementary Note 2, demonstrating that flow matching achieves superior conditioning fidelity while both strategies produce structurally realistic outputs across the full diversity of scattering patterns\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/figures/generated_images.png)Figure 10:Synthetic scattering images generated via latent sampling of C\-VAE latent space\. Diverse structures confirm that the sampling preserves structural variety present in the training distribution\.

## 4Discussion

We trained a domain\-specific attention\-based C\-VAE on 1\.5 million X\-ray scattering images collected at the Advanced Light Source beamline 7\.3\.3, and applied it to two distinct settings: post\-experiment exploration of large experimental archives and on\-the\-fly analysis during live experiments\. The coherent cluster structure and smooth latent trajectories emerging from a decade of ALS measurements suggest that dominant structural variation in X\-ray scattering data is low\-dimensional, a finding that justifies the use of compact 512\-dimensional embeddings for real\-time analysis without significant loss of discriminative information\. Successful deployment at NSLS\-II without retraining indicates that the dominant structural variation captured by the C\-VAE is facility\-independent, likely reflecting the physics of scattering rather than detector or beamline\-specific artifacts\. The advantage over DINOv3 \(ViT\-7B\) in cluster separation and trajectory smoothness further supports this interpretation: natural image features such as texture and color contrast may not be well suited for scattering\-specific structure like diffraction ring symmetry, peak sharpness, and orientation, which the domain\-trained model learns directly\.

TheLatent Space Explorerapplication within theMLExchangeplatform makes these representations accessible during both offline and live experiments\. With an inference time of approximately 0\.05 seconds per image, the system supports real\-time monitoring of structural transitions, inspection of representative patterns, and interactive selection of regions of interest in the latent space as data arrive\. For higher\-throughput scenarios, additional compute resources can be allocated to scale inference to faster acquisition rates\. The learned latent space supports synthetic scattering image generation through two complementary strategies\. UMAP\-guided PCA sampling requires no additional training beyond the C\-VAE itself and provides broad coverage of the scattering phase space, making it practical for exploratory applications\. Conditional flow matching achieves superior conditioning fidelity, producing synthetic images that more closely match target structural states, at the cost of a dedicated training step\. Both strategies preserve manifold structure without out\-of\-distribution drift, as confirmed by quantitative evaluation across diverse structural clusters \(Supplementary Note 2\)\. Beyond data exploration, these capabilities have practical utility for augmenting underrepresented structural states in class\-imbalanced training sets, pre\-testing on\-the\-fly analysis pipelines before beamtime, and generating synthetic experimental trajectories for experiment planning\.

Several directions are worth pursuing in future work\. The most immediate extension is integrating the learned latent representations with autonomous experimental control\. The C\-VAE already detects structural transitions in real time, and its latent trajectories could inform adaptive responses during an experiment\. These include adjusting acquisition parameters when a transition is detected, redirecting the beam to a region of interest, or automatically initiating a complementary measurement when an unexpected structural state appears\. Moving in this direction would shift the system from passive monitoring toward active, data\-driven experimental decision making\.

A second direction is incorporating experimental metadata into the learned representation\. The current model encodes only detector images, and integrating sample metadata such as temperature, humidity, solvent composition, or deposition speed into the latent space would allow the model to distinguish structural states that look similar in scattering but arise from different experimental conditions\. This would make the latent trajectories more physically interpretable and could help reveal correlations between processing parameters and structural outcomes that are not visible from scattering data alone\.

Third, while the ALS\-trained model generalised to NSLS\-II without retraining, the degree to which facility\-specific fine\-tuning improves latent organization remains an open question\. Transfer learning from the pre\-trained C\-VAE using a small number of images from a new facility or sample system could reduce the cost of adapting the model to new experimental contexts\. Understanding how few examples are needed to recover well\-organized latent structure at a new beamline would make the approach practical for broader deployment across the synchrotron community\.

Finally, the structured outputs of the latent space analysis provide a natural interface for language model and agent\-based workflows\. Current large language models struggle to reason directly over high\-dimensional detector images, but they can readily process structured descriptions of experimental state such as cluster assignments, trajectory positions, detected transitions, and anomaly flags, which are exactly the outputs thatLatent Space Exploreralready produces in real time\. Connecting these representations to a language model agent could support natural language interaction with ongoing experiments\. A scientist could ask whether the current scattering state resembles a known phase, request a comparison between the current run and historical trajectories, or trigger alerts for specific structural transitions\. More broadly, an agent with access to both the latent space and experimental metadata could assist with experimental planning by suggesting acquisition parameters based on the current structural state or retrieving relevant experiments from the archive\. The latent clusters and trajectories built in this work provide the structured foundation that such a system would require\.

## Acknowledgements

This work was performed and partially supported by the U\.S\. Department of Energy \(DOE\), Office of Science, Office of Basic Energy Sciences, Data, Artificial Intelligence and Machine Learning at DOE Scientific User Facilities program under the MLExchange Project \(Award No\. 107514\)\. Support was also provided by the Center for Materials for Water and Energy Systems \(M\-WET\), an Energy Frontier Research Center funded by DOE, Office of Science, Basic Energy Sciences under Award No\. DE\-SC0019272\. This research used resources of the Advanced Light Source, a DOE Office of Science User Facility under contract No\. DE\-AC02\-05CH11231; the National Synchrotron Light Source II, a DOE Office of Science User Facility operated by Brookhaven National Laboratory under Contract No\. DE\-SC0012704; and the National Energy Research Scientific Computing Center \(NERSC\), a DOE Office of Science User Facility, under NERSC Award No\. BES\-ERCAP0027412\. S\.V\.R\. acknowledges financial support from the German Bundesministerium für Bildung und Forschung \(now: Bundesministerium für Forschung, Technologie und Raumfahrt \(BFTR\)\) within the ErUM\-Data framework under grant No\. 13D22CH7 \(“Versatile Inverse Problem Framework”\)\. The authors thank the staff at ALS beamline 7\.3\.3, NSLS\-II SMI beamline, and PETRA III P03 \(MiNaXS\) beamline for support during experimental data collection\. GPT\-5 from OpenAI and Claude Sonnet 4\.6 from Anthropic were used for minor text editing purposes in this manuscript\.

## Author contributions

M\.C\.: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing\. X\.C\.: Software, Writing – review & editing\. R\.J\.: Software\. W\.K\.: Data curation, Investigation, Software, Writing – review & editing\. P\.H\.Z\.: Conceptualization, Formal analysis, Methodology, Software, Writing – review & editing\. D\.E\.: Resources, Software\. G\.M\.S\.: Data curation, Investigation, Writing – review & editing\. E\.S\.: Investigation, Resources\. C\.Z\.: Investigation, Resources\. M\.N\.: Investigation, Resources\. N\.P\.W\.: Data curation, Investigation, Resources\. K\.K\.L\.: Data curation, Investigation, Writing – review & editing\. J\.M\.C\.: Data curation, Investigation, Writing – review & editing\. J\.C\.D\.: Data curation, Investigation, Writing – review & editing\. C\.M\.: Data curation, Investigation\. L\.K\.: Data curation, Investigation\. B\.F\.: Funding acquisition, Investigation\. G\.F\.: Data curation, Investigation\. Y\.M\.: Data curation, Investigation\. E\.G\.: Data curation, Investigation, Resources\. D\.B\.A\.: Data curation, Investigation, Resources, Software\. F\.S\.: Resources\. B\.S\.: Data curation, Investigation, Resources, Writing – review & editing\. S\.V\.R\.: Data curation, Investigation, Resources\. E\.C\.: Data curation, Resources, Writing – review & editing\. D\.M\.: Conceptualization, Project administration, Software, Supervision, Writing – review & editing\. T\.C\.: Conceptualization, Methodology, Project administration, Software, Supervision, Writing – review & editing\. A\.H\.: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing\.

## Competing interests

The authors declare no competing interests\.

## Data availability

The X\-ray scattering training dataset used in this study and the synthetic generated data are available upon reasonable request from the corresponding author\.

## Code availability

TheLatent Space Explorerapplication and associatedMLExchangeplatform components used in this study are openly available at[https://github\.com/mlexchange](https://github.com/mlexchange)\. The C\-VAE model training and inference scripts are also available here\.

## References

- \[1\]D\. Allan, T\. Caswell, S\. Campbell, and M\. Rakitin\(2019\)Bluesky’s ahead: a multi\-facility collaboration for an a la carte software project for data acquisition and management\.Synchrotron Radiation News32\(3\),pp\. 19–22\.Cited by:[§2\.4](https://arxiv.org/html/2606.14999#S2.SS4.SSSx1.p1.1)\.
- \[2\]A\. Barbour, S\. Campbell, T\. Caswell, M\. Fukuto, M\. Hanwell, A\. Kiss, T\. Konstantinova, R\. Laasch, P\. Maffettone, B\. Ravel,et al\.\(2022\)Advancing discovery with artificial intelligence and machine learning at nsls\-ii\.Synchrotron Radiation News35\(4\),pp\. 44–50\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1)\.
- \[3\]M\. Calvat, C\. Bean, D\. Anjaria, H\. Park, H\. Wang, K\. Vecchio, and J\. Stinville\(2025\)Learning metal microstructural heterogeneity through spatial mapping of diffraction latent space features\.npj Computational Materials11\(1\),pp\. 284\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[4\]R\. J\. Campello, D\. Moulavi, and J\. Sander\(2013\)Density\-based clustering based on hierarchical density estimates\.InPacific\-Asia conference on knowledge discovery and data mining,pp\. 160–172\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.14999#S3.SS1.p1.3)\.
- \[5\]M\. Caron, H\. Touvron, I\. Misra, H\. Jégou, J\. Mairal, P\. Bojanowski, and A\. Joulin\(2021\)Emerging properties in self\-supervised vision transformers\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 9650–9660\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1),[§1](https://arxiv.org/html/2606.14999#S1.p3.1)\.
- \[6\]C\. Chadebec and S\. Allassonnière\(2021\)Data augmentation with variational autoencoders and manifold sampling\.InMICCAI Workshop on Deep Generative Models,pp\. 184–192\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[7\]T\. Chavez, Z\. Zhao, R\. Jiang, W\. Koepp, D\. McReynolds, P\. H\. Zwart, D\. B\. Allan, E\. H\. Gann, N\. Schwarz, D\. Ushizima,et al\.\(2025\)A machine\-learning\-driven data labeling pipeline for scientific analysis in mlexchange\.Applied Crystallography58\(3\)\.Cited by:[§2\.4](https://arxiv.org/html/2606.14999#S2.SS4.SSSx1.p1.1),[§2\.4](https://arxiv.org/html/2606.14999#S2.SS4.SSSx2.p1.1)\.
- \[8\]T\. Chavez, Z\. Zhao, R\. Jiang, W\. Koepp, D\. McReynolds, P\. H\. Zwart, D\. B\. Allan, E\. H\. Gann, N\. Schwarz, D\. Ushizima, E\. S\. Barnard, A\. Mehta, S\. Sankaranarayanan, and A\. Hexemer\(2024\)A machine\-learning\-driven data labeling pipeline for scientific analysis in mlexchange\.Journal of Applied Crystallography\.Note:In pressExternal Links:[Document](https://dx.doi.org/10.1107/S1600576725002328)Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.14999#S2.SS4.SSSx3.p2.1)\.
- \[9\]R\. Cohn and E\. Holm\(2021\)Unsupervised machine learning via transfer learning and k\-means clustering to classify materials image data\.Integrating Materials and Manufacturing Innovation10\(2\),pp\. 231–244\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[10\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly,et al\.\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1),[§1](https://arxiv.org/html/2606.14999#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.14999#S2.SS1.p2.1)\.
- \[11\]P\. J\. Dudenas and A\. Kusoglu\(2019\)Evolution of ionomer morphology from dispersion to film: an in situ x\-ray study\.Macromolecules52\(20\),pp\. 7779–7785\.Cited by:[§3\.2](https://arxiv.org/html/2606.14999#S3.SS2.p1.3)\.
- \[12\]A\. Hexemer, W\. Bras, J\. Glossinger, E\. Schaible, E\. Gann, R\. Kirian, A\. MacDowell, M\. Church, B\. Rude, and H\. Padmore\(2010\)A saxs/waxs/gisaxs beamline with multilayer monochromator\.InJournal of Physics: Conference Series,Vol\.247,pp\. 012007\.Cited by:[§2\.2](https://arxiv.org/html/2606.14999#S2.SS2.p1.2)\.
- \[13\]I\. Higgins, L\. Matthey, A\. Pal, C\. P\. Burgess, X\. Glorot, M\. M\. Botvinick, S\. Mohamed, and A\. Lerchner\(2017\)β\\beta\-VAE: learning basic visual concepts with a constrained variational framework\.InInternational Conference on Learning Representations,External Links:[Link](https://api.semanticscholar.org/CorpusID:46798026)Cited by:[§3\.1](https://arxiv.org/html/2606.14999#S3.SS1a.p4.3)\.
- \[14\]J\. Ho and T\. Salimans\(2022\)Classifier\-free diffusion guidance\.arXiv preprint arXiv:2207\.12598\.Cited by:[§2\.3](https://arxiv.org/html/2606.14999#S2.SS3.p3.5),[§2](https://arxiv.org/html/2606.14999#S2.SSx2.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.14999#S3.SS3.p1.2)\.
- \[15\]J\. P\. Horwath, X\. Lin, H\. He, Q\. Zhang, E\. M\. Dufresne, M\. Chu, S\. K\. Sankaranarayanan, W\. Chen, S\. Narayanan, and M\. J\. Cherukara\(2024\)AI\-nerd: elucidation of relaxation dynamics beyond equilibrium through ai\-informed x\-ray photon correlation spectroscopy\.Nature Communications15\(1\),pp\. 5945\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1)\.
- \[16\]X\. Huang, S\. Jamonnak, Y\. Zhao, B\. Wang, M\. Hoai, K\. Yager, and W\. Xu\(2020\)Interactive visual study of multiple attributes learning model of x\-ray scattering images\.IEEE Transactions on Visualization and Computer Graphics27\(2\),pp\. 1312–1321\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[17\]S\. V\. Kalinin, O\. Dyck, S\. Jesse, and M\. Ziatdinov\(2021\)Exploring order parameters and dynamic processes in disordered systems via variational autoencoders\.Science Advances7\(17\),pp\. eabd5084\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[18\]Y\. Kim, H\. K\. Park, J\. Jung, P\. Asghari\-Rad, S\. Lee, J\. Y\. Kim, H\. G\. Jung, and H\. S\. Kim\(2021\)Exploration of optimal microstructure and mechanical properties in continuous microstructure space using a variational autoencoder\.Materials & Design202,pp\. 109544\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[19\]D\. P\. Kingma and M\. Welling\(2014\)Auto\-encoding variational bayes\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.14999#S2.SS1.p1.1)\.
- \[20\]A\. Krizhevsky, I\. Sutskever, and G\. E\. Hinton\(2012\)Imagenet classification with deep convolutional neural networks\.Advances in neural information processing systems25\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.14999#S2.SS1.p2.1)\.
- \[21\]A\. Kusoglu and A\. Z\. Weber\(2017\)New insights into perfluorinated sulfonic\-acid ionomers\.Chemical reviews117\(3\),pp\. 987–1104\.Cited by:[§3\.2](https://arxiv.org/html/2606.14999#S3.SS2.p1.3)\.
- \[22\]B\. C\. Kwon, S\. Friedman, K\. Xu, S\. A\. Lubitz, A\. Philippakis, P\. Batra, P\. T\. Ellinor, and K\. Ng\(2023\)Latent space explorer: visual analytics for multimodal latent space exploration\.arXiv preprint arXiv:2312\.00857\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[23\]J\. Lee, W\. B\. Park, J\. H\. Lee, S\. P\. Singh, and K\. Sohn\(2020\)A deep\-learning technique for phase identification in multiphase inorganic compounds using synthetic xrd powder patterns\.Nature communications11\(1\),pp\. 86\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[24\]Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le\(2022\)Flow matching for generative modeling\.arXiv preprint arXiv:2210\.02747\.Cited by:[§2\.3](https://arxiv.org/html/2606.14999#S2.SS3.p3.5),[§2](https://arxiv.org/html/2606.14999#S2.SSx2.p1.2),[§3\.3](https://arxiv.org/html/2606.14999#S3.SS3.p1.2)\.
- \[25\]Z\. Liu, Y\. Lin, Y\. Cao, H\. Hu, Y\. Wei, Z\. Zhang, S\. Lin, and B\. Guo\(2021\)Swin transformer: hierarchical vision transformer using shifted windows\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 10012–10022\.Cited by:[§2\.1](https://arxiv.org/html/2606.14999#S2.SS1.p2.1)\.
- \[26\]A\. Luo, T\. Zhou, M\. Du, M\. V\. Holt, A\. Singer, and M\. J\. Cherukara\(2025\)DONUT: physics\-aware machine learning for real\-time x\-ray nanodiffraction analysis\.npj Computational Materials11\(1\),pp\. 380\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1)\.
- \[27\]A\. Maćkiewicz and W\. Ratajczak\(1993\)Principal components analysis \(pca\)\.Computers & Geosciences19\(3\),pp\. 303–342\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[28\]L\. McInnes, J\. Healy, and J\. Melville\(2018\)Umap: uniform manifold approximation and projection for dimension reduction\.arXiv preprint arXiv:1802\.03426\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.14999#S3.SS1.p1.3)\.
- \[29\]National Energy Research Scientific Computing Center \(NERSC\)\(2024\)NERSC perlmutter architecture\.Note:[https://docs\.nersc\.gov/systems/perlmutter/architecture/](https://docs.nersc.gov/systems/perlmutter/architecture/)Accessed: 2026\-01\-25Cited by:[§2\.2](https://arxiv.org/html/2606.14999#S2.SS2.p2.1)\.
- \[30\]M\. M\. Noack, K\. G\. Yager, M\. Fukuto, G\. S\. Doerk, R\. Li, and J\. A\. Sethian\(2019\)A kriging\-based approach to autonomous experimentation with applications to x\-ray scattering\.Scientific reports9\(1\),pp\. 11809\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1)\.
- \[31\]M\. Oquab, T\. Darcet, T\. Moutakanni, H\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. Haziza, F\. Massa, A\. El\-Nouby,et al\.\(2023\)Dinov2: learning robust visual features without supervision\.arXiv preprint arXiv:2304\.07193\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1),[§1](https://arxiv.org/html/2606.14999#S1.p3.1)\.
- \[32\]D\. Y\. Parkinson, T\. Chavez, M\. Choudhary, D\. English, G\. Hao, T\. Hellert, S\. C\. Leemann, S\. Nemsak, E\. Rotenberg, A\. L\. Taylor,et al\.\(2024\)AI@ als workshop report: machine learning needs at the advanced light source\.Taylor & Francis\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1)\.
- \[33\]L\. Pithan, V\. Starostin, D\. Mareček, L\. Petersdorf, C\. Völter, V\. Munteanu, M\. Jankowski, O\. Konovalov, A\. Gerlach, A\. Hinderhofer,et al\.\(2023\)Closing the loop: autonomous experiments enabled by machine\-learning\-based online data analysis in synchrotron beamline environments\.Synchrotron Radiation30\(6\),pp\. 1064–1075\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1)\.
- \[34\]F\. Ren, T\. Williams, J\. Hattrick\-Simpers, and A\. Mehta\(2017\)On\-the\-fly segmentation approaches for x\-ray diffraction datasets for metallic glasses\.MRS Communications7\(3\),pp\. 613–620\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[35\]J\. E\. Salgado, S\. Lerman, Z\. Du, C\. Xu, and N\. Abdolrahim\(2023\)Automated classification of big x\-ray diffraction data using deep learning models\.npj Computational Materials9\(1\),pp\. 214\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1)\.
- \[36\]M\. Seifi, D\. Dalle Nogare, J\. M\. Battagliotti, V\. Galinova, A\. K\. Rao, P\. Jouneau, A\. Archit, C\. Pape, J\. Decelle,et al\.\(2025\)FeatureForest: the power of foundation models, the usability of random forests\.npj Imaging3\(1\),pp\. 32\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[37\]O\. Simeoni, H\. V\. Vo, M\. Seitzer, F\. Baldassarre, M\. Oquab, C\. Jose, V\. Khalidov, M\. Szafraniec, S\. Yi, M\. Ramamonjisoa,et al\.\(2025\)DINOv3: a family of large self\-supervised vision foundation models\.arXiv preprint arXiv:2508\.10104\.Cited by:[3rd item](https://arxiv.org/html/2606.14999#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2606.14999#S1.p3.1),[§1](https://arxiv.org/html/2606.14999#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.14999#S3.SS2.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2606.14999#S3.p1.1)\.
- \[38\]V\. Starostin, V\. Munteanu, A\. Greco, E\. Kneschaurek, A\. Pleli, F\. Bertram, A\. Gerlach, A\. Hinderhofer, and F\. Schreiber\(2022\)Tracking perovskite crystallization via deep learning\-based feature detection on 2d x\-ray scattering data\.npj Computational Materials8\(1\),pp\. 101\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p3.1)\.
- \[39\]T\. Strohmann, P\. Barriobero\-Vila, J\. Gussone, D\. Melching, A\. Stark, N\. Schell, and G\. Requena\(2023\)Can unsupervised machine learning boost the on\-site analysis of in situ synchrotron diffraction data?\.Scripta materialia226,pp\. 115238\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[40\]D\. R\. Sutherland, R\. Ford, Y\. Liu, T\. B\. Martin, and P\. A\. Beaucage\(2025\)AutoSAS: a new human\-aside\-the\-loop paradigm for automated sas fitting for high throughput and autonomous experimentation\.APL Machine Learning3\(3\)\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[41\]N\. J\. Szymanski, C\. J\. Bartel, Y\. Zeng, M\. Diallo, H\. Kim, and G\. Ceder\(2023\)Adaptively driven x\-ray diffraction guided by machine learning for autonomous phase identification\.npj Computational Materials9\(1\),pp\. 31\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p3.1)\.
- \[42\]M\. Valleti, M\. Ziatdinov, Y\. Liu, and S\. V\. Kalinin\(2024\)Physics and chemistry from parsimonious representations: image analysis via invariant variational autoencoders\.npj Computational Materials10\(1\),pp\. 183\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[43\]L\. Van der Maaten and G\. Hinton\(2008\)Visualizing data using t\-sne\.\.Journal of machine learning research9\(11\)\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[44\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.14999#S2.SS1.p2.1)\.
- \[45\]M\. Zaharia, A\. Chen, A\. Davidson, A\. Ghodsi, S\. A\. Hong, A\. Konwinski, S\. Murching, T\. Nykodym, P\. Ogilvie, M\. Parkhe,et al\.\(2018\)Accelerating the machine learning lifecycle with mlflow\.\.IEEE Data Eng\. Bull\.41\(4\),pp\. 39–45\.Cited by:[§2\.4](https://arxiv.org/html/2606.14999#S2.SS4.SSSx4.p1.1)\.
- \[46\]Y\. Zhang, J\. Li, and X\. Chao\(2024\)ChemNav: an interactive visual tool to navigate in the latent space for chemical molecules discovery\.Visual Informatics8\(4\),pp\. 60–70\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[47\]Z\. Zhang, C\. Li, W\. Wang, Z\. Dong, G\. Liu, Y\. Dong, and Y\. Zhang\(2024\)Towards full\-stack deep learning\-empowered data processing pipeline for synchrotron tomography experiments\.The Innovation5\(1\)\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[48\]Z\. Zhao, T\. Chavez, E\. A\. Holman, G\. Hao, A\. Green, H\. Krishnan, D\. McReynolds, R\. J\. Pandolfi, E\. J\. Roberts, P\. H\. Zwart,et al\.\(2022\)MLExchange: a web\-based platform enabling exchangeable machine learning workflows for scientific studies\.In2022 4th Annual Workshop on Extreme\-scale Experiment\-in\-the\-Loop Computing \(XLOOP\),pp\. 10–15\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.14999#S2.SS4.SSSx2.p1.1)\.
- \[49\]Z\. Zhao, X\. Chong, T\. Chavez, and A\. Hexemer\(2024\)Generating realistic x\-ray scattering images using stable diffusion and human\-in\-the\-loop annotations\.arXiv preprint arXiv:2408\.12720\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p2.1)\.
- \[50\]Z\. Zhou, C\. Li, X\. Bi, C\. Zhang, Y\. Huang, J\. Zhuang, W\. Hua, Z\. Dong, L\. Zhao, Y\. Zhang,et al\.\(2023\)A machine learning model for textured x\-ray scattering and diffraction image denoising\.npj Computational Materials9\(1\),pp\. 58\.Cited by:[§1](https://arxiv.org/html/2606.14999#S1.p3.1)\.

Supplementary Information Unlocking Latent Dimensions: Exploring Representations of Large\-Scale X\-ray Scattering Data using Variational Autoencoders Monika Choudhary et al\.

## 1Latent Space Structure and Representation Validation

A central question in applying a learned encoder to scientific image data is whether the resulting representation captures physically meaningful structure, or merely compresses pixel information without semantic organization\. This note addresses that question from two complementary directions: first by examining the linear structure of the latent space through PCA, and second by comparing the C\-VAE projection against a direct pixel\-based UMAP baseline\.

### 1\.1Principal Component Analysis of the Latent Space

To characterize the linear structure of the learned representation, we perform PCA on the 512\-dimensional latent vectors extracted from the 1\.5 million training images\. Figure[1\(a\)](https://arxiv.org/html/2606.14999#S1.F1.sf1)shows the variance explained by the leading principal components\. The first few components capture a disproportionately large fraction of the total variance, indicating that the latent representation possesses meaningful low\-dimensional structure despite its high dimensionality\. This compression is a consequence of the variational objective, which penalizes redundant latent dimensions and encourages the encoder to represent only the dominant modes of image variation\.

To examine the relationship between linear and nonlinear structure in the latent space, the per\-sample scores along PC0 and PC1 are used to color each point in the UMAP embedding, as shown in Figure[S1](https://arxiv.org/html/2606.14999#S1.F1a)\. In Figure[1\(b\)](https://arxiv.org/html/2606.14999#S1.F1.sf2), a clear and continuous gradient is visible across the UMAP layout, indicating that PC0, the dominant mode of linear variation, is well\-aligned with the global structure captured by UMAP\. This suggests the leading principal component corresponds to a dominant and smoothly varying physical mode such as changes in intensity distribution, ring curvature, or feature orientation\. Figure[1\(c\)](https://arxiv.org/html/2606.14999#S1.F1.sf3)shows a less pronounced but spatially structured gradient for PC1, consistent with a secondary mode of variation more diffusely distributed across the latent manifold\. Together, these results confirm that the latent space organizes images along directions that correspond to interpretable physical variation rather than arbitrary encoder outputs\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/figures_appendix/pc_variance_distribution.png)\(a\)Variance explained by leading PCs\.
![Refer to caption](https://arxiv.org/html/2606.14999v1/figures_appendix/umap_colored_by_pc0_20pc.png)\(b\)UMAP embedding colored by PC0 scores\.
![Refer to caption](https://arxiv.org/html/2606.14999v1/figures_appendix/umap_colored_by_pc1_20pc.png)\(c\)UMAP embedding colored by PC1 scores\.

Figure S1:PCA of the C\-VAE latent space from 1\.5 million training images\.\(a\)Variance explained by the leading principal components: a small number of components capture most of the latent variance, confirming low\-dimensional structure\.\(b\)UMAP embedding colored by PC0 scores: the smooth gradient confirms strong alignment between the dominant linear mode and global UMAP structure\.\(c\)UMAP embedding colored by PC1 scores: a weaker but spatially structured gradient corresponding to a secondary mode of variation\.
### 1\.2Comparison with Direct Image UMAP

The PCA results above confirm that the latent space is structured and low\-dimensional\. A complementary question is whether this structure is necessary or whether projecting raw pixel intensities directly through UMAP would yield equivalent groupings without a trained encoder\. Direct image UMAP is appealing in its simplicity, requiring no training or architectural choices\. However, pixel similarity and structural similarity are not equivalent in scientific imaging: two scattering patterns can share the same morphological class while differing in brightness, beam flux, or detector noise, and the variational objective of the C\-VAE explicitly encourages the encoder to suppress such low\-level variation in favor of structural modes\.

To test this, each image was resized to128×128128\\times 128pixels, flattened, L2\-normalized to remove global illumination differences, and projected to 2D using UMAP fitted on a random subset of 20 000 images and applied to the full 1\.5\-million\-image dataset\. Figure[S2](https://arxiv.org/html/2606.14999#S1.F2)shows the result for six representative HDBSCAN clusters\. In the C\-VAE latent space, all six clusters form compact, well\-separated manifolds whose elongated geometry reflects the smooth temporal evolution of scattering patterns within each run, consistent with the structured low\-dimensional organization confirmed by PCA above\. In the direct image UMAP, the same clusters either collapse into isolated points or spread without recoverable structure, even after L2normalization\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/x12.png)Figure S2:Comparison of UMAP projections for six representative HDBSCAN clusters \(263, 381, 424, 359, 213, 170\), colored consistently across panels\. Left: C\-VAE latent space: each cluster forms a compact, well\-separated manifold consistent with the structured low\-dimensional organization of the learned representation\. Right: Direct image UMAP applied to L2\-normalized128×128128\\times 128pixel vectors: the same clusters collapse into isolated points or spread without recoverable structure, demonstrating that raw pixel similarity is insufficient to reproduce the structural organization recovered by the C\-VAE encoder\.

## 2Latent Manifold Sampling Pipelines for Synthetic Scattering Image Generation

This note describes the two generation strategies evaluated in Section 2\.3 of the main text: UMAP\-guided PCA sampling and conditional flow matching\. Both strategies receive a two\-dimensional UMAP coordinate as conditioning input and produce a 512\-dimensional latent vector that is decoded by the trained C\-VAE decoder to yield a synthetic grayscale scattering image\. An overview of the UMAP\-guided PCA pipeline is shown in Figure S3 and flow matching pipeline in Figure S4\. The conditional flow matching architecture is summarised in Table S1\.

### UMAP\-Guided PCA Sampling

LetZ∈ℝN×DZ\\in\\mathbb\{R\}^\{N\\times D\}denote the latent representation of the training data, whereD=512D=512\. Each latent vector is associated with a cluster label and a cluster membership strength obtained from the HDBSCAN clustering analysis\. To restrict sampling to meaningful regions of the latent space, only core latent vectors whose cluster membership strength meets or exceeds a threshold of 0\.8 are retained\. For each sufficiently populated cluster, the centroid is computed in the two\-dimensional UMAP embedding using these core samples\. These centroids provide a coarse representation of the global structure of the latent manifold\.

Synthetic samples are generated by providing a query UMAP coordinate𝐮∈ℝ2\\mathbf\{u\}\\in\\mathbb\{R\}^\{2\}\. The Euclidean distance from𝐮\\mathbf\{u\}to each cluster centroid𝐜j\\mathbf\{c\}\_\{j\}is computed, and these distances are converted into a probability distribution over clusters using a temperature\-controlled top\-kksoftmax weighting scheme:

wj=exp⁡\(−‖𝐮−𝐜j‖2/T\)∑l∈𝒦exp⁡\(−‖𝐮−𝐜l‖2/T\),j∈𝒦,w\_\{j\}=\\frac\{\\exp\\\!\\left\(\-\\\|\\mathbf\{u\}\-\\mathbf\{c\}\_\{j\}\\\|^\{2\}\\,/\\,T\\right\)\}\{\\sum\_\{l\\in\\mathcal\{K\}\}\\exp\\\!\\left\(\-\\\|\\mathbf\{u\}\-\\mathbf\{c\}\_\{l\}\\\|^\{2\}\\,/\\,T\\right\)\},\\quad j\\in\\mathcal\{K\},\(S1\)where𝒦\\mathcal\{K\}denotes the set ofkknearest cluster centroids to𝐮\\mathbf\{u\}, andTTis the temperature parameter\. In our implementation,k=2k=2andT=0\.1T=0\.1\. Weights for all clusters outside𝒦\\mathcal\{K\}are set to zero\. A cluster is selected stochastically according to the resulting distribution\{wj\}\\\{w\_\{j\}\\\}\.

To reconstruct high\-dimensional latent vectors from the selected cluster, intra\-cluster variability is modelled using PCA\. For each cluster, a PCA model is fitted to its core latent vectors using up to 64 principal components\. New latent candidates are sampled by drawing Gaussian noise in the PCA coordinate space, scaled by the per\-component standard deviations with a scale factor of 0\.75\. The sampled vector is inverse\-transformed to the full latent space to yield a candidate latent vector\. To reduce out\-of\-distribution artefacts, per\-dimension clamping is applied, restricting each latent dimension to the intervalμ±k​σ\\mu\\pm k\\sigma, whereμ\\muandσ\\sigmadenote the global mean and standard deviation of the core latent vectors andk=2\.5k=2\.5\. The resulting latent vectors are decoded in batches by the trained C\-VAE decoder to produce grayscale scattering images\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/x13.png)Figure S3:Overview of the UMAP\-guided latent manifold sampling pipeline\. Cluster\-aware PCA models are fitted to core latent vectors, new samples are drawn using a temperature\-controlled top\-kksoftmax weighting scheme in UMAP space, and the resulting latent vectors are decoded by the trained C\-VAE decoder to produce synthetic scattering images\.##### Comparison with simpler baselines\.

To contextualise this approach, three simpler baseline strategies were evaluated: global PCA sampling, clusterwise PCA sampling, and interpolation between cluster centroids\. Global PCA sampling tended to oversmooth distinct structural modes by averaging across all clusters\. Clusterwise PCA sampling limited diversity by restricting samples to individual clusters without interpolation\. Mean\-based interpolation between centroids reduced variability and produced less realistic reconstructions\. Relative to these baselines, the UMAP\-guided strategy better preserved the structural diversity of the training distribution\.

### Conditional Flow Matching

Conditional flow matching learns a continuous normalizing flow\[[24](https://arxiv.org/html/2606.14999#bib.bib49)\]that transports samples from a standard Gaussian prior𝒩​\(𝟎,𝐈\)\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)to the training latent distribution, conditioned on a two\-dimensional UMAP coordinate\. This provides a more expressive generative model than PCA\-based sampling by learning the full conditional densityp​\(𝐳∣𝐮\)p\(\\mathbf\{z\}\\mid\\mathbf\{u\}\)rather than approximating it with a Gaussian in PCA space\.

##### Training objective\.

Given a training latent vector𝐳1\\mathbf\{z\}\_\{1\}and noise𝜺∼𝒩​\(𝟎,𝐈\)\\boldsymbol\{\\varepsilon\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\), the linear interpolation path is defined as:

𝐳t=\(1−t\)​𝜺\+t​𝐳1,t∼Uniform​\[0,1\],\\mathbf\{z\}\_\{t\}=\(1\-t\)\\,\\boldsymbol\{\\varepsilon\}\+t\\,\\mathbf\{z\}\_\{1\},\\qquad t\\sim\\mathrm\{Uniform\}\[0,1\],\(S2\)with velocity target𝐯=𝐳1−𝜺\\mathbf\{v\}=\\mathbf\{z\}\_\{1\}\-\\boldsymbol\{\\varepsilon\}along this path\. A velocity network𝐯θ​\(𝐳t,t,𝐮\)\\mathbf\{v\}\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},t,\\mathbf\{u\}\)is trained to predict this target:

ℒ​\(θ\)=𝔼t,𝐳1,𝜺,𝐮​‖𝐯θ​\(𝐳t,t,𝐮\)−\(𝐳1−𝜺\)‖2\.\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\,\\mathbf\{z\}\_\{1\},\\,\\boldsymbol\{\\varepsilon\},\\,\\mathbf\{u\}\}\\left\\\|\\mathbf\{v\}\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},t,\\mathbf\{u\}\)\-\(\\mathbf\{z\}\_\{1\}\-\\boldsymbol\{\\varepsilon\}\)\\right\\\|^\{2\}\.\(S3\)

##### Velocity network architecture\.

The network consists of six residual blocks, each applying adaptive layer normalisation \(AdaLN\) conditioned on a fused 256\-dimensional embedding of time and UMAP coordinate\. The time coordinate is encoded via a sinusoidal embedding of dimension 128, projected through a two\-layer MLP\. The UMAP coordinate is encoded through a three\-layer MLP \(2→64→256→2562\\to 64\\to 256\\to 256with SiLU activations\)\. Both embeddings are concatenated and merged through a two\-layer MLP to produce the shared conditioning vector\. Each residual block applies AdaLN normalisation, a4×4\\timesexpansion feedforward network with SiLU activation and dropout of 0\.1, and a residual connection\. The output projection is initialised to zero to ensure the network starts from the identity flow\. Architecture hyperparameters are summarised in Table S1\.

##### Classifier\-free guidance\.

To support fidelity\-diversity trade\-off control at inference, classifier\-free guidance\[[14](https://arxiv.org/html/2606.14999#bib.bib50)\]is incorporated during training by randomly zeroing the UMAP conditioning with probabilityp=0\.15p=0\.15, training the network simultaneously as a conditional and unconditional model\. At inference, the guided velocity is constructed as:

𝐯guided=𝐯uncond\+s⋅\(𝐯cond−𝐯uncond\),\\mathbf\{v\}\_\{\\mathrm\{guided\}\}=\\mathbf\{v\}\_\{\\mathrm\{uncond\}\}\+s\\cdot\\left\(\\mathbf\{v\}\_\{\\mathrm\{cond\}\}\-\\mathbf\{v\}\_\{\\mathrm\{uncond\}\}\\right\),\(S4\)wheres=5\.0s=5\.0is the guidance scale\. This formulation allows trading sample diversity for conditioning fidelity at inference time without retraining\.

##### Training details\.

The model was trained on high\-confidence cluster members with HDBSCAN membership strength exceeding 0\.5, corresponding to approximately 643,000 points, using a 90/10 training and validation split\. UMAP coordinates were normalized to\[−1,1\]\[\-1,1\]and latent vectors were standardized to zero mean and unit variance prior to training\. Optimization used AdamW with a learning rate of3×10−43\\times 10^\{\-4\}and weight decay of10−410^\{\-4\}, combined with a cosine annealing learning rate scheduler and automatic mixed precision\. At inference, the ODE is integrated over 20 Euler steps with guidance scales=5\.0s=5\.0\.

![Refer to caption](https://arxiv.org/html/2606.14999v1/x14.png)Figure S4:Overview of the conditional flow matching pipeline\. During training \(top\), the velocity network learns to predict the true velocity𝐳1−𝜺\\mathbf\{z\}\_\{1\}\-\\boldsymbol\{\\varepsilon\}from interpolated inputs conditioned on timettand UMAP coordinate𝐮\\mathbf\{u\}, with classifier\-free guidance dropoutp=0\.15p=0\.15\. The trained network weights are transferred directly to inference \(bottom\), where a query UMAP coordinate conditions the ODE integration over 20 Euler steps to produce a synthetic latent vector, which is decoded by the trained C\-VAE decoder to yield a synthetic scattering image\.Table S1:Conditional flow matching architecture and training hyperparametersHyperparameterValueLatent dimension512UMAP conditioning dimension2Hidden dimension512Conditioning dimension256Number of residual blocks6Time embedding dimension128Dropout0\.1CFG dropout probability0\.15Guidance scale \(inference\)5\.0ODE integration steps20Training points∼643,000\{\\sim\}643\{,\}000Training / validation split90 / 10OptimiserAdamWLearning rate3×10−43\\times 10^\{\-4\}Weight decay10−410^\{\-4\}LR schedulerCosineAnnealingLRTraining epochs500

### Quantitative Evaluation of Generation Strategies

To evaluate generation quality, a real scattering image is selected from the training dataset at a known UMAP location; its two\-dimensional UMAP coordinate is provided to each generation method as the sole input, and the resulting synthetic image is compared against the real image as ground truth\. Twenty real images are sampled per structural cluster, yielding a cluster\-stratified evaluation set that spans the full diversity of scattering patterns in the training distribution\. We report two complementary metrics: latentℓ2\\ell\_\{2\}distance between the generated and real latent vectors, which measures conditioning fidelity, and pixel mean squared error between the generated and real decoded images, which measures image\-level fidelity\. Two baselines bound the comparison\. PCA k\-NN averages thek=5k=5nearest real training latents to the query UMAP coordinate and adds PCA\-space noise, serving as a retrieval upper bound with direct access to real training data\. Hence, it does not produce novel samples\. PCA random samples from the global PCA distribution without any UMAP conditioning, serving as the unconditional lower bound\.

Results are summarised in Table S2\. Flow matching achieves the lowest latentℓ2\\ell\_\{2\}distance \(mean±\\pms\.d\.:2\.48±1\.822\.48\\pm 1\.82, median 1\.84\), outperforming UMAP\-guided PCA \(5\.39±3\.415\.39\\pm 3\.41\) and approaching the retrieval upper bound set by PCA k\-NN \(2\.99±0\.552\.99\\pm 0\.55\)\. On pixel MSE, PCA k\-NN scores lowest \(4\.3×10−34\.3\\times 10^\{\-3\}\), however this reflects near\-copy retrieval of real training images rather than generative capability; flow matching \(7\.6×10−37\.6\\times 10^\{\-3\}\) substantially outperforms both PCA\-based methods \(4\.9×10−24\.9\\times 10^\{\-2\}\) while producing genuinely novel samples\. The unconditional baseline performs worst on both metrics, confirming that UMAP conditioning is necessary for generating structurally relevant scattering patterns\.

Table S2:Conditional generation quality across methods \(mean±\\pms\.d\. across all query points\)MethodLatentℓ2\\ell\_\{2\}↓\\downarrowPixel MSE↓\\downarrowPCA k\-NN2\.99±0\.552\.99\\pm 0\.55\(4\.3±6\.0\)×10−3\(4\.3\\pm 6\.0\)\\times 10^\{\-3\}PCA random6\.28±1\.076\.28\\pm 1\.07\(4\.9±4\.0\)×10−2\(4\.9\\pm 4\.0\)\\times 10^\{\-2\}UMAP\-guided PCA5\.39±3\.415\.39\\pm 3\.41\(4\.9±5\.9\)×10−2\(4\.9\\pm 5\.9\)\\times 10^\{\-2\}Conditional flow matching2\.48±1\.82\\mathbf\{2\.48\\pm 1\.82\}\(7\.6±27\.2\)×𝟏𝟎−𝟑\\mathbf\{\(7\.6\\pm 27\.2\)\\times 10^\{\-3\}\}

## 3Hyperparameter Selection and Evaluation Metrics

This note provides additional details regarding the training procedure, hyperparameter selection, and quantitative evaluation metrics for the C\-VAE model\.

### 3\.1Hyperparameter Selection

Hyperparameters were chosen based on reconstruction performance, KL divergence behavior during training, and the structural quality of the resulting latent space representations\. Candidate values and final selections are summarized in Table[S3](https://arxiv.org/html/2606.14999#S3.T3)\.

Table S3:Hyperparameter configurations explored for the C\-VAE model\.HyperparameterCandidatesChosenLatent dimension \(latent\_dim\)128, 256, 512512Encoder/decoder depth3, 4, 55Image size256, 512512Learning rate1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}OptimizerAdamAdamKL weight \(β\\beta\)0\.25, 0\.5, 1\.00\.5Training epochs125125Training batch size16, 32, 6464Validation batch size16, 32, 6464LR schedulerCosineAnnealingLRCosineAnnealingLRThe latent dimensionality was evaluated at 128, 256, and 512\. Figure[S5](https://arxiv.org/html/2606.14999#S3.F5a)shows the reconstruction loss \(MSE\) and KL divergence behavior during training for all three configurations\. Increasing the latent dimension consistently reduces the reconstruction loss for both the training and validation sets\. The model with a latent dimension of 512 achieves the lowest reconstruction MSE and demonstrates stable convergence across training epochs\. Although all configurations exhibit similar convergence trends, the 512\-dimensional model maintains slightly lower and more stable KL divergence values toward the end of training, indicating improved latent regularization\.

Quantitative reconstruction and latent space metrics further support this selection, as summarized in Table[S5](https://arxiv.org/html/2606.14999#S3.T5)\. The 512\-dimensional configuration achieves the lowest reconstruction MSE and highest PSNR among the evaluated models\. In addition, the latent space produced by this model demonstrates improved clustering characteristics, indicated by a higher Silhouette score and a lower Davies–Bouldin index\. Although the latent dimensionality is set to 512, approximately 404 latent units remain active, indicating that the model utilizes a large fraction of the available representational capacity while maintaining effective regularization\. Based on these results, a latent dimension of 512 was selected for the final model\.

The encoder and decoder depth were varied between 3 and 5 layers, with a depth of 5 providing improved reconstruction quality without signs of overfitting\. All models were trained on images resized to512×512512\\times 512pixels\. During training, input images were augmented using random rotations and horizontal or vertical flips to improve robustness\. Optimization was performed using the Adam optimizer with a learning rate of1×10−41\\times 10^\{\-4\}, combined with a CosineAnnealingLR scheduler to ensure smooth convergence throughout training\. The KL divergence weight in the VAE loss function was fixed atβ=0\.5\\beta=0\.5, balancing reconstruction fidelity and latent regularization\[[13](https://arxiv.org/html/2606.14999#bib.bib24)\]\. Training was performed for 125 epochs using batch sizes of 64 for both training and validation datasets\.

Beyond model training, the downstream dimensionality reduction steps require their own parameter choices\. Table[S4](https://arxiv.org/html/2606.14999#S3.T4)summarizes the settings used for PCA, UMAP, and t\-SNE applied to the learned latent representations\. These are analysis decisions made after training and do not affect the learned embeddings themselves\. PCA was used as a linear baseline for variance analysis and two\-dimensional visualization\. For UMAP, a higher neighbor count of 30 was selected over smaller values to better preserve global structure across the latent space, and cosine similarity was used as the distance metric to account for the directional nature of the L2\-normalized embeddings\. A minimum distance of 0\.1 was chosen to allow moderate separation between clusters while avoiding over\-compression\. For t\-SNE, a perplexity of 30 was found to produce stable and interpretable cluster separation\. The number of iterations was set to 1500 to ensure convergence, and PCA initialization was used to improve stability and reproducibility of the embedding\. All three methods were applied to L2\-normalized latent vectors extracted from the final C\-VAE model\.

Table S4:Hyperparameter configurations for PCA, UMAP, and t\-SNE\.MethodHyperparameterCandidatesChosenPCANumber of components–2UMAPNumber of neighbors15, 3030Minimum distance0\.05, 0\.10\.1Number of components–2Metriccosine, euclideancosinet\-SNEPerplexity20, 30, 5030Number of iterations1000, 15001500Initializationrandom, PCAPCALearning rate–autoNumber of components–2![Refer to caption](https://arxiv.org/html/2606.14999v1/figures_appendix/mse_vs_epoch_latent_sweep_semilog.png)\(a\)Reconstruction loss \(MSE\) plotted on a logarithmic scale for training and validation sets\.
![Refer to caption](https://arxiv.org/html/2606.14999v1/figures_appendix/kld_vs_epoch_latent_sweep_semilog.png)\(b\)KL divergence loss plotted on a logarithmic scale across training epochs\.

Figure S5:Training behavior of C\-VAE models with latent dimensions of 128, 256, and 512\.Table S5:Latent space and reconstruction metrics for models with different latent dimensions\.LatentDimActiveUnitsReconMSE \(10−310^\{\-3\}\)PSNRSSIMKLLossSilhouetteDavies–BouldinCalinski–HarabaszPCAPR1281285\.0423\.010\.64241\.030\.411\.07550\.013\.082562565\.1822\.910\.65237\.680\.401\.06506\.913\.425124044\.9823\.080\.64218\.560\.410\.98527\.333\.86

### 3\.2Latent Space Evaluation Metrics

To evaluate the structure of the learned latent representations, we report several commonly used clustering and representation quality metrics\.

TheSilhouette scoremeasures how well samples are separated between clusters\. For each sample, it compares the average distance to other points in the same cluster with the distance to points in neighboring clusters\. Higher values indicate more clearly separated clusters\.

Cluster separation is further characterized by two complementary indices\. TheDavies–Bouldin indexmeasures the ratio of within\-cluster scatter to between\-cluster distance, where lower values indicate more compact and well\-separated clusters\. TheCalinski–Harabaszscore measures the ratio of between\-cluster variance to within\-cluster variance, where larger values indicate stronger separation\. Together they provide convergent evidence of cluster quality from different geometric perspectives\. ThePCA participation ratio\(PCA PR\) estimates the effective dimensionality of the latent representation\. This metric measures how many principal components contribute significantly to the total variance of the latent space\. Higher values indicate that the latent representation captures richer structural variability in the data\.

Similar Articles

Variational lossy autoencoder

OpenAI Blog

OpenAI researchers present a Variational Lossy Autoencoder (VLAE) that combines VAEs with neural autoregressive models (RNN, MADE, PixelRNN/CNN) to learn controllable global representations, achieving state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101 Silhouettes density estimation tasks.

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

arXiv cs.LG

This paper addresses the issue of dimensional collapse in VQ-VAEs, showing that representations often occupy a low-dimensional subspace. It proposes an 'AE Warm-Up' strategy that trains the model as an unquantized autoencoder first, which improves reconstruction quality and increases effective latent dimensionality.

Smoothing Dark Areas in Molecular Latent Diffusion

arXiv cs.LG

This paper introduces TopVAE, a topology-optimized VAE that reduces 'dark areas' in molecular latent diffusion by making the decoder internalize structural and chemical constraints, achieving significant improvements in molecular generation quality.