Residual-Space Evolutionary Optimization via Flow-based Generative Models

arXiv cs.AI Papers

Summary

Introduces a framework combining flow-based generative editing with evolutionary algorithms to perform optimization in residual space, enabling controllable data editing with non-differentiable objectives. Validated on MorphoMNIST and crystal data.

arXiv:2606.20084v1 Announce Type: new Abstract: Data editing with generative methods typically requires differentiable objectives and gradient-based search. However, these assumptions break down in flow-based settings, where edits are performed through forward and backward integration and often involve non-differentiable or black-box objectives. We introduce residual-space evolutionary optimization, a model-agnostic framework that addresses this gap by combining flow-based generative editing with evolutionary algorithms. Building on the observation that conditional flow matching (CFM) can disentangle condition-controlled factors from instance-specific residuals, our framework directly operates in residual space and separates two complementary search regimes: self-pollination performs local exploitation through feature-preserving residual refinement, and cross-pollination promotes broader exploration by recombining residuals across heterogeneous samples. As a proof of concept, we validate on MorphoMNIST, a benchmark dataset for counterfactual generation, and on crystal data, demonstrating that this exploration--exploitation decomposition provides a useful mechanism for balancing target alignment, instance preservation, and diversity, and extends beyond images to real-world scientific domains.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:35 PM

# Residual-Space Evolutionary Optimization via Flow-based Generative Models
Source: [https://arxiv.org/html/2606.20084](https://arxiv.org/html/2606.20084)
###### Abstract

Data editing with generative methods typically requires differentiable objectives and gradient\-based search\. However, these assumptions break down in flow\-based settings, where edits are performed through forward and backward integration and often involve non\-differentiable or black\-box objectives\. We introduceresidual\-space evolutionary optimization, a model\-agnostic framework that addresses this gap by combining flow\-based generative editing with evolutionary algorithms\. Building on the observation that conditional flow matching \(CFM\) can disentangle condition\-controlled factors from instance\-specific residuals, our framework directly operates in residual space and separates two complementary search regimes:*self\-pollination*performs local exploitation through feature\-preserving residual refinement, and*cross\-pollination*promotes broader exploration by recombining residuals across heterogeneous samples\. As a proof of concept, we validate on MorphoMNIST, a benchmark dataset for counterfactual generation, and on crystal data, demonstrating that this exploration–exploitation decomposition provides a useful mechanism for balancing target alignment, instance preservation, and diversity, and extends beyond images to real\-world scientific domains\.

data optimization, conditional flow matching, counterfactual explanations, evolutionary algorithms

## 1Introduction

Controllable data editing, i\.e\., modifying targeted attributes while preserving instance\-specific structure, is a core operation in machine learning, in applications ranging from counterfactual explanations to data augmentation\. Beyond images, it is equally important in scientific domains such as drug discovery, crystal structure prediction, and materials optimization, where controllable edits can steer valid samples toward desired functional properties\. Most existing approaches treat editing as gradient\-based optimization, implicitly assuming that objectives are differentiable and that the generative pipeline is fully transparent\. A broad line of work in the image domain, including feature visualization and network dissection\(Mahendran and Vedaldi,[2015](https://arxiv.org/html/2606.20084#bib.bib5); Olahet al\.,[2017](https://arxiv.org/html/2606.20084#bib.bib6); Carteret al\.,[2019](https://arxiv.org/html/2606.20084#bib.bib15); Bauet al\.,[2017](https://arxiv.org/html/2606.20084#bib.bib16),[2019](https://arxiv.org/html/2606.20084#bib.bib17); Selvarajuet al\.,[2020](https://arxiv.org/html/2606.20084#bib.bib13)\), style transfer and image\-to\-image translation\(Gatyset al\.,[2016](https://arxiv.org/html/2606.20084#bib.bib7); Zhuet al\.,[2017](https://arxiv.org/html/2606.20084#bib.bib12); Isolaet al\.,[2017](https://arxiv.org/html/2606.20084#bib.bib18); Parket al\.,[2020](https://arxiv.org/html/2606.20084#bib.bib19)\), GAN inversion and latent editing\(Abdalet al\.,[2019](https://arxiv.org/html/2606.20084#bib.bib8); Shenet al\.,[2020](https://arxiv.org/html/2606.20084#bib.bib20); Härkönenet al\.,[2020](https://arxiv.org/html/2606.20084#bib.bib21); Patashniket al\.,[2021](https://arxiv.org/html/2606.20084#bib.bib22); Roichet al\.,[2022](https://arxiv.org/html/2606.20084#bib.bib24); Panet al\.,[2023](https://arxiv.org/html/2606.20084#bib.bib25)\), and diffusion\-based image editing and controllable generation\(Dhariwal and Nichol,[2021](https://arxiv.org/html/2606.20084#bib.bib10); Menget al\.,[2022](https://arxiv.org/html/2606.20084#bib.bib26); Hertzet al\.,[2023](https://arxiv.org/html/2606.20084#bib.bib29); Mokadyet al\.,[2023](https://arxiv.org/html/2606.20084#bib.bib30); Brookset al\.,[2023](https://arxiv.org/html/2606.20084#bib.bib31); Zhanget al\.,[2023](https://arxiv.org/html/2606.20084#bib.bib33); Parmaret al\.,[2023](https://arxiv.org/html/2606.20084#bib.bib34); Mouet al\.,[2024](https://arxiv.org/html/2606.20084#bib.bib35)\), share this assumption, that treating editing as optimization problems over pixels, features, or latent variables is feasible\. This assumption does not hold in flow\-based generative editing, where edits are implemented through forward and backward numerical integration and objectives are often non\-differentiable or black\-box\. Recent work shows that conditional flow matching \(CFM\) disentangles condition\-controlled factors from instance\-specific residual information\(Liet al\.,[2024](https://arxiv.org/html/2606.20084#bib.bib14); Caoet al\.,[2025b](https://arxiv.org/html/2606.20084#bib.bib11)\), enabling iterative editing through repeated integration\. This iterative mechanism is naturally well\-suited to evolutionary algorithms\(Holland,[1975](https://arxiv.org/html/2606.20084#bib.bib36); Goldberg,[1989](https://arxiv.org/html/2606.20084#bib.bib37); Bäck,[1996](https://arxiv.org/html/2606.20084#bib.bib42); Eiben and Smith,[2015](https://arxiv.org/html/2606.20084#bib.bib43); Hansen and Ostermeier,[2001](https://arxiv.org/html/2606.20084#bib.bib44)\), which operate through repeated proposal, evaluation, and refinement, enabling residual\-space edits to act as genotype\-like variations that can be selected to optimize target properties\.

We propose*residual\-space evolutionary optimization*, a model\-agnostic framework that combines flow\-based generative editing with evolutionary algorithms\. Given a fixed conditional generator, our method maps data into residual states, edits these states through mutation and crossover, and decodes the resulting candidates under a target condition\. Selection is then performed using task\-specific criteria such as target validity, instance preservation, feature control, or diversity \(see[Figure1](https://arxiv.org/html/2606.20084#S2.F1)\), without requiring gradient information from the generator\. Thus, the method acts as a lightweight optimization layer on top of an existing generator, rather than a new generative model training objective\.

A central perspective of our framework is that residual\-space evolution factorizes the classical exploration–exploitation trade\-off into two pollination mechanisms\. Self\-pollination exploits the local residual neighborhood of an existing sample, making it suitable for refinement problems where preserving the source instance is important\. Cross\-pollination explores a broader residual search space by recombining information across heterogeneous samples, which can help discover diverse candidates and mitigate premature convergence to a local optimum\. Importantly, we do not claim that cross\-pollination guarantees a global optimum; rather, it provides a mechanism for increasing coverage of the target\-conditioned solution space before selection\.

We instantiate the framework based on the existing work LeapFactual\(Caoet al\.,[2025b](https://arxiv.org/html/2606.20084#bib.bib11)\)and evaluate it on MorphoMNIST\(Castroet al\.,[2019](https://arxiv.org/html/2606.20084#bib.bib46)\)as a controlled image\-editing testbed\. Although images provide a convenient visualization domain, the framework is not image\-specific and can apply to any conditional data editing setting with an editable latent or residual representation\. We further validate the framework on the Wyckoff inorganic crystal generator \(WyCryst\)\(Zhuet al\.,[2024](https://arxiv.org/html/2606.20084#bib.bib64)\), demonstrating applicability beyond the image domain to real\-world scientific data\. Our results demonstrate that residual states exposed by flow\-based generative editors constitute effective search spaces for controlled editing, with the exploration\-exploitation decomposition providing explicit mechanisms for balancing target alignment, instance preservation, and diversity\.

## 2Method

![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/leap_GA_illustration_v4.png)Figure 1:Comparison of leap\-only, self\-pollination, and cross\-pollination\.Colored boxes visualize the individual methods\.### 2\.1Preliminaries

#### Evolutionary Algorithms\.

Evolutionary algorithms are population\-based optimization methods inspired by natural selection\. Given a population of candidate solutions, they iteratively generate new candidates through stochastic variation operators, such as*mutation*and*crossover*, and retain promising candidates through*selection*\. Mutation perturbs an individual candidate to explore its local neighborhood, while crossover recombines information from multiple candidates to produce new offspring\. Selection then evaluates candidates according to a task\-specific fitness function and keeps those that best satisfy the desired objective\. In our framework, residual states serve as candidate representation, flow\-based edits act as variation operators, and task\-specific criteria define the fitness function\.

#### LeapFactual\.

Our framework builds on the flow\-based editing formulation ofCaoet al\.\([2025b](https://arxiv.org/html/2606.20084#bib.bib11)\), which we briefly review\. Letxxdenote an input image,z=E​\(x\)z=E\(x\)its autoencoder latent representation,c^=f​\(x\)\\hat\{c\}=f\(x\)its predicted source class, andctgtc\_\{\\mathrm\{tgt\}\}a user\-specified target class\. We assume a single shared conditional flow modelvθ​\(zt,t,c\)v\_\{\\theta\}\(z\_\{t\},t,c\)trained with class conditions\. At editing time, the same flow is used in two integration directions\.

The source\-conditioned reverse integration, referred to as*lifting*, removes class\-related information from the latentzz

zres=Lift⁡\(z,c^\),z\_\{\\mathrm\{res\}\}=\\operatorname\{Lift\}\(z,\\hat\{c\}\),\(1\)returning the residual statezresz\_\{\\mathrm\{res\}\}, where the flow is integrated backward fromt=1t=1tot=0t=0\. The target\-conditioned forward integration, referred to as*landing*, reconstructs a complete latentz′z^\{\\prime\}under the desired target conditionctgtc\_\{\\mathrm\{tgt\}\}:

z′=Land⁡\(zres,ctgt\),z^\{\\prime\}=\\operatorname\{Land\}\(z\_\{\\mathrm\{res\}\},c\_\{\\mathrm\{tgt\}\}\),\(2\)where the flow is integrated forward fromt=0t=0tot=1t=1\. The combination of a*Lift*and a*Land*operation forms a*Leap*, and the edited image is then obtained asx′=D​\(z′\)x^\{\\prime\}=D\(z^\{\\prime\}\)using the autoencoder’s decoderDD\.

#### Design Principle\.

In the experiments below, we use this formulation as a concrete instantiation of a broader residual\-space optimization principle\. All search operations are performed inzresz\_\{\\mathrm\{res\}\}, rather than in image space or in the latent space of the autoencoder, preserving a clean separation: class\-related information is controlled by the source and target conditions, while instance\-specific residual variation is manipulated by the search procedure\. Although source and target conditions may be identical, we show empirically that allowing class changes enables more data\-efficient use of residual information across instances\.

### 2\.2Residual\-Space Evolutionary Optimization

We introduce an evolutionary layer on top of a frozen conditional flow model, treating the residual statezresz\_\{\\mathrm\{res\}\}as the searchable genome of a sample, while leaving the conditional flow model responsible for imposing the source and target conditions through*Lift*and*Land*\. This design separates two roles: the flow model controls semantic conditioning, whereas the evolutionary layer searches over instance\-specific variation\.

Depending on how the residual population is constructed, this framework leads to two complementary search regimes\.*Self\-pollination*instantiates local exploitation: it starts from a single residual and uses mutation to refine candidates within the neighborhood of an existing solution\.*Cross\-pollination*instantiates broader exploration: it starts from multiple residuals and recombines them through crossover, allowing residual information from heterogeneous sources to generate candidates in different regions of the target\-conditioned space\. Pseudocode is in Appendix[A](https://arxiv.org/html/2606.20084#A1)\.

#### Self\-pollination\.

For a single inputxx, self\-pollination first computes its residual state through the source\-conditioned*Lift*operation:zres=Lift⁡\(E​\(x\),c^\)z\_\{\\mathrm\{res\}\}=\\operatorname\{Lift\}\(E\(x\),\\hat\{c\}\)\. It then constructs a child pool with sizemmby sampling perturbed residuals,

z~res\(m\)=zres\+ϵ\(m\),ϵ\(m\)∼𝒩​\(0,σ2​I\),\\tilde\{z\}^\{\(m\)\}\_\{\\mathrm\{res\}\}=z\_\{\\mathrm\{res\}\}\+\\epsilon^\{\(m\)\},\\qquad\\epsilon^\{\(m\)\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\),\(3\)where alternative perturbations, such as feature swapping, can also be used when residual dimensions are treated as exchangeable genes\. Each child residual is then*landed*under the target condition and decoded:x′⁣\(m\)=D​\(Land⁡\(z~res\(m\),ctgt\)\)x^\{\\prime\(m\)\}=D\(\\operatorname\{Land\}\(\\tilde\{z\}^\{\(m\)\}\_\{\\mathrm\{res\}\},c\_\{\\mathrm\{tgt\}\}\)\)\.

Selection keeps the best candidates according to a user\-defined fitness score\. Thus, self\-pollination performs local residual exploitation around one input and is mainly used for feature\-preserving refinement\. This makes it a natural fit for attractive objectives such as counterfactual explanation\(Dombrowskiet al\.,[2023](https://arxiv.org/html/2606.20084#bib.bib53); Samangoueiet al\.,[2018](https://arxiv.org/html/2606.20084#bib.bib54); Singlaet al\.,[2019](https://arxiv.org/html/2606.20084#bib.bib55); Nemirovskyet al\.,[2020](https://arxiv.org/html/2606.20084#bib.bib56); Kimet al\.,[2021](https://arxiv.org/html/2606.20084#bib.bib57); Hvilshøjet al\.,[2021](https://arxiv.org/html/2606.20084#bib.bib59); Caoet al\.,[2025a](https://arxiv.org/html/2606.20084#bib.bib63),[b](https://arxiv.org/html/2606.20084#bib.bib11)\), where the search should converge toward a target condition without unnecessarily drifting away from the source instance\.

#### Cross\-pollination\.

For a population\{xi\}i=1N\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}, cross\-pollination first computes source\-conditioned residuals

zres,i=Lift⁡\(E​\(xi\),c^i\),i=1,…,N\.z\_\{\\mathrm\{res\},i\}=\\operatorname\{Lift\}\(E\(x\_\{i\}\),\\hat\{c\}\_\{i\}\),\\qquad i=1,\\dots,N\.\(4\)Each residual will then be paired by a partner residual and recombined through crossover

z~res,i\(m\)=Crossover​\(zres,i,zres,j\(m\),α\)\+ϵ\(m\),\\tilde\{z\}\_\{\\mathrm\{res,i\}\}^\{\(m\)\}=\\mathrm\{Crossover\}\(z\_\{\\mathrm\{res\},i\},z\_\{\\mathrm\{res\},j\}^\{\(m\)\},\\alpha\)\+\\epsilon^\{\(m\)\},\(5\)whereα\\alphacontrols the contribution of each parent, andϵ\(m\)\\epsilon^\{\(m\)\}is an optional mutation term\. Different crossover mechanisms can be applied\. The details can be found in[AppendixA](https://arxiv.org/html/2606.20084#A1)\.

The resulting child residuals are landed under the same target condition and decoded\. Selection again keeps the top\-k candidates according to the defined fitness score\. Unlike self\-pollination, which preserves the identity of one source sample, cross\-pollination uses residual diversity across multiple sources to explore a broader target\-conditioned search space\. Therefore, diversity is important to prevent early convergence\. Optionally, advanced selection mechanisms, such as tournament and diverse greedy selection\(Grahamet al\.,[2011](https://arxiv.org/html/2606.20084#bib.bib62); Liuet al\.,[2024](https://arxiv.org/html/2606.20084#bib.bib60); Wulandariet al\.,[2024](https://arxiv.org/html/2606.20084#bib.bib61)\), can be applied\. In our experiments, we demonstrate that the simple top\-k mechanism works well since the diversity induced by cross\-pollination prevents premature convergence \([Section3](https://arxiv.org/html/2606.20084#S3)\)\.

## 3Domain Modelling

In the following we introduce the experimental setups for both image \(Sec\.[3\.1](https://arxiv.org/html/2606.20084#S3.SS1)\) and scientific domain \(Sec\.[3\.2](https://arxiv.org/html/2606.20084#S3.SS2)\)\.

### 3\.1Image Domain: MorphoMNIST

We use MorphoMNIST\(Castroet al\.,[2019](https://arxiv.org/html/2606.20084#bib.bib46)\)as a studied domain because it provides interpretable scalar attributes, including morphological attributes such as thickness, slant, and width\. Our model stack consists of an image classifier, a VAE\-style latent encoder–decoder, and a single class\-conditional CFM model used for source\-conditioned lifting and target\-conditioned landing\. The framework supports arbitrary source\-to\-target digit pairs\.

Our study is designed to demonstrate the complementary roles of the two proposed variants through an exploration–exploitation decomposition\.

#### Self\-pollination\.

For self\-pollination, we test whether residual\-space mutation can refine a leap\-only edit while better preserving instance\-specific information from the input sample\. Specifically, we select edits that maximize source similarity while encouraging target\-class confidence:

Sself​\(x′,x\)=sim​\(x′,x\)\+λ​pclf​\(ctgt∣x′\),S\_\{\\mathrm\{self\}\}\(x^\{\\prime\},x\)=\\mathrm\{sim\}\(x^\{\\prime\},x\)\+\\lambda\\,p\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\),\(6\)wheresim​\(x′,x\)\\mathrm\{sim\}\(x^\{\\prime\},x\)is an image similarity measure defined in[AppendixB](https://arxiv.org/html/2606.20084#A2),pclf​\(ctgt∣x′\)=softmax​\(C​\(x′\)\)ctgtp\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\)=\\mathrm\{softmax\}\(C\(x^\{\\prime\}\)\)\_\{c\_\{\\mathrm\{tgt\}\}\}, andλ\>0\\lambda\>0controls the trade\-off between target confidence and source preservation\.

#### Cross\-pollination\.

For cross\-pollination, we shift the objective from instance preservation to feature exploration\. Here, we test whether residual information from non\-target classes can serve as genetic material for target\-conditioned synthesis\. Rather than optimizing only over samples that already belong to the target digit, cross\-pollination supports broader exploration by collecting residual “genes” from a mixed\-source population and landing all offspring under the same target condition\. Specifically, we select edits that maximize digit thickness, measured using the MorphoMNIST\(Castroet al\.,[2019](https://arxiv.org/html/2606.20084#bib.bib46)\), illustrating that the framework supports non\-differentiable, black box fitness functions, while encouraging target\-class confidence:

Scross​\(x′\)=thickness​\(x′\)\+λ​pclf​\(ctgt∣x′\),S\_\{\\mathrm\{cross\}\}\(x^\{\\prime\}\)=\\mathrm\{thickness\}\(x^\{\\prime\}\)\+\\lambda\\,p\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\),\(7\)wherethickness​\(x′\)\\mathrm\{thickness\}\(x^\{\\prime\}\)is the morphological thickness andpclf​\(ctgt∣x′\)=softmax​\(C​\(x′\)\)ctgtp\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\)=\\mathrm\{softmax\}\(C\(x^\{\\prime\}\)\)\_\{c\_\{\\mathrm\{tgt\}\}\}\. More details can be found in Appendix[B](https://arxiv.org/html/2606.20084#A2)and[C](https://arxiv.org/html/2606.20084#A3)\.

#### Baselines\.

The two variants differ fundamentally in their starting population: self\-pollination always begins from a single source sample, whereas cross\-pollination begins from a set of samples whose composition defines the search regime\. For self\-pollination, the main baseline is a leap\-only method without residual\-space evolution\. For cross\-pollination, the baseline uses a homogeneous population drawn entirely from the target class, while the proposed diverse cross\-pollination draws from a mixed\-source population and decodes every child under the same target condition, allowing residual information from other classes to contribute to target\-conditioned generation\.

#### Metrics\.

For self\-pollination, we report validity and similarity\. Validity is the fraction of generated images classified as the target class, reported for both variants\. Similarity is measured as mean RMSE between the generated and the input image\. For cross\-pollination, we report validity, feature value, and diversity\. Feature value reports mean digit thickness in the top 95th percentile of the population, measuring the framework’s ability to produce high\-attribute\-value candidates under the target condition\. Diversity is measured as the angular distance between normalized flattened images from the input and generated populations\. More details can be found in Appendix[B](https://arxiv.org/html/2606.20084#A2)\.

### 3\.2Scientific Domain: Crystal Data

We conduct the experiment using WyCryst\(Zhuet al\.,[2024](https://arxiv.org/html/2606.20084#bib.bib64)\), a VAE\-based model for encoding and decoding material structures\. The dataset to generate the latent space was provided by the authors via the project repository, originally queried from the Materials Project database \(v\.2023\.7\.4\)\(Jainet al\.,[2013](https://arxiv.org/html/2606.20084#bib.bib65)\)and containing 66,643 ternary inorganic compounds\. After filtering to structures with at most 20 atoms per unit cell, formation energy≤1\\leq 1eV/atom, and energy above the convex hullEhull<0\.1E\_\{\\mathrm\{hull\}\}<0\.1eV/atom, the working set comprises 28,318 crystal structures spanning 87 unique elements and all seven crystal systems: cubic, hexagonal, trigonal, tetragonal, orthorhombic, monoclinic, and triclinic\. In the latent space learned by WyCryst, we train three additional models: a classifier that predicts the crystal system classes; a regressor that predicts the band gap, which serves as the scalar material property and optimization objective; and a conditional flow matching model\. Unlike the image\-domain setting, all components in this setting are trained and operate directly in the material latent space\.

#### Cross\-pollination\.

We apply cross\-pollination to maximize the predicted band gap, steering structures toward wider\-gap insulating phases, while simultaneously changing the crystal system of the material\.

This allows the algorithm to reuse residual information, or “genetic” features, from materials belonging to different crystal systems, expanding the search space beyond within\-class variations\. Specifically, we select edits that maximize band gap while encouraging crystal system confidence:

Scross​\(x′\)=bandgap​\(x′\)\+λ​pclf​\(ctgt∣x′\),S\_\{\\mathrm\{cross\}\}\(x^\{\\prime\}\)=\\mathrm\{bandgap\}\(x^\{\\prime\}\)\+\\lambda\\,p\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\),\(8\)wherebandgap​\(x′\)\\mathrm\{bandgap\}\(x^\{\\prime\}\)is predicted by the trained regressor andpclf​\(ctgt∣x′\)=softmax​\(C​\(x′\)\)ctgtp\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\)=\\mathrm\{softmax\}\(C\(x^\{\\prime\}\)\)\_\{c\_\{\\mathrm\{tgt\}\}\}\.

#### Baselines\.

As with the image\-domain experiments, cross\-pollination begins from a set of samples whose composition defines the search regime\. The baseline uses a homogeneous population drawn from the target crystal system\. The diverse cross\-pollination variant draws from a mixed\-source population spanning all crystal systems, allowing residual information from structurally distinct materials to contribute to target\-conditioned generation and to expand the search space beyond within\-class variation\.

#### Metrics\.

We report validity, feature value, and diversity\. Validity is the fraction of generated crystal classified as the target crystal\. Feature value is the mean band gap among samples in the top 95th percentile of the population\. Diversity is measured as the angular distance between normalized latent representations of the generated populations\.

## 4Results

#### Self\-pollination supports local exploitation\.

As shown in[Table1](https://arxiv.org/html/2606.20084#S4.T1), self\-pollination maintains near\-perfect validity \(\>0\.99\>0\.99\) while consistently improving source\-instance preservation\. Specifically, self\-pollination improves similarity by 3% over the leap\-only baseline, indicating that residual\-space mutation refines the target\-conditioned edit while better preserving input\-specific structure\.

The qualitative examples in[Figure2](https://arxiv.org/html/2606.20084#S4.F2)support this interpretation\. While both leap\-only and self\-pollination successfully reach the desired target classes, self\-pollination better preserves instance\-level characteristics such as stroke thickness, slant, and writing style from the input images\. A per\-digit breakdown is provided in[Table6](https://arxiv.org/html/2606.20084#A4.T6)in the appendix, where self\-pollination improves similarity for every target digit and maintains or improves validity across all digits\. These results suggest that self\-pollination acts as an effective local exploitation mechanism, improving source\-instance preservation without sacrificing target validity\.

![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/self_pollination_demo_cut.png)Figure 2:Qualitative comparison of leap\-only and self\-pollination across target digits\.Self\-pollination better preserves input\-specific stroke style while achieving the target digit\. Columns are target digits \(0\-9\)\. Rows are input \(top\), leap\-only \(middle\), and self\-pollination \(bottom\)\.Table 1:Aggregated comparison between leap\-only and self\-pollination on MorphoMNIST\. Self\-pollination is reported as the relative improvement over leap\-only\. Results are shown as mean±STEover three random seeds\.Table 2:Aggregated comparison of homogeneous vs\. diverse cross\-pollination\. Feature value and diversity differ by domain: MorphoMNIST reports optimized morphological feature and image\-space diversity; crystal data reports predicted band gap and latent diversity\. Diverse is reported as the relative improvement over homogeneous\. The final generation results are shown as mean±STEover all target classes/systems and three random seeds\.
#### Cross\-pollination supports global exploration\.

As shown in[Table2](https://arxiv.org/html/2606.20084#S4.T2), diverse cross\-pollination preserves perfect validity across both domains while consistently improving population diversity\. On MorphoMNIST, diverse cross\-pollination improves both thickness and image\-space diversity over the homogeneous baseline, with both metrics improving steadily across generations\. The per\-digit breakdown in[Table7](https://arxiv.org/html/2606.20084#A4.T7)\(Appendix[D](https://arxiv.org/html/2606.20084#A4)\) further supports this trend, with diverse cross\-pollination maintaining perfect validity and improving diversity across all digits\.

On crystal data, diverse cross\-pollination substantially increases latent diversity \(\+1\.2\+1\.2\) but yields a slight reduction in band gap \(−0\.082\-0\.082\), reflecting a stronger exploration\-exploitation tension in a more heterogeneous domain\.

Notably,[Figure3](https://arxiv.org/html/2606.20084#S4.F3)shows that diverse cross\-pollination reaches target validity more slowly in the crystal setting, suggesting that residual combination across structurally different crystal systems introduces broader but slower targeted variation\. Importantly, validity converges to the same level in both settings by generation 10, confirming that diverse cross\-pollination expands the search space without sacrificing target conditioning\.

![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/aggregated_trend_plot.png)Figure 3:Metrics across generations\.Rows: MorphoMNIST \(top\) and WyCryst \(bottom\) experiments\. Colors show search regimes for cross\-pollination\. Columns show validity, feature value, and diversity\. MorphoMNIST optimizes thickness and image diversity and WyCryst optimizes band gap and latent diversity\.

## 5Discussion and Limitations

Our results demonstrate that the exploration\-exploitation decomposition is an effective mechanism for residual\-space generative editing across both image and scientific domains\. Self\-pollination and cross\-pollination address different search regimes without sacrificing target validity\. This is enabled by the residual structure exposed by conditional flow matching, which can be treated as the genotype for mutation, crossover, and selection, while the condition variable controls semantic attributes such as class identity\.

Beyond images, we demonstrate that the framework extends to crystal data, where conditions encodethe crystal systemand residual states provide a search space for evolutionary optimisation toward desired band gap properties\. We apply cross\-pollination to maximise the predicted band gap of crystal structures, demonstrating that the framework also supports scalar objectives\. We note that the band gap is evaluated via a surrogate regressor rather than first\-principles calculations, and that the physical stability of generated structures is not explicitly enforced; we therefore use this study as a demonstration of the optimisation mechanism rather than a novel materials discovery modeling approach\. These findings indicate that any domain with an editable residual representation is a candidate for residual\-space evolutionary optimisation, including molecular design and materials discovery\.

Our current study is a controlled proof of concept, and several limitations remain\. First, while we validate on both MorphoMNIST and crystal data, further validation on larger datasets and more complex generation tasks remains open\. Second, key hyperparameters, including mutation strength, population size, selection criteria, and crossover design, require systematic ablation to assess robustness, efficiency, and failure modes\. Addressing these questions is a promising direction for future work\.

## References

- R\. Abdal, Y\. Qin, and P\. Wonka \(2019\)Image2StyleGAN: how to embed images into the stylegan latent space?\.InIEEE/CVF International Conference on Computer Vision,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- T\. Bäck \(1996\)Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms\.Oxford University Press,New York\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- D\. Bau, B\. Zhou, A\. Khosla, A\. Oliva, and A\. Torralba \(2017\)Network dissection: quantifying interpretability of deep visual representations\.InIEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- D\. Bau, J\. Zhu, H\. Strobelt, B\. Zhou, J\. B\. Tenenbaum, W\. T\. Freeman, and A\. Torralba \(2019\)GAN dissection: visualizing and understanding generative adversarial networks\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- T\. Brooks, A\. Holynski, and A\. A\. Efros \(2023\)InstructPix2Pix: learning to follow image editing instructions\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- Z\. Cao, L\. Krieger, H\. Scharr, and I\. Assent \(2025a\)Galaxy morphology classification with counterfactual explanation\.arXiv preprint arXiv:2510\.14655\.Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px1.p2.1)\.
- Z\. Cao, X\. Zhao, L\. Krieger, H\. Scharr, and I\. Assent \(2025b\)LeapFactual: reliable visual counterfactual explanation using conditional flow matching\.Note:Accepted for publication at The Thirty\-Ninth Annual Conference on Neural Information Processing Systems, \(NeurIPS\) 2025External Links:[Link](https://neurips.cc/virtual/2025/poster/119174)Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1),[§1](https://arxiv.org/html/2606.20084#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.20084#S2.SS1.SSS0.Px2.p1.5),[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px1.p2.1)\.
- S\. Carter, Z\. Armstrong, L\. Schubert, I\. Johnson, and C\. Olah \(2019\)Activation atlas\.Distill\.External Links:[Document](https://dx.doi.org/10.23915/distill.00015)Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- D\. C\. Castro, J\. Tan, B\. Kainz, E\. Konukoglu, and B\. Glocker \(2019\)Morpho\-MNIST: quantitative assessment and diagnostics for representation learning\.Journal of Machine Learning Research20\(178\)\.External Links:arXiv:1809\.10780Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.20084#S3.SS1.SSS0.Px2.p1.3),[§3\.1](https://arxiv.org/html/2606.20084#S3.SS1.p1.1)\.
- P\. Dhariwal and A\. Q\. Nichol \(2021\)Diffusion models beat gans on image synthesis\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- A\. Dombrowski, J\. E\. Gerken, K\. Müller, and P\. Kessel \(2023\)Diffeomorphic Counterfactuals with Generative Models\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px1.p2.1)\.
- A\. E\. Eiben and J\. E\. Smith \(2015\)Introduction to evolutionary computing\.2 edition,Springer\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- L\. A\. Gatys, A\. S\. Ecker, and M\. Bethge \(2016\)Image style transfer using convolutional neural networks\.InIEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- D\. E\. Goldberg \(1989\)Genetic algorithms in search, optimization, and machine learning\.Addison\-Wesley,Reading, MA\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- L\. Graham, J\. Borbone, and G\. Parker \(2011\)Comparison of a greedy selection operator to tournament selection and a hill climber\.In2011 IEEE Congress of Evolutionary Computation \(CEC\),Vol\.,pp\. 1504–1508\.External Links:[Document](https://dx.doi.org/10.1109/CEC.2011.5949793)Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px2.p2.1)\.
- N\. Hansen and A\. Ostermeier \(2001\)Completely derandomized self\-adaptation in evolution strategies\.Evolutionary Computation9\(2\),pp\. 159–195\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- E\. Härkönen, A\. Hertzmann, J\. Lehtinen, and S\. Paris \(2020\)GANSpace: discovering interpretable gan controls\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- A\. Hertz, R\. Mokady, J\. Tenenbaum, K\. Aberman, Y\. Pritch, and D\. Cohen\-Or \(2023\)Prompt\-to\-prompt image editing with cross\-attention control\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- J\. H\. Holland \(1975\)Adaptation in natural and artificial systems\.University of Michigan Press,Ann Arbor, MI\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- F\. Hvilshøj, A\. Iosifidis, and I\. Assent \(2021\)ECINN: Efficient Counterfactuals from Invertible Neural Networks\.arXiv preprint arXiv:2103\.13701\.Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px1.p2.1)\.
- P\. Isola, J\. Zhu, T\. Zhou, and A\. A\. Efros \(2017\)Image\-to\-image translation with conditional adversarial networks\.InIEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- A\. Jain, S\. P\. Ong, G\. Hautier, W\. Chen, W\. D\. Richards, S\. Dacek, S\. Cholia, D\. Gunter, D\. Skinner, G\. Ceder, and K\. A\. Persson \(2013\)Commentary: the materials project: a materials genome approach to accelerating materials innovation\.APL Materials1\(1\),pp\. 011002\.Cited by:[§3\.2](https://arxiv.org/html/2606.20084#S3.SS2.p1.2)\.
- H\. Kim, S\. Shin, J\. Jang, K\. Song, W\. Joo, W\. Kang, and I\. Moon \(2021\)Counterfactual Fairness with Disentangled Causal Effect Variational Autoencoder\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 8128–8136\.Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px1.p2.1)\.
- T\. Li, D\. Katabi, and K\. He \(2024\)Return of unconditional generation: a self\-supervised representation generation method\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=clTa4JFBML)Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- Y\. Liu, H\. Yin, Z\. Huang, and Y\. Wu \(2024\)Enhanced genetic algorithm for traveling salesman problem\.In2024 4th International Conference on Artificial Intelligence, Robotics, and Communication \(ICAIRC\),Vol\.,pp\. 785–790\.External Links:[Document](https://dx.doi.org/10.1109/ICAIRC64177.2024.10900033)Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px2.p2.1)\.
- A\. Mahendran and A\. Vedaldi \(2015\)Understanding deep image representations by inverting them\.InIEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- C\. Meng, Y\. He, Y\. Song, J\. Song, J\. Wu, J\. Zhu, and S\. Ermon \(2022\)SDEdit: guided image synthesis and editing with stochastic differential equations\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- R\. Mokady, A\. Hertz, K\. Aberman, Y\. Pritch, and D\. Cohen\-Or \(2023\)Null\-text inversion for editing real images using guided diffusion models\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- C\. Mou, X\. Wang, L\. Xie, Y\. Wu, J\. Zhang, Z\. Qi, Y\. Shan, and X\. Qie \(2024\)T2I\-adapter: learning adapters to dig out more controllable ability for text\-to\-image diffusion models\.InAAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- D\. Nemirovsky, N\. Thiebaut, Y\. Xu, and A\. Gupta \(2020\)CounteRGAN: generating Realistic Counterfactuals with Residual Generative Adversarial Nets\.arXiv preprint arXiv:2009\.05199\.Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px1.p2.1)\.
- C\. Olah, A\. Mordvintsev, and L\. Schubert \(2017\)Feature visualization\.Distill\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- X\. Pan, A\. Tewari, T\. Leimkühler, L\. Liu, A\. Meka, and C\. Theobalt \(2023\)Drag your gan: interactive point\-based manipulation on the generative image manifold\.ACM Transactions on Graphics\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- T\. Park, A\. A\. Efros, R\. Zhang, and J\. Zhu \(2020\)Contrastive learning for unpaired image\-to\-image translation\.InEuropean Conference on Computer Vision,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- G\. Parmar, K\. K\. Singh, R\. Zhang, Y\. Li, J\. Lu, and J\. Zhu \(2023\)Zero\-shot image\-to\-image translation\.ACM Transactions on Graphics\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- O\. Patashnik, Z\. Wu, E\. Shechtman, D\. Cohen\-Or, and D\. Lischinski \(2021\)StyleCLIP: text\-driven manipulation of stylegan imagery\.InIEEE/CVF International Conference on Computer Vision,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- D\. Roich, R\. Mokady, A\. H\. Bermano, and D\. Cohen\-Or \(2022\)Pivotal tuning for latent\-based editing of real images\.ACM Transactions on Graphics\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- P\. Samangouei, A\. Saeedi, L\. Nakagawa, and N\. Silberman \(2018\)ExplainGAN: Model Explanation via Decision Boundary Crossing Transformations\.InProceedings of the European Conference on Computer Vision \(ECCV\),pp\. 666–681\.Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px1.p2.1)\.
- R\. R\. Selvaraju, M\. Cogswell, A\. Das, R\. Vedantam, D\. Parikh, and D\. Batra \(2020\)Grad\-cam: visual explanations from deep networks via gradient\-based localization\.International journal of computer vision128\(2\),pp\. 336–359\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- Y\. Shen, J\. Gu, X\. Tang, and B\. Zhou \(2020\)Interpreting the latent space of gans for semantic face editing\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- S\. Singla, B\. Pollack, J\. Chen, and K\. Batmanghelich \(2019\)Explanation by Progressive Exaggeration\.arXiv preprint arXiv:1911\.00483\.Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px1.p2.1)\.
- N\. S\. Wulandari, Z\. Zainuddin, and M\. Yusuf \(2024\)Optimizing genetic algorithms for tsp: evaluating greedy permuting method with diverse selection and crossover techniques\.In2024 8th International Conference on Information Technology, Information Systems and Electrical Engineering \(ICITISEE\),Vol\.,pp\. 191–196\.External Links:[Document](https://dx.doi.org/10.1109/ICITISEE63424.2024.10730362)Cited by:[§2\.2](https://arxiv.org/html/2606.20084#S2.SS2.SSS0.Px2.p2.1)\.
- L\. Zhang, A\. Rao, and M\. Agrawala \(2023\)Adding conditional control to text\-to\-image diffusion models\.InIEEE/CVF International Conference on Computer Vision,Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- J\. Zhu, T\. Park, P\. Isola, and A\. A\. Efros \(2017\)Unpaired image\-to\-image translation using cycle\-consistent adversarial networks\.InProceedings of the IEEE international conference on computer vision,pp\. 2223–2232\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p1.1)\.
- R\. Zhu, W\. Nong, S\. Yamazaki, and K\. Hippalgaonkar \(2024\)WyCryst: wyckoff inorganic crystal generator framework\.Matter7\(10\),pp\. 3469–3488\.Cited by:[§1](https://arxiv.org/html/2606.20084#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.20084#S3.SS2.p1.2)\.

## Appendix AThe Algorithm

The complete algorithm is described in[Algorithm1](https://arxiv.org/html/2606.20084#alg1)\.

Algorithm 1Residual\-Space Evolutionary Optimization0:Initial inputs

𝒫0=\{xj\(0\)\}j=1N\\mathcal\{P\}\_\{0\}=\\\{x\_\{j\}^\{\(0\)\}\\\}\_\{j=1\}^\{N\}, encoder

EE, decoder

DD, classifier

CC, frozen flow model

vθv\_\{\\theta\}, target condition

ctgtc\_\{\\mathrm\{tgt\}\}, generations

GG, population size

KK, child pool size

MM, mode

∈\{self,cross\}\\in\\\{\\mathrm\{self\},\\mathrm\{cross\}\\\}
0:Retained target\-conditioned samples

1:for

g=1,…,Gg=1,\\dots,Gdo

2:Initialize child set

𝒫~g←∅\\widetilde\{\\mathcal\{P\}\}\_\{g\}\\leftarrow\\emptyset
3:Encode current population:

zj\(g−1\)←E​\(xj\(g−1\)\)z\_\{j\}^\{\(g\-1\)\}\\leftarrow E\(x\_\{j\}^\{\(g\-1\)\}\)for all

xj\(g−1\)∈𝒫g−1x\_\{j\}^\{\(g\-1\)\}\\in\\mathcal\{P\}\_\{g\-1\}
4:Predict source conditions:

c^j\(g−1\)←arg⁡maxc⁡C​\(xj\(g−1\)\)c\\hat\{c\}\_\{j\}^\{\(g\-1\)\}\\leftarrow\\arg\\max\_\{c\}C\(x\_\{j\}^\{\(g\-1\)\}\)\_\{c\}
5:Lift current population:

zres,j\(g−1\)←Lift⁡\(zj\(g−1\),c^j\(g−1\)\)z\_\{\\mathrm\{res\},j\}^\{\(g\-1\)\}\\leftarrow\\operatorname\{Lift\}\\\!\\left\(z\_\{j\}^\{\(g\-1\)\},\\hat\{c\}\_\{j\}^\{\(g\-1\)\}\\right\)
6:for

m=1,…,Mm=1,\\dots,Mdo

7:if

mode=self\\mathrm\{mode\}=\\mathrm\{self\}then

8:Sample one parent residual

zres,i\(g−1\)z\_\{\\mathrm\{res\},i\}^\{\(g\-1\)\}
9:Sample mutation noise

ϵ\(m\)∼𝒩​\(0,σ2​I\)\\epsilon^\{\(m\)\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)
10:Generate child residual:

z~res\(m\)←zres,i\(g−1\)\+ϵ\(m\)\\tilde\{z\}\_\{\\mathrm\{res\}\}^\{\(m\)\}\\leftarrow z\_\{\\mathrm\{res\},i\}^\{\(g\-1\)\}\+\\epsilon^\{\(m\)\}
11:else

12:Sample two parent residuals

zres,i\(g−1\)z\_\{\\mathrm\{res\},i\}^\{\(g\-1\)\}and

zres,j\(g−1\)z\_\{\\mathrm\{res\},j\}^\{\(g\-1\)\}
13:Sample mixing weight

α\\alphaand mutation noise

ϵ\(m\)\\epsilon^\{\(m\)\}
14:Generate child residual:

z~res\(m\)←Crossover⁡\(zres,i\(g−1\),zres,j\(g−1\),α\)\+ϵ\(m\)\\tilde\{z\}\_\{\\mathrm\{res\}\}^\{\(m\)\}\\leftarrow\\operatorname\{Crossover\}\\\!\\left\(z\_\{\\mathrm\{res\},i\}^\{\(g\-1\)\},z\_\{\\mathrm\{res\},j\}^\{\(g\-1\)\},\\alpha\\right\)\+\\epsilon^\{\(m\)\}
15:endif

16:Land child residual under target condition:

z~\(m\)←Land⁡\(z~res\(m\),ctgt\)\\tilde\{z\}^\{\(m\)\}\\leftarrow\\operatorname\{Land\}\(\\tilde\{z\}\_\{\\mathrm\{res\}\}^\{\(m\)\},c\_\{\\mathrm\{tgt\}\}\)
17:Decode child sample:

x~\(m\)←D​\(z~\(m\)\)\\tilde\{x\}^\{\(m\)\}\\leftarrow D\(\\tilde\{z\}^\{\(m\)\}\)
18:Add

x~\(m\)\\tilde\{x\}^\{\(m\)\}to

𝒫~g\\widetilde\{\\mathcal\{P\}\}\_\{g\}
19:endfor

20:Score all children in

𝒫~g\\widetilde\{\\mathcal\{P\}\}\_\{g\}using

SselfS\_\{\\mathrm\{self\}\}or

ScrossS\_\{\\mathrm\{cross\}\}
21:Select

KKchildren as the next data population:

𝒫g←Select⁡\(𝒫~g\)\\mathcal\{P\}\_\{g\}\\leftarrow\\operatorname\{Select\}\\\!\\left\(\\widetilde\{\\mathcal\{P\}\}\_\{g\}\\right\)
22:endfor

23:returnfinal population

𝒫G\\mathcal\{P\}\_\{G\}

### A\.1Lift–Land Operations

#### Lift operation\.

Letxxdenote an input image andz=E​\(x\)z=E\(x\)its autoencoder latent representation\. The source condition is inferred from the classifier as

c^=arg⁡maxc⁡C​\(x\)c\.\\hat\{c\}=\\arg\\max\_\{c\}C\(x\)\_\{c\}\.Given the conditional velocity fieldvθ​\(zt,t,c\)v\_\{\\theta\}\(z\_\{t\},t,c\), the source\-conditioned*Lift*operation removes condition\-specific information by integrating the flow backward fromt=1t=1tot=0t=0:

zres=Lift⁡\(z,c^\)\.z\_\{\\mathrm\{res\}\}=\\operatorname\{Lift\}\(z,\\hat\{c\}\)\.The resulting residual statezresz\_\{\\mathrm\{res\}\}is the representation used for mutation, crossover, and selection\.

#### Land operation\.

Given a target conditionctgtc\_\{\\mathrm\{tgt\}\}, the*Land*operation injects target\-specific information by integrating the same flow forward fromt=0t=0tot=1t=1:

z′=Land⁡\(zres,ctgt\)\.z^\{\\prime\}=\\operatorname\{Land\}\(z\_\{\\mathrm\{res\}\},c\_\{\\mathrm\{tgt\}\}\)\.The edited image is obtained as

x′=D​\(z′\)\.x^\{\\prime\}=D\(z^\{\\prime\}\)\.

#### Leap step\.

A complete target\-directed leap is the composition of Lift and Land:

z′=Land⁡\(Lift⁡\(E​\(x\),c^\),ctgt\),x′=D​\(z′\)\.z^\{\\prime\}=\\operatorname\{Land\}\\\!\\left\(\\operatorname\{Lift\}\(E\(x\),\\hat\{c\}\),c\_\{\\mathrm\{tgt\}\}\\right\),\\qquad x^\{\\prime\}=D\(z^\{\\prime\}\)\.Thus, the model first removes the source condition by lifting into residual space, and then imposes the target condition by landing underctgtc\_\{\\mathrm\{tgt\}\}\.

### A\.2Cross\-pollination Crossover

Given two parent residual stateszres,iz\_\{\\mathrm\{res\},i\}andzres,jz\_\{\\mathrm\{res\},j\}, cross\-pollination constructs a child residualz~res\\tilde\{z\}\_\{\\mathrm\{res\}\}by combining information from both parents, and then maps it back to the target\-conditioned latent space through the Land operator\.

#### Linear crossover\.

Linear crossover forms a convex interpolation between two parent residuals:

z~res=\(1−α\)​zres,i\+α​zres,j,\\tilde\{z\}\_\{\\mathrm\{res\}\}=\(1\-\\alpha\)z\_\{\\mathrm\{res\},i\}\+\\alpha z\_\{\\mathrm\{res\},j\},whereα∈\[0,1\]\\alpha\\in\[0,1\]is the crossover mixing ratio\. This operation produces a child residual that lies on the line segment between the two parents in residual space\.

#### Dimension\-wise crossover\.

Dimension\-wise crossover combines two parent residuals coordinate by coordinate\. Letm∈\{0,1\}dm\\in\\\{0,1\\\}^\{d\}be a binary mask sampled independently for each residual dimension:

mk∼Bernoulli​\(α\),m\_\{k\}\\sim\\mathrm\{Bernoulli\}\(\\alpha\),whereα\\alphais the probability of inheriting dimensionkkfrom the second parent\. The child residual is

z~res,k=\(1−mk\)​zres,i,k\+mk​zres,j,k,k=1,…,d\.\\tilde\{z\}\_\{\\mathrm\{res\},k\}=\(1\-m\_\{k\}\)z\_\{\\mathrm\{res\},i,k\}\+m\_\{k\}z\_\{\\mathrm\{res\},j,k\},\\quad k=1,\\ldots,d\.Equivalently,

z~res=\(1−m\)⊙zres,i\+m⊙zres,j,\\tilde\{z\}\_\{\\mathrm\{res\}\}=\(1\-m\)\\odot z\_\{\\mathrm\{res\},i\}\+m\\odot z\_\{\\mathrm\{res\},j\},where⊙\\odotdenotes element\-wise multiplication\.

### A\.3Selection

#### Top\-kkselection\.

Given a candidate pool𝒞=\{xi′\}i=1N\\mathcal\{C\}=\\\{x^\{\\prime\}\_\{i\}\\\}\_\{i=1\}^\{N\}and a scalar selection scoreS​\(xi′\)S\(x^\{\\prime\}\_\{i\}\)for each candidate, top\-kkselection keeps theKKcandidates with the largest scores:

𝒫g=TopKxi′∈𝒞⁡\(S​\(xi′\),K\)\.\\mathcal\{P\}\_\{g\}=\\operatorname\{TopK\}\_\{x^\{\\prime\}\_\{i\}\\in\\mathcal\{C\}\}\\bigl\(S\(x^\{\\prime\}\_\{i\}\),K\\bigr\)\.This mode greedily preserves the highest\-scoring candidates\.

#### Tournament selection\.

Tournament selection repeatedly samples a small subset of candidates and keeps the best candidate within that subset\. For each selected individual, we sample a tournament set

𝒯m⊂𝒞,\|𝒯m\|=M\.\\mathcal\{T\}\_\{m\}\\subset\\mathcal\{C\},\\qquad\|\\mathcal\{T\}\_\{m\}\|=M\.The winner is

x′⁣\(m\)=arg⁡maxxi′∈𝒯m⁡S​\(xi′\)\.x^\{\\prime\(m\)\}=\\arg\\max\_\{x^\{\\prime\}\_\{i\}\\in\\mathcal\{T\}\_\{m\}\}S\(x^\{\\prime\}\_\{i\}\)\.After each winner is selected, it is removed from the remaining pool\. Repeating this processKKtimes gives

𝒫g=\{x′⁣\(1\),…,x′⁣\(K\)\}\.\\mathcal\{P\}\_\{g\}=\\\{x^\{\\prime\(1\)\},\\ldots,x^\{\\prime\(K\)\}\\\}\.Compared with top\-kkselection, tournament selection introduces stochasticity while still favoring high\-scoring candidates\.

## Appendix BDefinitions

#### Target validity\.

For a generated imagex′x^\{\\prime\}, the target\-class probability used in the main text is

pclf​\(ctgt∣x′\)=softmax​\(C​\(x′\)\)ctgt\.p\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\)=\\mathrm\{softmax\}\\\!\\left\(C\(x^\{\\prime\}\)\\right\)\_\{c\_\{\\mathrm\{tgt\}\}\}\.The validity indicator is

𝕀valid​\(x′\)=𝟏​\[arg⁡maxc⁡C​\(x′\)c=ctgt\]\.\\mathbb\{I\}\_\{\\mathrm\{valid\}\}\(x^\{\\prime\}\)=\\mathbf\{1\}\\left\[\\arg\\max\_\{c\}C\(x^\{\\prime\}\)\_\{c\}=c\_\{\\mathrm\{tgt\}\}\\right\]\.For a generated population𝒫=\{xi′\}i=1N\\mathcal\{P\}=\\\{x^\{\\prime\}\_\{i\}\\\}\_\{i=1\}^\{N\}, validity is reported as

Validity​\(𝒫\)=1\|𝒫\|​∑xi′∈𝒫𝕀valid​\(xi′\)\.\\mathrm\{Validity\}\(\\mathcal\{P\}\)=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{x^\{\\prime\}\_\{i\}\\in\\mathcal\{P\}\}\\mathbb\{I\}\_\{\\mathrm\{valid\}\}\(x^\{\\prime\}\_\{i\}\)\.

#### Classification margin\.

When used for selection, the classifier margin is

mctgt​\(x′\)=C​\(x′\)ctgt−maxc≠ctgt⁡C​\(x′\)c\.m\_\{c\_\{\\mathrm\{tgt\}\}\}\(x^\{\\prime\}\)=C\(x^\{\\prime\}\)\_\{c\_\{\\mathrm\{tgt\}\}\}\-\\max\_\{c\\neq c\_\{\\mathrm\{tgt\}\}\}C\(x^\{\\prime\}\)\_\{c\}\.A larger margin indicates that the classifier assigns the generated image more confidently to the target class\.

#### Image similarity\.

For self\-pollination, similarity measures how much the generated image preserves the source image\. Given source imagexxand edited imagex′x^\{\\prime\}, we use

sim​\(x′,x\)=1−1H​W​∑u=1H∑v=1W\(xu,v′−xu,v\)2\.\\mathrm\{sim\}\(x^\{\\prime\},x\)=1\-\\frac\{1\}\{HW\}\\sum\_\{u=1\}^\{H\}\\sum\_\{v=1\}^\{W\}\\left\(x^\{\\prime\}\_\{u,v\}\-x\_\{u,v\}\\right\)^\{2\}\.For a population, we report the mean similarity over source–edit pairs\. This definition matches the scoreSself​\(x′,x\)S\_\{\\mathrm\{self\}\}\(x^\{\\prime\},x\)in the main text\.

#### Morphological feature value\.

For cross\-pollination, we evaluate a MorphoMNIST feature

ϕ​\(x′\)∈\{area,length,thickness,slant,width,height\}\.\\phi\(x^\{\\prime\}\)\\in\\\{\\mathrm\{area\},\\mathrm\{length\},\\mathrm\{thickness\},\\mathrm\{slant\},\\mathrm\{width\},\\mathrm\{height\}\\\}\.By default, we use thickness, so the feature term in the main text is

thickness​\(x′\)=ϕ​\(x′\)\.\\mathrm\{thickness\}\(x^\{\\prime\}\)=\\phi\(x^\{\\prime\}\)\.For a population𝒫\\mathcal\{P\}, the mean feature value is

MeanFeature​\(𝒫\)=1\|𝒫\|​∑xi′∈𝒫ϕ​\(xi′\)\.\\mathrm\{MeanFeature\}\(\\mathcal\{P\}\)=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{x^\{\\prime\}\_\{i\}\\in\\mathcal\{P\}\}\\phi\(x^\{\\prime\}\_\{i\}\)\.

#### Top\-percentile feature value\.

For trend plots, we also report the mean feature value among the top feature percentile\. Letqρq\_\{\\rho\}be theρ\\rho\-th percentile of feature values in the population\. The top\-percentile feature metric is

TopFeatureρ​\(𝒫\)=1\|𝒫ρ\|​∑xi′∈𝒫ρϕ​\(xi′\),\\mathrm\{TopFeature\}\_\{\\rho\}\(\\mathcal\{P\}\)=\\frac\{1\}\{\|\\mathcal\{P\}\_\{\\rho\}\|\}\\sum\_\{x^\{\\prime\}\_\{i\}\\in\\mathcal\{P\}\_\{\\rho\}\}\\phi\(x^\{\\prime\}\_\{i\}\),where

𝒫ρ=\{xi′∈𝒫:ϕ​\(xi′\)≥qρ\}\.\\mathcal\{P\}\_\{\\rho\}=\\left\\\{x^\{\\prime\}\_\{i\}\\in\\mathcal\{P\}:\\phi\(x^\{\\prime\}\_\{i\}\)\\geq q\_\{\\rho\}\\right\\\}\.In our experiments,ρ=95\\rho=95by default, so this reports the mean feature value of the top5%5\\%of the population\.

#### Image diversity\.

In the main text, diversity denotes image\-space diversity\. We compute it as the mean pairwise angular distance between normalized flattened images:

Diversity​\(𝒫\)=1\|𝒫\|2​∑xi′,xj′∈𝒫\(1−⟨vec⁡\(xi′\),vec⁡\(xj′\)⟩‖vec⁡\(xi′\)‖2​‖vec⁡\(xj′\)‖2\)\.\\mathrm\{Diversity\}\(\\mathcal\{P\}\)=\\frac\{1\}\{\|\\mathcal\{P\}\|^\{2\}\}\\sum\_\{x^\{\\prime\}\_\{i\},x^\{\\prime\}\_\{j\}\\in\\mathcal\{P\}\}\\left\(1\-\\frac\{\\langle\\operatorname\{vec\}\(x^\{\\prime\}\_\{i\}\),\\operatorname\{vec\}\(x^\{\\prime\}\_\{j\}\)\\rangle\}\{\\left\\lVert\\operatorname\{vec\}\(x^\{\\prime\}\_\{i\}\)\\right\\rVert\_\{2\}\\,\\left\\lVert\\operatorname\{vec\}\(x^\{\\prime\}\_\{j\}\)\\right\\rVert\_\{2\}\}\\right\)\.

#### Crystal latent diversity\.

For the crystal experiments, diversity is measured in the crystal latent space rather than image space\. Given a generated population𝒫=\{zi′\}i=1\|𝒫\|\\mathcal\{P\}=\\\{z^\{\\prime\}\_\{i\}\\\}\_\{i=1\}^\{\|\\mathcal\{P\}\|\}, where eachzi′z^\{\\prime\}\_\{i\}is a crystal latent vector, we compute latent diversity as the mean pairwise Euclidean distance:

Diversitylatent​\(𝒫\)=1\|𝒫\|2​∑zi′,zj′∈𝒫‖zi′−zj′‖2\.\\mathrm\{Diversity\}\_\{\\mathrm\{latent\}\}\(\\mathcal\{P\}\)=\\frac\{1\}\{\|\\mathcal\{P\}\|^\{2\}\}\\sum\_\{z^\{\\prime\}\_\{i\},z^\{\\prime\}\_\{j\}\\in\\mathcal\{P\}\}\\left\\lVert z^\{\\prime\}\_\{i\}\-z^\{\\prime\}\_\{j\}\\right\\rVert\_\{2\}\.This metric quantifies how broadly the generated candidates spread in the learned crystal latent space\.

#### Self\-pollination selection score\.

The main text uses the simplified score

Sself​\(x′,x\)=sim​\(x′,x\)\+λ​pclf​\(ctgt∣x′\)\.S\_\{\\mathrm\{self\}\}\(x^\{\\prime\},x\)=\\mathrm\{sim\}\(x^\{\\prime\},x\)\+\\lambda\\,p\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\)\.In the implementation, we additionally include a weight term:

Sselfimpl​\(x′,x\)=λsim​sim​\(x′,x\)\+λconf​pclf​\(ctgt∣x′\)\+λmargin​mctgt​\(x′\)\.S\_\{\\mathrm\{self\}\}^\{\\mathrm\{impl\}\}\(x^\{\\prime\},x\)=\\lambda\_\{\\mathrm\{sim\}\}\\mathrm\{sim\}\(x^\{\\prime\},x\)\+\\lambda\_\{\\mathrm\{conf\}\}p\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\)\+\\lambda\_\{\\mathrm\{margin\}\}m\_\{c\_\{\\mathrm\{tgt\}\}\}\(x^\{\\prime\}\)\.

#### Cross\-pollination selection score\.

The main text uses the simplified score

Scross​\(x′\)=thickness​\(x′\)\+λ​pclf​\(ctgt∣x′\)\.S\_\{\\mathrm\{cross\}\}\(x^\{\\prime\}\)=\\mathrm\{thickness\}\(x^\{\\prime\}\)\+\\lambda\\,p\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\)\.In the implementation, we additionally include a weight term:

Scrossimpl​\(x′\)=λfeat​ϕ​\(x′\)\+λconf​pclf​\(ctgt∣x′\)\+λmargin​mctgt​\(x′\)\.S\_\{\\mathrm\{cross\}\}^\{\\mathrm\{impl\}\}\(x^\{\\prime\}\)=\\lambda\_\{\\mathrm\{feat\}\}\\phi\(x^\{\\prime\}\)\+\\lambda\_\{\\mathrm\{conf\}\}p\_\{\\mathrm\{clf\}\}\(c\_\{\\mathrm\{tgt\}\}\\mid x^\{\\prime\}\)\+\\lambda\_\{\\mathrm\{margin\}\}m\_\{c\_\{\\mathrm\{tgt\}\}\}\(x^\{\\prime\}\)\.

## Appendix CExperiment Setup

### C\.1Self\-Pollination Experiment Setup

The experiment evaluates all target digits0,…,90,\\ldots,9\. For each target digit, source digits are randomly sampled from the non\-target digits\. We use20482048source samples per target digit, resulting in2048020480evaluated source–target edits in total\. The underlying autoencoder, classifier, and conditional flow model are fixed during this experiment\. The hyperparameters are shown in[Table3](https://arxiv.org/html/2606.20084#A3.T3)\.

Table 3:Hyperparameters used for the self\-pollination experiment\.
### C\.2Cross\-Pollination Experiment Setup

#### MorphoMNIST

The experiment evaluates all target digits0,…,90,\\ldots,9\. For each target digit, we compare homogeneous and diverse cross\-pollination\. The homogeneous population is initialized from samples of the target digit, whereas the diverse population is initialized from randomly sampled non\-target digits\. population size\. The underlying autoencoder, classifier, and conditional flow model are fixed during this experiment\. The hyperparameters are shown in[Table4](https://arxiv.org/html/2606.20084#A3.T4)\.

Table 4:Hyperparameters used for the cross\-pollination experiment\.
#### Crystal structures\.

The crystal experiment evaluates all seven crystal systems: Cubic, Hexagonal, Monoclinic, Orthorhombic, Tetragonal, Triclinic, and Trigonal\. For each target crystal system, we compare homogeneous and diverse cross\-pollination\. The homogeneous population is initialized from structures belonging to the target crystal system, whereas the diverse population is initialized from structures sampled from non\-target crystal systems\. The optimization objective is the predicted band gap\. Before optimization, an initial candidate pool is filtered by tournament selection using the true band gap, and the selected candidates are used as the initial population\. The underlying crystal latent representation, crystal\-system classifier, band\-gap regressor, and conditional flow model are fixed during this experiment\. The hyperparameters are shown in[Table5](https://arxiv.org/html/2606.20084#A3.T5)\.

Table 5:Hyperparameters used for the crystal cross\-pollination experiment\.

## Appendix DMore Results

We present per\-digit results for the self\-pollination and cross\-pollination experiments in[Tables6](https://arxiv.org/html/2606.20084#A4.T6)and[7](https://arxiv.org/html/2606.20084#A4.T7)\. Furthermore, we proved the per\-digit result for each cross\-pollination generation in[Figures4](https://arxiv.org/html/2606.20084#A4.F4),[5](https://arxiv.org/html/2606.20084#A4.F5),[6](https://arxiv.org/html/2606.20084#A4.F6),[7](https://arxiv.org/html/2606.20084#A4.F7),[8](https://arxiv.org/html/2606.20084#A4.F8),[9](https://arxiv.org/html/2606.20084#A4.F9),[10](https://arxiv.org/html/2606.20084#A4.F10),[11](https://arxiv.org/html/2606.20084#A4.F11),[12](https://arxiv.org/html/2606.20084#A4.F12)and[13](https://arxiv.org/html/2606.20084#A4.F13)\.

For WyCryst experiment, we provide per\-target\-crystal\-system comparison in[Table8](https://arxiv.org/html/2606.20084#A4.T8)\.

Table 6:Per\-target\-digit comparison between leap\-only and self\-pollination on MorphoMNIST\. Results are reported as mean±STE\. Self\-pollination denotes the improvement over leap\-only\.Table 7:Per\-target\-digit comparison between homogeneous and diverse cross\-pollination\. Results are reported as mean±STEover three random seeds\. Diverse denotes the improvement over homogeneous cross\-pollination\. Validity denotes target\-class success rate\. Feature value corresponds to the optimized MorphoMNIST thickness value\. Diversity denotes image\-space diversity\.Table 8:Per\-target\-crystal\-system comparison between homogeneous and diverse cross\-pollination\. Results are reported as mean±STEover three random seeds\. Diverse denotes the improvement over homogeneous cross\-pollination\. Validity denotes target\-crystal\-system success rate\. Band gap corresponds to the top\-percentile mean predicted band gap\. Diversity denotes latent\-space diversity\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_0.png)Figure 4:TBF![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_1.png)Figure 5:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 1\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_2.png)Figure 6:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 2\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_3.png)Figure 7:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 3\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_4.png)Figure 8:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 4\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_5.png)Figure 9:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 5\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_6.png)Figure 10:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 6\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_7.png)Figure 11:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 7\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_8.png)Figure 12:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 8\.![Refer to caption](https://arxiv.org/html/2606.20084v1/figures/Cross_figure_9.png)Figure 13:Similar as[Figure4](https://arxiv.org/html/2606.20084#A4.F4)but for digit 9\.

Similar Articles