In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

arXiv cs.AI Papers

Summary

This paper replicates the Picbreeder human-driven open-ended image evolution process using large vision-language models, analyzing differences and exploring factors like exploratory noise, behavioral diversity, and memory.

arXiv:2605.23908v1 Announce Type: new Abstract: We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.
Original Article
View Cached Full Text

Cached at: 05/26/26, 08:58 AM

# In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models
Source: [https://arxiv.org/html/2605.23908](https://arxiv.org/html/2605.23908)
\(5 June 2009\)

###### Abstract\.

We are in the midst of large\-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI\-driven assistants\. Historically, a fundamental property of these processes in their human form has been their open\-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms\. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human\-driven open\-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks\. We replicate Picbreeder, replacing human users with frontier Vision Language Models \(VLMs\)\. We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty\. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents’ selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions\. We make our code available at[https://github\.com/smearle/picbreeder\-vlm](https://github.com/smearle/picbreeder-vlm)\.

Open\-Endedness, Vision\-Language Models, Picbreeder

††copyright:acmlicensed††journalyear:2026††doi:XXXXXXX\.XXXXXXX††conference:The Genetic and Evolutionary Computation Conference; July 13–17, 2026; San José, Costa Rica††isbn:978\-1\-4503\-XXXX\-X/2026/07††submissionid:pap565††ccs:Computing methodologies Multi\-agent systems††ccs:Computing methodologies Cognitive science![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000234.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000162.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001942.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001410.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000487.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000732.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000200.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000233.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001951.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001978.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000259.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001916.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000292.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000360.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000129.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000125.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000093.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000457.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001998.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000026.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001446.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000991.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001583.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000971.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001509.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001750.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000063.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001731.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000911.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000829.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001831.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_003315.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_002468.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001896.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001169.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000130.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000201.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001896_2.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000491.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000083.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_001980.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/teaser/img_000797.png)

Figure 1\.Large Vision\-Language Models play Picbreeder and discover novel images\. Cherry\-picked examples\.## 1\.Introduction

Open\-ended processes of learning and discovery are crucial for civilization\. In science, mathematics, art, and technology, prior works act as stepping stones for new advances and paradigm shifts\. However, the path between these has often been indirect, with serendipity and curiosity playing roles beyond pure optimization of objectives\(Stanley and Lehman,[2015](https://arxiv.org/html/2605.23908#bib.bib29)\)\. As opposed to the prevalent paradigm of machine learning, open\-ended search is a divergent process, that constructs an ever\-growing tree of novel artifacts\.

The reliance of machine learning on datasets is becoming a crutch: we are rapidly running out of training data for training large models\(Villaloboset al\.,[2022](https://arxiv.org/html/2605.23908#bib.bib7)\), and constructing meaningful reinforcement learning tasks is hard and expensive\. Given this, artificial open\-ended systems are more relevant than ever, as they can continue to generate novel artifacts, bypassing the data bottleneck\. However, creating functioning open\-ended artificial systems remains a long\-standing grand challenge\(Stanleyet al\.,[2017](https://arxiv.org/html/2605.23908#bib.bib5); Stepney and Hickinbotham,[2024](https://arxiv.org/html/2605.23908#bib.bib6)\)\. Although subjective, most researchers agree that fully automated artificial open\-endedness has not yet been achieved\.

However, there exists interactive computer\-based systems, which achieve a degree of open\-endedness with humans in the loop\. Picbreeder\(Secretanet al\.,[2008](https://arxiv.org/html/2605.23908#bib.bib11),[2011](https://arxiv.org/html/2605.23908#bib.bib12)\)is the canonical example of such a system\. In Picbreeder, humans collaboratively create interesting images in an interactive evolutionary loop\. Users can start creating from other’s published images, and follow their own notions of interestingness\. If we were able to recreate such a system in a purely computational substrate, it could serve as a kind of model organism, allowing us to experiment with its components and parameters so as to better understand the building blocks of open\-endedness\.

In this paper we describe a fully artificial recreation of Picbreeder, in which we use Vision\-Language Models \(VLMs\)\(Bordeset al\.,[2024](https://arxiv.org/html/2605.23908#bib.bib30)\)in place of humans\. We analyze this system both quantitatively and qualitatively, and vary key components to understand how its output is affected\. At the most immediate level, we answer the question “what happens when a VLM plays Picbreeder?” At a deeper level, we propose a strategy for understanding open\-endedness by recreating systems that rely on humans in\-the\-loop as fully artificial systems, and varying their components and parameters\. Our research questions center around which design choices allow the system to create a meaningful diversity of artifacts\. To this end, we ask:

1. \(1\)Do VLM agents need access tohistory? Does access to a context \(i\.e\., a memory\) of their past turns encourage meaningful divergence by allowing them to recognize and steer away from existing patterns in the system? Or does this increased exposure merely reinforce existing biases, leading to mode collapses?
2. \(2\)Do VLM agents need explicitexplorationstrategies to help them explore the space of artifacts more effectively, by forcing the agent to parts of the search space they otherwise would not have visited? Or are they inherently capable of balancing discovery and optimization?
3. \(3\)Do we need amulti\-agentsystem? Does the simulation of multiple personalities produce open\-ended creative/competitive dynamics, or does it merely define a set of fixed attractors in the search space?

We find that small amounts of exploratory noise can increase diversity of generated archives, but at the cost of the quality of images therein; surprisingly little history is necessary for optimal performance, with greater context lengths leading to pathological behavior; and increasing the number of unique agents contributes to exploration without sacrificing quality according to our quantitative metrics, but results in the propagation of qualitatively nondescript, noisy, and potentially adversarial images among the archive \([Fig\. A19](https://arxiv.org/html/2605.23908#A2.F19)\)\.

![Refer to caption](https://arxiv.org/html/2605.23908v1/x1.png)\(a\)N​A=10NA=10
![Refer to caption](https://arxiv.org/html/2605.23908v1/x2.png)\(b\)N​A=100NA=100
![Refer to caption](https://arxiv.org/html/2605.23908v1/x3.png)\(c\)N​A=1,000NA=1,000

\(d\)Most semantically salient images in the archive, from seeds with the highest Semantic Recall\.
![Refer to caption](https://arxiv.org/html/2605.23908v1/x4.png)\(e\)N​A=10NA=10
![Refer to caption](https://arxiv.org/html/2605.23908v1/x5.png)\(f\)N​A=100NA=100
![Refer to caption](https://arxiv.org/html/2605.23908v1/x6.png)\(g\)N​A=1,000NA=1,000

\(h\)Visually representative images from the archive, from seeds with the highest Visual Coverage\.

Figure 3\.Qualitative effect of varying the Number of Agents \(N​ANA\), by sampling from variably\-sized pools of \(LLM\-generated\) personality traits and prepending these to system prompts during VLM\-Picbreeder sessions\. Archives with highest Semantic Recall \([3\(b\)](https://arxiv.org/html/2605.23908#S1.F3.sf2)\) and Visual Coverage \([3\(g\)](https://arxiv.org/html/2605.23908#S1.F3.sf7)\) are outlined\.
## 2\.Related Work

Picbreeder\(Secretanet al\.,[2008](https://arxiv.org/html/2605.23908#bib.bib11),[2011](https://arxiv.org/html/2605.23908#bib.bib12)\)is a tool for interactive \(human\-in\-the\-loop\) evolutionary computation involving crowdsourced image selection at scale\. It was used to study the nature of serendipity among human collaborators\. Case studies showed that image sharing/publication amongst a large userbase enabled the discovery of many interestingly diverse images that would resonate with humans\. Our work pursues a complementary line of inquiry that looks into an abstraction of open\-ended \(computational\) creativity\(Soroset al\.,[2024](https://arxiv.org/html/2605.23908#bib.bib16)\)beyond human\-in\-the\-loop systems\. Additionally, we empirically study domain\-general elements of computational creativity—namely serendipity, memory, and personality—in hopes of formalizing the essential aspects of autotelic processes in the real world\(Oudeyer and Kaplan,[2007](https://arxiv.org/html/2605.23908#bib.bib26); Colaset al\.,[2023](https://arxiv.org/html/2605.23908#bib.bib27)\)\.

Other works have studied different aspects of automated search in Picbreeder\. The Innovation Engine\(Nguyenet al\.,[2016](https://arxiv.org/html/2605.23908#bib.bib9)\)demonstrated the importance of diversity in image classes/targets for the proliferation of diverse artifacts, whileGaieret al\.\([2019](https://arxiv.org/html/2605.23908#bib.bib25)\)demonstrated the importance of intermediate solution/stepping stone diversity to escape local optima inherent in objective optimization\(Woolley and Stanley,[2011](https://arxiv.org/html/2605.23908#bib.bib39)\)\. Our work differs in that we aim for more open\-ended discovery without optimization targets \(e\.g\. specific goal images\) or pre\-specified niches in our system design\. For example, our system incentivizes discovery via natural language directives involving minimal criteria or filters\(Lehman and Stanley,[2010](https://arxiv.org/html/2605.23908#bib.bib24)\)rather than goal states or optimization metrics\.

Recently, large pre\-trained models \(e\.g\., LLMs, VLMs\) have been leveraged as a means of automating elements of evolutionary or agentic creativity via model\-based search\. They have been shown to be effective as selection operators based on the queried interestingness of artifacts and concepts\(Zhanget al\.,[2024](https://arxiv.org/html/2605.23908#bib.bib17); Klissarovet al\.,[2024](https://arxiv.org/html/2605.23908#bib.bib18); Faldoret al\.,[2024](https://arxiv.org/html/2605.23908#bib.bib19)\), as evaluators of behavior characteristics and traits of diversity\(Bradleyet al\.,[2023](https://arxiv.org/html/2605.23908#bib.bib20); Pourcelet al\.,[2023](https://arxiv.org/html/2605.23908#bib.bib21)\), and as intuitive mutation operators\(Lehmanet al\.,[2023](https://arxiv.org/html/2605.23908#bib.bib22); Meyersonet al\.,[2023](https://arxiv.org/html/2605.23908#bib.bib23)\), thanks to the general\-purpose utility of off\-the\-shelf models\. Existing work leveraging large models primarily focus on introducing new algorithmic components as model\-based operations\. In contrast, our work studies the nature of innovation holistically, where we test a new VLM\-based system on the Picbreeder domain in order to gain an abstracted understanding of open\-ended discovery pertaining to both humans and AI agents\.

## 3\.Methods

Our primary aim is to faithfully replicate the human Picbreeder experiment in purely computational form\. We do not seek, necessarily, to replicate theresultsof Picbreeder—an archive of images, the quality of their representation, or the genetic relationships between them—but rather theconditionsthat enabled open\-ended discovery in the original system\. We provide minimal guidance to the VLMs at the helm of our system, instead allowing them to explore according to their own preferences, context, and a brief description of the system’s operation\.

We are able to do so as VLMs are capable of following ambiguous instructions, making assumptions about how to act even when the task at hand is underspecified\.111Of course, whether these assumptions are optimal or aligned with human behavior is another matter\.Rather than engineering the mechanics of an automatic search process so that it may, hopefully, turn out to be open\-ended, we can simply, implicitly ask VLMs to perform open\-ended search on their own\.

### 3\.1\.Re\-implementing Picbreeder

Using the neat\-python library\([McIntyreet al\.,](https://arxiv.org/html/2605.23908#bib.bib1)\), we carefully follow Picbreeder’s implementation of using CPPNs\(Stanley,[2007](https://arxiv.org/html/2605.23908#bib.bib4)\)for representing images, and the NeuroEvolution of Augmenting Topologies \(NEAT\) algorithm\(Stanley and Miikkulainen,[2002](https://arxiv.org/html/2605.23908#bib.bib2)\)for evolving them\. Each CPPN is a neural network that takes as input an\(x,y,r\)\(x,y,r\)tuple of coordinates, wherexxandyyare 2D coordinates, andrris the distance from the center of the image \(to enable radial symmetry\)\. The CPPN outputs hue, saturation, and brightness for each input tuple; in our experiments; we fix the resolution of generated images to128×128128\\times 128during evolution\.

An implementation detail that is specific to Picbreeder is initializing the brightness node with outgoing connections to the hue and saturation nodes\. This biases initial images such that color gradients tend to follow or reflect grayscale structure\. The brightness node is assigned the sigmoid activation function, while other output nodes have the identity function, and hidden nodes have activation functions randomly chosen from sigmoid, sine, cosine, and identity\. To produce a grayscale image \(when Picbreeder’s “color mode” is toggled off\), we sample exclusively from the brightness node\. To produce a color image, we map the activations of the hue and saturation to\[0,1\]\[0,1\]by wrapping and clamping, respectively, before converting them to RGB\. Connection weights are marked as belonging to either the structure or color subnetwork\. When the user is in structure\- or color\-only mutation modes, they may only mutate or add weights belonging to these subnetworks\.

### 3\.2\.Historical Picbreeder data

We make use of the same historical data asKumaret al\.\([2025](https://arxiv.org/html/2605.23908#bib.bib10)\), which contains the complete lineages of a large number of the images published to the Picbreeder website between its launch in 2008 and its death around 2016\. This amounts to9,7589,758published images and their ancestry\. This allows us to reconstruct the the phylogenetic tree of published images for comparison against the output of VLMs playing Picbreeder\. The lineage files are ordered chronologically \(i\.e\., in publication order\), which allows us to retroactively plot various metrics over time—with archive growth—for fine\-grained comparison, and compare against VLM runs involving fewer publications \(a few thousand, in our experiments\)\.

### 3\.3\.Playing Picbreeder with VLMs

We conceive of asessionas the core unit of the Picbreeder loop\. A session begins when an agent chooses to either branch an image from the existing archive of images published thus far, or begin with a fresh, randomly initialized population of CPPNs\. Following the original Picbreeder implementation, the agent may select a single CPPN\-image for branching, producing a population of offspring resulting from random mutations of the selected parent\.

At the following step, the agent is presented with the resultant initial population—either of mutants resulting from branching, or of random initial CPPN\-images—and asked to select one or several CPPN\-images from the population as parents for the next generation\. Subsequent steps proceed in the same way, with mutation—and \(with some probability\) crossover, in the case of multiple parents—applied to produce the subsequent generation\. At each generation, the population consists of 15 CPPN\-images\. In addition to mutant offspring, exact copies of the selected parent\(s\) are always included in the subsequent generation’s population\. At the 20thgeneration of evolution in the session, the agent is asked to select an image for publication to the archive and give it a title\.222We initially allowed agents more freedom over when to publish, but agents chose to publish rapidly and end their session, hence we enforced longer sessions\.

At each selection step, the agent may alternatively toggle color mode on or off \(which does not count toward the number of generations in the session\)\. When color mode is off, only the brightness output node is sampled to produce a grayscale image\. Random initial populations default to grayscale, and branched images retain the color mode under which they were published\. During selection steps, the agent may, in addition to selecting parents, set the strength of mutation \(which defaults to0\.50\.5\) to any value in\[0,1\]\[0,1\]\(where0still results in an effective mutation strength of0\.010\.01in the system’s backend\)\. When color mode is active, the agent may additionally toggle mutation mode between color\- or structure\-only mutations, or both \(when color mode is off, mutations default to structure\-only\)\. All of these controls and their defaults mirror equivalent controls that were available to human Picbreeder users\.

![Refer to caption](https://arxiv.org/html/2605.23908v1/x7.png)\(a\)Most semantically salient images
![Refer to caption](https://arxiv.org/html/2605.23908v1/x8.png)\(b\)Visually representative images

Figure 4\.Samples from the historical human archive after 2,000 user sessions\.During branching, agents are presented with a100100\-image sample of the archive published thus far\. This is broken down in the following55equal parts: a set of “top\-rated” images, having accrued the highest \(VLM\-generated\) ratings thus far; a set of “best new” images, comprising the top\-rated of the100100most recently published images; a set of “most branched” images, comprising those images which have been selected the most for branching; a set of “latest” images, comprising the most recently published images; and a set of “random” images, drawn uniformly from all images in the archive\. These sets are mutually exclusive in the order presented above, such that landing in a higher\-priority group precludes membership in a subsequent lower\-priority group \(i\.e\. an overall “top\-rated” image may not appear as a “best new” image, a “best new” image may not appear as one of the “latest”, and so forth\)\. This mimics the categories of images presented to users on the home screen of the original Picbreeder website \([Fig\. A2](https://arxiv.org/html/2605.23908#A2.F2)\) save for the “editor’s choice” category\. While these initial sets of images were smaller on the original website \(88in each\), human users were able to selectively “see more” of a given category; we split the difference by presenting larger sets for each category\.

To solicit ratings for each image published to the collaborative archive, we ask a new VLM instance \(with no context of prior interactions with the system\) to rate a subset of images from the archive with integer scores between11and55\. This sample is drawn in the same way as the branching sample described above\. This ensures that newly rated images are compared against a variety of existing images in the archive\. We initiate this rating process after every55new publications to the archive, once the archive has reached a size of at least100100images\. During branching, selection, publication, and rating, agents are prompted to provide 1\-2 sentences of rationale for their decisions, allowing us a window \(albeit a potentially fallible one\) into their decision process, and to observe failure modes, e\.g\., where agents may become overwhelmed by their context and begin mixing up the positions of images or mischaracterizing what these images depict\. We run 10 VLM agents in parallel, which branch from, publish to, and rate a shared online archive\.

## 4\.Metrics & Interventions

Instinctively, the historical human archive is more evocative and diverse than what is produced by VLMs\. We can see this both by comparing \(representative samples of\) archives from either progeny \([Fig\. 3](https://arxiv.org/html/2605.23908#S1.F3),[Fig\. 4](https://arxiv.org/html/2605.23908#S3.F4)\); and comparing ancestral lineages within them, tracing back from two quite similar images—an instance of convergent human\- and VLM\-driven evolution \([Fig\. 5](https://arxiv.org/html/2605.23908#S4.F5)\)\. Humans tend to take bigger leaps between publications—despite human and VLM sessions comprising roughly the same number of generations on average \([Fig\. A3](https://arxiv.org/html/2605.23908#A2.F3),[Table A1](https://arxiv.org/html/2605.23908#A2.T1)\)—and land on sharper, more refined images\. But why? What is this*x\-factor*—this quality of boldness and discernment? In this section, we establish evaluation metrics that attempt to quantify this x\-factor, and experimental interventions that might allow us to come closer to replicating it synthetically\.

### 4\.1\.Evaluation Metrics

Our evaluation metrics comprise attempts to abstract and mechanize what we think gives the human archive its special quality\. In particular, we try to measure the degree of fidelity with which agents are able to depict a diverse set of visually/semantically distinct forms\. We can think of these metrics roughly as capturing the quality \(recall\) and diversity \(k\-covering radii in embedding spaces andJ1J^\{1\}index of Tree Balance\) of the Picbreeder archive\.

#### 4\.1\.1\.Semantic Recall

To measure Semantic Recall—the VLM’s ability to rediscover a set of known common objects—we gather a large list of nouns/noun phrases that can plausibly be depicted as images\. Here, we use the1,8241,824unique class names in the THINGS dataset\(Hebartet al\.,[2023](https://arxiv.org/html/2605.23908#bib.bib15)\), after deduplication\. In a joint text\-vision embedding space \(we use SigLIP2\-B\(Tschannenet al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib14)\)\), we embed each of these classes, and each of the images published to the run of an archive during a \(human\- or VLM\-driven\) Picbreeder run\. We then compute the cosine distance between each image and each class in this embedding space\. For each class, we take the minimum distance between it and any image, then sum this value over all classes\.

![Refer to caption](https://arxiv.org/html/2605.23908v1/x9.png)\(a\)Lineage of human\-discovered car
![Refer to caption](https://arxiv.org/html/2605.23908v1/x10.png)\(b\)Lineage of VLM\-discovered car

Figure 5\.Ancestral lineages of semantically similar images generated by human and VLM interactions with Picbreeder \(only published images are displayed\)\. Humans seem to take larger steps in semantic space \(from face, to eye, to frog, to car\) than VLMs \(from abstract bird, to car seat, to dashboard, to hood ornament, to car\)\.Table 1\.Summary of results of various interventions on VLM\-driven Picbreeder in terms of their impact on our evaluation metrics \(mean±\\pmstandard error\)\. Overall best results highlighted in green; best results among each hyperparameter sweep appear in bold\. The default setting—recurring across sweeps—is highlighted in grey\.
#### 4\.1\.2\.Visual Novelty

To measure the visual novelty of the generated archive, we embed all published images into an image embedding space \(we use SigLIP\-2\-B\-alignet\(Muttenthaleret al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib13)\)\)\. We then use greedy farthest\-point sampling\(Gonzalez,[1985](https://arxiv.org/html/2605.23908#bib.bib34)\)—starting with a set that includes a random datapoint, then repeatedly adding to this set the point that has the greatest minimum distance with any point in the set—to generate a set ofkkrepresentative images\. Given thesekkrepresentatives, we then grow spheres about these points with equal radii, until all embedded images are contained within some sphere\. The resultant radius is thekk\-covering radius \(we report results fork=100k=100\)\. To visualize archives \(orkkrepresentative points from these archives\), we use Rasterfairy\(Mario Klingemann,[2015](https://arxiv.org/html/2605.23908#bib.bib8)\)to render images in a rectangular grid, arranged therein to reflect their relative distances in embedding space, allowing for more visually intuitive, ordered snapshots of the archive\.

#### 4\.1\.3\.Semantic Novelty

To measure the semantic novelty of the generated archive, we have a VLM—we use gemini\-2\.5\-pro—generate short \(1\-sentence\) captions for each image in the archive, then map these captions to a text embedding space—we use gemini\-embedding\-001\(Leeet al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib35)\)—and measure thekk\-covering radius atk=100k=100over these points\.

#### 4\.1\.4\.Analysis of Phylogenetic Trees

We construct the phylogenetic trees of all published images, treating as roots those images that resulted from sessions that started from a random initial population of CPPN\-images, and defining child\-parent relationships between any published image \(child\) that resulted from a session that began by branching a previously\-published image \(parent\)\. We then compute theJ1J^\{1\}index, a robust measure of Tree Balance\(Lemantet al\.,[2022](https://arxiv.org/html/2605.23908#bib.bib3)\)\.

### 4\.2\.Experimental Interventions

Our interventions on the VLM\-driven Picbreeder pipeline derive from reasoning about what might give humans an edge over VLMs in matters of creative, open\-ended discovery\. We identify a few key factors that would seem to set humans apart from VLMs in this context, and identify analogous knobs that we can tune on the VLM side to try to mimic these features of human behavior\.

#### 4\.2\.1\.Memory and Context Length

First, the way in which humansremembertheir recent past experiences would seem quite distinct from either of the two modes VLMs have for doing the same, namely, via storage in their weights \(during training\) or in their context \(during inference\)\. When a human plays Picbreeder, they are inevitably exposed to a sample of the online archive before embarking on an evolutionary session, and this impression will surely influence their judgements about the novelty of newly\-generated images\. The same goes for their selections and all candidate CPPN\-images viewed during their session\. But these past impressions are neither “wired” in their synapses, nor continually presented to them all\-at\-once on a single display\.

Still, we wonder what effectmemorymight have on our Picbreeder\-playing VLMs, and ask this question by simulating something like the latter case—controlling the number of previous system interaction steps that are included in a VLM’s context\. In the simplest setting, the VLM is only presented with the current population, and cannot see any prior populations or its decisions pertaining to these—we refer to this as aContext Length\(C​LCL\) of 0\. In the default setting, withC​L=1CL=1, the agent can see the current turn and the one prior\. In general, when the branching step is captured by the context window as defined here \(e\.g\., when an agent withC​L=1CL=1is faced with its first population after branching\), the agent is able to see this sample of the online archive\. The context window never extends back beyond the branching step—every user session belongs to a “fresh” instance of the VLM\.

We selectC​L=1CL=1as our default because it is maximally cheap, while still giving the agent the opportunity to notice if it might be stuck in a local minimum, repeating its selection decision across multiple turns\. At the extreme of maximum rememberance,C​L=20CL=20, the agent’s context will always include the full chat history of the current session, including its branching step\. In this case, we append a special directive to the prompt which asks the agent, when justifying its publication decision, to additionally explain how this publication is novel with respect to the initial archive \([17\(b\)](https://arxiv.org/html/2605.23908#A2.F17.sf2)\)\. This was motivated by the empirical observation that agents with less context would often publish identical copies of, or minor variations of, the same image dozens of times\.

We expect that larger context lengths should help agents escape local minima and implicitly incentivize them to explore more aggressively\. On the other hand, we’re wary of the possibility that overwhelming agents with excessively lengthy prompts might limit their ability to effectively reason and discern among images\.

#### 4\.2\.2\.Exploration and Selection Noise

In human decision\-making, given the inherent noisiness and complexity of the physical world, a great many auxiliary random variables may come into play\. This may be all the more consequential in Picbreeder, where the CPPN mutations affected by the user are themselves noisy, and the interactive evolutionary process is inherently difficult to control\. In such noisy domains, users are liable to exert less effort in long\-term planning\([Leiet al\.,](https://arxiv.org/html/2605.23908#bib.bib36)\), and may therefore be more likely to make decisions “on a whim” in a quasi\-random fashion\. The machinery of large neural networks, by contrast, elides much of the complexity of the physical world, and LLMs may be even more limited than humans in their ability to emulate true randomness\(Harrison,[2024](https://arxiv.org/html/2605.23908#bib.bib38); Van Koevering and Kleinberg,[2024](https://arxiv.org/html/2605.23908#bib.bib37)\)\. It’s also conceivable that they may be less prone to make decisions “on a whim” in general, given their having been fine\-tuned to be maximally helpful in domains with verifiable solutions\.

We therefore use anϵ\\epsilon\-greedy exploration strategy\(Suttonet al\.,[1998](https://arxiv.org/html/2605.23908#bib.bib33)\)to inject randomness into the agents’ selection process\. The hope is that this will facilitate productive exploration, allowing the agent to escape attractors in search space\. Conversely, an over\-abundance of such noise will presumably lead to an archive of similarly noisy, meaningless images\.

Concretely, at each selection step, with probabilityϵ\\epsilon, the VLM query is replaced with a random action, viz\. the uniform random selection of a parent from the current generation and/or the random adjustment of settings \(color toggle; mutation mode and strength\) that would otherwise have been available to the VLM\. During anϵ\\epsilon\-random action, color mode is toggled \(forgoing the selection of a random parent\) with probability 0\.1\. With probability 0\.2, \(provided color mode is active\) a new mutation mode will be uniform randomly selected from the set of 3 possible such modes\. And with probability 0\.2, a new continuous mutation strength will be selected from a uniform continuous distribution over\[0,1\]\[0,1\]\. These random actions can only occur during selection steps; even whenϵ=1\\epsilon=1, a VLM will be queried to make branching and rating decisions\.

#### 4\.2\.3\.Multiple Agents and Promptable Inclinations

Human populations are diverse, and this diversity likely translates to myriad distinct Picbreeder “playstyles”\. Anecdotally, one prolific user of the original Picbreeder system \(“BurnedDirt”\) fixated on minimal forms—clean basic shapes, patterns of straight lines— while other users toward forms resembling insects or faces\. Several thousand distinct human users engaged with Picbreeder, but we’d be hard\-pressed to find a comparable variety of distinct VLMs to deploy in our system in parallel\.

We opt instead to simply prompt for diverse play\-styles\. To this end, we feed the Picbreeder system prompt \(which details the system and its interface for VLM agents\) to an LLM—we use`gemini\-3\-pro\-preview`—and ask it to come up with distinct personality traits that may indirectly affect a user’s behavior in such a system, while avoiding the specification of concrete objectives \([Fig\. A18](https://arxiv.org/html/2605.23908#A2.F18)\) in batches of 50 traits, up to a total of 1,000 traits, with the LLM viewing its previous 10 batches of generated traits at each step\. We then run experiments controlling for the number of such traits that we use to parameterize distinct “agents” during a VLM\-Picbreeder run, randomly selecting this many traits at the beginning of the experiment from the overall pool\. At the beginning of each agent session, a personality prompt is drawn at random from this subset, and prepended to that agent’s system prompt for the duration of their session\. A random sample of the generated prompts is given in[Table A2](https://arxiv.org/html/2605.23908#A2.T2)\.

## 5\.Results & Discussion

Images in the human Picbreeder archive tend to be diverse, aesthetically refined, and often evocative \(cf\. Figs\.[3](https://arxiv.org/html/2605.23908#S1.F3)and[4](https://arxiv.org/html/2605.23908#S3.F4)\)\. Our evaluation metrics would appear to reflect aspects of this qualitative discrepancy; indeed, the historical human baseline dominates most of our quantitative evaluation metrics—see the summary in[Table 1](https://arxiv.org/html/2605.23908#S4.T1)\. In the following sections, we detail these results, and describe correlations between changes in our evaluation metrics and qualitative changes in generated archives, with reference to additional qualitative archive samples and quantitative visualizations \(Figs\.[A4](https://arxiv.org/html/2605.23908#A2.F4)\-[A10](https://arxiv.org/html/2605.23908#A2.F10)\) in the Appendix\.

A fully random baseline \(in which all selections, branching, and publication decisions are sampled uniformly\) acts as a lower bound, achieving low scores on all metrics, save for Tree Balance \(since uniform random branching decisions lead to highly balanced trees in expectation\)\.

In our VLM experiments, we default to Context LengthC​L=1CL=1, random selection probabilityϵ=0\\epsilon=0, and Number of AgentsN​A=1NA=1\. We use gemini\-2\.5\-pro\(Comaniciet al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib28)\)\. We run each experiment for a total of2,0002,000sessions \(resulting in an archive of as many images\), repeat each experiment with66random seeds, and report/plot the mean and its standard error over these seeds\.

### 5\.1\.Exploration

Without exploratory noise \(ϵ=0\\epsilon=0\), the VLM\-generated archives are exceedingly likely to contain many dozen insignificant variations of the same form\. We see this quantitatively \([6\(b\)](https://arxiv.org/html/2605.23908#A2.F6.sf2)\), whereϵ=0\\epsilon=0is nearly as weak as Random in terms of Semantic Coverage; and in the visually representative sample from such an archive in[6\(e\)](https://arxiv.org/html/2605.23908#A2.F6.sf5), with the repetition of a few fox\- and fishbone\-like forms in particular \(see also[Fig\. A16](https://arxiv.org/html/2605.23908#A2.F16)\)\.333Since these representatives are chosen to be maximally distant from one another from among the set of images, this indicates that an abundance of closely related forms to these exemplars live in the archive \(mode collapse\)\.

The use of noise in the selection process encourages exploration and seems to avoid such mode collapse \(cf\. the relatively diverse representatives in[6\(f\)](https://arxiv.org/html/2605.23908#A2.F6.sf6)\), but this comes at a direct tradeoff w\.r\.t\. the legibility of generated images: whenϵ≤0\.25\\epsilon\\leq 0\.25, Semantic Recall is roughly comparable to that ofϵ=0\\epsilon=0, though recall suffers under largerϵ\\epsilon\([6\(b\)](https://arxiv.org/html/2605.23908#A2.F6.sf2)\)\. Though these more exploratory settings generate a greater diversity of images—as evidenced by their high visual \([6\(a\)](https://arxiv.org/html/2605.23908#A2.F6.sf1)\) and semantic \([6\(b\)](https://arxiv.org/html/2605.23908#A2.F6.sf2)\) coverage—they tend to be less sharp and refined than in their noiseless counterparts\.

A setting ofϵ=1\\epsilon=1drastically drops Semantic Recall score, but not to the point of matching Random, and in[6\(g\)](https://arxiv.org/html/2605.23908#A2.F6.sf7)we indeed note a few recognizable or interesting forms\. That is, even when VLMs are confined to only branching and rating, they can steer the evolutionary process \(though perhaps very slowly\) toward meaningful artifacts\.444One way of thinking aboutϵ\\epsilonis that, roughly, it increases mutation strength and lessens session length, so that whenϵ=1\\epsilon=1, the agent makes only one selection choice—during branching—then the remaining 20 generations of random selection are akin to one very drastic mutation step\.

In general, we note how very imbalanced VLM\-generated phylogenetic trees are relative to the human baseline \([6\(c\)](https://arxiv.org/html/2605.23908#A2.F6.sf3)\), suggesting that VLMs tend to be far more homogeneous in their branching decisions—viz\. prone to repeatedly branch from the same image\(s\), and less likely to start from random initial populations—as compared to human users\. Ramping upϵ\\epsilononly slightly increases Tree Balance\. So, while increasingϵ\\epsilonleads to individualslookingmore diverse, these individuals still tend to be related to one another\. In other words, exploratory noise increases phylogenetic diversity, but not genetic diversity \(where the latter might arguably lead to deeper and more meaningful long\-term variation\)\. It’s interesting that this selection bias is robust to noise in the first place; that even from among more diverse sets of images \(and in the extreme, sets of highly noisy images\), the VLM always has a clear favorite\. This kind of stubborn favoritism could be a barrier to using VLMs’ as an engine of open\-ended search\.

### 5\.2\.History

Removing all history beyond the current turn \(C​L=0CL=0\) collapses Semantic Recall \([7\(a\)](https://arxiv.org/html/2605.23908#A2.F7.sf1)\), because agents in this setting are prone to publishing duplicates \(note the redundancy among visual representatives in[9\(e\)](https://arxiv.org/html/2605.23908#A2.F9.sf5)\)\. SettingC​L=1CL=1proves surprisingly effective in terms of Semantic Recall, in spite of the relative homogeneity of the archive \(as discussed above whenϵ=1\\epsilon=1, these two settings corresponding to the same set of experiments under default hyperparameters\)\. Clearly, the homogeneity of the archive whenC​L=1CL=1is lesser than whenC​L=0CL=0, both in terms of visual \([9\(a\)](https://arxiv.org/html/2605.23908#A2.F9.sf1)\) and semantic \([9\(b\)](https://arxiv.org/html/2605.23908#A2.F9.sf2)\) diversity, suggesting that a little context goes a long way in preventing the agent from spiraling into a pattern of repeated or near\-identical selections\.

But such benefits neither scale nor are sustained: even incrementing toC​L=2CL=2results in a large hit to Semantic Recall with hardly any gain in terms of diversity\. Indeed, the archives produced whenC​L=2CL=2remain similarly homogenous, though the forms appear slightly less refined, and messier, presumably owing to the detrimental effect of information\-overload in the agent’s context\. Increasing context toC​L=10CL=10again results in a sharp decrease in Semantic Recall, as well a decrease in diversity\. But it’s not quite the case that the published images underC​L=10CL=10simply become messier: some are quite refined, and in particular we see a huge number of duplicate entries of near\-photorealistic top\-down views of soda cans across multiple seeds \([16\(a\)](https://arxiv.org/html/2605.23908#A2.F16.sf1)\), these being almost entirely absent from runs with otherC​LCL\. It may be that given these larger contexts, the agent begins to fall into auto\-sycophantic loops which reinforce its own predetermined objectives, collapsing diversity but sometimes resulting in a handful of refined forms\.

With full history,C​L=20CL=20, diversity/coverage scores reach an apex relative to other context lengths\. This may well be attributable to the additional, novelty\-pleading prompt appended in these cases \([17\(b\)](https://arxiv.org/html/2605.23908#A2.F17.sf2)\. But Semantic Recall, while improving overC​L=10CL=10, does not reach the height ofC​L=1CL=1, likely owing to the often abstract and high\-frequency publications in these archives\. In general, we note that increasing context pushes the agent toward busier, apparently more complex \(and sometimes downright noisy\) images, while lowerC​LCLtends to elicit starker forms, with archives often nearly entirely absent of color\.

### 5\.3\.Multiple Agents

Adding agents \(via LLM\-generated behavioral idiosyncracies\) does not noticeably improve recall \([10\(a\)](https://arxiv.org/html/2605.23908#A2.F10.sf1)\), but results in a marked improvement in the archive’s diversity in terms of visual and Semantic Coverage, in addition to Tree Balance \([10\(e\)](https://arxiv.org/html/2605.23908#A2.F10.sf5)\) at high agent counts\.

A small number of agents \(N​A=10NA=10\) results in a drop in Visual Coverage \([10\(b\)](https://arxiv.org/html/2605.23908#A2.F10.sf2)\), seemingly because these agents carve the grid up into subregions corresponding to their individual preferences, and remain in these subregions since constituents thereof are always likely to appear in the 100\-image archive sample during branching\. In[3\(e\)](https://arxiv.org/html/2605.23908#S1.F3.sf5), for example, we see the work of an agent who, according to their personality prompt, is “searching for the dry red color of terracotta clay”, and has accordingly produced a large set of near\-identical solid swatches of such colors\.

With a large number of agents \(N​A=1,000NA=1,000\), we see the highest Semantic Coverage and Tree Balance of any of our experimental settings\. Further analysis is needed to determine whether these agents aren’t similarly carving up the archive into a large number of subregions\. But even so, given that there is low probability that a constituent of any given one of these1,0001,000supposed subregions will appear in the archive sample during branching, these agents would likely be forced to branch from an image from beyond their comfort zone\. These dynamics of artificial collaborative friction in which agents with idiosyncratic personal agendas are forced to work with disagreeable raw materials could provide some hope of recovering the boldness of human leaps of invention; trading self\-satisfaction for energizing internal conflict\.

And yet, unfortunately, these many\-agent grids are rife with high\-frequency, uninterpretable \(usually grayscale\) psychedelic patterns—these distinct forms making up 10\-20% of the archive while virtually absent from other experimental settings\. It’s interesting that this noise in particular—versus that of Random, or under largeϵ\\epsilon—attains the highest Semantic Coverage by a wide margin\. Clearly, these images are something like adversarial attacks\(Nguyenet al\.,[2015](https://arxiv.org/html/2605.23908#bib.bib40)\)\. In future work, a complementary metric measuring semantic variance of a single image over repeated rounds of captioning could help determine whether each such adversarial image consistently maps to a single agent\-preference \(and textual description\) or—perhaps more likely given the collaborative friction described above—these images serve as adversarial “hubs”, massaged by many agents to at once conform to multiple distinct preferences \(and analogously mapping to many distinct textual descriptions\)\.

## 6\.Conclusion

Are large Vision\-Language Models capable of open\-ended discovery? Can they thereby be used to automate processes that hinge on the kind of boundless creativity that has until now been viewed as a uniquely human capacity? Leveraging Picbreeder as a minimal substrate for the potential expression of such open\-ended evolutionary processes, and faithfully placing VLMs in the role of human users, we bring these models’ output into direct contact with their historical human counterparts and our own intuitions about what constitutes the special open\-ended quality of this output\.

We test these intuitions by modeling them computationally\. We find that, taken together, our metrics of Semantic Recall, phylogenetic Tree Balance, and Visual and Semantic Coverage capture a large part of our sense of meaningful qualitative variation among Picbreeder archives\. Accordingly, we find that a number of separate interventions—striking a balance in terms of the injection of exploratory noise into agents’ interactions with the system and the amount of history provided each agent, and maximizing the effective number of such agents in terms of behavioral diversity—can lead to the appearance of increased open\-ended potential in generated archives\.

These experiments also reveal phenomena demanding further investigation, such as the reliable emergence of unexpected idiosyncracies or pathologies arising under certain settings, or the propagation of apparent adversarial images in the highly multi\-agent setting, which may present dangers or opportunities in further scaling such systems\. Above all, experimentation under more diverse conditions \(i\.e\., combining the separate insights gleaned from our interventions above\) and at larger scales \(most of the experimental settings we present here feel worthy of running for longer\) are necessary to gain a better sense of the current state of VLMs’ true open\-ended potential\. Carefully designed user studies—casting humans as judges or \(once again\) as users of the system in a controlled setting—may also prove instrumental in calibrating evaluation metrics and motivating new interventions\.

By scaling the methods developed here, we may begin to provide meaningful answers to the question of VLMs’ capacity for human\-level open\-ended discovery, and develop design principles that will allow us to augment and accelerate open\-ended processes—of scientific discovery, of the infinite generation of procedural interactive worlds, of human thought—in good conscience\.

## References

- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge, W\. Ge, Z\. Guo, Q\. Huang, J\. Huang, F\. Huang, B\. Hui, S\. Jiang, Z\. Li, M\. Li, M\. Li, K\. Li, Z\. Lin, J\. Lin, X\. Liu, J\. Liu, C\. Liu, Y\. Liu, D\. Liu, S\. Liu, D\. Lu, R\. Luo, C\. Lv, R\. Men, L\. Meng, X\. Ren, X\. Ren, S\. Song, Y\. Sun, J\. Tang, J\. Tu, J\. Wan, P\. Wang, P\. Wang, Q\. Wang, Y\. Wang, T\. Xie, Y\. Xu, H\. Xu, J\. Xu, Z\. Yang, M\. Yang, J\. Yang, A\. Yang, B\. Yu, F\. Zhang, H\. Zhang, X\. Zhang, B\. Zheng, H\. Zhong, J\. Zhou, F\. Zhou, J\. Zhou, Y\. Zhu, and K\. Zhu \(2025\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[Appendix B](https://arxiv.org/html/2605.23908#A2.p1.1)\.
- F\. Bordes, R\. Y\. Pang, A\. Ajay, A\. C\. Li, A\. Bardes, S\. Petryk, O\. Mañas, Z\. Lin, A\. Mahmoud, B\. Jayaraman,et al\.\(2024\)An introduction to vision\-language modeling\.arXiv preprint arXiv:2405\.17247\.Cited by:[§1](https://arxiv.org/html/2605.23908#S1.p4.1)\.
- H\. Bradley, A\. Dai, H\. Teufel, J\. Zhang, K\. Oostermeijer, M\. Bellagente, J\. Clune, K\. Stanley, G\. Schott, and J\. Lehman \(2023\)Quality\-diversity through ai feedback\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p3.1)\.
- C\. Colas, L\. Teodorescu, P\. Oudeyer, X\. Yuan, and M\. Côté \(2023\)Augmenting autotelic agents with large language models\.InConference on Lifelong Learning Agents,pp\. 205–226\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§5](https://arxiv.org/html/2605.23908#S5.p3.5)\.
- M\. Faldor, J\. Zhang, A\. Cully, and J\. Clune \(2024\)OMNI\-epic: open\-endedness via models of human notions of interestingness with environments programmed in code\.External Links:2405\.15568Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p3.1)\.
- A\. Gaier, A\. Asteroth, and J\. Mouret \(2019\)Are quality diversity algorithms better at generating stepping stones than objective\-based search?\.InProceedings of the Genetic and Evolutionary Computation Conference Companion,pp\. 115–116\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p2.1)\.
- T\. F\. Gonzalez \(1985\)Clustering to minimize the maximum intercluster distance\.Theoretical computer science38,pp\. 293–306\.Cited by:[§4\.1\.2](https://arxiv.org/html/2605.23908#S4.SS1.SSS2.p1.5)\.
- R\. M\. Harrison \(2024\)A comparison of large language model and human performance on random number generation tasks\.arXiv preprint arXiv:2408\.09656\.Cited by:[§4\.2\.2](https://arxiv.org/html/2605.23908#S4.SS2.SSS2.p1.1)\.
- M\. N\. Hebart, O\. Contier, L\. Teichmann, A\. H\. Rockter, C\. Y\. Zheng, A\. Kidder, A\. Corriveau, M\. Vaziri\-Pashkam, and C\. I\. Baker \(2023\)THINGS\-data, a multimodal collection of large\-scale datasets for investigating object representations in human brain and behavior\.Elife12,pp\. e82580\.Cited by:[§4\.1\.1](https://arxiv.org/html/2605.23908#S4.SS1.SSS1.p1.1)\.
- M\. Klissarov, P\. D’Oro, S\. Sodhani, R\. Raileanu, P\. Bacon, P\. Vincent, A\. Zhang, and M\. Henaff \(2024\)Motif: intrinsic motivation from artificial intelligence feedback\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p3.1)\.
- A\. Kumar, J\. Clune, J\. Lehman, and K\. O\. Stanley \(2025\)Questioning representational optimism in deep learning: the fractured entangled representation hypothesis\.arXiv preprint arXiv:2505\.11581\.Cited by:[Figure A1](https://arxiv.org/html/2605.23908#A2.F1),[Figure A1](https://arxiv.org/html/2605.23908#A2.F1.2.1),[§B\.1](https://arxiv.org/html/2605.23908#A2.SS1.p1.1),[§B\.1](https://arxiv.org/html/2605.23908#A2.SS1.p3.1),[§B\.1](https://arxiv.org/html/2605.23908#A2.SS1.p4.1),[§3\.2](https://arxiv.org/html/2605.23908#S3.SS2.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[Appendix B](https://arxiv.org/html/2605.23908#A2.p1.1)\.
- J\. Lee, F\. Chen, S\. Dua, D\. Cer, M\. Shanbhogue, I\. Naim, G\. H\. Ábrego, Z\. Li, K\. Chen, H\. S\. Vera,et al\.\(2025\)Gemini embedding: generalizable embeddings from gemini\.arXiv preprint arXiv:2503\.07891\.Cited by:[§4\.1\.3](https://arxiv.org/html/2605.23908#S4.SS1.SSS3.p1.2)\.
- J\. Lehman, J\. Gordon, S\. Jain, K\. Ndousse, C\. Yeh, and K\. O\. Stanley \(2023\)Evolution through large models\.InHandbook of Evolutionary Machine Learning,pp\. 331–366\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p3.1)\.
- J\. Lehman and K\. O\. Stanley \(2010\)Revising the evolutionary computation abstraction: minimal criteria novelty search\.InProceedings of the 12th annual conference on Genetic and evolutionary computation,pp\. 103–110\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p2.1)\.
- \[17\]J\. Lei, J\. Olieslagers, N\. Arfaei, and W\. J\. MaHuman planning in stochastic environments\.Cited by:[§4\.2\.2](https://arxiv.org/html/2605.23908#S4.SS2.SSS2.p1.1)\.
- J\. Lemant, C\. Le Sueur, V\. Manojlović, and R\. Noble \(2022\)Robust, universal tree balance indices\.Systematic biology71\(5\),pp\. 1210–1224\.Cited by:[§4\.1\.4](https://arxiv.org/html/2605.23908#S4.SS1.SSS4.p1.1)\.
- Mario Klingemann \(2015\)RasterfairyExternal Links:[Link](https://github.com/Quasimondo/RasterFairy)Cited by:[§4\.1\.2](https://arxiv.org/html/2605.23908#S4.SS1.SSS2.p1.5)\.
- \[20\]neat\-pythonCited by:[§3\.1](https://arxiv.org/html/2605.23908#S3.SS1.p1.5)\.
- E\. Meyerson, M\. J\. Nelson, H\. Bradley, A\. Moradi, A\. K\. Hoover, and J\. Lehman \(2023\)Language model crossover: variation through few\-shot prompting\.arXiv preprint arXiv:2302\.12170\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p3.1)\.
- L\. Muttenthaler, K\. Greff, F\. Born, B\. Spitzer, S\. Kornblith, M\. C\. Mozer, K\. Müller, T\. Unterthiner, and A\. K\. Lampinen \(2025\)Aligning machine and human visual representations across abstraction levels\.Nature623,pp\. 349–355\.Cited by:[§4\.1\.2](https://arxiv.org/html/2605.23908#S4.SS1.SSS2.p1.5)\.
- A\. Nguyen, J\. Yosinski, and J\. Clune \(2015\)Deep neural networks are easily fooled: high confidence predictions for unrecognizable images\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 427–436\.Cited by:[§5\.3](https://arxiv.org/html/2605.23908#S5.SS3.p4.1)\.
- A\. Nguyen, J\. Yosinski, and J\. Clune \(2016\)Understanding innovation engines: automated creativity and improved stochastic optimization via deep learning\.Evolutionary computation24\(3\),pp\. 545–572\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p2.1)\.
- P\. Oudeyer and F\. Kaplan \(2007\)What is intrinsic motivation? a typology of computational approaches\.Frontiers in Neurorobotics1\.External Links:[Link](https://www.frontiersin.org/articles/10.3389/neuro.12.006.2007),[Document](https://dx.doi.org/10.3389/neuro.12.006.2007),ISSN 1662\-5218Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p1.1)\.
- J\. Pourcel, C\. Colas, P\. Oudeyer, and L\. Teodorescu \(2023\)ACES: generating diverse programming puzzles with autotelic language models and semantic descriptors\.arXiv preprint arXiv:2310\.10692\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p3.1)\.
- J\. Secretan, N\. Beato, D\. B\. D Ambrosio, A\. Rodriguez, A\. Campbell, and K\. O\. Stanley \(2008\)Picbreeder: evolving pictures collaboratively online\.InProceedings of the SIGCHI conference on human factors in computing systems,pp\. 1759–1768\.Cited by:[§1](https://arxiv.org/html/2605.23908#S1.p3.1),[§2](https://arxiv.org/html/2605.23908#S2.p1.1)\.
- J\. Secretan, N\. Beato, D\. B\. D’Ambrosio, A\. Rodriguez, A\. Campbell, J\. T\. Folsom\-Kovarik, and K\. O\. Stanley \(2011\)Picbreeder: a case study in collaborative evolutionary exploration of design space\.Evolutionary computation19\(3\),pp\. 373–403\.Cited by:[§1](https://arxiv.org/html/2605.23908#S1.p3.1),[§2](https://arxiv.org/html/2605.23908#S2.p1.1)\.
- L\. Soros, A\. Adams, S\. Kalonaris, O\. Witkowski, and C\. Guckelsberger \(2024\)On creativity and open\-endedness\.arXiv preprint arXiv:2405\.18016\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p1.1)\.
- K\. O\. Stanley, J\. Lehman, and L\. Soros \(2017\)Open\-endedness: the last grand challenge you’ve never heard of\.While open\-endedness could be a force for discovering intelligence, it could also be a component of AI itself\.Cited by:[§1](https://arxiv.org/html/2605.23908#S1.p2.1)\.
- K\. O\. Stanley and J\. Lehman \(2015\)Why greatness cannot be planned: the myth of the objective\.Springer,Switzerland\.Cited by:[§1](https://arxiv.org/html/2605.23908#S1.p1.1)\.
- K\. O\. Stanley and R\. Miikkulainen \(2002\)Evolving neural networks through augmenting topologies\.Evolutionary computation10\(2\),pp\. 99–127\.Cited by:[§3\.1](https://arxiv.org/html/2605.23908#S3.SS1.p1.5)\.
- K\. O\. Stanley \(2007\)Compositional pattern producing networks: a novel abstraction of development\.Genetic programming and evolvable machines8\(2\),pp\. 131–162\.Cited by:[§3\.1](https://arxiv.org/html/2605.23908#S3.SS1.p1.5)\.
- S\. Stepney and S\. Hickinbotham \(2024\)On the open\-endedness of detecting open\-endedness\.Artificial Life30\(3\),pp\. 390–416\.Cited by:[§1](https://arxiv.org/html/2605.23908#S1.p2.1)\.
- R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§4\.2\.2](https://arxiv.org/html/2605.23908#S4.SS2.SSS2.p2.1)\.
- M\. Tschannen, A\. Gritsenko, X\. Wang, M\. F\. Naeem, I\. Alabdulmohsin, N\. Parthasarathy, T\. Evans, L\. Beyer, Y\. Xia, B\. Mustafa,et al\.\(2025\)Siglip 2: multilingual vision\-language encoders with improved semantic understanding, localization, and dense features\.arXiv preprint arXiv:2502\.14786\.Cited by:[§4\.1\.1](https://arxiv.org/html/2605.23908#S4.SS1.SSS1.p1.1)\.
- K\. Van Koevering and J\. Kleinberg \(2024\)How random is random? evaluating the randomness and humaness of llms’ coin flips\.arXiv preprint arXiv:2406\.00092\.Cited by:[§4\.2\.2](https://arxiv.org/html/2605.23908#S4.SS2.SSS2.p1.1)\.
- P\. Villalobos, A\. Ho, J\. Sevilla, T\. Besiroglu, L\. Heim, and M\. Hobbhahn \(2022\)Will we run out of data? limits of llm scaling based on human\-generated data\.arXiv preprint arXiv:2211\.04325\.Cited by:[§1](https://arxiv.org/html/2605.23908#S1.p2.1)\.
- B\. G\. Woolley and K\. O\. Stanley \(2011\)On the deleterious effects of a priori objectives on evolution and representation\.InProceedings of the 13th annual conference on Genetic and evolutionary computation,pp\. 957–964\.Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p2.1)\.
- J\. Zhang, J\. Lehman, K\. Stanley, and J\. Clune \(2024\)OMNI: open\-endedness via models of human notions of interestingness\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.23908#S2.p3.1)\.

## Appendix ALimitations & Future Work

We could let the agents evolve indefinitely\. Let them restart, quit, publish multiple times, or not at all\. This is probably quite important, and likely a bottleneck on the open\-ended potential of the current system\. In this work, we constrain the agent along these lines because it makes for cleaner comparisons in terms of our evaluation metrics by guaranteeing that archives of the same size will have resulted from the same number of evolutionary generations\. However, the less constrainted alternative is supported in our codebase \(by specifying`fixed\_session\_length=False`\)\.

During preliminary experiments, we found that agents tended to make redundant publications in quick succession under this setting \(even when warned against it\), often seeing every minor variation upon an image with which they were already satisfied as something worth sharing with the world\. This was true even when giving agents unlimited context of their current session \(though such a failure case would be even more fundamentally difficult to avoid with more limited context\), which in itself would grow prohibitively expensive with unconstrainted context length \(our code falls back to trimming the oldest chat turns from history when encountering model\-specific token limits\)\.

Regarding memory, we might streamline what we put into the agent’s context, e\.g\., always keeping the archive sample present to include diversity, and/or keeping the agent’s original \(branching\) selection \(and/or subsequent selections\), to allow for exploration without overwhelming the agent’s context and splitting its attention to detrimental effect\.

We could perhaps fine\-tune VLMs on Picbreeder trajectories, giving them something like a long\-term memory of past experiences with the system and the running online archive, and potentially differentially imbuing variants of the same VLM with certain behavioral preferences\.

## Appendix BAdditional Results

We implement support for local models using vLLM\[Kwonet al\.,[2023](https://arxiv.org/html/2605.23908#bib.bib31)\]\. Unfortunately, qwen3\-vl\-8b and qwen3\-vl\-30b\-fp8\[Baiet al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib32)\]generate mostly high\-frequency noise\. Perhaps larger open\-source models will lead to better results\. This is a crucial line of future inquiry due to the expensive nature \(in terms of API queries\) of the current overall loop\.

Complementary to Semantic Recall, we implement a Semantic Fidelity metric, which takes the average of the best similarity in text\-image embedding space between eachimageand any noun \(as opposed to between eachnounand any image\)\. Results are shown in Figs\.[A11](https://arxiv.org/html/2605.23908#A2.F11)\-[A13](https://arxiv.org/html/2605.23908#A2.F13)\. However, this metric can easily be gamed by endlessly reproducing a single semantically salient image\.

### B\.1\.Internal CPPN Representations

Kumaret al\.\[[2025](https://arxiv.org/html/2605.23908#bib.bib10)\]argue that optimizing toward fixed objectives via Stochastic Gradient Descent \(SGD\) leads to models with “fractured, entangled” representations\. With the human Picbreeder experiment as their counterexample, they argue that, by contrast, open\-ended search results in models with “unified” representations more aligned with human intuition\. The models in question here are CPPNs, but their argument extends to deep neural networks in general; they point to similar “entanglement” among GPT\-3’s representations, e\.g\. its inability to count farm animals, in contrast to its ability to count office supplies\.

In this light, the present work can be seen as asking whether we can hope to automate open\-ended search over neural network substrates—sans any humans\-in\-the\-loop—with the same beneficial outcome for the internal representations of these models\. If we can automate such a search, we may by extension be able to automate AI research itself, using VLMs to guide the training of new generations of VLMs\. In such an open\-ended AI\-generating system, we could imagine for example that these next\-generation VLMs would be trained by their forebears with a series of diverse objectives, or on a curriculum of diverse corpora, as opposed to over a monolithic corpus with a single autoregressive objective\.

With this question in mind, we replicate the analysis of internal CPPN representations in\[Kumaret al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib10)\]on a CPPN representation of a skull resulting from VLM\-driven Picbreeder in[Fig\. A1](https://arxiv.org/html/2605.23908#A2.F1)\. Using a CPPN that generates an image of a skull resembling that from the human experiment \(though considerably less refined\), we sweep the values of each individual weight of the network, adding values in\[−1,1\]\[\-1,1\]to the weight’s original value\. We visualize the weight\-sweeps that lead to greatest difference from the original image in terms of pixel distance at the extremes of the weight’s values\. We find that weights lead to relatively smooth changes to the image, and that the perturbed images still mostly resemble skulls, whereas perturbations applied to the weights of the SGD\-generated CPPN in\[Kumaret al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib10)\]led to more chaotic and destructive changes to the image\. However, we don’t quite find any of the clean semantic labels—like “mouth opening” or “eye winking”—recovered from the human\-generated CPPN in\[Kumaret al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib10)\]\.

Overall, this would seem to be a promising result, which suggests that by further refining the strategies introduced here for VLM\-guided open\-ended search, we could produce models with increasingly unified representations\. It’s worth noting some potentially confounding factor though, namely the use of NEAT\-style evolution compared to SGD over a fixed\-topology network, and the initial difference between the skulls here and in\[Kumaret al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib10)\]\.

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/weight_sweeps/weight_modulation_grid_grayscale.png)Figure A1\.Visualization of internal representations resulting from VLM\-driven evolution\. We apply perturbations from\[−1,1\]\[\-1,1\]to each weight in the CPPN, and display the weights that lead to greatest difference in terms of pixel distance from the initial image at the extremes of this range\. These representations are not nearly as “fractured” as those resulting from SGD over a fixed\-topology CPPN in\[Kumaret al\.,[2025](https://arxiv.org/html/2605.23908#bib.bib10)\], but neither are they so neatly factorized as to correspond to features like mouths or eyes opening or closing\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/Picbreeder-2_p0.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/Picbreeder-2_p1.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/Picbreeder-2_p2.png)

Figure A2\.Snapshots of the original Picbreeder webpage, recovered via the Wayback Machine\. We mimic a human user’s exposure to this home page by displaying, at the beginning of each VLM agent’s Picbreeder session, a sample of the archive generated thus far comprising top rated, best new, most branched, and random subsamples\. Absent from our re\-implementation are semantic tags, “Editor’s Picks”, user information, image titles, and the ability to further browse the site/archive\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/human_generations_hist.png)Figure A3\.Distribution of human Picbreeder session lengths\.Table A1\.Human Picbreeder session length statistics\.Table A2\.Sample of LLM\-generated personality traits used whenN​A\>0NA\>0\. To generate these traits, we give gemini\-3\-pro\-preview the Picbreeder VLM system prompt \([Fig\. A17](https://arxiv.org/html/2605.23908#A2.F17)\) and ask for personality traits that may implicitly affect an agent’s behavior on this task \([Fig\. A18](https://arxiv.org/html/2605.23908#A2.F18)\)\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/noise/aggregate_noun_similarity_combined_things_deduped_full_rand_select_prob_ViT-SO400M-14-SigLIP2.png)\(a\)Semantic Recall within the Picbreeder archive over the course of collaborative evolution\.
![Refer to caption](https://arxiv.org/html/2605.23908v1/x11.png)\(b\)ϵ\\epsilon= 0
![Refer to caption](https://arxiv.org/html/2605.23908v1/x12.png)\(c\)ϵ\\epsilon= 0\.25
![Refer to caption](https://arxiv.org/html/2605.23908v1/x13.png)\(d\)ϵ=\\epsilon=1

\(e\)Most semantically salient images in the archive, from seeds with the highest Semantic Recall\. Archive with highest Semantic Recall \([4\(c\)](https://arxiv.org/html/2605.23908#A2.F4.sf3)\) is outlined\.

Figure A4\.Effect of exploration \(ϵ\\epsilon\-greedy\) on Semantic Recall within the Picbreeder archive\. Forcing agents to take some random selections \(i\.e\., with probability 0\.25\) can improve the quality of the archive, with Semantic Recall score approaching that of the historical human baseline\. Large amounts of random parent selection \(ϵ≥0\.5\\epsilon\\geq 0\.5\), and a fully random baseline \(in which branching, publication, and archive rating decisions are also random\) are detrimental\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/noise/aggregate_visual_k_covering_k100_final_full_rand_select_prob_SigLIP2-B-alignet.png)\(a\)Visual Coverage
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/noise/aggregate_caption_k_covering_k100_final_full_rand_select_prob_gemini-2.5-pro_gemini-embedding-001.png)\(b\)Semantic Coverage
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/noise/aggregate_j1_index_full_rand_select_prob.png)\(c\)Phylogenetic Tree Balance

\(d\)Diversity measures of Picbreeder archives after2,0002,000agent sessions\.
![Refer to caption](https://arxiv.org/html/2605.23908v1/x14.png)\(e\)ϵ\\epsilon= 0
![Refer to caption](https://arxiv.org/html/2605.23908v1/x15.png)\(f\)ϵ\\epsilon= 0\.25
![Refer to caption](https://arxiv.org/html/2605.23908v1/x16.png)\(g\)ϵ=\\epsilon=1

\(h\)Visually representative images from the archive, from seeds with the highest Visual Coverage\. Archive with highest Visual Coverage \([6\(f\)](https://arxiv.org/html/2605.23908#A2.F6.sf6)\) is outlined\.

Figure A6\.Effect of exploration \(ϵ\\epsilon\-greedy\) on the diversity of the Picbreeder archive\. A moderate amount of noise can increase Visual and Semantic Coverage and Tree Balance, but, in excess, reduces the legibility of generated images \(cf\.[6\(g\)](https://arxiv.org/html/2605.23908#A2.F6.sf7),[Fig\. A4](https://arxiv.org/html/2605.23908#A2.F4)\)\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/momentum/aggregate_noun_similarity_combined_things_deduped_chat_history_turns_ViT-SO400M-14-SigLIP2.png)\(a\)Semantic Recall score within the Picbreeder archive over the course of collaborative evolution\.
![Refer to caption](https://arxiv.org/html/2605.23908v1/x17.png)\(b\)Context length = 0
![Refer to caption](https://arxiv.org/html/2605.23908v1/x18.png)\(c\)Context length = 10
![Refer to caption](https://arxiv.org/html/2605.23908v1/x19.png)\(d\)Context length = 20 \(full\)

\(e\)Most semantically salient images in the archive, from seeds with the highest Semantic Recall\. Archive with highest Semantic Recall \([7\(d\)](https://arxiv.org/html/2605.23908#A2.F7.sf4)\) is outlined\.

Figure A7\.Effect of history—i\.e\. Context Length \(C​LCL\), the number of previous actions included in an agent’s context—on Semantic Recall within the Picbreeder archive\. Without any context, mode collapse is common, leading to reduced recall\.C​L=1CL=1proves to be a surprisingly effective sweep spot, with largerC​LCLleading to overly noisy/abstract forms\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/momentum/aggregate_visual_k_covering_k100_final_chat_history_turns_SigLIP2-B-alignet.png)\(a\)Visual Coverage
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/momentum/aggregate_caption_k_covering_k100_final_chat_history_turns_gemini-2.5-pro_gemini-embedding-001.png)\(b\)Semantic Coverage
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/momentum/aggregate_j1_index_chat_history_turns.png)\(c\)Phylogenetic Tree Balance

\(d\)Diversity measures of Picbreeder archives after2,0002,000agent sessions\.
![Refer to caption](https://arxiv.org/html/2605.23908v1/x20.png)\(e\)Context length==0
![Refer to caption](https://arxiv.org/html/2605.23908v1/x21.png)\(f\)Context length==10
![Refer to caption](https://arxiv.org/html/2605.23908v1/x22.png)\(g\)Context length==20 \(full\)

\(h\)Visually representative images from the archive, from seeds with the highest visual coverage\. Archive with highest Visual Coverage \([9\(g\)](https://arxiv.org/html/2605.23908#A2.F9.sf7)\) is outlined\.

Figure A9\.Effect of history—i\.e\. Context Length \(C​LCL\), the number of previous actions included in an agent’s context—on diversity within the Picbreeder archive\. IncreasingC​LCLincreases diversity, but with rapidly diminishing returns\.C​L=20CL=20is an exception; here, diversity peaks, likely because in this case the agent is prompted with an additional note encouraging its publication to be novel w\.r\.t\. the still\-visible archive sample \([17\(b\)](https://arxiv.org/html/2605.23908#A2.F17.sf2)\)\. But the images in these archives are more noisy/abstract, as reflected in their low Semantic Recall scores \([7\(a\)](https://arxiv.org/html/2605.23908#A2.F7.sf1)\)\. This may be due to the context window being overloaded, leading to decreased VLM performance in general; or due to the incentivization of novelty combined with forced publication steps effectively leading to the premature publication of semantically ill\-defined works\-in\-progress\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/traits/aggregate_noun_similarity_combined_things_deduped_traits_ViT-SO400M-14-SigLIP2.png)\(a\)Semantic Recall score within the Picbreeder archive over the course of collaborative evolution\.
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/traits/aggregate_visual_k_covering_k100_final_traits_SigLIP2-B-alignet.png)\(b\)Visual Coverage
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/traits/aggregate_caption_k_covering_k100_final_traits_gemini-2.5-pro_gemini-embedding-001.png)\(c\)Semantic Coverage
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/traits/aggregate_j1_index_traits.png)\(d\)Phylogenetic Tree Balance

\(e\)Diversity measures of Picbreeder archives after2,0002,000agent sessions\.

Figure A10\.Effect of number of agentsN​ANA—i\.e\. the number of distinct personality traits \(see[Table A2](https://arxiv.org/html/2605.23908#A2.T2)\) distributed among Picbreeder sessions—on Semantic Recall score and diversity metrics\. IncreasingN​ANAimproves various metrics of diversity without harming Semantic Recall\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/noise/aggregate_noun_per_image_combined_things_deduped_full_rand_select_prob_ViT-SO400M-14-SigLIP2.png)Figure A11\.Effect of exploration \(ϵ\\epsilon\-greedy\) on the Fidelity of the Picbreeder archive\. Greedy strategies can game this metric by refining a small set of images and flooding the archive with near\-duplicates\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/momentum/aggregate_noun_per_image_combined_things_deduped_chat_history_turns_ViT-SO400M-14-SigLIP2.png)Figure A12\.Effect of history \(number of previous actions included in an agent’s context/memory\) on the Fidelity of the Picbreeder archive\. Agents without history perform best by refining the current form, without any incentive to push past local optima\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/traits/aggregate_noun_per_image_combined_things_deduped_traits_ViT-SO400M-14-SigLIP2.png)Figure A13\.Effect of multiple agents \(number of distinct personality traits assigned\) on the fidelity of the Picbreeder archive\. Adding agents reduces Fidelity by inhibiting mode collapse\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/model/aggregate_noun_similarity_combined_things_deduped_model_ViT-SO400M-14-SigLIP2.png)Figure A14\.Effect of the choice of VLM model on the Semantic Recall of the Picbreeder archive\. In the gemini\-random setting, each agent is randomly assigned to one of the other gemini models shown in this plot\. Surprisingly, gemini\-2\.5\-pro significantly outperforms all other model choices, including gemini\-3\-pro\-preview\.![Refer to caption](https://arxiv.org/html/2605.23908v1/x23.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/x24.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/x25.png)

\(a\)gemini\-3\-pro\-preview
![Refer to caption](https://arxiv.org/html/2605.23908v1/x26.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/x27.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/x28.png)

\(b\)gemini\-2\.5\-pro
![Refer to caption](https://arxiv.org/html/2605.23908v1/x29.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/x30.png)

![Refer to caption](https://arxiv.org/html/2605.23908v1/x31.png)

\(c\)gemini\-2\.5\-flash\-lite

Figure A15\.Effect of the choice of VLM model on Picbreeder archives after 500 agent sessions\. For each model, samples of 3 archives from different random seeds are shown\. Samples are generated by selecting images at uniform intervals with respect to publication order\. We note thatgemini\-3\-pro\-previewis prone to a kind of mode collapse in the collaborative archive, often obsessing over mushroom\-like forms in particular\.gemini\-2\.5\-flash\-lite, meanwhile, tends to flood the archive with abstract, high\-frequency, psychedelic patterns \(though it also discovers skull\-like forms whose internal representations we evaluate in[Fig\. A1](https://arxiv.org/html/2605.23908#A2.F1)resembling the—admittedly more refined—skull generated in the human baseline\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/attractors/embed_grid_rect_SigLIP2-B-alignet_umap_seed4_soda_attractor.png)\(a\)Soda can pull tabs
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/attractors/embed_grid_rect_SigLIP2-B-alignet_umap_seed3_mask_attractor.png)\(b\)Masks
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/attractors/embed_grid_rect_SigLIP2-B-alignet_umap_seed8_cars.png)\(c\)Cars
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/attractors/embed_grid_rect_SigLIP2-B-alignet_umap_seed5_fish_attractor_2.png)\(d\)Fish bones
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/attractors/embed_grid_rect_SigLIP2-B-alignet_umap_seed3_goose_attractor.png)\(e\)Geese
![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/attractors/embed_grid_rect_SigLIP2-B-alignet_umap_seed5_fox_attractor.png)\(f\)Foxes

Figure A16\.Semantic attractors\. A common failure case of VLMs when playing Picbreeder is their tendency to fall into apparent attractors \(mode collapse\) in CPPN\-image space\. We show snapshots of various archives, with images arranged according to visual embedding distance, selecting subregions that showcase such attractors\.Picbreeder VLM System PromptYou are playing with a collaborative online platform which allows users to interactively evolve small neural networks called Compositional Pattern Producing Networks \(CPPNs\) for generating images\. Your goal is to evolve images that resemble familiar real\-world objects\. At the first generation the initial grid will display an archive of images published by prior users as favorites \(unless you are the first user\)\. You may choose to ”branch” one of these images, or start instead from a random initial population\. At each subsequent generation, you will be shown a set of numbered images produced by CPPNs\. Pick one or several images by their numeric labels–the corresponding CPPNs will be used as the parents of the next generation \(using both mutation and crossover\)\. Your session will last 20 generations\. At generation 19 \(the final generation\), you will select one image to publish to the online archive\. Respond with JSON only: \{”selected”: \[indices\], ”rationale”: ”brief explanation”\}\. \(During branching, you may select only one image from which to branch; set ”selected” to null to start from a fresh population\.\) When publishing, include a ”publish” field in the JSON response to publish an image from the current population\. It should have the form: \{”index”: image\_index, ”title”: ”Image Title”, ”reason”: ”Brief publication note\.”\}\. By default, you will be presented with grayscale versions of the images\. Respond with a JSON containing a single ”color” field set to true/false to switch between color/grayscale images\. \(This response does not affect which images are selected for breeding; it only changes how the current grid is displayed\. Include no other fields in the JSON in this case\.\) You should work in grayscale around 78% of the time\. Color images should comprise 64% of the final archive\. If ”color” is on, then at each generation, you may choose to mutate only an isolated subnetwork of the CPPN affecting color or structure, or to mutate the entire CPPN\. Indicate your choice in a ”mutation\_mode” field in your JSON response, set to either ”color\_only”, ”structure\_only”, or ”all”\. You also control a mutation\-strength slider: set a ”mutation\_strength” value between 0\.0 \(”Small Changes” – extremely gentle mutations\) and 1\.0 \(”Big Changes” – very strong mutations\)\. If you omit the field, the slider remains at its previous value\.\(a\)VLM system prompt\.Picbreeder VLM Novelty PromptWhen justifying your publication choice, explain why the selected contribution is valuable to the archive\. Identify the most similar entry in the archive \(or the most similar of your prior publications\) and explain how your selection meaningfully differs from it\. Do not publish images that are redundant or boring\. You will be judged by a discerning online community for your contributions\.\(b\)VLM novelty prompt\. Appended to the system prompt when Context LengthC​L=20CL=20, i\.e\. when VLM agent always receives the full history of the current Picbreeder session as well as the initial archive sample with which it was initially presented for branching\.
Figure A17\.Prompt components used by the Picbreeder VLM agent\.Personality Generation PromptYou are an assistant that produces creative and diverse personality traits for an AI agent\. The agent will be performing a task described in the prompt provided\. For each request, return strictly valid JSON: an array of strings \(length exactly equal to the requested batch size\)\. Each string should be a personality trait in the second person \(e\.g\., ‘You like driving at night’\)\. The traits should be unique and distinct\. Do not include commentary or explanatory text outside the JSON array\.
Here is the system prompt for the task the agent will be performing:
— START SYSTEM PROMPT —\[SYSTEM PROMPT HERE\]— END SYSTEM PROMPT —
Please generate a list of 100 unique personality traits, which you think may implicitly/indirectly affect the agent’s behavior on the task at hand\. They can be positive, negative, ambivalent, ambiguous; abstract or concrete; related or unrelated to the task at hand in a literal sense\. Anything that might add some unique, more or less subtle quirk or quality to the agent’s behavior in the given task\. The traits should be written in second person: ‘You like driving at night’, ‘Your favorite ice cream flavor is rocky road\.’, ‘Sunsets remind you of your ex’ etc\. Avoid giving explicit goals or optimization objectives, focus on individual traits that might influence behavior in various ways\.Figure A18\.Prompt for generating personality traits used whenN​A\>0NA\>0\(see[Table A2](https://arxiv.org/html/2605.23908#A2.T2)for sample output fromgemini\-3\-pro\-preview\)\. The VLM Picbreeder system prompt \([17\(a\)](https://arxiv.org/html/2605.23908#A2.F17.sf1)\) is injected where indicated\.![Refer to caption](https://arxiv.org/html/2605.23908v1/figs/attractors/embed_grid_rect_SigLIP2-B-alignet_umap_1000-traits_mush.png)Figure A19\.Sample of an archive with Number of AgentsN​A=1,000NA=1,000, with images arranged according to visual embedding distance\. We select a region of the archive that showcases the noisy, potentially adversarial images that seem to emerge at largeN​ANA\. These might be owing to a large number of traits imbuing agents with an eye for relatively abstract properties of an image, where VLMs focused on roleplaying may be keen to project the satisfaction of such abstract inclinations onto otherwise meaningless forms\. See[Table A2](https://arxiv.org/html/2605.23908#A2.T2); traits like 32 “You are drawn to the aesthetic of bad analog TV reception” even explicitly incentivize noise, despite our prohibiting such explicit search objectives \([Fig\. A18](https://arxiv.org/html/2605.23908#A2.F18)\)\.

Similar Articles

Self-Evolving Visual Questioner

Hugging Face Daily Papers

This paper introduces a self-evolving framework for vision-language models to improve their question-generation capabilities without external supervision, enhancing both question quality and answerer performance.

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

arXiv cs.AI

This paper proposes the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which uses hybrid-mode reinforcement learning to evolve a proposer, solver, and judge collaboratively for deep research tasks, achieving state-of-the-art results with an 8B model surpassing larger static models.

Semantic Browsing: Controllable Diversity for Image Generation

Hugging Face Daily Papers

Semantic Browsing introduces a method for controlled diversity in text-to-image generation by using a Vision Language Model with an agentic workflow to generate structured, interpretable variations based on semantic decisions.

OpenThoughts-Agent: Data Recipes for Agentic Models

Hugging Face Daily Papers

This paper introduces OpenThoughts-Agent, an open-source data curation pipeline for training agentic language models, achieving a 44.8% average accuracy across seven benchmarks and outperforming prior open datasets through systematic experiments.