I built my 'first' flow matching image generator, here's what I learned [P]

Reddit r/MachineLearning 07/04/26, 05:46 AM Models

Summary

The author shares their experience building a small flow matching image generation model trained on Apple emoji images, describing the initial failed approach and the successful pivot using RGB channels, residual blocks, and attention.

Today I put out my first flow matching image generation model! This is a toy example trained on a 2024 MPS Macbook Pro using a small sample of images—specifically, the Apple emoji library and their text labels. Because of this, it’s not a massive model (clocking in at ~4.7 million parameters), but it was an incredible learning experience. My original approach (which failed): I initially tried taking the emoji images, converting them to grayscale, and using an extremely basic CNN to expand the gray channels into 64 feature maps. I applied a ReLU activation, repeated this three times, and then coalesced back into a single layer to get the final theta prediction of the velocity field. I coupled this with CLIP word vector embeddings (run across the official Apple emoji descriptions) and a simple time encoding for interpolating between the noise vector field (x_0) and the target image vector field (x_1) to find the state at time t. This approach wasn't expressive enough for the model to actually learn to predict the velocity field, especially since I was using float32 to keep the model lightweight. The Pivot (What worked): To fix this, I switched to using full RGB channels instead of grayscale, implemented residual blocks, and added self/cross-attention. I also increased the feature channels to allow the network to retain more information about the emojis themselves. This worked much better. When predicting a velocity field for emojis, color is an incredibly important heuristic, and having more capacity allowed the text embeddings to form a much more meaningful relationship with the visual features during inference. The model is completely free to play around with here: https://emoji-generator-69.web.app/

Original Article

I built my 'first' flow matching image generator, here's what I learned [P]

Similar Articles

@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Qwen-Image-Flash (26 minute read)

Submit Feedback

Similar Articles

@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Qwen-Image-Flash (26 minute read)