I built my 'first' flow matching image generator, here's what I learned [P]

Reddit r/MachineLearning Models

Summary

The author shares their experience building a small flow matching image generation model trained on Apple emoji images, describing the initial failed approach and the successful pivot using RGB channels, residual blocks, and attention.

Today I put out my first flow matching image generation model! This is a toy example trained on a 2024 MPS Macbook Pro using a small sample of images—specifically, the Apple emoji library and their text labels. Because of this, it’s not a massive model (clocking in at ~4.7 million parameters), but it was an incredible learning experience. My original approach (which failed): I initially tried taking the emoji images, converting them to grayscale, and using an extremely basic CNN to expand the gray channels into 64 feature maps. I applied a ReLU activation, repeated this three times, and then coalesced back into a single layer to get the final theta prediction of the velocity field. I coupled this with CLIP word vector embeddings (run across the official Apple emoji descriptions) and a simple time encoding for interpolating between the noise vector field (x_0) and the target image vector field (x_1) to find the state at time t. This approach wasn't expressive enough for the model to actually learn to predict the velocity field, especially since I was using float32 to keep the model lightweight. The Pivot (What worked): To fix this, I switched to using full RGB channels instead of grayscale, implemented residual blocks, and added self/cross-attention. I also increased the feature channels to allow the network to retain more information about the emojis themselves. This worked much better. When predicting a velocity field for emojis, color is an incredibly important heuristic, and having more capacity allowed the text embeddings to form a much more meaningful relationship with the visual features during inference. The model is completely free to play around with here: https://emoji-generator-69.web.app/
Original Article

Similar Articles

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

arXiv cs.CL

This paper investigates how semantic information is distributed across textual tokens in text-to-image models, finding that information concentration and cross-item interactions significantly affect image generation alignment. The authors use patching techniques to demonstrate that simple encoding-stage interventions can improve alignment quality.

Qwen-Image-Flash (26 minute read)

TLDR AI

This paper from Alibaba revisits few-step distillation for visual generative models, focusing on training recipe factors such as data composition, teacher guidance, and task mixture, using Qwen-Image-2.0 as a case study to develop Qwen-Image-Flash.