@stephenbtl: My talk at @aiDotEngineer is now online. I talked about our research and where @bfl_ml is heading. Thanks @swyx for the…
Summary
Black Forest Labs shared the evolution of the Flux series models at the AI Engineer Conference and released the SelfFlow research paper, proposing a self-supervised multimodal training method that does not require external encoders.
View Cached Full Text
Cached at: 05/11/26, 02:46 PM
My talk at @aiDotEngineer is now online.
I talked about our research and where @bfl_ml is heading.
Thanks @swyx for the invite
https://t.co/Qs4Hv0UDb1
TL;DR: Black Forest Labs reviews the evolution of the Flux series models and releases the SelfFlow research paper, proposing a self-supervised multimodal training method without external encoders.
Black Forest Labs and Flux Model Background
Black Forest Labs (BFL) is the core team behind Stable Diffusion, Latent Diffusion, and the Flux models. The team has over 200,000 citations in academia and is dedicated not only to developing models but also to collaborating with enterprises to apply these technologies. Clients include well-known companies such as Microsoft, Adobe, Canva, and Mistral.
BFL’s primary operating principle is to release State-of-the-Art (SOTA) models and drive progress in the field through openly sharing research results. The company started with Flux 1, released in August 2024, which was its first major breakthrough.
Evolution of the Flux Series Models
Flux 1: Open Source Breakthrough
Flux 1 focuses on text-to-image generation and is a game-changer because it can run on laptops. Compared to other larger models, Flux 1 performs excellently in generation quality, especially regarding human anatomy. At the time of release, it was the most liked model on Hugging Face, marking a huge success for BFL as a new company.
Flux Context: Editing and Narrative
The subsequently released Flux Context is the world’s first open-source editing model, combining text-to-image generation with image editing capabilities. At the time, achieving both functions simultaneously was a major breakthrough.
- Speed Advantage: In the era when early GPT Image generation or editing an image took 40 to 50 seconds, Context requires only 7 to 8 seconds.
- Consistent Editing: The model can maintain high character consistency. For example, removing snowflakes from a face, or moving a character to take a selfie on the streets of Freiburg (where BFL is headquartered), or even changing the background to snowfall while naturally covering the character’s face with snowflakes.
- Narrative Application: This model is very useful for video or animation models. Users can create storyboards based on a single image (e.g., a seagull wearing a VR headset drinking in a bar), then add friends or change scenes (e.g., walking outside). These images can be provided as input frames or end frames to video models to generate coherent content.
Flux 2: Visual Intelligence and Multi-Image Reference
In November 2024, BFL released Flux 2, marking a step towards “Visual Intelligence.” As a foundation model, Flux 2 reaches a level of image quality where it is difficult to distinguish from AI generation, with excellent detail performance (such as hand veins, worn bracelets, animal fur, etc.).
- Multimodal Capabilities: Not limited to people or animals, it also supports professional product photography (such as waffles, or a person tied to balloons on a scooter).
- Multi-Image Reference Editing: Flux 2 is the first model to support multi-image references, accepting up to 10 images at once.
- Outfit Generation: Input six images with the prompt “Create an outfit using these images,” and the model can generate a logical combination (such as a proper jacket and tie).
- Product Placement: For example, placing a sofa image into a consumer’s living room scene to imagine the actual effect.
- Consistency: Performs excellently in consistency regarding characters, products, and styles.
Interactive Generation Speed
The updated version of Flux 2 released in January 2025 moves towards interactive editing and generation.
- Real-Time Speed: Generating and editing images can be completed in less than one second. The fastest editing takes only 500 milliseconds, and generation takes only 300 milliseconds.
Challenges in Model Training: Representation Alignment
As a research company, BFL’s core focus is on publicly sharing results. In the model training process for generated content (images, video, audio), there is a fundamental problem: models do not understand physical common sense.
- Nature of the Problem: The training process usually involves adding random noise to images and then denoising. The model cannot learn that “a glass shouldn’t pass through a table” or “a person sitting on a chair shouldn’t clip through.”
- Traditional Solution: Use “Representation Alignment.” Introduce external models (such as image encoders) to understand physical common sense and tell the generative model the correct relationships between objects.
- Effect: When using external alignment, the model convergence and loss reduction speed can be 70 times faster.
Limitations of External Encoders
Although effective, relying on external encoders has significant drawbacks:
- Scaling Ceiling: External models are usually fixed weight checkpoints. When the generative model scales up, it is limited by the adjacent encoder and cannot fully scale synchronously.
- Modality Specialization: Encoders are usually specialized for specific modalities (e.g., DinoV2 specializes in images). If the model needs to generate images, audio, video, and other content, encoders must be equipped for all modalities, causing the architecture to become pieced together like a “Frankenstein.”
- Objective Mismatch: The goal of the generative model is to generate content, while the goal of the encoder might be to segment objects. The goals differ, and the synergy is not perfect.
- Case: DinoV3 is technically better than DinoV2, but performs worse when used to train generative models. Currently, there are no clear rules explaining why certain encoders are effective while others are not.
SelfFlow: A New Generation Training Method
To solve the above problems, BFL released the research paper “SelfFlow” about a month and a half ago. This method aims to directly teach the model representations without using external encoders.
Core Mechanism
SelfFlow is a scalable method for training multimodal generative models, adopting self-supervised learning without requiring any other models to assist in training.
- Joint Learning: Combines representation learning and generation in the same process.
- Dual Model Collaboration:
- Student Model: Always receives the noisiest images and attempts to denoise.
- Teacher Model: Essentially a more stable version of the student model, always receiving low-noise images.
- Loss Function: The student model simultaneously attempts to minimize generation loss and representation loss.
- Advantage: There is only one model, no external components needed. When scaling the model size, only the student and teacher models need to be scaled, without worrying about the limitations of external encoders.
Performance and Results
BFL trained one model across all modalities and compared it with the standard Flow Matching training method:
- Multimodal Improvement: Performs better than the baseline on audio, images, and video.
- Convergence Speed: The baseline encounters a plateau during convergence, while SelfFlow converges faster and the loss continues to drop. If trained to two million steps, the baseline may stop improving or even worsen, while SelfFlow loss continues to decrease.
- Text Generation Improvement:
- Baseline Issues: Text generation is imperfect, such as missing letters in “The future is flux” or spelling errors in “worlds.”
- SelfFlow Effect: Letters are arranged reasonably, mirrored text and text on trees are correct. The model learns representations and knows letters should be arranged sequentially.
- Anatomy Improvement: Human facial anatomy is more natural and reasonable than the baseline model.
- Video Generation: The same model can also generate video. The baseline generates strange push-up postures, while SelfFlow generates standard postures, correct arm movements, and proper hair handling.
Research Disclaimer
The SelfFlow-related models shown are research models and are not intended for direct use in production environments. However, this represents the future direction BFL believes in: moving away from external encoders and achieving better multimodal generation through self-supervised learning.
Source: https://youtu.be/x8Yb4RidLgM?is=ad8jtmUrL5boUbTU
Similar Articles
@tanzhengmc97: https://x.com/tanzhengmc97/status/2066531753762656730
Explained the operating principles of large models in easy-to-understand language, including word vectors, Transformer attention mechanism, next-word prediction training, and emergent abilities, suitable for beginners to understand basic AI concepts.
@nini_incrypto_: Hugging Face automates entire AI training pipeline! Recently, a project called ml-intern has gone viral on GitHub. It's like a 24/7 algorithmic intern that can independently perform post-training of large models. 1. Autonomous research: It will…
The ml-intern project from Hugging Face has gone viral on GitHub, enabling full automation of the entire workflow including paper research, data processing, training script writing, and model training, without human intervention. It significantly improves the performance of small models (such as Qwen3-1.7B), even surpassing Claude Code.
@snowboat84: https://x.com/snowboat84/status/2065215177029787705
This article is the middle part of the AI Engineering Landscape series, detailing core techniques such as inference optimization, model slimming (quantization, distillation, pruning, MoE), and speculative decoding, while reviewing the latest advances from hardware to the engineering stack.
@DayShuai: Tomorrow I'll volunteer to share my own AI loop at the Yang Zhang lab group meeting. The same OS pattern has run out 3,400+ 0-axiom Lean 4 theorems on automath and newmath in the past six months, with 5×/week automatic releases,...
Sharing experience from the AI loop at the Yang Zhang lab group meeting, including automated theorem proving, multi-machine collaboration, distilling a private experience base, and mentioning examples of Fields medalists using AI to solve mathematical problems.
@Prince_Canuma: My @aiDotEngineer talk is live: "On-device Intelligence using MLX" Huge thanks to @swyx and the team for having me — ha…
The author announces their live talk titled 'On-device Intelligence using MLX' at the aiDotEngineer event, expressing gratitude to the organizers and community contributors.