MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation
Summary
MIMFlow integrates Masked Image Modeling with Normalizing Flows for end-to-end image generation, achieving a FID of 2.50 on ImageNet 256x256 with 50% fewer tokens than standard models.
View Cached Full Text
Cached at: 06/30/26, 03:33 AM
Paper page - MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation
Source: https://huggingface.co/papers/2606.26016
Abstract
MIMFlow combines Normalizing Flows with Masked Image Modeling to improve generative modeling by decoupling semantic representation from pixel-level details, achieving better performance with fewer tokens.
Normalizing Flows(NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. WhileMasked Image Modeling(MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizeslatent semantics, pixel reconstruction, andgenerative flow. By employing aVAE encoderto infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequencysemantic manifold, while a specialized decoder handleshigh-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256times256 show that MIMFlow-L reaches 71.3\%linear probing accuracyand anFIDof 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2606\.26016
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.26016 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.26016 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.26016 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Masked Language Flow Models
This paper introduces Masked Language Flow Models (MLFMs), which incorporate masking into flow-based language models to enable continuous flow for conditional generation and allow pretrained Masked Diffusion Models to be converted. The authors propose a novel sampler that alternates continuous denoising with discrete unmasking, demonstrating for the first time that flow-based language models can scale to downstream reasoning and instruction-following tasks.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 is a new research paper introducing an architecture that bridges language models and autoregressive normalizing flows for unified multimodal generation. It addresses structural mismatches in existing systems by using a shared causal masking mechanism for interleaved text-image sequences.
@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…
Stanford and ByteDance introduce W-Flow, a single-step generative model that uses Wasserstein gradient flows to achieve state-of-the-art one-step ImageNet 256x256 generation (1.29 FID) with 100x faster sampling than multi-step diffusion models.
Multi-Resolution Flow Matching: Training-Free Diffusion Acceleration via Staged Sampling
MrFlow is a training-free multi-resolution acceleration strategy for flow-matching text-to-image models that combines low-resolution generation with pixel-space super-resolution and noise injection, achieving up to 25x end-to-end speedup without training or runtime modifications.
I built my 'first' flow matching image generator, here's what I learned [P]
The author shares their experience building a small flow matching image generation model trained on Apple emoji images, describing the initial failed approach and the successful pivot using RGB channels, residual blocks, and attention.