MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Hugging Face Daily Papers 06/24/26, 12:00 AM Papers

Summary

MIMFlow integrates Masked Image Modeling with Normalizing Flows for end-to-end image generation, achieving a FID of 2.50 on ImageNet 256x256 with 50% fewer tokens than standard models.

Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256times256 show that MIMFlow-L reaches 71.3\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.

Original Article

View Cached Full Text

Cached at: 06/30/26, 03:33 AM

Paper page - MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Source: https://huggingface.co/papers/2606.26016

Abstract

MIMFlow combines Normalizing Flows with Masked Image Modeling to improve generative modeling by decoupling semantic representation from pixel-level details, achieving better performance with fewer tokens.

Normalizing Flows(NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. WhileMasked Image Modeling(MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizeslatent semantics, pixel reconstruction, andgenerative flow. By employing aVAE encoderto infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequencysemantic manifold, while a specialized decoder handleshigh-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256times256 show that MIMFlow-L reaches 71.3\%linear probing accuracyand anFIDof 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.

View arXiv page View PDF GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2606\.26016

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.26016 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.26016 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.26016 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Paper page - MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Masked Language Flow Models

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…

Multi-Resolution Flow Matching: Training-Free Diffusion Acceleration via Staged Sampling

I built my 'first' flow matching image generator, here's what I learned [P]

Submit Feedback

Similar Articles

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…

Multi-Resolution Flow Matching: Training-Free Diffusion Acceleration via Staged Sampling

I built my 'first' flow matching image generator, here's what I learned [P]