Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Summary
Researchers extend MeanFlow one-step image generation from class labels to flexible text inputs by integrating highly-discriminative LLM-based text encoders, enabling efficient text-conditioned synthesis with improved performance.
View Cached Full Text
Cached at: 04/21/26, 11:27 AM
Paper page - Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Source: https://huggingface.co/papers/2604.18168
Abstract
Researchers extend MeanFlow generation from class labels to text inputs by integrating powerful LLM-based text encoders, overcoming limitations of few-step refinement through enhanced semantic feature representation.
Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified byMeanFlowachieving remarkable results. Existing research onMeanFlowprimarily focuses onclass-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model’s understanding capability, necessitating the effective integration of powerful text encoders into theMeanFlowframework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerfulLLM-based text encodersusing conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number ofrefinement stepsin theMeanFlowgeneration, such as only one step, the text feature representations are required to possess sufficiently highdiscriminability. This also explains why discrete and easily distinguishable class features perform well within theMeanFlowframework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the requiredsemantic propertiesand adapt theMeanFlowgeneration process to this framework, resulting in efficienttext-conditioned synthesisfor the first time. Furthermore, we validate our approach on the widely useddiffusion model, demonstrating significantgeneration performance improvements. We hope this work provides a general and practical reference for future research on text-conditionedMeanFlowgeneration. The code is available at https://github.com/AMAP-ML/EMF.
View arXiv pageView PDFGitHub73Add to collection
Get this paper in your agent:
hf papers read 2604\.18168
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18168 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18168 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18168 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Hierarchical text-conditional image generation with CLIP latents
OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.
Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models
This paper investigates how semantic information is distributed across textual tokens in text-to-image models, finding that information concentration and cross-item interactions significantly affect image generation alignment. The authors use patching techniques to demonstrate that simple encoding-stage interventions can improve alignment quality.
Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting
This paper introduces MeLISA, a latent-free autoregressive generative surrogate for forecasting high-dimensional physical dynamics that uses pixel-space MeanFlow to achieve efficient one-step generation. It demonstrates superior long-horizon statistical accuracy and inference speed compared to neural operators on turbulent flow benchmarks.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Proposes TextPro-SLM, a speech large language model that minimizes the modality gap by processing spoken input to resemble prosody-aware text input, achieving strong paralinguistic understanding with low training data.
@_vmlops: How LLMs Generate Text End-to-End Inference Pipeline A Mock Interview Guide https://drive.google.com/file/d/1eDqEtWWtIe…
This guide explains the end-to-end inference pipeline of LLMs, serving as a mock interview resource for understanding text generation.