@xichen_pan: Modern text-to-image models are increasingly powered by large pretrained LLMs. But there is a curious mismatch: the LLM…
Summary
RepFusion introduces a method to use pretrained multimodal LLMs as noisy representation encoders in diffusion transformers for text-to-image generation, outperforming baselines with similar compute.
View Cached Full Text
Cached at: 06/16/26, 11:40 AM
Modern text-to-image models are increasingly powered by large pretrained LLMs.
But there is a curious mismatch: the LLM typically encodes the prompt only once, while the evolving noisy latent states are handled entirely by a newly trained generative backbone.
Can pretrained multimodal prior participate in the denoising process?
Introducing RepFusion. (1/12)
https://arxiv.org/abs/2606.14700 https://xichenpan.com/repfusion/
RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space
Source: https://arxiv.org/html/2606.14700 1]Meta AI 2]New York University\contribution[*]Work done at Meta\contribution[†]Equal advising
Aashu SinghSatya Narayan ShuklaXiangjun FanShlok Kumar MishraSaining Xie[[[email protected]
(June 12, 2026)
Abstract
Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.
1Introduction
Text-to-image (T2I) generation is commonly formulated as conditional image generation, where image generators are conditioned on the outputs of text encoders. Alongside the evolution of image generators from GANs(gan)to diffusion models(ddpm), text encoders have also progressed from LSTMs(lstm)to CLIP(clip)and T5(t5). Recently, many systems have replaced these encoders with large language models (LLMs)(gpt;llama;llama3)due to their stronger representational capacity, richer world knowledge, in-context learning ability, and compatibility with unified multimodal models(metaquery). However, in recent pipelines(pixart;luminanext;sana;qwenimage;flux2;zimage), LLMs still primarily act as static text encoders that produce text embeddings, while diffusion transformers (DiTs)(dit)carry out the denoising trajectory and image synthesis.
This division of labor made sense in the VAE(vae)era. Diffusion models typically denoise VAE latents, and these latents were never designed to be “read” by pretrained language priors. They are low-dimensional, local, and optimized for reconstruction rather than semantics. As a result, even if one aims to bring an LLM closer to the denoising loop, it is unclear what the LLM should consume or why doing so would be beneficial.
![[Uncaptioned image]](https://arxiv.org/html/2606.14700v1/x1.png)
Figure 1.GenEval comparison when switching from VAEs to RAEs for three conditioning strategies: TextEmbed (conditioning a DiT with an LLM’s last-layer text token embeddings following recent T2I practice(sana;qwenimage;flux2;zimage)), Transfusion(transfusion), and RepFusion. All three variants in this comparison use a 7B LLM, TextEmbed and RepFusion also use a 1.3B DiT. RepFusion feeds noisy visual representations into a pretrained MLLM and uses the resulting outputs to condition a DiT. It benefits most significantly from the transition, achieving a 30% relative gain (+0.16 absolute improvement), compared to 21% (+0.10) for TextEmbed and 11% (+0.06) for Transfusion.
![[Uncaptioned image]](https://arxiv.org/html/2606.14700v1/x2.png)
Figure 2.GenEval comparison across different conditioning strategies under similar inference FLOPs. Circle size denotes total parameters, and the inner disk denotes trainable parameters. Each method allocates roughly 8B parameters to modules that either process noisy visual latents or denoise them. RepFusion fine-tunes only a 1.3B DiT and an MLP projector, yet outperforms TextEmbed and Transfusion, both of which train 8B parameters (a larger DiT and an LLM, respectively). This cross-method comparison suggests that MLLMs provide strong priors for denoising visual representations, and that repurposing them to encode noisy representations can be a more effective use of parameters than scaling newly initialized denoisers.
Representation autoencoders (RAEs)(rae)change this picture. By moving generation from VAE latents to semantically structured visual representations, such as CLIP(clip)or DINO(dino)features, RAEs provide a denoising space that is both easier to optimize and more semantically meaningful. Furthermore, these developments bridge T2I and the feature spaces currently utilized by Multimodal LLMs (MLLMs).
In the multimodal understanding community, pretrained LLMs have demonstrated a simple yet powerful property: with an MLP projector, they can ingest clean visual representations and immediately become strong sequence models over multimodal tokens(llava). This observation is usually discussed in the context of understanding and reasoning. Here, we take it as a design principle for generation: if an LLM can perceive clean visual representations, can it also process noisy counterparts during denoising?
Our answer is yes. As shown in Figure1, the resulting system is highly effective and best suited to the RAE latent space. We present RepFusion, a T2I model that treats a pretrained MLLM as anoisy representation encoder. In addition to text inputs, we feed noisy RAE latents into an off-the-shelf MLLM by reusing its MLP projector. We keep the pretrained LLM backbone frozen and fine-tune only its projector. We then use the MLLM’s output to condition a DiT that denoises in the same latent space. Conceptually, this design allows the pretrained MLLM to focus on what it does best: modeling structured visual representations.
This design first changes the capacity allocation picture beyond the standard “make the denoiser bigger” recipe. As shown in Figure1, under similar inference FLOPs, all compared systems allocate roughly 8B parameters to modules that either process noisy visual latents or denoise them: TextEmbed uses a 7B frozen MLLM text encoder and an 8B DiT, Transfusion uses an 8B joint denoising transformer, and RepFusion uses the same 7B frozen MLLM together with a 1.3B DiT. We provide details on training the TextEmbed and Transfusion baselines in Appendix6. Despite fine-tuning only the DiT and an MLP projector, RepFusion outperforms these baselines, showing that, across model families, allocating substantial model capacity to a frozen pretrained conditional encoder can outperform spending nearly the entire parameter budget on newly initialized denoising modules. This suggests that pretrained MLLMs carry priors that transfer beyond multimodal understanding: once the representation space is compatible, those priors can directly help denoise noisy visual representations.
RepFusion also introduces a distinct axis for scaling at test time. In TextEmbed pipelines, the conditional encoder is run once to produce static text embeddings that are reused across all denoising steps. In contrast, RepFusion feeds evolving noisy RAE latents into the MLLM, making the conditioning signal change along the denoising trajectory and making per-step MLLM recomputation useful.
We also compare against unified architectures such as Transfusion(transfusion), which can be viewed as another way of exposing noisy visual information to language models. As shown in Figure1, even when we upgrade such baselines to operate in the RAE latent space, the gains are smaller than those obtained by explicitly repurposing a frozen MLLM as a noisy encoder. In other words, moving from VAE to RAE helps, but by itself it does not unlock the full benefit of pretrained language priors.
In summary, this paper argues for a simple shift in perspective: many modern T2I systems already allocate substantial capacity to huge LLM text encoders, and RAEs provide a representation space where these encoders can do more than encode text. By letting frozen MLLMs take noisy visual representations as input, we obtain a strong and efficient prior for denoising in representation space. The main contributions are:
- •We show that frozen pretrained MLLMs can encode noisy RAE latents and provide useful denoising priors beyond static text conditioning.
- •We demonstrate that allocating parameters to a frozen pretrained conditional encoder can outperform static text embedding baselines that spend comparable capacity on newly initialized denoisers.
- •We show that noisy representation inputs unlock a way to scale test-time compute by making MLLM conditioning evolve across denoising steps.
- •We show that the pretrained MLLM prior is strong: freezing it outperforms further jointly optimizing it for generation.
2Related Work
Text encoders in T2IEarly conditional GANs use small text encoders such as LSTMs(lstm), producing either global sentence embeddings(ganintcls;stackgan)or token-level embeddings(attngan). Diffusion models later standardized text conditioning with frozen pretrained encoders that provide token embeddings for cross-attention. Stable Diffusion 1.5(sd1p5)popularized a CLIP(clip)text encoder. Recent systems increasingly scale the text encoder: Imagen(imagen)moves beyond CLIP to LLMs such as T5-XXL(t5), and PixArt-α\alpha(pixart), Stable Diffusion 3(sd3), and FLUX.1(flux)follow to ship with large T5-family encoders. Recent open-source models such as Lumina-Next(luminanext)and Sana(sana)adopt LLM encoders, and FLUX.2(flux2)further scales the LLM to a 24B-parameter Mistral Small 3(mistralsmall3). Overall, modern T2I pipelines often devote billions of parameters to text encoders, motivating methods that better utilize their capacity.
From VAEs to RAEsLatent diffusion(sd1p5)popularized a key design choice in modern T2I models: instead of diffusing in pixel space, models denoise in the latent space of an autoencoder, making high-resolution generation tractable. Most systems adopt VAEs(vae)for this purpose, but VAE latents are heavily compressed and optimized for reconstruction, which limits their semantic expressiveness. RAEs(rae)avoid this bottleneck by pairing a decoder with a frozen pretrained encoder (e.g., CLIP(clip)or DINO(dino)), working with semantically rich latents that are easier to denoise. This shift removes the VAE bottleneck and brings T2I into representation spaces that pretrained MLLMs already handle well, creating a natural opportunity to leverage their priors beyond static text conditioning.
Integration of Language Models and DenoisersA growing line of work seeks tighter integration between conditional encoders and denoisers. Unified architectures such as Transfusion(transfusion)train a large transformer to jointly model language outputs and denoise VAE latents, aiming for a single modeling stack across modalities. Another direction builds compact interfaces between MLLMs and diffusion backbones, for example via learnable queries(metaquery;blip3o;scalerae)or joint attention(lmfusion;bagel). In contrast, our focus is not on the conditioning mechanism, but on changing the content of the condition itself. We push MLLMs beyond text encoding, repurposing them to encode noisy representations and condition DiTs(dit).
Figure 3:Overview of RepFusion. Blue modules are frozen, while red modules are trainable. We reuse a pretrained MLLM to process the text prompt and noisy RAE latents. The noisy RAE latents are projected into the MLLM input space through an MLP projector, and the resulting outputs condition every DiT block via AdaLN modulation.
3RepFusion
This section first formalizes diffusion in visual representation space (Section3.1) and describes how RepFusion uses an MLLM to encode noisy representations for DiT conditioning (Section3.2). We then use controlled ablations to isolate the role of noisy representation inputs (Section3.3) and multimodal perception pretraining (Section3.4). Finally, we break down how these ingredients improve over TextEmbed and Transfusion baselines (Section3.5). Unless otherwise specified, the variants discussed in this section use a 7B LLM backbone paired with a 1.3B DiT denoiser.
3.1Preliminary
A flow matching T2I model is a conditional generative model. Given a text promptyy, we first obtain a text embedding𝒄=Eϕ(y)\bm{c}=E_{\phi}(y)with a typically frozen text encoder. The generative network is then conditioned on𝒄\bm{c}, either through cross attention(sd1p5)or adaptive normalization(dit). In our setting, diffusion operates in a visual representation space: let𝒙\bm{x}denote a clean visual representation, letttbe the timestep, and letϵ\bm{\epsilon}be the Gaussian noise. We adopt the𝒗\bm{v}-prediction parameterization(lipman2022flow;liu2022flow;albergo2022building):
𝒛t=t𝒙+(1−t)ϵ,𝒙∼pdata(𝒙).\bm{z}_{t}=t\,\bm{x}+(1-t)\,\bm{\epsilon},\qquad\bm{x}\sim p_{\text{data}}(\bm{x}).(1)We follow the timestep shifting strategy ofrae;sd3. For a base dimensionnnand an effective data dimensionmm, the sampled timesteptn∼𝒰(0,1)t_{n}\sim\mathcal{U}(0,1)is shifted tot=αtn1+(α−1)tnt=\frac{\alpha t_{n}}{1+(\alpha-1)t_{n}}, whereα=m/n\alpha=\sqrt{m/n}. Followingrae;sd3, we usen=4,096n=4{,}096and setmmto the effective dimensionality of the visual representation; in our RAE setup, this givesα=12\alpha=12.
The flow velocity is given by the time derivative of𝒛t\bm{z}_{t}:
𝒗=𝒛t′=𝒙−ϵ.\bm{v}\;=\;\bm{z}_{t}^{\prime}\;=\;\bm{x}-\bm{\epsilon}.We learn a conditional velocity field𝒗θ(𝒛t,t,𝒄)\bm{v}_{\theta}(\bm{z}_{t},t,\bm{c})by minimizing the standard flow matching objective(lipman2022flow;albergo2022building):
ℒ:=𝔼t,𝒙,ϵ‖𝒗θ(𝒛t,t,𝒄)−𝒗‖2,\mathcal{L}:=\mathbb{E}_{t,\bm{x},\bm{\epsilon}}\big\|\bm{v}_{\theta}(\bm{z}_{t},t,\bm{c})-\bm{v}\big\|^{2},where𝒗θ\bm{v}_{\theta}is predicted by the diffusion model.
3.2Methods
In standard approaches, the conditioning𝒄\bm{c}relies solely on the textyy. In RepFusion, as shown in Figure3, we augment the conditioning to also include the noisy visual representations,𝒛t\bm{z}_{t}. This design allows the LLM to perceive the denoising trajectory.
Specifically, the LLM input consists of a sequence of text tokens followed by projected noisy visual tokens. LetELLME_{\text{LLM}}denote the LLM,PψP_{\psi}an MLP projector, and𝒆t\bm{e}_{t}the timestep embedding; we use the same notation for its projected forms in the visual representation space and the LLM hidden space. The conditioning𝒄t\bm{c}_{t}is defined as
𝒄t=LastN(ELLM([y,Pψ(𝒛t+𝒆t)]))\bm{c}_{t}=\operatorname{Last}_{N}\!\left(E_{\text{LLM}}\left([y,P_{\psi}(\bm{z}_{t}+\bm{e}_{t})]\right)\right)(2)where[⋅,⋅][\cdot,\cdot]denotes sequence concatenation, andLastN\operatorname{Last}_{N}selects the finalNNhidden states corresponding to the noisy visual tokens. We inject timestep information before the MLLM by adding the embedding𝒆t\bm{e}_{t}to𝒛t\bm{z}_{t}before applying the projectionPψP_{\psi}, followingtransfusion. During sampling,𝒛t\bm{z}_{t}evolves at each denoising step, so𝒄t\bm{c}_{t}is recomputed accordingly. The LLM remains causal.
Following Decoupled Diffusion Transformer (DDT)(ddt), we condition the DiT using adaptive layernorm(dit)without introducing additional cross-attention modules. Concretely, we adopt the AdaLN-Single variant from PixArt-α\alpha(pixart): a shared projection produces token-wise modulation parameters reused across blocks, and each transformer block adds a lightweight learned offset table𝑻(ℓ)∈ℝ6×D\bm{T}^{(\ell)}\in\mathbb{R}^{6\times D}before splitting the resulting parameters into(𝜷,𝜸,𝜶)(\bm{\beta},\bm{\gamma},\bm{\alpha})for the MSA and MLP branches.
Let𝒉t(ℓ)∈ℝN×D\bm{h}^{(\ell)}_{t}\in\mathbb{R}^{N\times D}denote the intermediate states at DiT blockℓ\elland timesteptt, and let𝒄t∈ℝN×Dc\bm{c}_{t}\in\mathbb{R}^{N\times D_{c}}denote the selected LLM hidden states for the noisy visual tokens. In our setting, the token counts are aligned (N=576N=576). Following DiT(dit), we inject the noise level by adding the timestep embedding and applying a SiLU nonlinearity:
𝒄~t=SiLU(𝒄t+𝒆t),𝒄~t∈ℝN×Dc.\tilde{\bm{c}}_{t}=\mathrm{SiLU}(\bm{c}_{t}+\bm{e}_{t}),\qquad\tilde{\bm{c}}_{t}\in\mathbb{R}^{N\times D_{c}}.A shared linear layer predicts modulation parameters:
𝒎t=Linear(𝒄~t)∈ℝN×6D.\bm{m}_{t}=\mathrm{Linear}(\tilde{\bm{c}}_{t})\in\mathbb{R}^{N\times 6D}.For each blockℓ\ell, we interpret𝒎t\bm{m}_{t}as six token-wiseDD-dimensional modulation matrices and add a lightweight block-specific table𝑻(ℓ)∈ℝ6×D\bm{T}^{(\ell)}\in\mathbb{R}^{6\times D}(broadcast over tokens) to obtain the final modulation parameters for the block. Splitting along the last channel dimension yields
(𝜷t,msa(ℓ),𝜸t,msa(ℓ),𝜶t,msa(ℓ),𝜷t,mlp(ℓ),𝜸t,mlp(ℓ),𝜶t,mlp(ℓ)),(\bm{\beta}^{(\ell)}_{t,\mathrm{msa}},\bm{\gamma}^{(\ell)}_{t,\mathrm{msa}},\bm{\alpha}^{(\ell)}_{t,\mathrm{msa}},\bm{\beta}^{(\ell)}_{t,\mathrm{mlp}},\bm{\gamma}^{(\ell)}_{t,\mathrm{mlp}},\bm{\alpha}^{(\ell)}_{t,\mathrm{mlp}}),where each element lies inℝN×D\mathbb{R}^{N\times D}. Here,𝜷\bm{\beta}and𝜸\bm{\gamma}denote the shift and scale terms, respectively, and𝜶\bm{\alpha}denotes the residual gate.
The modulation operator can be defined asMod(𝒖;𝜸,𝜷)=𝒖⊙(1+𝜸)+𝜷\mathrm{Mod}(\bm{u};\bm{\gamma},\bm{\beta})=\bm{u}\odot(1+\bm{\gamma})+\bm{\beta}. Each block then computes:
𝒉~=Mod(RMSNorm(𝒉t(ℓ));𝜸t,msa(ℓ),𝜷t,msa(ℓ)),𝒉t′(ℓ)=𝒉t(ℓ)+𝜶t,msa(ℓ)⊙MSA(𝒉~),𝒉~′=Mod(RMSNorm(𝒉t′(ℓ));𝜸t,mlp(ℓ),𝜷t,mlp(ℓ)),𝒉t(ℓ+1)=𝒉t′(ℓ)+𝜶t,mlp(ℓ)⊙MLP(𝒉~′).\begin{gathered}\tilde{\bm{h}}=\mathrm{Mod}\!\left(\mathrm{RMSNorm}(\bm{h}^{(\ell)}_{t});\ \bm{\gamma}^{(\ell)}_{t,\mathrm{msa}},\ \bm{\beta}^{(\ell)}_{t,\mathrm{msa}}\right),\\ \bm{h}^{\prime(\ell)}_{t}=\bm{h}^{(\ell)}_{t}+\bm{\alpha}^{(\ell)}_{t,\mathrm{msa}}\odot\mathrm{MSA}(\tilde{\bm{h}}),\\ \tilde{\bm{h}}^{\prime}=\mathrm{Mod}\!\left(\mathrm{RMSNorm}(\bm{h}^{\prime(\ell)}_{t});\ \bm{\gamma}^{(\ell)}_{t,\mathrm{mlp}},\ \bm{\beta}^{(\ell)}_{t,\mathrm{mlp}}\right),\\ \bm{h}^{(\ell+1)}_{t}=\bm{h}^{\prime(\ell)}_{t}+\bm{\alpha}^{(\ell)}_{t,\mathrm{mlp}}\odot\mathrm{MLP}(\tilde{\bm{h}}^{\prime}).\end{gathered}Crucially,𝒄~t\tilde{\bm{c}}_{t}is token-aligned with𝒉t(ℓ)\bm{h}^{(\ell)}_{t}. As a result, the scale, shift, and residual gates are applied independently to each token rather than being broadcast over the sequence dimension.
MetaQueryTraining FLOPs:35 TMatched Inference FLOPs\begin{overpic}[width=199.4681pt]{figures/metaquery.pdf} \put(13.0,-3.0){\small(a) MetaQuery} \put(63.0,-3.0){\small(b) {RepFusion} } \end{overpic}RepFusionTraining FLOPs:35 TInference FLOPsFLOPs:547 TGenEval:0.70
Figure 4:High-level comparison between MetaQuery-style(metaquery)architectures (e.g., BLIP-3o(blip3o)and Scale-RAE(scalerae)) and RepFusion. During training, both methods backpropagate gradients through the conditional encoder and denoiser, resulting in similar training FLOPs. At inference time, we rerun the MetaQuery conditional encoder with different timestep embeddings to match RepFusion’s inference budget. This increases compute from 113 to 552 TFLOPs, but GenEval does not improve (0.55 to 0.54), because the MLLM still does not observe the evolving noisy representation. In contrast, noisy representation inputs make RepFusion’s condition depend on the denoising state, enabling useful test-time compute scaling through repeated MLLM conditioning and improving GenEval to 0.70.
(a)Effect of multimodal perception pretraining in the LLM backbone. Replacing an LLM with a perception-pretrained MLLM improves both Transfusion-RAE and RepFusion under settings with frozen and trainable LLMs.
(b)Effect of fine-tuning the LLM backbone. Fine-tuning helps when starting from a language-only LLM, but can hurt when starting from a perception-pretrained MLLM backbone in RepFusion-RAE.
Figure 5:Ablations on multimodal perception pretraining and LLM fine-tuning. Perception-pretrained backbones consistently improve RAE diffusion, while fine-tuning benefits text-only backbones but may degrade performance when the backbone is already multimodally pretrained. This indicates that perception pretraining is a strong prior that can outperform joint optimization for generation.
3.3Noisy Representation Input
To assess the importance of conditioning the MLLM on noisy representations, as illustrated in Figure4, we construct a learnable query baseline by replacing the projected noisy RAE latentsPψ(𝒛t+𝒆t)P_{\psi}(\bm{z}_{t}+\bm{e}_{t})in Equation2withNNlearnable queries𝑸η∈ℝN×Dc\bm{Q}_{\eta}\in\mathbb{R}^{N\times D_{c}}, following MetaQuery(metaquery), while keeping the rest of the architecture unchanged. This baseline closely resembles BLIP-3o(blip3o)and Scale-RAE(scalerae). With a frozen 7B MLLM and a 1.3B DiT, it reaches 0.55 on GenEval, whereas RepFusion reaches 0.70.
Crucially, this gap is not due to additional training compute. RepFusion and the learnable query baseline have similar training FLOPs: for each sampled timestep, both run the same forward pass through the frozen LLM and the denoiser, and both backpropagate through these computations while updating only the conditioning inputs or projector and the denoiser. The difference is that noisy representation inputs make the condition depend on the current denoising state. Since𝒛t\bm{z}_{t}evolves during sampling, RepFusion spends additional test-time compute on a changing, input-dependent conditioning signal. In contrast, learnable queries do not expose the LLM to𝒛t\bm{z}_{t}, so repeated inference has no evolving visual signal to re-encode. To isolate recomputation from noisy representation input, we also make the learnable queries timestep-dependent, closely matching RepFusion in inference FLOPs. This variant reaches only 0.54 on GenEval, below the original learnable query baseline, indicating that the gain comes from recomputing an input-dependent condition over evolving noisy representations, not from recomputation alone.
3.4Multimodal Perception Pretraining
The gains above depend on the conditional encoder being able to interpret structured visual representations along a denoising trajectory. We therefore examine what makes an LLM better at interpreting noisy RAE latents.
In particular, our conditional encoder is an MLLM whose backbone has been pretrained to perceive visual representations. We therefore study the role of multimodal perception pretraining. To isolate the effect of this capability, we compare a language-only LLM backbone with a perception-pretrained MLLM backbone, while keeping the denoiser and token budget the same. As shown in Figure5(a), replacing the language-only LLM with a perception-pretrained MLLM improves both Transfusion-RAE and RepFusion under settings with frozen and trainable LLMs. This indicates that perception pretraining provides a transferable prior for diffusion in RAE space: an MLLM that can interpret clean visual representations also better supports encoding their noisy counterparts.
We further compare preserving a perception-pretrained LLM with jointly optimizing the LLM for generation. Following Transfusion(transfusion), when fine-tuning the LLM, we add an auxiliary language modeling (LM) loss on the caption tokens and allow the injected noisy visual tokens to use bidirectional attention, while keeping the caption stream causal. Figure5(b)shows a consistent pattern: fine-tuning helps when starting from a language-only LLM backbone (in both VAE and RAE setups), but it degrades performance when starting from a perception-pretrained MLLM backbone in RepFusion-RAE. This suggests that multimodal perception pretraining is a strong prior that is best preserved rather than further re-optimized for generation.
(a)Path from TextEmbed to RepFusion.
(b)Path from Transfusion to RepFusion.
Figure 6:Step-by-step ablations from (a) TextEmbed and (b) Transfusion toward RepFusion. Bars show GenEval scores, and hatched bars denote modifications that are evaluated but not adopted.
3.5Breaking down the improvement
We use these observations to break down the improvements of RepFusion over TextEmbed and Transfusion.
As shown in Figure6(a), starting from a standard text embedding baseline with a GenEval score of 0.47, feeding noisy VAE latents into the LLM improves the score to 0.54. Replacing VAE latents with RAE latents, which are easier to denoise and more compatible with LLMs, further improves the score to 0.64. Jointly training the LLM with an LM loss and bidirectional attention gives a minor gain to 0.65. Finally, adding multimodal perception pretraining improves denoising in RAE space, but the best performance is achieved when the LLM backbone remains frozen, preserving the pretrained prior.
We also trace the path from the Transfusion baseline. Transfusion can be viewed as another way of exposing noisy visual latents to language models, but this unified baseline can be improved by using the LLM explicitly as a conditional encoder rather than as the denoiser itself. As shown in Figure6(b), starting from the Transfusion baseline at 0.56, replacing VAE latents with RAE latents and adopting a perception-pretrained LLM improves performance to 0.62. We then replicate the last 6 layers of the LLM to construct a stronger 8.0B Transfusion-RAE baseline. Reallocating the same 1.3B trainable parameters to a separate DiT, while using the LLM as the conditional encoder, improves performance from 0.64 to 0.68; freezing the LLM further increases it to 0.70.
4Experiments
4.1Experimental Setup
ModelUnless otherwise specified, we follow the MLLM setup ofllava: a causal LLM backbone is paired with a CLIP-L/14 vision tower(clip)through an MLP projector, which provides a clean interface for our RAE setup. We follow this simple architecture because many recent MLLMs introduce fine-tuned vision towers, any-resolution support, token compression(gemma3), and deep stacks(qwen3vl), which are tailored for multimodal understanding and are non-trivial to adopt for denoising purposes. We set the input resolution to 336, producingN=576N{=}576visual tokens. For VAE-based experiments, we use DC-AE(dcae)with a spatial downsampling factor of 32. We set the input resolution to 512, which yieldsN=256N{=}256latent tokens. This setup keeps output resolutions comparable across different latent spaces; we include a token-matched DC-AE comparison in Appendix8. For both RAE and VAE settings, we set the DiT patch size to 1.
DataWe pretrain all models on the BLIP-3o 31M dataset(blip3o), which is recaptioned with MLLMs and contains 27M long- and 4M short-caption pairs. For supervised fine-tuning (SFT), we combine BLIP-3o 60k(blip3o), ShareGPT4o-Image(sharegpt4oimage), and Echo-4o(echo4o)into a 200k synthetic dataset. SFT images are sourced from GPT-4o Image(gpt4oimagegeneration).
TrainingFor all pretraining experiments, we train the models on 128 H200 GPUs with a global batch size of 2,048. Models are trained for 10 epochs (160k steps) with a learning rate of3×10−43\times 10^{-4}. They are optimized using the AdamW(adamw)optimizer withβ1=0.9\beta_{1}=0.9,β2=0.95\beta_{2}=0.95, and a weight decay of 0.1. The learning rate follows a cosine decay schedule with a 10k-step warmup period. For SFT experiments, we use a learning rate of1×10−41\times 10^{-4}and train the model for 64 epochs.
Figure 7:Qualitative T2I samples generated by RepFusion. Some prompts are adapted frommoviegen.Representation DecodersTo decode the visual representations back into pixels, we employ two different strategies: an RAE decoder(rae)and a diffusion decoder(emu). Unless otherwise specified, all parameter counts reported exclude decoder parameters, and the decoder is the RAE decoder by default. For the RAE decoder, we follow standard RAE practice by using a ViT-XL(mae)decoder and a DINO(dino)GAN discriminator. The decoder patch size is set to 24 to produce images at a resolution of 576. We train the RAE decoder on ImageNet-22k(imagenet)for 16 epochs. For the diffusion decoder, we follow Emu(emu), starting from the SANA 1.6B checkpoint and replacing the text conditioning with CLIP features. The output resolution is 512. We train the diffusion decoder on ImageNet-22k for 10 epochs.
Table 1:T2I generation results on GenEval(geneval), GenEval++(echo4o), GenEval2(geneval2), and DPG-Bench(dpg).†denotes rewritten prompts. For GenEval2, we report the prompt-level metric Soft-TIFAGM.
4.2Prompt Alignment
With only around 30M image-caption pairs, RepFusion achieves strong T2I prompt alignment, with qualitative samples shown in Figure7. We evaluate our largest configuration, which uses a 7B MLLM and a 3.2B DiT, on four representative benchmarks: GenEval(geneval), GenEval++(echo4o), GenEval2(geneval2), and DPG-Bench(dpg). As shown in Table1, RepFusion achieves competitive performance across all of them, and RepFusion-SFT further improves the performance to state-of-the-art levels. Notably, benchmarks such as GenEval and DPG-Bench are increasingly subject to benchmark-specific optimization in the current synthetic data era(reca): many pipelines perform SFT on synthetic images sourced from GPT-4o(gpt4oimagegeneration)and Nano Banana(nanobanana), or directly apply RL with GenEval as a verifiable reward. To address this benchmark drift issue, GenEval2 was recently proposed with a more robust evaluation protocol, Soft-TIFA. Consistent with this motivation, we find that RepFusion-SFT yields only limited improvements on GenEval2, while the pretrained RepFusion remains strong. RepFusion also compares favorably with BAGEL(bagel)with Self-CoT, which is pretrained on over 1 billion web-scale examples.
4.3Reasoning-based Generation
We find that, similar to learnable query methods such as MetaQuery(metaquery)and BLIP-3o(blip3o), RepFusion can also effectively leverage the capabilities of a frozen LLM. This enables the model to better understand and follow complex prompts, including those requiring world knowledge and reasoning. To quantitatively evaluate RepFusion’s world-knowledge reasoning capability, we employ the WISE(wise)benchmark. As shown in Table2, RepFusion matches state-of-the-art performance.
Table 2:Reasoning-based generation on WISE(wise).
4.4Conditioning Interface
We ablate how the MLLM hidden states are injected into the DiT. Cross attention provides a general conditioning mechanism, but it adds extra attention projections and treats the conditioning stream as a separate context. In our RAE setting, theNNMLLM outputs are naturally aligned with theNNDiT tokens, so a token-wise adaptive normalization interface can use this correspondence directly. As shown in Table3, AdaLN-Single(pixart)achieves a slightly higher GenEval score with fewer parameters, and we therefore use it as the default conditioning interface.
Table 3:Ablation on the interface used to inject MLLM hidden states into the DiT. Parameter counts include the DiT and the conditioning interface.
4.5Scaling Behavior
RepFusion has two scaling axes at inference time: the frozen MLLM that repeatedly reads evolving noisy representations, and the DiT denoiser that predicts the velocity. We therefore study how performance changes when scaling either component in the billion-parameter regime. Figure8shows that increasing either the MLLM or DiT size can improve performance, with the clearest scaling trends appearing on GenEval and GenEval++.
We further study how to allocate inference compute across these two components under iso-FLOPs settings in Table4. At around 280T FLOPs, allocating more compute to the DiT outperforms the configuration that allocates more compute to the MLLM across all metrics. The same trend holds at around 540T FLOPs: configurations with larger DiTs achieve stronger GenEval++ and GenEval2 scores. These within-family comparisons answer a different question from Figure1: among RepFusion variants, scaling the DiT is generally more effective under a fixed inference budget, but compared with TextEmbed, RepFusion remains stronger even when the baseline spends nearly all sampling compute on the DiT. Thus, RepFusion benefits from allocating part of the test-time compute budget to repeated MLLM conditioning relative to static text embedding pipelines, while still following the broader trend that denoiser capacity is highly valuable.
Figure 8:MLLM and DiT co-scaling results. Increasing either component can improve performance, with clearer trends on GenEval and GenEval++.Table 4:Iso-FLOPs comparison. Given a fixed inference budget, we compare different allocations between MLLMs and DiTs within RepFusion, and include TextEmbed as a reference baseline. Scaling the DiT is generally more favorable than scaling the MLLM within RepFusion, but RepFusion remains substantially stronger than TextEmbed, whose static text embedding design leaves nearly all sampling compute in the DiT.
5Conclusion
We study a simple but underused degree of freedom in modern T2I systems: the conditional encoder. By allowing a frozen MLLM to read noisy visual representations, the encoder becomes an active component of the denoising loop rather than a static text encoder. The resulting gains suggest a practical recipe for future models: expose pretrained MLLM priors to evolving noisy representations, spend test-time compute on repeated conditioning only when it carries input-dependent information, and preserve those priors. We hope this perspective helps shift how the community thinks about the role of MLLMs in T2I generation.
References
\beginappendix
6Details on Training TextEmbed and Transfusion Baselines
We train the TextEmbed and Transfusion baselines with the same training setup as RepFusion. The controlled comparisons use the same text encoder family and newly initialized denoising components at the corresponding model scale. For TextEmbed, we follow the recent T2I practice used in Sana(sana): the LLM is used as a static text encoder, and its last-layer text-token embeddings condition a newly initialized DiT denoiser. Following Sana, we also apply RMSNorm after the decoder-only text encoder to normalize the variance of the text embeddings to 1.0. For Transfusion(transfusion), in addition to the diffusion training described above, we perform interleaved image-captioning training for the same number of training steps, using the same image-caption data.
7Impact of Decoders
We report RepFusion results with both the RAE decoder and the diffusion decoder in Table1. We observe a performance gap between these two decoders. However, when we use RepFusion to generate a CLIP feature and then apply these two decoders to the same feature, the resulting images appear very similar. The overall layout and colors are largely determined by the CLIP feature, while only fine-grained textures differ slightly (Figure9). This suggests that the choice of decoder does not affect the prompt-following ability of RepFusion. Instead, part of the performance gap on GenEval and DPG-Bench appears to arise because images decoded by the RAE decoder are blurrier in texture and therefore harder for detectors or vision-language models (VLMs) to evaluate reliably.
\begin{overpic}[width=420.61192pt]{figures/decoder.pdf} \put(-2.0,20.0){\rotatebox{90.0}{\small Diff Decoder}} \put(-2.0,7.0){\rotatebox{90.0}{\small RAE}} \end{overpic}
Figure 9:Same CLIP representation decoded by the diffusion decoder and the RAE decoder. Since both decoders are optimized for reconstruction, the mapping from CLIP features to pixels is largely deterministic: object layout and colors are determined upstream when the model denoises the CLIP representation. The choice of decoder mainly affects fine-grained appearance (e.g., textures), and thus has little impact on object-level prompt-following.Table 5:Token-matched latent-space comparison for TextEmbed. Increasing the DC-AE sequence length to match the RAE setting does not close the gap to RAE latents.
8Additional Latent Space Comparison
In the main experiments, we keep the output resolutions comparable across latent spaces, which givesN=256N{=}256tokens for DC-AE andN=576N{=}576tokens for RAE. To isolate the effect of token count, we also evaluate a DC-AE setting withN=576N{=}576tokens by increasing its output resolution. As shown in Table5, matching the DC-AE token count does not improve the TextEmbed baseline.
Similar Articles
RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space
RepFusion proposes using multimodal large language models as noisy representation encoders for diffusion transformers in text-to-image generation, outperforming traditional denoising approaches.
Text-to-Image Models Need Less from Text Encoders Than You Think
This paper demonstrates that text-to-image diffusion transformer models primarily rely on token merging and word order from text encoders rather than full contextual embeddings, suggesting that the image model itself decodes complex linguistic structures.
i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models
The paper introduces i1, a 3B-parameter text-to-image diffusion model that achieves competitive performance with leading closed models while being fully open (weights, data, code). It provides insights from 300+ controlled experiments and offers a practical recipe for open research.
TextLDM: Language Modeling with Continuous Latent Diffusion
This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
Lens is a compact 3.8B-parameter text-to-image model from Microsoft that achieves competitive performance with larger models while requiring significantly less training compute, using dense captions, multi-resolution batching, and efficient architecture.