Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

arXiv cs.AI 05/19/26, 04:00 AM Papers
Summary
This paper proposes HT-GRPO, a hierarchical reinforcement learning method for diffusion multi-modal large language models that uses a sketch-then-paint training scheme and hierarchical credit assignment to improve image generation quality and reward alignment.
arXiv:2605.16842v1 Announce Type: new Abstract: Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:36 AM
# Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
Source: [https://arxiv.org/html/2605.16842](https://arxiv.org/html/2605.16842)
Siqi Luo1,2,\*,Jianghan Shen2,3,\*,Yi Xin3,4,Huayu Zheng1,Haoxing Chen3,Yan Tai1,Qi Qin2, Yue Li5,Junjun He2,4,Yihao Liu2,Guangtao Zhai1,Yuewen Cao2,Xiaohong Liu1,4,† 1Shanghai Jiao Tong University,2Shanghai Artificial Intelligence Laboratory,3Nanjing University, 4Shanghai Innovation Institute,5Peking University ![[Uncaptioned image]](https://arxiv.org/html/2605.16842v1/fig/github.png)Code:[https://github\.com/Alpha\-VLLM/Lumina\-DiMOO](https://github.com/Alpha-VLLM/Lumina-DiMOO)

###### Abstract

Diffusion Multi\-Modal Large Language Models \(dMLLMs\) are powerful for image generation, but optimizing them through reinforcement learning \(RL\) remains a major challenge\. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios is often intractable\. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details\. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image\. To address these issues, we proposeHierarchicalTokenGRPO\(HT\-GRPO\), which integrates this hierarchy directly into the policy optimization process\. Our approach features a Sketch\-Then\-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement\. We also use a prompt\-conditioned estimator to calculate importance ratios starting from a fully masked state\. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation\. Experiments using two popular dMLLM backbones, MMaDA and Lumina\-DiMOO, demonstrate that HT\-GRPO achieves substantial gains on the GenEval and DPG benchmarks\. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference\.

††∗Equal contribution†Corresponding author:xiaohongliu@sjtu\.edu\.cn## 1Introduction

Recent advancements in text\-to\-image \(T2I\) generation, driven by powerful diffusion modelsCaiet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib49)\); Qinet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib48)\); Wuet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib50)\); Teamet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib51)\)and autoregressive \(AR\) modelsWanget al\.\([2024](https://arxiv.org/html/2605.16842#bib.bib55)\); Cuiet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib52)\); Xinet al\.\([2025c](https://arxiv.org/html/2605.16842#bib.bib54)\); Liuet al\.\([2026a](https://arxiv.org/html/2605.16842#bib.bib53)\), have achieved significant breakthroughs\. This success is largely built upon the scaling of computational resources, model parameters, and training data\. Alongside these efforts, the integration of reinforcement learning \(RL\)Liuet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib56)\); Xueet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib57)\); Genget al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib58)\); Yuanet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib59)\)has also become an essential component\. By effectively aligning generation outputs with human preferences \(e\.g\., prompt adherence, aesthetic\), RL significantly improves the performance of T2I models\. More recently, Diffusion Multi\-Modal Large Language Models \(dMLLMs\)Yanget al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib36)\); Xinet al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib38)\); Shiet al\.\([2026](https://arxiv.org/html/2605.16842#bib.bib37)\); Bieet al\.\([2026](https://arxiv.org/html/2605.16842#bib.bib61)\); Liet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib63)\); Youet al\.\([2026](https://arxiv.org/html/2605.16842#bib.bib62)\)have emerged as a significant architectural evolution\. By performing parallel iterative denoising over discrete tokens, dMLLMs offer a novel approach to visual generation\. This architectural shift naturally requires researchers to design RL strategies specifically tailored for dMLLMs\. The generation process of dMLLMs is inherently hierarchical: tokens decoded early shape global layout under high uncertainty, while tokens decoded late refine local details under low uncertainty\.

![Refer to caption](https://arxiv.org/html/2605.16842v1/x1.png)Figure 1:Comparison of RL inner\-loop strategies for dMLLMs\.Existing RL approaches generally follow two routes, but both fail to account for this structure:1\) Random remasking methods\(shown in Figure[1](https://arxiv.org/html/2605.16842#S1.F1)\(a\)\)Maet al\.\([2025a](https://arxiv.org/html/2605.16842#bib.bib27)\); Zhaoet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib29)\); Liuet al\.\([2026b](https://arxiv.org/html/2605.16842#bib.bib30)\)approximate intermediate states by randomly remasking tokens in generated images and evaluating selected tokens under the remaining visible context \(Limitation: future\-token contamination\)\. However, this earliest naive paradigm leaks future information into the selected step\. It also arbitrarily mixes structural tokens responsible for global layout with refinement tokens responsible for local details into the same gradient update, preventing the model from optimizing each group at its appropriate stage \(Limitation: stage conflation\)\.2\) Trajectory recording methods\(shown in Figure[1](https://arxiv.org/html/2605.16842#S1.F1)\(b\)\)Wanget al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib44)\); Yanget al\.\([2025a](https://arxiv.org/html/2605.16842#bib.bib31)\); Panet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib32)\); Zhan \([2025](https://arxiv.org/html/2605.16842#bib.bib11)\)build a more explicit decision process by storing actual denoising trajectory\. While this prevents information leakage, the inner\-loop updates for each generated image are bound to a single recorded ordering, ignoring the many other valid unmasking orderings \(Limitation: narrow path coverage\)\. Beyond these specific issues, both routes assign uniform image\-level rewards to all tokens, ignoring the fact that different tokens have entirely different impacts on the final image quality \(Shared Limitation: uniform token reward\)\.

To overcome these limitations, we propose Hierarchical Token GRPO \(HT\-GRPO\), a novel RL method that explicitly encodes the hierarchical structure of dMLLM generation into inner\-loop policy optimization\. HT\-GRPO consists of two core components\.First, we propose a Sketch\-Then\-Paint staged training framework\.This organizes the inner\-loop updates into three consecutive stages \(shown in Figure[1](https://arxiv.org/html/2605.16842#S1.F1)\(c\)\): a Global stage that updates all tokens jointly to establish a stable coarse foundation, a Structure stage that focuses exclusively on structural tokens to sharpen global composition, and a Refinement stage that polishes local details only after the structure is settled, preventing structural and refinement tokens from being mixed in the same gradient update\. Within each stage, random subset sampling covers multiple valid generation paths at the same semantic level\. This enables the model to explore and optimize over a rich set of semantically consistent generation paths\. To support this staged optimization, we further introduce aprompt\-conditioned estimatorthat computes token\-level importance ratios under a unified fully masked state using only the text prompt as condition\. This simultaneously eliminates future\-token contamination, prevents low\-entropy degradation when evaluating refinement tokens, and removes the need to cache any intermediate denoising states\.Second, we introduce Hierarchical Credit Assignment\.This assigns higher credit weights to early structural tokens and lower weights to later refinement tokens, ensuring that image\-level rewards are distributed in proportion to each token’s actual contribution\.

Extensive experiments confirm that HT\-GRPO delivers superior text\-image alignment on multiple benchmarks\. On the MMaDA model, it raises GenEval from 0\.56 to 0\.83 and DPG from 70\.51 to 81\.09\. When applied to Lumina\-DiMOO, these scores further improve to 0\.92 and 84\.47 respectively\. Additionally, HT\-GRPO significantly boosts performance across six metrics covering image quality, aesthetics, and human preference\. These results confirm that hierarchical token optimization is a versatile and effective approach\. Overall, our contributions can be summarized as follows:

- •We reveal that the intrinsic hierarchical structure of dMLLM generation is essential\. Existing RL methods ignore it, leading to stage conflation and inaccurate reward broadcasts\. Instead, we explicitly leverage this structure to achieve principled hierarchical policy optimization\.
- •We proposeSketch\-Then\-Paintstaged training, which divides inner\-loop updates into Global, Structure, and Refinement stages by generation order, and uses a prompt\-conditioned estimator to prevent future\-token contamination without caching trajectories\.
- •We designHierarchical Credit Assignmentto allocate generation\-order\-aware weights to different tokens\. This effectively aligns the reward propagation with the actual contribution of each token\.
- •We show that HT\-GRPO consistently enhances MMaDA and Lumina\-DiMOO across various benchmarks\. This confirms the strong generalization of hierarchical token optimization across dMLLM architectures, outperforming leading T2I models\.

![Refer to caption](https://arxiv.org/html/2605.16842v1/x2.png)Figure 2:Visualization of the image generation process in dMLLMs\.Top: early \(red, structure\) tokens set layout, later \(blue, refinement\) tokens add detail\. Bottom: depict the corresponding outputs\.
## 2Preliminaries

### 2\.1Image Generation Process of dMLLMs

dMLLMs generate images by iteratively denoising masked visual tokens\. An image is represented as a token sequence of lengthNNwith text conditioncc\. Starting from a fully masked state𝐱\(T\)\\mathbf\{x\}^\{\(T\)\}, the model repeatedly unmasks a subset of positions until the final image𝐱\(0\)=\(v1,…,vN\)\\mathbf\{x\}^\{\(0\)\}=\(v\_\{1\},\\ldots,v\_\{N\}\)is produced:

𝐱\(t−1\)∼πθ\(⋅∣𝐱\(t\),c\),t=T,T−1,…,1\.\\mathbf\{x\}^\{\(t\-1\)\}\\sim\\pi\_\{\\theta\}\\\!\\left\(\\cdot\\mid\\mathbf\{x\}^\{\(t\)\},c\\right\),\\qquad t=T,T\-1,\\ldots,1\.\(1\)Unlike autoregressive models, which generate tokens in a fixed left\-to\-right order, dMLLMs can produce the same image through many different unmasking orderings\. We track this ordering with the generation\-order rankρg,i∈\{1,…,N\}\\rho\_\{g,i\}\\in\\\{1,\\ldots,N\\\}: rank 1 is the first position unmasked and rankNNthe last\. Tokens unmasked in the same step receive consecutive ranks under an arbitrary fixed order within that step\.

The rankρg,i\\rho\_\{g,i\}reflects a fundamental asymmetry in what the model knows at prediction time\. A token with smallρg,i\\rho\_\{g,i\}is unmasked early, when most of the image is still masked\. It faces high uncertainty, and its choice largely determines where objects appear and how the scene is organized\. A token with largeρg,i\\rho\_\{g,i\}is unmasked late, after most of the image has been revealed\. It faces much lower uncertainty and mainly refines color, texture, and local detail\. We call tokens with smallρg,i\\rho\_\{g,i\}*structural tokens*and tokens with largeρg,i\\rho\_\{g,i\}*refinement tokens*\(Figure[2](https://arxiv.org/html/2605.16842#S1.F2)\)\.[Proposition C\.1](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px1)provides a formal characterization of this entropy gap\. This rank\-induced asymmetry becomes a key challenge when applying RL to dMLLMs, as formalized in Section[2\.2](https://arxiv.org/html/2605.16842#S2.SS2)\.

### 2\.2Group Relative Policy Optimization

GRPOShaoet al\.\([2024](https://arxiv.org/html/2605.16842#bib.bib46)\)maximizes a clipped surrogate objective overGGrollouts sampled from a behavior policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}, using per\-token importance ratios and group\-relative advantages; the standard formulation is recalled in Appendix[D](https://arxiv.org/html/2605.16842#A4)\.

Applying GRPO to dMLLMs\.Extending the token\-level GRPO objective to dMLLMs rests on two properties of Eq\. \([1](https://arxiv.org/html/2605.16842#S2.E1)\)\. First,*Markov chain decomposition*: along a fixed unmasking trajectory, the joint probability factorizes step\-wise via the chain rule\. Second,*within\-step conditional independence*: conditioned on the current state𝐱\(t\)\\mathbf\{x\}^\{\(t\)\}and text promptcc, the tokens unmasked at stepttare sampled independently across positions\. Together, these two properties enable a per\-token formulation of the importance ratio in dMLLMs, analogous to the autoregressive case \(Appendix[D](https://arxiv.org/html/2605.16842#A4)\)\.

###### Assumption 1\(Stage\-varying optimization\)\.

Because dMLLMs decode tokens acrossTTstages under progressively changing visual contexts, the per\-step optimization objective is inherently non\-uniform\. We therefore introduce an*optimization support set*ℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}and a*conditioning context*𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}to characterize each gradient stepkk\. All methods discussed in this paper are special instantiations of this paradigm\. A unified comparison is given in Appendix[B](https://arxiv.org/html/2605.16842#A2)\.

Under Assumption[1](https://arxiv.org/html/2605.16842#Thmassumption1), the objective at gradient stepk∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}is:

𝒥\(k\)\(θ\)=𝔼c,\{𝐱g\(0\)\}∼πθold\[\\displaystyle\\mathcal\{J\}^\{\(k\)\}\(\\theta\)=\\mathbb\{E\}\_\{c,\\,\\\{\\mathbf\{x\}^\{\(0\)\}\_\{g\}\\\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\\!\\Bigg\[1G∑g=1G1\|ℳg\(k\)\|∑i∈ℳg\(k\)min⁡\(rg,i\(k\)\(θ\)Ag,clip\(rg,i\(k\)\(θ\),1−ϵ,1\+ϵ\)Ag\)\\displaystyle\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\\frac\{1\}\{\|\\mathcal\{M\}\_\{g\}^\{\(k\)\}\|\}\\\!\\sum\_\{i\\in\\mathcal\{M\}\_\{g\}^\{\(k\)\}\}\\min\\\!\\Big\(r\_\{g,i\}^\{\(k\)\}\(\\theta\)\\,A\_\{g\},\\;\\mathrm\{clip\}\\\!\\left\(r\_\{g,i\}^\{\(k\)\}\(\\theta\),1\{\-\}\\epsilon,1\{\+\}\\epsilon\\right\)A\_\{g\}\\Big\)−β𝔻KL\(πθ∥πref\)\],\\displaystyle\-\\;\\beta\\,\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)\\Bigg\],\(2\)where the per\-token importance ratio

rg,i\(k\)\(θ\)=πθ\(vg,i∣𝐂g,i\(k\),c\)πθold\(vg,i∣𝐂g,i\(k\),c\)r\_\{g,i\}^\{\(k\)\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\\\!\\left\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{g,i\}^\{\(k\)\},\\,c\\right\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\\!\\left\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{g,i\}^\{\(k\)\},\\,c\\right\)\}\(3\)measures the relative likelihood of visual tokenvg,iv\_\{g,i\}under the current versus behavior policy, conditioned on the visual context𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}and text contextcc\. The group\-relative advantageAg=\(Rg−mean\(\{Rj\}\)\)/\(std\(\{Rj\}\)\+δ\)A\_\{g\}=\(R\_\{g\}\-\\mathrm\{mean\}\(\\\{R\_\{j\}\\\}\)\)/\(\\mathrm\{std\}\(\\\{R\_\{j\}\\\}\)\+\\delta\)is a scalar broadcast uniformly to every token in rolloutgg\.

### 2\.3Limitations of Existing Methods

As noted in Remark[1](https://arxiv.org/html/2605.16842#Thmremark1),𝒥∗\\mathcal\{J\}^\{\*\}is intractable in practice\. Existing methods therefore approximate it by making different choices ofℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}and𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}\. Random remasking methods construct diverse pseudo\-trajectories through random masking, while trajectory recording methods condition on the single observed trajectory\. Each approximation carries its own limitations, analyzed below\. A unified formal comparison is provided in Appendix[B](https://arxiv.org/html/2605.16842#A2)\.

Random Remasking Methods\.MaskGRPOMaet al\.\([2025a](https://arxiv.org/html/2605.16842#bib.bib27)\), D1Zhaoet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib29)\), and UniGRPOLiuet al\.\([2026b](https://arxiv.org/html/2605.16842#bib.bib30)\)scheduleℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}by progressively increasing the mask ratio across theKKgradient steps\. At each stepkk, the conditioning context𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}consists of the randomly retained visible tokens together with the text promptcc\. This construction introduces two limitations\. First,*future\-token contamination*: the randomly retained tokens are not ordered by generation rank\. The context𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}may therefore include tokens not yet unmasked when positioniiwas originally generated, making it causally inconsistent, as shown in[Proposition C\.2](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px3)\. Second,*stage conflation*: the progressive mask schedule assigns tokens toℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}without regard to generation\-order rankρg,i\\rho\_\{g,i\}\. Structural tokens and refinement tokens are thus mixed in the same gradient update despite their fundamentally different entropy levels, as established in Section[2\.1](https://arxiv.org/html/2605.16842#S2.SS1)and formalized in[Proposition C\.1](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px1)\. On the other hand, the random remasking strategy provides multi\-path coverage\. By randomly varying the masked tokens across gradient steps, these methods expose the model to multiple partial views of the same generated image, providing broader context diversity than a single fixed trajectory\.

Trajectory Recording Methods\.TraceRLWanget al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib44)\), CJ\-GRPOYanget al\.\([2025a](https://arxiv.org/html/2605.16842#bib.bib31)\), d\-TreeRPOPanet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib32)\), and AGRPOZhan \([2025](https://arxiv.org/html/2605.16842#bib.bib11)\)setℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}to allNNtoken positions at every gradient step\. The conditioning context𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}is the set of tokens already unmasked just before positioniiwas revealed during the rollout trajectory\. By grounding the context in the actual generation order, these methods avoid both future\-token contamination and stage conflation\. They introduce two limitations of their own, however\. First,*limited path coverage*: as discussed in Remark[1](https://arxiv.org/html/2605.16842#Thmremark1), the ideal𝒥∗\\mathcal\{J\}^\{\*\}requires averaging over all valid trajectories\. Each rollout records only one ordering, so all alternatives remain unseen and the expectation is approximated by a single sample\. Second, computational overhead: recording the full denoising trajectory requires storing all intermediate states and performingO\(T\)O\(T\)forward passes per rollout\. This substantially increases both memory usage and compute cost per training cycle\.

Shared Limitation\.Neither family accounts for the difference between structural and refinement tokens in the advantage assignment\. The scalarAgA\_\{g\}is broadcast uniformly to every token regardless of denoising stage, assigning equal credit to structural and refinement tokens despite their vastly different entropy levels\.

## 3Method

The limitations in Section[2\.3](https://arxiv.org/html/2605.16842#S2.SS3)fall into two categories\. The first concerns the approximation of𝒥∗\\mathcal\{J\}^\{\*\}: existing choices ofℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}and𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}each introduce stage conflation, limited path coverage, or future\-token contamination\. The second concerns reward attribution: the image\-level advantageAgA\_\{g\}is broadcast uniformly, ignoring the entropy gap between structural and refinement tokens\.

HT\-GRPO addresses both categories through two components\. Section[3\.1](https://arxiv.org/html/2605.16842#S3.SS1)presents Sketch\-Then\-Paint staged training with a prompt\-conditioned estimator\. Inter\-stage partitioning ofℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}resolves stage conflation, intra\-stage random subset sampling alleviates limited path coverage, and fixing𝐂g,i\(k\)=𝐂∅\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}=\\mathbf\{C\}\_\{\\emptyset\}eliminates future\-token contamination\. Section[3\.2](https://arxiv.org/html/2605.16842#S3.SS2)replaces the uniformAgA\_\{g\}with Hierarchical Credit Assignment\.

![Refer to caption](https://arxiv.org/html/2605.16842v1/x3.png)Figure 3:HT\-GRPO framework\.\(1\) Prompt and masked image tokens are rolled out to produceGGsample groups\. \(2\) Tokens are partitioned into structural \(early, high\-entropy\) and refinement \(later, low\-entropy\) sets via generation\-order rank\. \(3\) The image\-level advantageAgA\_\{g\}is reweighted by per\-token creditswg,iw\_\{g,i\}to form the hierarchical advantageA~g,i\\tilde\{A\}\_\{g,i\}\. \(4\) Inner\-loop updates are scheduled into three stages, each randomly sampling from its corresponding token set\.### 3\.1Hierarchical Token GRPO

Visual generation is a progressive process from global structure to local detail, encoded in the distribution ofρg,i\\rho\_\{g,i\}\. As established in Section[2\.1](https://arxiv.org/html/2605.16842#S2.SS1), tokens with smallρg,i\\rho\_\{g,i\}are structural tokens facing high uncertainty, while those with largeρg,i\\rho\_\{g,i\}are refinement tokens operating under rich context\. HT\-GRPO encodes this ordering into both the optimization schedule and the importance ratio estimator\.

#### 3\.1\.1Sketch\-Then\-Paint Staged Training

Under standard GRPO, one policy optimization cycle is:

πθold→rollout\{𝐱g\(0\)\}g=1G→optimizeθ→synchronizeθold←θ\.\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\xrightarrow\{\\text\{rollout\}\}\\bigl\\\{\\mathbf\{x\}^\{\(0\)\}\_\{g\}\\bigr\\\}\_\{g=1\}^\{G\}\\xrightarrow\{\\text\{optimize\}\}\\theta\\xrightarrow\{\\text\{synchronize\}\}\\theta\_\{\\mathrm\{old\}\}\\leftarrow\\theta\.\(5\)The policy performsKKinner\-loop gradient updates on the same rollout batch\. Although some methods vary which tokens enter each update, none account for the entropy\-level distinction between structural and refinement tokens\.

We organize theKKinner\-loop updates into three consecutive stages:

Global→Structure→Refinement\.\\text\{Global\}\\rightarrow\\text\{Structure\}\\rightarrow\\text\{Refinement\}\.\(6\)This creates a coarse\-to\-fine curriculum from high entropy to low entropy\. The Global stage first optimizes all tokens jointly to provide a stable starting point\. The Structure stage then focuses on high\-entropy tokens, and the Refinement stage finally optimizes low\-entropy detail after the skeleton has been established\. This progression reduces gradient interference between token groups with fundamentally different entropy regimes\.

LetNs=⌊αN⌋N\_\{s\}=\\lfloor\\alpha N\\rfloorwithα∈\(0,1\)\\alpha\\in\(0,1\)denoting the structure fraction\. We define the stage\-specific token sets as

𝒮g,global=\{1,…,N\},𝒮g,structure=\{i∣ρg,i≤Ns\},𝒮g,refinement=\{i∣ρg,i\>Ns\}\.\\mathcal\{S\}\_\{g,\\mathrm\{global\}\}=\\\{1,\\ldots,N\\\},\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\}=\\\{i\\mid\\rho\_\{g,i\}\\leq N\_\{s\}\\\},\\mathcal\{S\}\_\{g,\\mathrm\{refinement\}\}=\\\{i\\mid\\rho\_\{g,i\}\>N\_\{s\}\\\}\.\(7\)HereNsN\_\{s\}is a token\-count cutoff rather than a denoising\-step cutoff\.𝒮g,structure\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\}collects theNsN\_\{s\}structural tokens with the smallestρg,i\\rho\_\{g,i\}values, and𝒮g,refinement\\mathcal\{S\}\_\{g,\\mathrm\{refinement\}\}the remainingN−NsN\-N\_\{s\}refinement tokens, following the taxonomy in Section[2\.1](https://arxiv.org/html/2605.16842#S2.SS1)\. This partition depends only on the rollout’s own unmasking order and requires no external annotation\. Using unmasking rank as an entropy proxy is justified by[Proposition C\.1](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px1)\.

We split the total inner\-loop budget asK=nglobal\+nstructure\+nrefinementK=n\_\{\\mathrm\{global\}\}\+n\_\{\\mathrm\{structure\}\}\+n\_\{\\mathrm\{refinement\}\}\. For theksk\_\{s\}\-th update inside stagess, whereks=0,…,ns−1k\_\{s\}=0,\\ldots,n\_\{s\}\-1, we independently sample a random subsetℳg\(ks\)⊆𝒮g,s\\mathcal\{M\}\_\{g\}^\{\(k\_\{s\}\)\}\\subseteq\\mathcal\{S\}\_\{g,s\}according to an annealed sampling rateγks\(s\)\\gamma\_\{k\_\{s\}\}^\{\(s\)\}\. Only tokens inℳg\(ks\)\\mathcal\{M\}\_\{g\}^\{\(k\_\{s\}\)\}contribute gradients in that update:

γks\(s\)=γmin\(s\)\+\(γmax\(s\)−γmin\(s\)\)max⁡\(1,ns−1\)−ksmax⁡\(1,ns−1\)\.\\gamma\_\{k\_\{s\}\}^\{\(s\)\}=\\gamma\_\{\\min\}^\{\(s\)\}\+\\bigl\(\\gamma\_\{\\max\}^\{\(s\)\}\-\\gamma\_\{\\min\}^\{\(s\)\}\\bigr\)\\frac\{\\max\(1,n\_\{s\}\-1\)\-k\_\{s\}\}\{\\max\(1,n\_\{s\}\-1\)\}\.\(8\)The sampling rate decays linearly fromγmax\(s\)\\gamma\_\{\\max\}^\{\(s\)\}toγmin\(s\)\\gamma\_\{\\min\}^\{\(s\)\}\. Early updates cover more tokens and provide stable gradient directions, while later updates reduce computation and sharpen focus\. Sampling different subsetsℳg\(ks\)\\mathcal\{M\}\_\{g\}^\{\(k\_\{s\}\)\}across inner\-loop updates ensures that distinct portions of the token space receive gradient signal, rather than committing to a fixed subset throughout optimization\. This intra\-stage sampling acts as a Monte Carlo estimator over the token space within each stage, partially alleviating the limited path coverage identified in Section[2\.3](https://arxiv.org/html/2605.16842#S2.SS3)\.

#### 3\.1\.2Prompt\-Conditioned Estimator

Random remasking methods set𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}to a randomly retained subset of the generated image, which may expose tokens not yet visible at generation time and causes future\-token contamination, as shown in[Proposition C\.2](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px3)\. Trajectory recording methods avoid contamination by using the actual trajectory state, but require storing all intermediate denoising states and performingO\(T\)O\(T\)forward passes per rollout cycle\.

We instead use the fully masked initial state𝐱\(T\)\\mathbf\{x\}^\{\(T\)\}as a unified conditioning context, denoted𝐂∅\\mathbf\{C\}\_\{\\emptyset\}to emphasize that allNNpositions are masked, and define the prompt\-conditioned importance ratio as

rg,i\(θ\)=πθ\(vg,i∣𝐂∅,c\)πθold\(vg,i∣𝐂∅,c\)\.r\_\{g,i\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\\\!\\left\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{\\emptyset\},c\\right\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\\!\\left\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{\\emptyset\},c\\right\)\}\.\(9\)Under a fully masked input, dMLLMs predict allNNtoken distributions in a single forward pass, so all probabilities are obtained without caching any intermediate states\. Using𝐂∅\\mathbf\{C\}\_\{\\emptyset\}eliminates future\-token contamination by construction, preserves predictive entropy for refinement tokens as shown in[Proposition C\.3](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px4), and avoids the memory cost of trajectory recording\.

### 3\.2Hierarchical Credit Assignment

Both families of existing methods uniformly broadcast the image\-level advantageAgA\_\{g\}to all tokens, assigning equal credit to structural and refinement tokens despite their fundamentally different roles\. We instead assign token\-level credit weights according to generation order:

A~g,i=Ag⋅wg,i,wg,i=\{λs,i∈𝒮g,structure,λr,i∈𝒮g,refinement,\\tilde\{A\}\_\{g,i\}=A\_\{g\}\\cdot w\_\{g,i\},\\qquad w\_\{g,i\}=\\begin\{cases\}\\lambda\_\{\\mathrm\{s\}\},&i\\in\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\},\\\\ \\lambda\_\{\\mathrm\{r\}\},&i\\in\\mathcal\{S\}\_\{g,\\mathrm\{refinement\}\},\\end\{cases\}\(10\)whereλs\>1\>λr≥0\\lambda\_\{\\mathrm\{s\}\}\>1\>\\lambda\_\{\\mathrm\{r\}\}\\geq 0\. Structural tokens receive weights above the uniform baseline to amplify updates on global composition and layout, while refinement tokens receive smaller weights to attenuate updates on local detail\. All three training stages share the same weighting scheme\. The only difference across stages is which active token set𝒮g,s\\mathcal\{S\}\_\{g,s\}is selected\.

The HT\-GRPO per\-step objective is Eq\. \([2\.2](https://arxiv.org/html/2605.16842#S2.Ex1)\) withAgA\_\{g\}replaced byA~g,i\\tilde\{A\}\_\{g,i\}:

𝒥\(k\)\(θ\)=𝔼c,\{𝐱g\(0\)\}∼πθold\[\\displaystyle\\mathcal\{J\}^\{\(k\)\}\(\\theta\)=\\mathbb\{E\}\_\{c,\\,\\\{\\mathbf\{x\}^\{\(0\)\}\_\{g\}\\\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\\!\\Bigg\[1G∑g=1G1\|ℳg\(k\)\|∑i∈ℳg\(k\)min⁡\(rg,i\(θ\)A~g,i,clip\(rg,i\(θ\),1−ϵ,1\+ϵ\)A~g,i\)\\displaystyle\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\\frac\{1\}\{\|\\mathcal\{M\}\_\{g\}^\{\(k\)\}\|\}\\\!\\sum\_\{i\\in\\mathcal\{M\}\_\{g\}^\{\(k\)\}\}\\min\\\!\\Big\(r\_\{g,i\}\(\\theta\)\\,\\tilde\{A\}\_\{g,i\},\\;\\mathrm\{clip\}\\\!\\left\(r\_\{g,i\}\(\\theta\),1\{\-\}\\epsilon,1\{\+\}\\epsilon\\right\)\\tilde\{A\}\_\{g,i\}\\Big\)−β𝔻KL\(πθ∥πref\)\],\\displaystyle\-\\;\\beta\\,\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)\\Bigg\],\(11\)where theKKupdates are ordered as Global→\\toStructure→\\toRefinement, consistent with Appendix[B](https://arxiv.org/html/2605.16842#A2)\.

Table 1:Performance comparison on GenEvalacross various dMLLMs and RL settings\.## 4Experiments

### 4\.1Experimental Setup

Models\.We validate our proposed HT\-GRPO on two popular open\-source pre\-trained dMLLMs: MMaDAYanget al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib36)\)and Lumina\-DiMOOXinet al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib38)\)\. Since other notable dMLLMs, such as LaViDa\-OLiet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib63)\), are not open\-source, we cannot include them in our experiments\. Additionally, models like LLaDA\-oYouet al\.\([2026](https://arxiv.org/html/2605.16842#bib.bib62)\)are not strictly dMLLMs because they rely on continuous diffusion for image generation, making them unsuitable as a base model\.

Baselines\.Our primary baselines include specialized T2I models \(e\.g\., SDXLPodellet al\.\([2023](https://arxiv.org/html/2605.16842#bib.bib7)\), FLUX\.1\-devBlack Forest Labs \([2024](https://arxiv.org/html/2605.16842#bib.bib41)\), DALL\-E 3Betkeret al\.\([2023](https://arxiv.org/html/2605.16842#bib.bib8)\), Janus\-ProChenet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib40)\)\), as well as the base versions of MMaDA and Lumina\-DiMOO\. Furthermore, we compare various RL training strategies\. Specifically, we evaluate MaskGRPOMaet al\.\([2025a](https://arxiv.org/html/2605.16842#bib.bib27)\)on the same dMLLM backbones to highlight the advantages of HT\-GRPO\. We additionally compare against TraceGRPO, a trajectory\-aware baseline that constructs causally consistent contexts from the recorded unmasking order\. Following the implementation of TraceRLWanget al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib44)\), we refer to this method as TraceGRPO\. We also test UniGRPO to establish the performance bound for single\-stage RL\.

Benchmarks & Metrics\.To evaluate T2I generation performance, we employ two related benchmarks: GenEvalGhoshet al\.\([2023](https://arxiv.org/html/2605.16842#bib.bib35)\)and DPG\-BenchHuet al\.\([2024](https://arxiv.org/html/2605.16842#bib.bib43)\)\. GenEval, consisting of 553 text prompts, is the most widely used benchmark to assess object\-centric generation\. In contrast, DPG\-Bench includes 1,065 dense and complex prompts to evaluate how well the generated images align with the text\. Beyond benchmark performance, we also measure the human preference, visual quality, and aesthetics of the generated images using ImageRewardXuet al\.\([2023](https://arxiv.org/html/2605.16842#bib.bib68)\), DeQAYouet al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib69)\), HPSv3Maet al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib70)\), and UniPerceptCaoet al\.\([2025](https://arxiv.org/html/2605.16842#bib.bib67)\)\.

Implementation Details\.HT\-GRPO is implemented on top of the MaskGRPOMaet al\.\([2025a](https://arxiv.org/html/2605.16842#bib.bib27)\)codebase, inheriting its optimizer, reward, and RL loop\. Specifically, we generateG=9G=9rollouts per prompt\. The final reward score is a combination of HPSv3Maet al\.\([2025b](https://arxiv.org/html/2605.16842#bib.bib70)\), CLIP scoreZhengwentai \([2023](https://arxiv.org/html/2605.16842#bib.bib71)\), and UniRewardWanget al\.\([2025a](https://arxiv.org/html/2605.16842#bib.bib66)\), maintaining the original KL penalty coefficient and fixing the classifier\-free guidance at 3\.5\. HT\-GRPO\-specific settings are: structure ratioα=0\.3\\alpha=0\.3, three\-stage budgetnglobal:nstructure:nrefinement=2:4:2n\_\{\\mathrm\{global\}\}:n\_\{\\mathrm\{structure\}\}:n\_\{\\mathrm\{refinement\}\}=2:4:2\(K=8K=8\), token\-level credit weightsλs=1\.5\\lambda\_\{\\mathrm\{s\}\}=1\.5andλr=0\.5\\lambda\_\{\\mathrm\{r\}\}=0\.5, and linear\-decay annealing \(γmax=1\.0\\gamma\_\{\\max\}=1\.0,γmin=0\.5\\gamma\_\{\\min\}=0\.5, down mode\)\. All experiments run on 8×\\timesA100\-80G GPUs\. More details in Appendix[F](https://arxiv.org/html/2605.16842#A6)\.

Table 2:Performance comparison on DPG\-Benchacross various dMLLMs and RL settings\.Table 3:Human preference, image quality, and aesthetics comparison on DPG\-Benchacross various dMLLMs and RL settings\.
### 4\.2Main Results

GenEval\.As shown in Table[3\.2](https://arxiv.org/html/2605.16842#S3.SS2), the evaluation on the GenEval benchmark highlights the advantages of HT\-GRPO in three main aspects\. First, HT\-GRPO is highly effective and outperforms previous RL methods\. For the MMaDA, our method increases the base overall score from 0\.56 to 0\.83, clearly outperforming UniGRPO \(0\.63\), MaskGRPO \(0\.80\), and TraceGRPO \(0\.79\)\. Second, this performance gain is consistent across different dMLLMs\. When applied to the stronger Lumina\-DiMOO, HT\-GRPO improves the overall score from 0\.83 to 0\.92 and outperforms MaskGRPO \(0\.88\) and TraceGRPO \(0\.87\)\. This proves that our method generalizes well to different dMLLMs\. Finally, models trained with HT\-GRPO are fully comparable to state\-of\-the\-art T2I models\. Leading models like FLUX\.1\-dev achieve an overall score of 0\.82\. MMaDA with HT\-GRPO reaches 0\.83 to slightly pass them, and Lumina\-DiMOO with HT\-GRPO sets a new high score of 0\.92\.

DPG\-Bench\.DPG\-Bench is more challenging than GenEval as it evaluates models using dense and complex prompts\. As shown in Table[4\.1](https://arxiv.org/html/2605.16842#S4.SS1), both MaskGRPO and TraceGRPO yield only marginal improvements over the Lumina\-DiMOO baseline \(82\.24 and 82\.62\), with TraceGRPO even underperforming MaskGRPO on MMaDA \(74\.37 vs\. 75\.81\)\. In contrast, HT\-GRPO successfully overcomes this performance bottleneck, raising the score to 84\.47\. This result surpasses leading models such as DALL\-E 3 \(83\.50\) and FLUX\.1\-dev \(83\.84\), with similar gains observed when using the MMaDA base model\. Furthermore, Table[3](https://arxiv.org/html/2605.16842#S4.T3)highlights another key advantage: HT\-GRPO improves semantic alignment without sacrificing visual quality\. It consistently achieves almost the highest scores across human preference metrics \(e\.g\., ImageReward, HPSv3\) and image quality & aesthetic evaluations \(e\.g\., DeQA, UniPercept\), proving that HT\-GRPO effectively balances complex instruction following with high\-fidelity image generation\.

### 4\.3Ablation Study

Figure[4](https://arxiv.org/html/2605.16842#S4.F4)and Table[5](https://arxiv.org/html/2605.16842#A4.T5)\(shown in Appendix[E](https://arxiv.org/html/2605.16842#A5)\) summarize five ablation studies on MMaDA \(K=8K=8,α=0\.3\\alpha=0\.3\)\.\(a\) Stage organization and budget:The full Global→\\toStructure→\\toRefinement schedule with a 2:4:2 budget reaches 83\.3 overall, outperforming all single\-stage and two\-stage variants \(78\.3–80\.2\); the coarse\-to\-fine ordering itself is critical\.\(b\) Structure ratio:Performance peaks atα=0\.3\\alpha=0\.3and degrades for narrower \(α=0\.1\\alpha=0\.1, 80\.2\) and broader \(α=0\.5\\alpha=0\.5\) boundaries, confirming that a well\-calibrated hierarchy is essential\.\(c\) Component analysis:Removing hierarchical credit weighting \(λs=λr=1\\lambda\_\{\\mathrm\{s\}\}=\\lambda\_\{\\mathrm\{r\}\}=1\) drops the score to 80\.8, and replacing𝐂∅\\mathbf\{C\}\_\{\\emptyset\}with revealed structural contexts reduces it to 80\.6, validating both designs independently\. The linear\-decay annealing schedule \(Table[5](https://arxiv.org/html/2605.16842#A4.T5), 83\.31\) outperforms static and ascending baselines, confirming the value of broad initial coverage for stable gradient directions\. Full per\-ablation details are provided in Appendix[E](https://arxiv.org/html/2605.16842#A5)\.

![Refer to caption](https://arxiv.org/html/2605.16842v1/x4.png)Figure 4:\(a\) Stage organization and budget allocation:The full Global→\\toStructure→\\toRefinement schedule with a structure\-biased 2:4:2 budget consistently outperforms single\-stage and two\-stage variants\.\(b\) Sensitivity to structure ratioα\\alpha:Performance peaks atα=0\.3\\alpha=0\.3and degrades when the boundary is too narrow or too broad\.\(c\) Component analysis:Hierarchical Credit Assignment and Prompt\-Conditioned Estimator each provide independent gains, validating both design choices\.## 5Conclusion

We revisit the application of GRPO to dMLLMs and identify two key challenges: multiple valid unmasking orderings make importance\-ratio estimation difficult, while existing methods uniformly assign image\-level rewards to tokens with heterogeneous roles\. To address these issues, we propose HT\-GRPO, which uses generation\-order rank to characterize token uncertainty and role, organizes inner\-loop updates into Global, Structure, and Refinement stages, estimates importance ratios under a fully masked prompt\-conditioned context, and performs generation\-order\-aware credit assignment\. Experiments on MMaDA and Lumina\-DiMOO show consistent improvements across GenEval, DPG\-Bench, and six human preference and visual quality metrics, demonstrating the effectiveness of hierarchical token optimization\. A current limitation is that HT\-GRPO uses a fixed structure ratio and fixed credit weights, which may not optimally adapt to prompts with different compositional complexity or rollout uncertainty\. In the future, we will advance HT\-GRPO through the integration of adaptive token grouping and dynamic credit assignment\.

## References

- \[1\]\(2021\)Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[2\]J\. Betker, G\. Goh, L\. Jing, T\. Brooks, J\. Wang, L\. Li, L\. Ouyang, J\. Zhuang, J\. Lee, Y\. Tian,et al\.\(2023\)Improving image generation with better captions\.Technical Report, OpenAI\.Cited by:[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.10.3.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.9.3.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p2.1)\.
- \[3\]T\. Bie, H\. Chen, T\. Chen, Z\. Cheng, L\. Cui, K\. Gan, Z\. Huang, Z\. Lan, H\. Li, J\. Li, T\. Lin, Q\. Qin, H\. Wang, X\. Wang, H\. Wu, Y\. Xin, and J\. Zhao\(2026\)LLaDA2\.0\-uni: unifying multimodal understanding and generation with diffusion large language model\.arXiv preprint arXiv:2604\.20796\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1),[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[4\]Black Forest Labs\(2024\)FLUX\.1\.Note:Official model releaseExternal Links:[Link](https://blackforestlabs.io/flux-1/)Cited by:[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.12.5.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.11.5.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p2.1)\.
- \[5\]H\. Cai, S\. Cao, R\. Du, P\. Gao, S\. Hoi, Z\. Hou, S\. Huang, D\. Jiang, X\. Jin, L\. Li,et al\.\(2025\)Z\-image: an efficient image generation foundation model with single\-stream diffusion transformer\.arXiv preprint arXiv:2511\.22699\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[6\]S\. Cao, J\. Li, X\. Li, Y\. Pu, K\. Zhu, Y\. Gao, S\. Luo, Y\. Xin, Q\. Qin, Y\. Zhou, X\. Chen, W\. Zhang, B\. Fu, Y\. Qiao, and Y\. Liu\(2025\)UniPercept: towards unified perceptual\-level image understanding across aesthetics, quality, structure, and texture\.arXiv preprint arXiv:2512\.21675\.Cited by:[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p3.1)\.
- \[7\]X\. Chen, Z\. Wu, X\. Liu, Z\. Pan, W\. Liu, Z\. Xie, X\. Yu, and C\. Ruan\(2025\)Janus\-pro: unified multimodal understanding and generation with data and model scaling\.arXiv preprint arXiv:2501\.17811\.Cited by:[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.11.4.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.10.4.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p2.1)\.
- \[8\]Y\. Cui, H\. Chen, H\. Deng, X\. Huang, X\. Li, J\. Liu, Y\. Liu, Z\. Luo,et al\.\(2025\)Emu3\.5: native multimodal models are world learners\.arXiv preprint arXiv:2510\.26583\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[9\]Z\. Geng, Y\. Wang, Y\. Ma, C\. Li, Y\. Rao, S\. Gu, Z\. Zhong, Q\. Lu, H\. Hu, X\. Zhang,et al\.\(2025\)X\-omni: reinforcement learning makes discrete autoregressive image generative models great again\.arXiv preprint arXiv:2507\.22058\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[10\]D\. Ghosh, H\. Hajishirzi, and L\. Schmidt\(2023\)GenEval: an object\-focused framework for evaluating text\-to\-image alignment\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 76341–76366\.Cited by:[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p3.1)\.
- \[11\]S\. Gong, S\. Agarwal, Y\. Zhang, J\. Ye, L\. Zheng, M\. Li, C\. An, P\. Zhao, W\. Bi, J\. Han,et al\.\(2025\)Scaling diffusion language models via adaptation from autoregressive models\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[12\]X\. Hu, R\. Wang, Y\. Fang, B\. Fu, P\. Cheng, and G\. Yu\(2024\)ELLA: equip diffusion models with llm for enhanced semantic alignment\.arXiv preprint arXiv:2403\.05135\.Note:Introduces DPG\-BenchCited by:[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p3.1)\.
- \[13\]S\. Li, J\. Gu, K\. Liu, Z\. Lin, Z\. Wei, A\. Grover, and J\. Kuen\(2025\)Lavida\-o: elastic large masked diffusion models for unified multimodal understanding and generation\.arXiv preprint arXiv:2509\.19244\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p1.1)\.
- \[14\]D\. Liu, Y\. Xin, S\. Zhao, L\. Zhuo, W\. Lin, X\. Li, Q\. Qin, G\. Zhai, X\. Liu, H\. Li,et al\.\(2026\)Lumina\-mgpt: flexible photorealistic autoregressive text\-to\-image generation\.International Journal of Computer Vision \(IJCV\)\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[15\]J\. Liu, G\. Liu, J\. Liang, Y\. Li, J\. Liu, X\. Wang, P\. Wan, D\. Zhang, and W\. Ouyang\(2025\)Flow\-grpo: training flow matching models via online rl\.arXiv preprint arXiv:2505\.05470\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[16\]J\. Liu, Z\. Ye, L\. Yuan, S\. Zhu, Y\. Gao, J\. Wu, K\. Li, X\. Wang, X\. Nie, W\. Huang, and W\. Ouyang\(2026\)UniGRPO: unified policy optimization for reasoning\-driven visual generation\.arXiv preprint arXiv:2603\.23500\.Cited by:[§A\.2](https://arxiv.org/html/2605.16842#A1.SS2.p1.1),[§B\.1](https://arxiv.org/html/2605.16842#A2.SS1.p1.2),[§1](https://arxiv.org/html/2605.16842#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.16842#S2.SS3.p2.9)\.
- \[17\]A\. Lou, C\. Meng, and S\. Ermon\(2024\)Discrete diffusion modeling by estimating the ratios of the data distribution\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[18\]T\. Ma, M\. Zhang, Y\. Wang, and Q\. Ye\(2025\)Consolidating reinforcement learning for multimodal discrete diffusion models\.arXiv preprint arXiv:2510\.02880\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2605.16842#A1.SS2.p1.1),[§B\.1](https://arxiv.org/html/2605.16842#A2.SS1.p1.2),[§1](https://arxiv.org/html/2605.16842#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.16842#S2.SS3.p2.9),[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.16.9.1),[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.20.13.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.14.8.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.18.12.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p4.9),[Table 3](https://arxiv.org/html/2605.16842#S4.T3.6.6.12.6.1),[Table 3](https://arxiv.org/html/2605.16842#S4.T3.6.6.8.2.1)\.
- \[19\]Y\. Ma, X\. Wu, K\. Sun, and H\. Li\(2025\)HPSv3: towards wide\-spectrum human preference score\.arXiv preprint arXiv:2508\.03789\.Cited by:[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p4.9)\.
- \[20\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li\(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[21\]L\. Pan, S\. Tao, Y\. Zhai, Z\. Fu, L\. Fang, M\. He, L\. Zhang, Z\. Liu, B\. Ding, A\. Liu, and L\. Wen\(2025\)D\-treerpo: towards more reliable policy optimization for diffusion language models\.arXiv preprint arXiv:2512\.09675\.Cited by:[§A\.2](https://arxiv.org/html/2605.16842#A1.SS2.p1.1),[§B\.2](https://arxiv.org/html/2605.16842#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.16842#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.16842#S2.SS3.p3.6)\.
- \[22\]D\. Podell, Z\. English, K\. Lacey, A\. Blattmann, T\. Dockhorn, J\. Müller, J\. Penna, and R\. Rombach\(2023\)SDXL: improving latent diffusion models for high\-resolution image synthesis\.arXiv preprint arXiv:2307\.01952\.Cited by:[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.9.2.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.8.2.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p2.1)\.
- \[23\]Q\. Qin, L\. Zhuo, Y\. Xin, R\. Du, Z\. Li, B\. Fu, Y\. Lu, J\. Yuan, X\. Li, D\. Liu,et al\.\(2025\)Lumina\-image 2\.0: a unified and efficient image generative framework\.Proceedings of the IEEE International Conference on Computer Vision \(ICCV\)\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[24\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\.K\. Li, Y\. Wu, and D\. Guo\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[Appendix D](https://arxiv.org/html/2605.16842#A4.p1.4),[§2\.2](https://arxiv.org/html/2605.16842#S2.SS2.p1.2)\.
- \[25\]Q\. Shi, J\. Bai, Z\. Zhao, W\. Chai, K\. Yu, J\. Wu, S\. Song, Y\. Tong, X\. Li, X\. Li, and S\. Yan\(2025\)Muddit: liberating generation beyond text\-to\-image with a unified discrete diffusion model\.External Links:2505\.23606,[Link](https://arxiv.org/abs/2505.23606)Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[26\]Q\. Shi, J\. Bai, Z\. Zhao, W\. Chai, K\. Yu, J\. Wu, S\. Song, Y\. Tong, X\. Li, X\. Li, and S\. Yan\(2026\)Muddit: liberating generation beyond text\-to\-image with a unified discrete diffusion model\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[27\]M\. L\. Team, H\. Ma, H\. Tan, J\. Huang, J\. Wu, J\. He, L\. Gao, S\. Xiao, X\. Wei, X\. Ma,et al\.\(2025\)Longcat\-image technical report\.arXiv preprint arXiv:2512\.07584\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[28\]Y\. Tian, L\. Yang, J\. Yang, A\. Wang, Y\. Tian, J\. Zheng, H\. Wang, Z\. Teng, Z\. Wang, Y\. Wang, Y\. Tong, M\. Wang, and X\. Li\(2025\)MMaDA\-parallel: multimodal large diffusion language models for thinking\-aware editing and generation\.arXiv preprint arXiv:2511\.09611\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[29\]X\. Wang, X\. Zhang, Z\. Luo, Q\. Sun, Y\. Cui, J\. Wang, F\. Zhang, Y\. Wang, Z\. Li, Q\. Yu,et al\.\(2024\)Emu3: next\-token prediction is all you need\.arXiv preprint arXiv:2409\.18869\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[30\]Y\. Wang, Y\. Zang, H\. Li, C\. Jin, and J\. Wang\(2025\)Unified reward model for multimodal understanding and generation\.arXiv preprint arXiv:2503\.05236\.Cited by:[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p4.9)\.
- \[31\]Y\. Wang, L\. Yang, B\. Li, Y\. Tian, K\. Shen, and M\. Wang\(2025\)Revolutionizing reinforcement learning framework for diffusion large language models\.arXiv preprint arXiv:2509\.06949\.Cited by:[§A\.2](https://arxiv.org/html/2605.16842#A1.SS2.p1.1),[§B\.2](https://arxiv.org/html/2605.16842#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.16842#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.16842#S2.SS3.p3.6),[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.17.10.1),[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.21.14.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.15.9.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.19.13.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p2.1),[Table 3](https://arxiv.org/html/2605.16842#S4.T3.6.6.13.7.1),[Table 3](https://arxiv.org/html/2605.16842#S4.T3.6.6.9.3.1)\.
- \[32\]C\. Wu, J\. Li, J\. Zhou, J\. Lin, K\. Gao, K\. Yan, S\. Yin, S\. Bai, X\. Xu, Y\. Chen,et al\.\(2025\)Qwen\-image technical report\.arXiv preprint arXiv:2508\.02324\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[33\]Y\. Xin, S\. Luo, Q\. Qin, H\. Chen, K\. Zhu, Z\. Zhang, Y\. He, R\. Zhang, J\. Bai, S\. Cao,et al\.\(2025\)DMLLM\-tts: self\-verified and efficient test\-time scaling for diffusion multi\-modal large language models\.arXiv preprint arXiv:2512\.19433\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[34\]Y\. Xin, Q\. Qin, S\. Luo, K\. Zhu, J\. Yan, Y\. Tai, J\. Lei, Y\. Cao, K\. Wang, Y\. Wang,et al\.\(2025\)Lumina\-dimoo: an omni diffusion large language model for multi\-modal generation and understanding\.arXiv preprint arXiv:2510\.06308\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1),[§1](https://arxiv.org/html/2605.16842#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.19.12.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.17.11.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p1.1),[Table 3](https://arxiv.org/html/2605.16842#S4.T3.6.6.11.5.1)\.
- \[35\]Y\. Xin, J\. Yan, Q\. Qin, Z\. Li, D\. Liu, S\. Li, V\. S\. Huang, Y\. Zhou, R\. Zhang, L\. Zhuo,et al\.\(2025\)Lumina\-mgpt 2\.0: stand\-alone autoregressive image modeling\.arXiv preprint arXiv:2507\.17801\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[36\]J\. Xu, X\. Liu, Y\. Wu, Y\. Tong, Q\. Li, M\. Ding, J\. Tang, and Y\. Dong\(2023\)ImageReward: learning and evaluating human preferences for text\-to\-image generation\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p3.1)\.
- \[37\]Z\. Xue, J\. Wu, Y\. Gao, F\. Kong, L\. Zhu, M\. Chen, Z\. Liu, W\. Liu, Q\. Guo, W\. Huang,et al\.\(2025\)DanceGRPO: unleashing grpo on visual generation\.arXiv preprint arXiv:2505\.07818\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[38\]J\. Yang, G\. Chen, X\. Hu, and J\. Shao\(2025\)Taming masked diffusion language models via consistency trajectory reinforcement learning with fewer decoding step\.arXiv preprint arXiv:2509\.23924\.Cited by:[§A\.2](https://arxiv.org/html/2605.16842#A1.SS2.p1.1),[§B\.2](https://arxiv.org/html/2605.16842#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.16842#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.16842#S2.SS3.p3.6)\.
- \[39\]L\. Yang, Y\. Tian, B\. Li, X\. Zhang, K\. Shen, Y\. Tong, and M\. Wang\(2025\)MMaDA: multimodal large diffusion language models\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1),[§1](https://arxiv.org/html/2605.16842#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.14.7.1),[§3\.2](https://arxiv.org/html/2605.16842#S3.SS2.7.7.7.15.8.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.6.6.6.13.7.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p1.1),[Table 3](https://arxiv.org/html/2605.16842#S4.T3.6.6.7.1.1)\.
- \[40\]Z\. You, S\. Nie, X\. Zhang, J\. Hu, J\. Zhou, Z\. Lu, J\. Wen, and C\. Li\(2025\)LLaDA\-V: large language diffusion models with visual instruction tuning\.arXiv preprint arXiv:2505\.16933\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[41\]Z\. You, X\. Zhang, J\. Zhou, C\. Li, and J\. Wen\(2026\)LLaDA\-o: an effective and length\-adaptive omni diffusion model\.arXiv preprint arXiv:2603\.01068\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p1.1)\.
- \[42\]Z\. You, X\. Cai, J\. Gu, T\. Xue, and C\. Dong\(2025\)Teaching large language models to regress accurate image quality scores using score distribution\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p3.1)\.
- \[43\]S\. Yuan, Y\. Liu, Y\. Yue, J\. Zhang, W\. Zuo, Q\. Wang, F\. Zhang, and G\. Zhou\(2025\)AR\-grpo: training autoregressive image generation models via reinforcement learning\.arXiv preprint arXiv:2508\.06924\.Cited by:[§1](https://arxiv.org/html/2605.16842#S1.p1.1)\.
- \[44\]A\. Zhan\(2025\)Simple policy gradients for reasoning with diffusion language models\.arXiv preprint arXiv:2510\.04019\.Cited by:[§A\.2](https://arxiv.org/html/2605.16842#A1.SS2.p1.1),[§B\.2](https://arxiv.org/html/2605.16842#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.16842#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.16842#S2.SS3.p3.6)\.
- \[45\]S\. Zhang, Z\. Zhang, C\. Dai, and Y\. Duan\(2026\)E\-GRPO: high entropy steps drive effective reinforcement learning for flow models\.arXiv preprint arXiv:2601\.00423\.Cited by:[§A\.2](https://arxiv.org/html/2605.16842#A1.SS2.p1.1)\.
- \[46\]S\. Zhao, D\. Gupta, Q\. Zheng, and A\. Grover\(2025\)D1: scaling reasoning in diffusion large language models via reinforcement learning\.arXiv preprint arXiv:2504\.12216\.Cited by:[§A\.2](https://arxiv.org/html/2605.16842#A1.SS2.p1.1),[§B\.1](https://arxiv.org/html/2605.16842#A2.SS1.p1.2),[§1](https://arxiv.org/html/2605.16842#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.16842#S2.SS3.p2.9)\.
- \[47\]S\. Zhengwentai\(2023\-03\)clip\-score: CLIP Score for PyTorch\.Note:Version 0\.2\.1[https://github\.com/taited/clip\-score](https://github.com/taited/clip-score)Cited by:[§4\.1](https://arxiv.org/html/2605.16842#S4.SS1.p4.9)\.
- \[48\]F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen, and C\. Li\(2025\)LLaDA 1\.5: variance\-reduced preference optimization for large language diffusion models\.arXiv preprint arXiv:2505\.19223\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.
- \[49\]K\. Zhu, Q\. Zeng, Y\. Pu, S\. Cao, X\. Li, Y\. Xin, Q\. Qin, J\. Li, Y\. Qiao, J\. Gu, and Y\. Liu\(2025\)Accelerating masked image generation by learning latent controlled dynamics\.arXiv preprint arXiv:2602\.23996\.Cited by:[§A\.1](https://arxiv.org/html/2605.16842#A1.SS1.p1.1)\.

Supplementary Material

## Appendix ARelated Work

### A\.1Diffusion Multi\-Modal Large Language Models

Discrete diffusion modeling is rapidly emerging as a highly promising paradigm, showing great potential to replace the traditional autoregressive \(AR\) modeling\. Early discrete diffusion methods\[[1](https://arxiv.org/html/2605.16842#bib.bib72),[17](https://arxiv.org/html/2605.16842#bib.bib73)\]built the foundation for this shift by establishing the core principles of token\-level denoising\. Building on these methods, diffusion large language models \(dLLMs\)\[[11](https://arxiv.org/html/2605.16842#bib.bib79),[20](https://arxiv.org/html/2605.16842#bib.bib78),[48](https://arxiv.org/html/2605.16842#bib.bib77)\]show that text generation can be treated as a parallel and iterative denoising process\. Expanding this framework to multi\-modal domains, LLaDA\-V\[[40](https://arxiv.org/html/2605.16842#bib.bib75)\]used visual instruction tuning to show that diffusion models can effectively combine visual and textual data\. Driven by this progress, recent studies focus on developing unified diffusion multi\-modal large language models \(dMLLMs\)\[[39](https://arxiv.org/html/2605.16842#bib.bib36),[25](https://arxiv.org/html/2605.16842#bib.bib74),[34](https://arxiv.org/html/2605.16842#bib.bib38),[3](https://arxiv.org/html/2605.16842#bib.bib61)\]\. These models further extend the capabilities of image generation\. This success also inspires new research on downstream applications for dMLLMs\. Key areas include reinforcement learning algorithm design\[[18](https://arxiv.org/html/2605.16842#bib.bib27),[39](https://arxiv.org/html/2605.16842#bib.bib36)\], image editing\[[28](https://arxiv.org/html/2605.16842#bib.bib82)\], test\-time scaling\[[33](https://arxiv.org/html/2605.16842#bib.bib80)\], and sampling acceleration\[[49](https://arxiv.org/html/2605.16842#bib.bib81)\]\.

### A\.2Reinforcement Learning Alignment for dMLLMs

Applying Reinforcement Learning \(RL\) to discrete diffusion models introduces unique challenges, primarily because dynamic masking breaks the standard policy gradient assumption\. To bypass this, existing works generally diverge into two directions\. Random remasking methods\[[18](https://arxiv.org/html/2605.16842#bib.bib27),[46](https://arxiv.org/html/2605.16842#bib.bib29),[16](https://arxiv.org/html/2605.16842#bib.bib30)\]approximate intermediate states by randomly masking tokens\. While maintaining the efficiency of a single forward pass, these approximated contexts often deviate from the actual denoising trajectory\. Conversely, trajectory recording methods\[[31](https://arxiv.org/html/2605.16842#bib.bib44),[38](https://arxiv.org/html/2605.16842#bib.bib31),[21](https://arxiv.org/html/2605.16842#bib.bib32),[44](https://arxiv.org/html/2605.16842#bib.bib11)\]cache the entire denoising process to build a precise Markov Decision Process \(MDP\)\. However, this rigorousness comes at a heavy price, as the computational and memory costs scale linearly with the number of denoising steps\. Crucially, both paradigms share a fundamental flaw: they uniformly assign the final reward to all tokens, ignoring prior insights\[[45](https://arxiv.org/html/2605.16842#bib.bib26)\]that early, high\-entropy steps matter more for generation quality\.

## Appendix BA Unified Theoretical Framework for dMLLM RL Methods

This section formalizes the two existing approaches described in Section[2](https://arxiv.org/html/2605.16842#S2), random remasking methods and trajectory recording methods, together with HT\-GRPO\. We usegg,ii,kk, andccfor rollout index, token position, inner\-loop update index, and text prompt condition respectively\. All other symbols follow the notation in the main text\.

As established in Section[2\.3](https://arxiv.org/html/2605.16842#S2.SS3), the surrogate ideal objective𝒥∗\\mathcal\{J\}^\{\*\}\(Eq\. \([4](https://arxiv.org/html/2605.16842#S2.E4)\)\) requires averaging over all valid denoising trajectoriesτ∈𝒯\\tau\\in\\mathcal\{T\}, which is combinatorially intractable\. All three paradigms therefore share the same clipped surrogate objective and differ only in three design choices: the conditioning context𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}, the optimization support setℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}, and the token\-level weightwg,iw\_\{g,i\}\. Setting these to the choices in Section[3](https://arxiv.org/html/2605.16842#S3)recovers HT\-GRPO exactly\.

ℒ\(k\)\(θ\)=1G∑g=1G1\|ℳg\(k\)\|∑i∈ℳg\(k\)min⁡\(rg,i\(k\)\(θ\)A~g,i,clip\(rg,i\(k\)\(θ\),1−ϵ,1\+ϵ\)A~g,i\),\\mathcal\{L\}^\{\(k\)\}\(\\theta\)=\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\\frac\{1\}\{\|\\mathcal\{M\}\_\{g\}^\{\(k\)\}\|\}\\sum\_\{i\\in\\mathcal\{M\}\_\{g\}^\{\(k\)\}\}\\min\\\!\\left\(r\_\{g,i\}^\{\(k\)\}\(\\theta\)\\,\\tilde\{A\}\_\{g,i\},\\;\\mathrm\{clip\}\\\!\\left\(r\_\{g,i\}^\{\(k\)\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\right\)\\tilde\{A\}\_\{g,i\}\\right\),\(12\)where

rg,i\(k\)\(θ\)=πθ\(vg,i∣𝐂g,i\(k\),c\)πθold\(vg,i∣𝐂g,i\(k\),c\)\.r\_\{g,i\}^\{\(k\)\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\\\!\\left\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{g,i\}^\{\(k\)\},c\\right\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\\!\\left\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{g,i\}^\{\(k\)\},c\\right\)\}\.\(13\)Here,k∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}indexes the inner\-loop gradient step, and the same rollout batch is reused for allKKupdates\.vg,iv\_\{g,i\}is the token value generated at positioniiin rolloutgg\.ℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}is the optimization support set, the set of token positions that contribute gradients in thekk\-th inner\-loop update\. The conditioning context𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}is the partially observed token configuration used to evaluate the importance ratio for positioniiin updatekk\. The token\-level weighted advantageA~g,i=wg,iAg\\tilde\{A\}\_\{g,i\}=w\_\{g,i\}A\_\{g\}scales the group\-relative advantageAgA\_\{g\}by a per\-token credit weightwg,iw\_\{g,i\}, where

Ag=Rg−mean\(\{Rj\}\)std\(\{Rj\}\)\+δm\.A\_\{g\}=\\frac\{R\_\{g\}\-\\mathrm\{mean\}\(\\\{R\_\{j\}\\\}\)\}\{\\mathrm\{std\}\(\\\{R\_\{j\}\\\}\)\+\\delta\_\{m\}\}\.\(14\)The three methods differ only in how they instantiate these three quantities\.

### B\.1Random Remasking Methods

Random remasking methods, including MaskGRPO\[[18](https://arxiv.org/html/2605.16842#bib.bib27)\], D1\[[46](https://arxiv.org/html/2605.16842#bib.bib29)\], and UniGRPO\[[16](https://arxiv.org/html/2605.16842#bib.bib30)\], retain only the final image𝐱g\\mathbf\{x\}\_\{g\}and construct synthetic contexts by independently remasking each position with probabilityγk\\gamma\_\{k\}:

ℳg\(k\)\\displaystyle\\mathcal\{M\}\_\{g\}^\{\(k\)\}=\{i:iis remasked with prob\.γk\}\\displaystyle=\\bigl\\\{i:i\\text\{ is remasked with prob\.\}\\ \\gamma\_\{k\}\\bigr\\\}\(support set\),\\displaystyle\\text\{\(support set\)\},\(15\)𝐂g,i\(k\)\\displaystyle\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}=Mask\(𝐱g,ℳg\(k\)\)\\displaystyle=\\mathrm\{Mask\}\(\\mathbf\{x\}\_\{g\},\\mathcal\{M\}\_\{g\}^\{\(k\)\}\)\(maskℳg\(k\); retain token values elsewhere\),\\displaystyle\\text\{\(mask $\\mathcal\{M\}\_\{g\}^\{\(k\)\}$; retain token values elsewhere\)\},wg,i\\displaystyle w\_\{g,i\}=1\\displaystyle=1\(uniform weight\)\.\\displaystyle\\text\{\(uniform weight\)\}\.
Stage conflation\.Because remasking is independent of the generation\-order rankρg,i\\rho\_\{g,i\}, the support setℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}mixes structural tokens \(smallρg,i\\rho\_\{g,i\}\) and refinement tokens \(largeρg,i\\rho\_\{g,i\}\) in the same update with equal probability\. By[Proposition C\.1](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px1), these two groups carry fundamentally different levels of uncertainty, so their gradients operate at very different scales within a single update step\.

Future\-token contamination\.The context𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}retains all positions not inℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}\. For positionii, define the future token set asℱi\(g\)=\{j:ρg,j\>ρg,i\}\\mathcal\{F\}\_\{i\}^\{\(g\)\}=\\\{j:\\rho\_\{g,j\}\>\\rho\_\{g,i\}\\\}\. Since remasking ignores generation order, each position inℱi\(g\)\\mathcal\{F\}\_\{i\}^\{\(g\)\}is retained in𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}with probability1−γk1\-\\gamma\_\{k\}, potentially exposing tokens that had not yet been generated when positioniiwas predicted\.[Proposition C\.2](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px3)shows this occurs with strictly positive probability for any non\-final position\.

Multi\-path coverage \(partial\)\.Different random masks induce diverse conditioning contexts, exploring a broader range of token\-context combinations than a single fixed trajectory\. Random remasking is agnostic to generation order, however, so the resulting contexts are not guaranteed to correspond to any valid denoising trajectory\. Because the context may include future tokens, as shown in the contamination analysis above, the diverse paths explored do not respect the causal ordering of the original generation process\. The approximation of𝒥∗\\mathcal\{J\}^\{\*\}is therefore partial: broader in context diversity than a single trajectory, but biased away from the causal structure that𝒥∗\\mathcal\{J\}^\{\*\}requires\.

Uniform token reward \(unresolved\)\.All positions receive the same credit weightwg,i=1w\_\{g,i\}=1regardless of their generation\-order rank\. Structural and refinement tokens carry fundamentally different entropy levels and play different roles in image generation, as established in Section[2\.3](https://arxiv.org/html/2605.16842#S2.SS3)\. Equal credit assignment ignores this asymmetry entirely\.

### B\.2Trajectory Recording Methods

Trajectory recording methods, including TraceRL\[[31](https://arxiv.org/html/2605.16842#bib.bib44)\], CJ\-GRPO\[[38](https://arxiv.org/html/2605.16842#bib.bib31)\], d\-TreeRPO\[[21](https://arxiv.org/html/2605.16842#bib.bib32)\], and AGRPO\[[44](https://arxiv.org/html/2605.16842#bib.bib11)\], record the denoising order and evaluate each token under its trajectory\-consistent conditioning context:

ℳg\(k\)\\displaystyle\\mathcal\{M\}\_\{g\}^\{\(k\)\}=\{1,…,N\}\\displaystyle=\\\{1,\\ldots,N\\\}\(allNtokens per update\),\\displaystyle\\text\{\(all $N$ tokens per update\)\},\(16\)𝐂g,i\(k\)\\displaystyle\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}=𝐱g\(≺i\)\\displaystyle=\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}\(reveals\{j:ρg,j<ρg,i\}; remaining positions masked\),\\displaystyle\\text\{\(reveals $\\\{j:\\rho\_\{g,j\}<\\rho\_\{g,i\}\\\}$; remaining positions masked\)\},wg,i\\displaystyle w\_\{g,i\}=1\\displaystyle=1\(uniform weight\)\.\\displaystyle\\text\{\(uniform weight\)\}\.
Stage conflation \(avoided\)\.Unlike random remasking methods, which apply the same synthetic context to all tokens inℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}regardless of generation rank, trajectory recording methods setℳg\(k\)=\{1,…,N\}\\mathcal\{M\}\_\{g\}^\{\(k\)\}=\\\{1,\\ldots,N\\\}and assign each token its own trajectory\-consistent context𝐂g,i\(k\)=𝐱g\(≺i\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}=\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}\. Structural tokens therefore receive sparse contexts reflecting their early generation stage, while refinement tokens receive progressively richer contexts\. Each importance ratio is computed under a context that matches the entropy level of the corresponding token, avoiding the uniform\-context mismatch that characterizes stage conflation\.

Future\-token contamination \(avoided\)\.The trajectory\-consistent context𝐂g,i\(k\)=𝐱g\(≺i\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}=\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}reveals only positions withρg,j<ρg,i\\rho\_\{g,j\}<\\rho\_\{g,i\}\. All positions in the future token setℱi\(g\)=\{j:ρg,j\>ρg,i\}\\mathcal\{F\}\_\{i\}^\{\(g\)\}=\\\{j:\\rho\_\{g,j\}\>\\rho\_\{g,i\}\\\}are masked by construction, so future\-token contamination cannot occur\.

Limited path coverage\.Bothℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}and𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}are derived from a single rollout trajectory, so each update evaluates only one element of the expectation in Eq\. \([4](https://arxiv.org/html/2605.16842#S2.E4)\)\. The many other valid trajectoriesτ∈𝒯\\tau\\in\\mathcal\{T\}that could produce the same image are never seen, restricting the diversity of generation paths the model learns from\.

Uniform token reward \(unresolved\)\.All positions receive the same credit weightwg,i=1w\_\{g,i\}=1\. Although the trajectory\-consistent context𝐱g\(≺i\)\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}naturally differentiates the conditioning richness of structural versus refinement tokens, the advantage broadcast to each token remains uniform\. There is no explicit mechanism to amplify the learning signal for high\-uncertainty structural tokens or attenuate it for near\-deterministic refinement tokens\.

### B\.3HT\-GRPO

HT\-GRPO retains the final sample together with the generation\-order rankρg,i∈\{1,…,N\}\\rho\_\{g,i\}\\in\\\{1,\\ldots,N\\\}, which records the unmasking step at which positioniiwas revealed in rolloutgg, withρg,i=1\\rho\_\{g,i\}=1for the first token andρg,i=N\\rho\_\{g,i\}=Nfor the last\. Positions unmasked in the same denoising step share the same context and entropy level despite receiving different rank values\. HT\-GRPO partitions tokens into global, structural, and refinement groups:

𝒮g,global\\displaystyle\\mathcal\{S\}\_\{g,\\mathrm\{global\}\}=\{1,…,N\},\\displaystyle=\\\{1,\\ldots,N\\\},\(17\)𝒮g,structure\\displaystyle\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\}=\{i:ρg,i≤Ns\},\\displaystyle=\\\{i:\\rho\_\{g,i\}\\leq N\_\{s\}\\\},𝒮g,refinement\\displaystyle\\mathcal\{S\}\_\{g,\\mathrm\{refinement\}\}=\{i:ρg,i\>Ns\},\\displaystyle=\\\{i:\\rho\_\{g,i\}\>N\_\{s\}\\\},whereNs=⌊αN⌋N\_\{s\}=\\lfloor\\alpha N\\rfloorandK=nglobal\+nstructure\+nrefinementK=n\_\{\\mathrm\{global\}\}\+n\_\{\\mathrm\{structure\}\}\+n\_\{\\mathrm\{refinement\}\}is the total number of inner\-loop updates\. TheKKsteps follow a fixed Global→\\toStructure→\\toRefinement schedule, and the support set at stepkkis a random subset drawn from the active stage:

ℳg\(k\)⊆\{𝒮g,global,1≤k≤nglobal,𝒮g,structure,nglobal<k≤nglobal\+nstructure,𝒮g,refinement,nglobal\+nstructure<k≤K\.\\mathcal\{M\}\_\{g\}^\{\(k\)\}\\subseteq\\begin\{cases\}\\mathcal\{S\}\_\{g,\\mathrm\{global\}\},&1\\leq k\\leq n\_\{\\mathrm\{global\}\},\\\\ \\mathcal\{S\}\_\{g,\\mathrm\{structure\}\},&n\_\{\\mathrm\{global\}\}<k\\leq n\_\{\\mathrm\{global\}\}\+n\_\{\\mathrm\{structure\}\},\\\\ \\mathcal\{S\}\_\{g,\\mathrm\{refinement\}\},&n\_\{\\mathrm\{global\}\}\+n\_\{\\mathrm\{structure\}\}<k\\leq K\.\\end\{cases\}\(18\)All tokens share the same fully masked conditioning context, defined as the state in which allNNimage positions remain masked:

𝐂g,i\(k\)≡𝐂∅\.\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}\\equiv\\mathbf\{C\}\_\{\\emptyset\}\.\(19\)Token weights are generation\-order\-aware rather than uniform:

A~g,i\\displaystyle\\tilde\{A\}\_\{g,i\}=wg,iAg,\\displaystyle=w\_\{g,i\}A\_\{g\},\(20\)wg,i\\displaystyle w\_\{g,i\}=\{λs,i∈𝒮g,structure,λr,i∈𝒮g,refinement\.\\displaystyle=
Stage conflation \(resolved\)\.At each inner\-loop stepkk, the support setℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}is drawn exclusively from the token set of the active stage\. The Global→\\toStructure→\\toRefinement schedule partitions theKKupdates by generation rank, so high\-entropy structural tokens and low\-entropy refinement tokens are never placed in the sameℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}\. The entropy gap between these two groups is established in[Proposition C\.1](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px1)\. This resolves the stage conflation identified in Section[2\.3](https://arxiv.org/html/2605.16842#S2.SS3)\.

Future\-token contamination \(eliminated\)\.Setting𝐂g,i\(k\)≡𝐂∅\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}\\equiv\\mathbf\{C\}\_\{\\emptyset\}masks allNNpositions uniformly for every token and every update, making future\-token exposure impossible by construction\. Contamination is eliminated without requiring trajectory storage, as formalized in[Proposition C\.2](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px3)\.

Limited path coverage \(alleviated\)\.Within each stage,ℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}is re\-sampled independently at every inner\-loop step according to an annealed sampling rateγks\(s\)\\gamma\_\{k\_\{s\}\}^\{\(s\)\}\(Appendix[F\.1](https://arxiv.org/html/2605.16842#A6.SS1)\)\. Different token subsets receive gradient signal across theKKupdates, providing Monte Carlo coverage of the token space within each stage and partially alleviating the single\-trajectory limitation\. Because𝐂∅\\mathbf\{C\}\_\{\\emptyset\}decouples the importance ratio from any particular ordering, this intra\-stage sampling introduces no causal inconsistency\.

Uniform token reward \(resolved\)\.The per\-token weightwg,iw\_\{g,i\}assignsλs\>1\\lambda\_\{\\mathrm\{s\}\}\>1to structural tokens andλr<1\\lambda\_\{\\mathrm\{r\}\}<1to refinement tokens, replacing the uniformwg,i=1w\_\{g,i\}=1used by both existing families\. This amplifies gradient updates on global composition and attenuates updates on near\-deterministic local detail, following the rationale in Section[3\.2](https://arxiv.org/html/2605.16842#S3.SS2)\. Trajectory recording methods avoid stage conflation and future\-token contamination but leave credit assignment uniform\. HT\-GRPO is therefore the only paradigm that addresses all four limitations\.

### B\.4Comparison

Table[4](https://arxiv.org/html/2605.16842#A2.T4)summarizes the three paradigms along eight dimensions\. The formal basis for the entropy\-related entries is provided in Section[C](https://arxiv.org/html/2605.16842#A3)\([Propositions C\.1](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px1)–[C\.3](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px4)\)\. Among the four limitations identified in Section[2\.3](https://arxiv.org/html/2605.16842#S2.SS3), random remasking methods introduce stage conflation and future\-token contamination while achieving only partial path coverage\. Trajectory recording methods avoid the first two but sacrifice path diversity and requireO\(T\)O\(T\)forward passes per rollout\. Neither family addresses the uniform token reward problem\. HT\-GRPO resolves all four: inter\-stage partitioning eliminates stage conflation,𝐂∅\\mathbf\{C\}\_\{\\emptyset\}eliminates future\-token contamination, intra\-stage random subsets partially alleviate limited path coverage, and generation\-order\-aware weightswg,iw\_\{g,i\}replace uniform credit assignment\.

Table 4:Comparison of three dMLLM RL paradigms under a unified formulation\. Limitation rows correspond to the four issues identified in Section[2\.3](https://arxiv.org/html/2605.16842#S2.SS3)\.DimensionRandom RemaskingTrajectory RecordingHT\-GRPODesign choicesSupport setℳg\(k\)\\mathcal\{M\}\_\{g\}^\{\(k\)\}Remasked positionsAllNNtokensRandom subset of active stageConditioning context𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}Randomly retained tokens after remaskingTrue rollout state𝐱g\(≺i\)\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}Fully masked state𝐂∅\\mathbf\{C\}\_\{\\emptyset\}Token weightwg,iw\_\{g,i\}Uniform \(=1=1\)Uniform \(=1=1\)Generation\-order\-aware \(λs\>1,λr<1\\lambda\_\{\\mathrm\{s\}\}\{\>\}1,\\,\\lambda\_\{\\mathrm\{r\}\}\{<\}1\)Operational costTrajectory storageFinal image onlyAllTTintermediate statesGeneration\-order ranksρg,i\\rho\_\{g,i\}Forward passes per inner loop11TT11Limitation analysisStage conflationYesNoResolvedFuture\-token contaminationYesNoEliminatedMulti\-path coveragePartial \(causally inconsistent\)Limited \(single trajectory\)Alleviated \(intra\-stage sampling\)Uniform token rewardUnresolvedUnresolvedResolved

## Appendix CInformation\-Theoretic Basis of Hierarchical Token Grouping and the Prompt\-Conditioned Estimator

We establish three formal results underpinning HT\-GRPO’s design\.

##### Assumption \(Approximate Consistency\)\.

LetViV\_\{i\}be the token at positionii,pθp\_\{\\theta\}the joint generation distribution, and𝐂\[n\]\\mathbf\{C\}^\{\[n\]\}any partially revealed context withnnpositions unmasked and positioniistill masked\. We assume the model’s prediction matches the true conditional:

πθ\(⋅∣𝐂\[n\],c\)=pθ\(⋅∣𝐂\[n\],c\)\.\\pi\_\{\\theta\}\\\!\\left\(\\cdot\\mid\\mathbf\{C\}^\{\[n\]\},c\\right\)=p\_\{\\theta\}\\\!\\left\(\\cdot\\mid\\mathbf\{C\}^\{\[n\]\},c\\right\)\.\(21\)This connects the model’s output to standard information\-theoretic quantities\. We writeH\(πθ\(⋅∣𝐂,c\)\)H\(\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{C\},c\)\)for thepredictive entropy, which measures how uncertain the model is about positioniigiven visible context𝐂\\mathbf\{C\}\.[Propositions C\.1](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px1)and[C\.3](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px4)additionally require that𝐂\[n′\]\\mathbf\{C\}^\{\[n^\{\\prime\}\]\}, wheren′\>nn^\{\\prime\}\>n, is obtained from𝐂\[n\]\\mathbf\{C\}^\{\[n\]\}by revealing more tokens along the same trajectory while keeping positioniimasked\.

##### Proposition C\.1 \(Entropy Monotonicity\)\.

Seeing more context reduces uncertainty\.Under Assumption \([21](https://arxiv.org/html/2605.16842#A3.E21)\), for any positioniiand any0≤n<n′≤N−10\\leq n<n^\{\\prime\}\\leq N\{\-\}1,

𝔼𝐂\[n\]\[H\(πθ\(⋅∣𝐂\[n\],c\)\)\]≥𝔼𝐂\[n′\]\[H\(πθ\(⋅∣𝐂\[n′\],c\)\)\]\.\\mathbb\{E\}\_\{\\mathbf\{C\}^\{\[n\]\}\}\\\!\\left\[H\\\!\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{C\}^\{\[n\]\},c\)\\right\)\\right\]\\geq\\mathbb\{E\}\_\{\\mathbf\{C\}^\{\[n^\{\\prime\}\]\}\}\\\!\\left\[H\\\!\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{C\}^\{\[n^\{\\prime\}\]\},c\)\\right\)\\right\]\.\(22\)
Proof\.Let𝐃\\mathbf\{D\}be the additional tokens revealed between𝐂\[n\]\\mathbf\{C\}^\{\[n\]\}and𝐂\[n′\]\\mathbf\{C\}^\{\[n^\{\\prime\}\]\}\. Since mutual information is non\-negative,I\(Vi;𝐃∣𝐂\[n\],c\)=H\(Vi∣𝐂\[n\],c\)−H\(Vi∣𝐂\[n′\],c\)≥0I\(V\_\{i\};\\mathbf\{D\}\\mid\\mathbf\{C\}^\{\[n\]\},c\)=H\(V\_\{i\}\\mid\\mathbf\{C\}^\{\[n\]\},c\)\-H\(V\_\{i\}\\mid\\mathbf\{C\}^\{\[n^\{\\prime\}\]\},c\)\\geq 0\. Applying Assumption \([21](https://arxiv.org/html/2605.16842#A3.E21)\) to both sides yields the stated inequality\.□\\square

Tokens unmasked early \(smallρg,i\\rho\_\{g,i\}, few context tokens visible\) therefore carry higher uncertainty than those unmasked late\. Tokens sharing the same denoising step share the same context and entropy level despite having differentρg,i\\rho\_\{g,i\}values\. This justifies treating early tokens as structural tokens and late tokens as refinement tokens \(𝒮g,structure\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\},𝒮g,refinement\\mathcal\{S\}\_\{g,\\mathrm\{refinement\}\}in Section[3\.1](https://arxiv.org/html/2605.16842#S3.SS1)\)\.

##### Definition \(Future\-token set\)\.

For rolloutggand positionii, defineℱi\(g\)=\{j:ρg,j\>ρg,i\}\\mathcal\{F\}\_\{i\}^\{\(g\)\}=\\\{j:\\rho\_\{g,j\}\>\\rho\_\{g,i\}\\\}as the set of tokens unmasked*after*ii\. These tokens are still masked wheniiis predicted; if any appear in the conditioning context, they inject future information into the current optimization step\.

##### Proposition C\.2 \(Future\-Token Contamination\)\.

Random remasking exposes future tokens with nonzero probability, misspecifying the conditioning distribution\.In random remasking methods, each position is independently remasked with probabilitypk∈\(0,1\)p\_\{k\}\\in\(0,1\)\. For any non\-final positionii, at least one future token fromℱi\(g\)\\mathcal\{F\}\_\{i\}^\{\(g\)\}remains visible in𝐂g,i\(k\)\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}with probability

Pr⁡\[𝐂g,i\(k\)≢𝐱g\(≺i\)\]=1−pk\|ℱi\(g\)\|\>0\.\\Pr\\\!\\bigl\[\\mathbf\{C\}\_\{g,i\}^\{\(k\)\}\\not\\equiv\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}\\bigr\]=1\-p\_\{k\}^\{\|\\mathcal\{F\}\_\{i\}^\{\(g\)\}\|\}\>0\.\(23\)In the true trajectory context𝐱g\(≺i\)\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}all future positions are masked; when contamination occurs, the ratior~g,i\(θ\)=πθ\(vg,i∣𝐂g,i\(k\),c\)/πθold\(vg,i∣𝐂g,i\(k\),c\)\\tilde\{r\}\_\{g,i\}\(\\theta\)=\\pi\_\{\\theta\}\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{g,i\}^\{\(k\)\},c\)/\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{g,i\}^\{\(k\)\},c\)is evaluated under a causally inconsistent context, leading to a misspecified policy gradient estimator\.

Proof\.The\|ℱi\(g\)\|\|\\mathcal\{F\}\_\{i\}^\{\(g\)\}\|future positions are remasked independently; all are correctly masked with probabilitypk\|ℱi\(g\)\|p\_\{k\}^\{\|\\mathcal\{F\}\_\{i\}^\{\(g\)\}\|\}, so the complement gives the contamination probability\. In𝐱g\(≺i\)\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}, all future positions are masked by definition, confirming the context mismatch\.□\\square

Thus, random remasking methods evaluate the likelihood ratio under a conditioning distribution that differs from the trajectory\-consistent one\.

Remark\.Letrg,i∗\(θ\)=πθ\(vg,i∣𝐱g\(≺i\),c\)/πθold\(vg,i∣𝐱g\(≺i\),c\)r^\{\*\}\_\{g,i\}\(\\theta\)=\\pi\_\{\\theta\}\(v\_\{g,i\}\\mid\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\},c\)/\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(v\_\{g,i\}\\mid\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\},c\)denote the trajectory\-consistent ratio, where𝐱g\(≺i\)\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\}is the true denoising state just before tokeniiis revealed \(Section[B](https://arxiv.org/html/2605.16842#A2)\)\. The claim above concerns the*conditioning distribution*, not a guaranteed pointwise inequality:r~g,i\\tilde\{r\}\_\{g,i\}andrg,i∗r^\{\*\}\_\{g,i\}could coincide numerically for specificθ\\theta, but whenever contamination occurs the estimator is structurally inconsistent with the trajectory\-consistent objective\.

Unlike𝐱g\(≺i\)\\mathbf\{x\}\_\{g\}^\{\(\\prec i\)\},𝐂∅\\mathbf\{C\}\_\{\\emptyset\}additionally masks past tokens, so HT\-GRPO does not recoverrg,i∗r^\{\*\}\_\{g,i\}\. This is a deliberate design choice: exact computation ofrg,i∗r^\{\*\}\_\{g,i\}requires marginalizing over all mask configurations consistent with the generation order, which is combinatorially intractable \(Section[3](https://arxiv.org/html/2605.16842#S3)\)\. HT\-GRPO instead defines a*unified*estimator using the same context𝐂∅\\mathbf\{C\}\_\{\\emptyset\}for every token:

r^g,i\(θ\)=πθ\(vg,i∣𝐂∅,c\)πθold\(vg,i∣𝐂∅,c\)\.\\hat\{r\}\_\{g,i\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{\\emptyset\},c\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{\\emptyset\},c\)\}\.\(24\)This estimator has three properties\. First, it is contamination\-free by construction\. Second, it is self\-consistent: the conditioning context is identical for both policies, ensuring a well\-defined likelihood ratio\. Third, it is entropy\-preserving: predictive entropy under𝐂∅\\mathbf\{C\}\_\{\\emptyset\}is not smaller in expectation than under contexts that condition on additional structural token information \([Proposition C\.3](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px4)\), which maintains reward variance and prevents GRPO advantage collapse\.

##### Proposition C\.3 \(Entropy Lower Bound under Full Masking\)\.

Revealing structural tokens reduces uncertainty over refinement tokens\.Let𝐒=\{Vj:j∈𝒮g,structure\}\\mathbf\{S\}=\\\{V\_\{j\}:j\\in\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\}\\\}denote the structural token values and𝐱structure\(𝐒\)\\mathbf\{x\}^\{\\mathrm\{structure\}\}\(\\mathbf\{S\}\)the context revealing only those positions\. For any refinement tokeni∉𝒮g,structurei\\notin\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\},

H\(πθ\(⋅∣𝐂∅,c\)\)≥𝔼𝐒\[H\(πθ\(⋅∣𝐱structure\(𝐒\),c\)\)\]\.H\\\!\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{C\}\_\{\\emptyset\},c\)\\right\)\\geq\\mathbb\{E\}\_\{\\mathbf\{S\}\}\\\!\\left\[H\\\!\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{x\}^\{\\mathrm\{structure\}\}\(\\mathbf\{S\}\),c\)\\right\)\\right\]\.\(25\)
Proof\.By Assumption \([21](https://arxiv.org/html/2605.16842#A3.E21)\), the left side equalsH\(Vi∣c\)H\(V\_\{i\}\\mid c\)and the right side equalsH\(Vi∣𝐒,c\)H\(V\_\{i\}\\mid\\mathbf\{S\},c\)\. Non\-negativity ofI\(Vi;𝐒∣c\)=H\(Vi∣c\)−H\(Vi∣𝐒,c\)≥0I\(V\_\{i\};\\mathbf\{S\}\\mid c\)=H\(V\_\{i\}\\mid c\)\-H\(V\_\{i\}\\mid\\mathbf\{S\},c\)\\geq 0gives the result\.□\\square

In practice, once structural layout is fixed, refinement tokens become nearly deterministic: most rollouts look alike, reward variance collapses, and the GRPO advantage shrinks toward zero\. This isconditional low\-entropy degradation\(Section[3](https://arxiv.org/html/2605.16842#S3)\)\. Using𝐂∅\\mathbf\{C\}\_\{\\emptyset\}avoids revealing any structural layout, keeping entropy high and preserving the learning signal\.

## Appendix DStandard GRPO Objective

Given a promptqq, GRPO\[[24](https://arxiv.org/html/2605.16842#bib.bib46)\]samplesGGcomplete outputs\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}from the behavior policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}and maximizes the clipped surrogate objective:

𝒥GRPO\(θ\)=𝔼q,\{oi\}∼πθold\[\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{q,\\,\\\{o\_\{i\}\\\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\\!\\Bigg\[1G∑i=1G1\|oi\|∑t=1\|oi\|min⁡\(ri,t\(θ\)A^i,clip\(ri,t\(θ\),1−ϵ,1\+ϵ\)A^i\)\\displaystyle\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\min\\\!\\Big\(r\_\{i,t\}\(\\theta\)\\,\\hat\{A\}\_\{i\},\\;\\mathrm\{clip\}\\\!\\left\(r\_\{i,t\}\(\\theta\),1\{\-\}\\epsilon,1\{\+\}\\epsilon\\right\)\\hat\{A\}\_\{i\}\\Big\)−β𝔻KL\(πθ∥πref\)\],\\displaystyle\-\\;\\beta\\,\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)\\Bigg\],\(26\)where the importance ratio

ri,t\(θ\)=πθ\(oi,t∣q,oi,<t\)πθold\(oi,t∣q,oi,<t\)r\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\\\!\\left\(o\_\{i,t\}\\mid q,\\,o\_\{i,<t\}\\right\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\\!\\left\(o\_\{i,t\}\\mid q,\\,o\_\{i,<t\}\\right\)\}\(27\)measures the relative likelihood of tokenoi,to\_\{i,t\}under the current policy versus the behavior policy, given the same causal prefixoi,<to\_\{i,<t\}\. The group\-relative advantage

A^i=Ri−mean\(\{Rj\}j=1G\)std\(\{Rj\}j=1G\)\+δ\\hat\{A\}\_\{i\}=\\frac\{R\_\{i\}\-\\mathrm\{mean\}\\\!\\left\(\\\{R\_\{j\}\\\}\_\{j=1\}^\{G\}\\right\)\}\{\\mathrm\{std\}\\\!\\left\(\\\{R\_\{j\}\\\}\_\{j=1\}^\{G\}\\right\)\+\\delta\}\(28\)is a scalar derived from the outcome rewardRiR\_\{i\}and broadcast uniformly to all tokens in outputoio\_\{i\}\. In practice, the same sampled batch\{oi\}\\\{o\_\{i\}\\\}is reused forKKsuccessive gradient steps, with all importance ratios computed relative to the fixed behavior policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\.

Table 5:Ablation study on random\-subset annealing\. All experiments useα=0\.3\\alpha=0\.3andng:ns:nr=2:4:2n\_\{g\}:n\_\{s\}:n\_\{r\}=2:4:2\.
## Appendix EAblation Study Details

All ablations run on MMaDA withK=8K=8andα=0\.3\\alpha=0\.3unless otherwise stated\.

A1: Staged RL Training is Necessary for dMLLMs\.Figure[4](https://arxiv.org/html/2605.16842#S4.F4)\(a\)compares various budget allocation strategies under a fixed inner\-loop budgetK=8K=8\. Both single\-stage and two\-stage variants yield limited overall performance, scoring only between 78\.3 and 80\.2\. The full three\-stage Sketch\-Then\-Paint schedule \(Global→\\toStructure→\\toRefinement with a 2:4:2 budget\) breaks this bottleneck, reaching 83\.3 overall, confirming that the coarse\-to\-fine ordering itself—not merely token\-group separation—drives the gain\.

![Refer to caption](https://arxiv.org/html/2605.16842v1/x5.png)Figure 5:DPG\-Bench counting examples on Lumina\-DiMOO\.For each prompt, we compare four samples per method: MaskGRPO on the left and HT\-GRPO on the right\. HT\-GRPO more consistently preserves the requested object count\.A2: Structure\-biased Budget Allocation Maximizes Returns\.Figure[4](https://arxiv.org/html/2605.16842#S4.F4)\(a\)also explores budget distribution across stages\. The default 2:4:2 allocation achieves the best overall score of 83\.3\. Refinement\-biased \(2:2:4\) and global\-biased \(4:2:2\) allocations underperform, indicating that structural tokens require more optimization steps to resolve layout and compositional relations\.

![Refer to caption](https://arxiv.org/html/2605.16842v1/x6.png)Figure 6:DPG\-Bench scene\-level structural completeness on Lumina\-DiMOO\.Each prompt compares MaskGRPO and HT\-GRPO with four samples per method\. HT\-GRPO better preserves relative scale, foreground–background structure, and subject pose across complex scene descriptions\.A3: Superiority of Coarse\-to\-fine Annealing Schedule\.Table[5](https://arxiv.org/html/2605.16842#A4.T5)evaluates how the annealing schedule controls token sampling within each stage\. Static strategies—constant full coverage \(γmax=γmin=1\.0\\gamma\_\{\\max\}\{=\}\\gamma\_\{\\min\}\{=\}1\.0, 79\.82\) and fixed sparse sampling \(γmax=γmin=0\.2\\gamma\_\{\\max\}\{=\}\\gamma\_\{\\min\}\{=\}0\.2, 80\.43\)—both underperform dynamic variants\. Ascending annealing \(γ:0\.5→1\.0\\gamma\{:\}\\,0\.5\{\\to\}1\.0, 82\.41\) improves over static baselines but lags behind our method: starting with sparse coverage causes unstable gradient directions in early updates\. Our linear decay schedule \(γ:1\.0→0\.5\\gamma\{:\}\\,1\.0\{\\to\}0\.5\) achieves the best overall score of83\.31\.

A4: Sensitivity to Structure Ratioα\\alpha\.Figure[4](https://arxiv.org/html/2605.16842#S4.F4)\(b\)evaluates the structural\-group boundary\. Performance peaks atα=0\.3\\alpha=0\.3\(83\.3\)\. A smaller ratio \(α=0\.1\\alpha=0\.1\) leaves many layout\-relevant tokens under\-optimized \(80\.2\), while a larger ratio \(α=0\.5\\alpha=0\.5\) dilutes the hierarchy by including tokens with rich visual context, both degrading performance\.

A5: Ablation on Credit Weighting & Ratio Conditioning\.Figure[4](https://arxiv.org/html/2605.16842#S4.F4)\(c\)examines the two components of Section[3\.2](https://arxiv.org/html/2605.16842#S3.SS2)and Section[3\.1\.2](https://arxiv.org/html/2605.16842#S3.SS1.SSS2)\. Discarding credit weights \(λs=λr=1\\lambda\_\{\\mathrm\{s\}\}=\\lambda\_\{\\mathrm\{r\}\}=1\) drops the overall score to 80\.8, confirming that amplifying structural\-token updates is critical for layout composition\. Replacing𝐂∅\\mathbf\{C\}\_\{\\emptyset\}with revealed structural contexts yields 80\.6: once the structural layout is exposed, refinement\-token distributions become overly sharp, collapsing ratio variation and weakening the RL signal—precisely the conditional low\-entropy degradation identified in[Proposition C\.3](https://arxiv.org/html/2605.16842#A3.SS0.SSS0.Px4)\.

Table 6:Hyperparameter settings used in all experiments\.CategoryHyperparameterValueHT\-GRPOStructure ratioα\\alpha0\.30\.3Stage budgetnglobal:nstructure:nrefinementn\_\{\\mathrm\{global\}\}:n\_\{\\mathrm\{structure\}\}:n\_\{\\mathrm\{refinement\}\}2:4:22:4:2Structural weightλs\\lambda\_\{\\mathrm\{s\}\}1\.51\.5Refinement weightλr\\lambda\_\{\\mathrm\{r\}\}0\.50\.5AnnealingSchedule modeϕ\\phidownγmax\\gamma\_\{\\max\}\(all stages\)1\.01\.0γmin\\gamma\_\{\\min\}\(all stages\)0\.50\.5RL trainingRollouts per promptGG99Classifier\-free guidance scale3\.53\.5RewardHPSv3 \+ CLIP \(ViT\-L/14\) \+ UniRwdHardwareGPUs8×\\timesA100\-80G
## Appendix FImplementation Supplement

### F\.1Random\-Subset Scheduling Functions

Letks∈\{0,…,ns−1\}k\_\{s\}\\in\\\{0,\\ldots,n\_\{s\}\-1\\\}denote the within\-stage update counter for stagess\. The general sampling rate is

γks\(s\)=γmin\(s\)\+\(γmax\(s\)−γmin\(s\)\)⋅ϕ\(ksmax⁡\(1,ns−1\)\),\\gamma\_\{k\_\{s\}\}^\{\(s\)\}=\\gamma\_\{\\min\}^\{\(s\)\}\+\\bigl\(\\gamma\_\{\\max\}^\{\(s\)\}\-\\gamma\_\{\\min\}^\{\(s\)\}\\bigr\)\\cdot\\phi\\\!\\left\(\\frac\{k\_\{s\}\}\{\\max\(1,n\_\{s\}\-1\)\}\\right\),\(29\)whereϕ:\[0,1\]→\[0,1\]\\phi:\[0,1\]\\to\[0,1\]is a shape function\. The main text usesϕ\(p\)=1−p\\phi\(p\)=1\{\-\}p\(downmode\), which specializes to Eq\. \([8](https://arxiv.org/html/2605.16842#S3.E8)\):

γks\(s\)=γmin\(s\)\+\(γmax\(s\)−γmin\(s\)\)⋅max⁡\(1,ns−1\)−ksmax⁡\(1,ns−1\)\.\\gamma\_\{k\_\{s\}\}^\{\(s\)\}=\\gamma\_\{\\min\}^\{\(s\)\}\+\\bigl\(\\gamma\_\{\\max\}^\{\(s\)\}\-\\gamma\_\{\\min\}^\{\(s\)\}\\bigr\)\\cdot\\frac\{\\max\(1,n\_\{s\}\-1\)\-k\_\{s\}\}\{\\max\(1,n\_\{s\}\-1\)\}\.\(30\)We additionally support the following variants \(Table[7](https://arxiv.org/html/2605.16842#A6.T7)\):

Table 7:Scheduling functions for random\-subset annealing\.Each stage maintains its ownγmax\(s\)\\gamma\_\{\\max\}^\{\(s\)\}andγmin\(s\)\\gamma\_\{\\min\}^\{\(s\)\}, allowing different coverage ranges for the global, structural, and refinement stages\.

### F\.2Hyperparameter Settings

Table[6](https://arxiv.org/html/2605.16842#A5.T6)lists all hyperparameters used in our experiments\. HT\-GRPO is implemented on top of the MaskGRPO codebase, inheriting its optimizer, KL penalty coefficient, and RL loop without modification; only the HT\-GRPO\-specific settings below are new\.

![Refer to caption](https://arxiv.org/html/2605.16842v1/x7.png)Figure 7:GenEval qualitative comparison\.From left to right: base model, MaskGRPO, and HT\-GRPO\. HT\-GRPO improves visual fidelity, object counting, and spatial relation grounding while producing more natural scene compositions\.Algorithm 1HT\-GRPO Training1:Policy

πθ\\pi\_\{\\theta\}, reference policy

πref\\pi\_\{\\mathrm\{ref\}\}, reward function

R\(⋅\)R\(\\cdot\), prompts

\{c\}\\\{c\\\}
2:Structure fraction

α\\alpha, stage budgets

nglobal,nstructure,nrefinementn\_\{\\mathrm\{global\}\},n\_\{\\mathrm\{structure\}\},n\_\{\\mathrm\{refinement\}\}
3:Annealing rates

γmax\(s\),γmin\(s\)\\gamma\_\{\\mathrm\{max\}\}^\{\(s\)\},\\gamma\_\{\\mathrm\{min\}\}^\{\(s\)\}, credit weights

λs,λr\\lambda\_\{\\mathrm\{s\}\},\\lambda\_\{\\mathrm\{r\}\}, clip ratio

ϵ\\epsilon, KL coefficient

β\\beta, stability constant

δ\\delta
4:foreach training iterationdo

5:

θold←θ\\theta\_\{\\mathrm\{old\}\}\\leftarrow\\theta
6:// Phase 1: Rollout

7:for

g=1,…,Gg=1,\\ldots,Gdo

8:Sample

𝐱g\(0\)\\mathbf\{x\}\_\{g\}^\{\(0\)\}from

πθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}; record unmasking rank

ρg,i\\rho\_\{g,i\}for each token

ii
9:Compute reward

Rg←R\(𝐱g\(0\),c\)R\_\{g\}\\leftarrow R\(\\mathbf\{x\}\_\{g\}^\{\(0\)\},c\)
10:endfor

11:// Phase 2: Token Partitioning

12:

Ns←⌊αN⌋N\_\{s\}\\leftarrow\\lfloor\\alpha N\\rfloor
13:

𝒮g,structure←\{i∣ρg,i≤Ns\}\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\}\\leftarrow\\\{i\\mid\\rho\_\{g,i\}\\leq N\_\{s\}\\\},

𝒮g,refinement←\{i∣ρg,i\>Ns\}\\mathcal\{S\}\_\{g,\\mathrm\{refinement\}\}\\leftarrow\\\{i\\mid\\rho\_\{g,i\}\>N\_\{s\}\\\},

𝒮g,global←\{1,…,N\}\\mathcal\{S\}\_\{g,\\mathrm\{global\}\}\\leftarrow\\\{1,\\ldots,N\\\}
14:// Phase 3: Reward & Hierarchical Credit Assignment

15:

Ag←\(Rg−mean\(\{Rj\}\)\)/\(std\(\{Rj\}\)\+δ\)A\_\{g\}\\leftarrow\\bigl\(R\_\{g\}\-\\mathrm\{mean\}\(\\\{R\_\{j\}\\\}\)\\bigr\)\\;/\\;\\bigl\(\\mathrm\{std\}\(\\\{R\_\{j\}\\\}\)\+\\delta\\bigr\)
16:

A~g,i←\{Ag⋅λsi∈𝒮g,structureAg⋅λri∈𝒮g,refinement\\tilde\{A\}\_\{g,i\}\\leftarrow\\begin\{cases\}A\_\{g\}\\cdot\\lambda\_\{\\mathrm\{s\}\}&i\\in\\mathcal\{S\}\_\{g,\\mathrm\{structure\}\}\\\\ A\_\{g\}\\cdot\\lambda\_\{\\mathrm\{r\}\}&i\\in\\mathcal\{S\}\_\{g,\\mathrm\{refinement\}\}\\end\{cases\}
17:// Phase 4: Sketch\-Then\-Paint Staged Optimization

18:for

s∈\[Global,Structure,Refinement\]s\\in\[\\mathrm\{Global\},\\,\\mathrm\{Structure\},\\,\\mathrm\{Refinement\}\]do

19:for

ks=0,…,ns−1k\_\{s\}=0,\\ldots,n\_\{s\}\-1do

20:

γks\(s\)←γmin\(s\)\+\(γmax\(s\)−γmin\(s\)\)⋅max⁡\(1,ns−1\)−ksmax⁡\(1,ns−1\)\\gamma\_\{k\_\{s\}\}^\{\(s\)\}\\leftarrow\\gamma\_\{\\mathrm\{min\}\}^\{\(s\)\}\+\\bigl\(\\gamma\_\{\\mathrm\{max\}\}^\{\(s\)\}\-\\gamma\_\{\\mathrm\{min\}\}^\{\(s\)\}\\bigr\)\\cdot\\dfrac\{\\max\(1,\\,n\_\{s\}\-1\)\-k\_\{s\}\}\{\\max\(1,\\,n\_\{s\}\-1\)\}
21:for

g=1,…,Gg=1,\\ldots,Gdo

22:Sample

ℳg\(ks\)⊆𝒮g,s\\mathcal\{M\}\_\{g\}^\{\(k\_\{s\}\)\}\\subseteq\\mathcal\{S\}\_\{g,s\}with rate

γks\(s\)\\gamma\_\{k\_\{s\}\}^\{\(s\)\}
23:endfor

24:Compute

πθ\(vg,i∣𝐂∅,c\)\\pi\_\{\\theta\}\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{\\emptyset\},c\)for all

i∈⋃gℳg\(ks\)i\\in\\bigcup\_\{g\}\\mathcal\{M\}\_\{g\}^\{\(k\_\{s\}\)\}⊳\\trianglerightsingle forward pass

25:

rg,i\(θ\)←πθ\(vg,i∣𝐂∅,c\)/πθold\(vg,i∣𝐂∅,c\)r\_\{g,i\}\(\\theta\)\\leftarrow\\pi\_\{\\theta\}\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{\\emptyset\},c\)\\;/\\;\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(v\_\{g,i\}\\mid\\mathbf\{C\}\_\{\\emptyset\},c\)
26:

𝒥\(ks\)←1G∑g=1G1\|ℳg\(ks\)\|∑i∈ℳg\(ks\)min⁡\(rg,iA~g,i,clip\(rg,i,1−ϵ,1\+ϵ\)A~g,i\)−β𝔻KL\(πθ∥πref\)\\mathcal\{J\}^\{\(k\_\{s\}\)\}\\leftarrow\\dfrac\{1\}\{G\}\\\!\\sum\_\{g=1\}^\{G\}\\dfrac\{1\}\{\|\\mathcal\{M\}\_\{g\}^\{\(k\_\{s\}\)\}\|\}\\\!\\sum\_\{i\\in\\mathcal\{M\}\_\{g\}^\{\(k\_\{s\}\)\}\}\\min\\\!\\bigl\(r\_\{g,i\}\\,\\tilde\{A\}\_\{g,i\},\\;\\mathrm\{clip\}\(r\_\{g,i\},1\{\-\}\\epsilon,1\{\+\}\\epsilon\)\\,\\tilde\{A\}\_\{g,i\}\\bigr\)\-\\beta\\,\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\)
27:

θ←θ\+η∇θ𝒥\(ks\)\\theta\\leftarrow\\theta\+\\eta\\,\\nabla\_\{\\theta\}\\mathcal\{J\}^\{\(k\_\{s\}\)\}
28:endfor

29:endfor

30:endfor

## Appendix GQualitative Results

### G\.1DPG\-Bench: Counting Accuracy

Figure[5](https://arxiv.org/html/2605.16842#A5.F5)compares MaskGRPO and HT\-GRPO on two DPG\-Bench prompts that require reliable counting on Lumina\-DiMOO\. The first prompt requires a person pointing toward a trio of birds, where the model must preserve both the object count and the human–object spatial relation\. The second prompt requires exactly three eggplants arranged on a table\. MaskGRPO often produces an incorrect number of target objects, while HT\-GRPO more consistently satisfies the requested count\.

This illustrates two benefits of the Sketch\-Then\-Paint hierarchy\. First, counting is treated as a structural decision rather than a refinement detail\. HT\-GRPO updates all tokens in the Global stage and then focuses on structural tokens in the Structure stage, allowing the model to determine how many objects should appear and where they should be placed before local appearance is refined\. Second, the staged inner loop prevents counting\-related structural decisions from being diluted by refinement updates\. MaskGRPO can mix structure and texture tokens in the same update, so the learning signal for object count and spatial placement competes with color and surface\-detail optimization\. HT\-GRPO separates these roles by optimizing structure before refinement, making the correct count and layout more stable across samples\.

### G\.2DPG\-Bench: Scene\-Level Structural Completeness

Figure[6](https://arxiv.org/html/2605.16842#A5.F6)compares MaskGRPO and HT\-GRPO on two DPG\-Bench prompts that require coherent scene\-level structure and foreground–background composition\. For the first prompt, both methods can generate the pigeon, oak tree, and village background, but MaskGRPO shows weaker spatial grounding: the tree roots are not always naturally connected to the ground, and the foreground tree can appear flattened against the village backdrop\. The pigeon is also sometimes disproportionately large relative to the tree and cottages, further weakening the scene hierarchy\. HT\-GRPO better preserves the spatial relationship among the pigeon, oak branches, tree trunk, roots, and village background, producing a more coherent foreground–background structure\.

For the second prompt, which requires a parrot with fully outstretched wings flying above hills and forest, MaskGRPO can generate the target bird but often renders it with a flatter, less three\-dimensional foreground structure\. Its wing geometry and flying pose are less naturally integrated with the landscape\. HT\-GRPO more reliably preserves the outstretched wing structure, flying pose, and background scene context, resulting in a stronger sense of depth and scene\-level coherence\. These examples suggest that the Sketch\-Then\-Paint hierarchy helps establish global scene layout and object configuration before local appearance refinement\.

### G\.3GenEval: Spatial Grounding and Visual Fidelity

Figure[7](https://arxiv.org/html/2605.16842#A6.F7)compares the base model, MaskGRPO, and HT\-GRPO on representative GenEval examples covering single\-object generation, object counting, and spatial relation understanding\. Beyond improving correctness, HT\-GRPO also produces more natural and visually diverse images\. In the single\-object example, all methods generate a baseball bat, while HT\-GRPO renders a cleaner object with a more realistic scene composition\. In the counting example, HT\-GRPO preserves the requested three buses while producing a more diverse urban scene\. The spatial\-relation example is particularly illustrative: MaskGRPO tends to entangle the book and laptop into a single top\-down composition, making the “book above laptop” relation ambiguous\. In contrast, HT\-GRPO separates the two objects into a clear vertical arrangement, placing the book above the laptop and enriching the scene with natural contextual elements\. This suggests that HT\-GRPO improves not only object\-level accuracy, but also spatial grounding and aesthetic scene construction\.

## Appendix HComplete Training Algorithm

We propose HT\-GRPO Training in Algorithm[1](https://arxiv.org/html/2605.16842#alg1)\. For completeness, the entire training pipeline is summarized in a step\-by\-step form\.
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Similar Articles

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Hierarchical Variational Policies for Reward-Guided Diffusion

Submit Feedback

Similar Articles

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models
Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Hierarchical Variational Policies for Reward-Guided Diffusion