MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale
Summary
MRT is a 20B-parameter masked region diffusion model that unifies text-to-layers, image-to-layers, and layers-to-layers tasks for scalable multi-layer transparent image generation and editing, achieving state-of-the-art performance.
View Cached Full Text
Cached at: 05/27/26, 02:47 AM
Paper page - MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale
Source: https://huggingface.co/papers/2605.27235
Abstract
A 20B-parameter masked region diffusion model enables scalable multi-layer transparent image generation and editing through unified task handling and efficient canvas management.
Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parametermasked region diffusion modeltailored formulti-layer transparent image generationand editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks includingtext-to-layers,image-to-layers, andlayers-to-layerswithin a shared masked region diffusion framework, whereselective token maskingenables flexiblelayer-wise generationand editing. Second, to enable overflow layer generation, we introduce anoverflow-aware canvas layerthat handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we applydiffusion distillationto achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark formulti-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model inimage-to-layersquality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.27235
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.27235 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.27235 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.27235 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models
Introduces Token-to-Mask (T2M) remasking to fix generation errors in masked diffusion LMs by resetting suspect tokens to mask state instead of overwriting, yielding up to +5.92 accuracy on CMATH without extra training or parameters.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE introduces a unified multimodal image generation and editing framework that aligns VLM semantic embeddings with diffusion conditioning, achieving state-of-the-art fidelity without costly fusion or from-scratch training.
M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement
M2Retinexformer extends the Retinexformer framework for low-light image enhancement by incorporating depth, luminance, and semantic cues via cross-attention and adaptive gating, achieving state-of-the-art results on multiple benchmarks.
RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models
RT-Lynx proposes using activation sparsity instead of weight sparsity to accelerate diffusion models, achieving up to 1.55× linear-layer speedup while maintaining generation quality, and is accepted at ICML 2026.
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.