MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

Hugging Face Daily Papers Papers

Summary

MRT is a 20B-parameter masked region diffusion model that unifies text-to-layers, image-to-layers, and layers-to-layers tasks for scalable multi-layer transparent image generation and editing, achieving state-of-the-art performance.

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.
Original Article
View Cached Full Text

Cached at: 05/27/26, 02:47 AM

Paper page - MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

Source: https://huggingface.co/papers/2605.27235

Abstract

A 20B-parameter masked region diffusion model enables scalable multi-layer transparent image generation and editing through unified task handling and efficient canvas management.

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parametermasked region diffusion modeltailored formulti-layer transparent image generationand editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks includingtext-to-layers,image-to-layers, andlayers-to-layerswithin a shared masked region diffusion framework, whereselective token maskingenables flexiblelayer-wise generationand editing. Second, to enable overflow layer generation, we introduce anoverflow-aware canvas layerthat handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we applydiffusion distillationto achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark formulti-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model inimage-to-layersquality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.27235

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.27235 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.27235 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.27235 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Hugging Face Daily Papers

This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.