Tag
This paper proposes Nemotron-Labs-Diffusion-Image, a masked discrete diffusion model for high-resolution text-to-image synthesis, introducing a token-editing mechanism and grouped cross-entropy objective to improve token refinement and training efficiency.
Introduces High-Res Neural Cellular Automata that operates on a coarse lattice and uses a Local Pattern Producing Network to generate high-resolution outputs, enabling efficient procedural generation.
Ideogram 4 is an open-weight text-to-image model trained from scratch, featuring structured JSON prompting, best-in-class multilingual text rendering, bounding-box layout controls, color-palette controls, and native 2K resolution output.
NVIDIA Spatial Intelligence Lab proposes PiD, which redesigns the decoding stage of latent diffusion models as a conditional pixel diffusion process, unifying decoding and upsampling to achieve low-latency, high-resolution decoding.
PiD introduces a pixel diffusion decoder that reformulates latent decoding as conditional pixel diffusion, enabling fast and high-quality image synthesis at high resolutions with reduced computational requirements. It decodes latents into 4x or 8x upscaled images in under a second on consumer hardware.
HL-OutPaint is a coarse-to-fine video outpainting framework for high-resolution long-range videos, using global coarse guidance to enable large spatial extrapolation while maintaining spatio-temporal consistency.
Microsoft releases Lens, a 3.8B-parameter foundational text-to-image model with efficient training and fast high-resolution generation, featuring dense-caption pre-training and mixed-resolution learning.
The L2P paper introduces a Latent-to-Pixel transfer paradigm that leverages pre-trained latent diffusion models to create efficient pixel-space models capable of 4K generation with minimal training overhead.
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.
HiDream-ai has released HiDream-O1-Image-Dev, an 8B parameter open-source image generation model that uses a pixel-level unified transformer without external VAEs. It ranks #8 in the Artificial Analysis Text to Image Arena and supports high-resolution generation up to 2,048x2,048.
HiDream-ai has open-sourced HiDream-O1-Image (8B), a unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) that natively handles text-to-image, image editing, and subject-driven personalization at up to 2048×2048 resolution without external VAEs or disjoint text encoders. It debuted at #8 in the Artificial Analysis Text to Image Arena and is positioned as a leading open-weights text-to-image model.
SwiftI2V is a new efficient framework for high-resolution image-to-video generation that uses conditional segment-wise generation to achieve 2K synthesis with significantly reduced computational costs. It enables practical generation on single consumer or datacenter GPUs while maintaining input fidelity.