Tag
The L2P paper introduces a Latent-to-Pixel transfer paradigm that leverages pre-trained latent diffusion models to create efficient pixel-space models capable of 4K generation with minimal training overhead.
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.
HiDream-ai has released HiDream-O1-Image-Dev, an 8B parameter open-source image generation model that uses a pixel-level unified transformer without external VAEs. It ranks #8 in the Artificial Analysis Text to Image Arena and supports high-resolution generation up to 2,048x2,048.
HiDream-ai has open-sourced HiDream-O1-Image (8B), a unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) that natively handles text-to-image, image editing, and subject-driven personalization at up to 2048×2048 resolution without external VAEs or disjoint text encoders. It debuted at #8 in the Artificial Analysis Text to Image Arena and is positioned as a leading open-weights text-to-image model.
SwiftI2V is a new efficient framework for high-resolution image-to-video generation that uses conditional segment-wise generation to achieve 2K synthesis with significantly reduced computational costs. It enables practical generation on single consumer or datacenter GPUs while maintaining input fidelity.