Tag
TAPS proposes a target-aware prefix tree selection method for diffusion-drafted speculative decoding, achieving up to 7.9x lossless end-to-end speedup by improving the acceptance-cost tradeoff over prior methods.
NVIDIA presents OmniDreams, a generative world model built from the Cosmos diffusion model for real-time action-conditioned video generation, enabling closed-loop simulation for autonomous driving policy evaluation in complex unseen scenarios.
PrismML has released Bonsai Image 4B, a 1-bit binary and ternary quantized diffusion model, with a size of only 3GB (1-bit version 0.93GB), achieving over 8x compression compared to the same-parameter FLUX.2 Klein 4B at 16GB, and fully supports local browser execution.
This paper introduces TabKG, a knowledge-graph-guided framework for generating logically consistent synthetic supply chain tabular data. It uses an LLM ensemble to discover operational dependencies and a latent diffusion model to generate independent columns, achieving high logical consistency while preserving statistical fidelity.
Proposes DDGAD, a diffusion-based framework for graph anomaly detection that uses trajectory dynamics to distinguish normal from anomalous nodes, mitigating contamination propagation via a reliability-aware consensus mechanism and three complementary anomaly signals.
Introduces AIMS-Fold, an inference-time guided-diffusion framework that integrates cross-linking mass spectrometry (XL-MS) and hydrogen-deuterium exchange (HDX-MS) data to improve protein co-folding predictions for induced proximity drug targets.
AirCast-SR is a diffusion-based foundation model that downscales global AI weather forecasts from 0.25° to 1 km resolution at hourly cadence, producing 67-hour forecasts with near-zero bias and structural realism, while running inference in minutes on a single commodity GPU.
MRT is a 20B-parameter masked region diffusion model that unifies text-to-layers, image-to-layers, and layers-to-layers tasks for scalable multi-layer transparent image generation and editing, achieving state-of-the-art performance.
AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation and improving generation quality in downstream tasks.
Proposes MacroDiff+, a physics-guided geometric diffusion framework for macro placement in VLSI design, achieving 6.1–6.2% wirelength reduction on ISPD2005 benchmarks with superior stability and scalability.
LongLive-2.0 introduces an NVFP4-based parallel infrastructure for long video generation, achieving up to 2.15x training speedup and 1.84x inference speedup with a 5B model reaching 45.7 FPS.
Stability AI releases Stable Audio 3, a family of latent diffusion models for variable-length audio generation and editing, with weights for small and medium models available on Hugging Face.
Scenema AI releases Scenema Audio, an open-source diffusion-based model for zero-shot expressive voice cloning and speech generation, separating emotional performance from voice identity to allow any voice to perform any emotion.
MIT's Kai-Ming He team released ELF, a new language model that uses diffusion processes in continuous vector space for text generation, bypassing standard autoregressive architectures and significantly reducing data requirements.
AnyFlow introduces a novel any-step video diffusion distillation framework that optimizes full ODE sampling trajectories through flow-map transition learning and backward simulation, achieving performance that matches or surpasses consistency-based counterparts while scaling with sampling step budgets.
Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.
Tsinghua University's OpenBMB has released VoxCPM2, an open-source multilingual TTS model with 20 billion parameters. It supports continuous latent space diffusion autoregressive generation without a Tokenizer, offering 48kHz studio-quality audio and powerful voice cloning and design capabilities.
NVIDIA releases PiD (Pixel Diffusion Decoder), a conditional pixel-space diffusion model that unifies latent-to-pixel decoding and upsampling into one generative module, producing super-resolved images in one pass. Model checkpoints and VAE weights are released under a non-commercial license.
Z-Anime is a full fine-tune of Alibaba's Z-Image Base model, specialized for high-quality anime generation with support for natural language prompts and low VRAM usage.
NVIDIA releases Nemotron-Labs-Diffusion, a family of tri-mode language models (3B, 8B, 14B) supporting AR, diffusion, and self-speculation decoding, achieving 2.7x-4x speed-ups over standard AR decoding.