Tag
NVIDIA released Nemotron-TwoTower-30B-A3B-Base-BF16, a diffusion-based language model that uses block-wise autoregressive diffusion to generate text by iterative denoising of token blocks, achieving 2.42× the generation throughput of the autoregressive baseline while retaining 98.7% of benchmark quality.
This paper introduces an atomistic language model that integrates a 3D atom encoder, Qwen LLM, and diffusion crystal generator to natively handle multimodal materials data, achieving state-of-the-art crystal structure prediction and de novo generation.
This paper introduces DigenRL, a disaggregated RL framework for diffusion-based generative LLMs that uses generation-axis pipeline parallelism and trainer-assisted generation to improve throughput by 1.56-2.10x over existing systems.
This paper introduces Approximate Structured Diffusion, a method that combines conditional random fields (CRFs) with discrete diffusion for sequence labelling. It uses a CRF conditioned on noisy label sequences and approximate mean-field inference, achieving a 16.5% error reduction on POS tagging.
JanusMesh is a fast, training-free framework that generates text-driven 3D visual illusions—a single mesh revealing different semantics from different viewing angles—by decoupling generation into cross-space dual-branch denoising and view-conditioned texture synthesis, achieving high realism in just 3-5 minutes.
MiniT2I is a minimalist direct-RGB text-to-image generator using a pixel-space MM-JiT denoiser with flow matching and frozen FLAN-T5-Large text tokens, with open-source JAX/Flax and PyTorch implementations released along with checkpoints.
Moebius is a 0.22B parameter image inpainting framework that rivals 10B-level models like FLUX.1-Fill-Dev, achieving over 15x faster inference through novel local-global interaction blocks and adaptive distillation strategies.
RepFusion introduces a method to use pretrained multimodal LLMs as noisy representation encoders in diffusion transformers for text-to-image generation, outperforming baselines with similar compute.
The author launches a weekly Video Model Journal Club covering video generation, world models, physical reasoning, diffusion, flow matching, etc. The first in-person talk will be by Yilun Du on Embodied Reasoning with World Models.
This paper introduces SP³, a method using Spherical Encoder priors for Plug-and-Play image restoration, achieving perceptual quality comparable to zero-shot diffusion priors while being 3–630× faster across tasks.
MoVerse generates real-time interactive video from single images by creating 360° panoramas and 3D Gaussian scaffolds, enabling efficient rendering through diffusion-based techniques.
VideoMDM trains 3D human motion priors from 2D poses using a diffusion framework with 2D reprojection loss and 3D motion regularizers, achieving near-3D supervised performance without requiring 3D ground truth.
Google released DiffusionGemma, an open-weight text generation model (26B parameters, 4B active) under Apache 2 license, demonstrating high inference speeds via NVIDIA's NIM cloud API.
DiffusionGemma, a 26B MoE model based on Gemma 4, achieves over 1000 tokens per second using diffusion for text generation in 256-token blocks, fitting in 18GB VRAM with quantization, released under Apache 2.0.
New paper shows how to optimize flow matching actors for reinforcement learning by approximating the Jacobian of the flow denoising process with the identity matrix, making training feasible.
Google DeepMind releases DiffusionGemma, a 26B-parameter Mixture-of-Experts model that uses discrete diffusion for faster text generation, supporting multimodal inputs and a 256K token context.
An essay explaining the physical constraints on cell size, focusing on surface-area-to-volume ratio and diffusion limits that make cells small.
SwiftVR is a real-time one-step generative video restoration framework that achieves high frame rates on consumer GPUs using efficient attention mechanisms and a lightweight restoration-aware autoencoder.
MaskAlign proposes a token-subset representation alignment method that improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment under perturbations.
Introduces SelfBootTok, a self-bootstrapped tokenization method that separates global and local information, reducing generator computation by ~40% and achieving a new state-of-the-art gFID of 1.56 with only 64 tokens.