Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
Summary
This paper introduces Live Music Diffusion Models (LMDMs), which modify the diffusion process to enable efficient block-wise processing and novel training paradigms for real-time interactive music generation on consumer hardware, outperforming discrete autoregressive models in inference complexity and enabling stable post-training alignment.
View Cached Full Text
Cached at: 05/22/26, 10:22 PM
Paper page - Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
Source: https://huggingface.co/papers/2605.22717 Authors:
,
,
,
,
,
,
,
,
,
Abstract
Audio diffusion models are adapted for interactive music generation through efficient block-wise processing and novel training paradigms that enable real-time performance on consumer hardware.
Interactive streaming music generation promises the use ofgenerative modelsfor live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whetheraudio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline forblock-wise outpaintingdiffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We proposeLive Music Diffusion Models(LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, theinference complexityof the discrete Live Music Models (LMMs) throughblock-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novelARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as agenerative instrumentin a real artist-AI collaboration, utilizing LMDMs as a “generative delay” to transform musicians’ improvisation live for variable timbral effects while running locally on a consumer gaming laptop.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2605\.22717
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22717 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22717 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22717 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Learnability-Informed Fine-Tuning of Diffusion Language Models
We propose LIFT, a learnability-informed fine-tuning algorithm for diffusion language models that aligns training with token difficulty and time step, achieving substantial gains on reasoning benchmarks.
DEMON: Diffusion Engine for Musical Orchestrated Noise
DEMON presents a real-time diffusion engine that enables live musical performance by controlling the denoising process, achieving up to 12.3 decoder completions per second on a single RTX 5090. It introduces heterogeneous scheduling, shared mutable state, per-frame blending, and windowed VAE decode for responsive control.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
This paper reformulates language generation as a stochastic optimal control problem, addressing limitations of autoregressive and diffusion models, and proposes a closed-loop diffusion method in latent control space using Flow Matching, achieving high-fidelity generation and efficient parallel sampling.
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
FlowLM introduces a flow matching language model derived from pre-trained diffusion models via efficient fine-tuning, enabling high-quality few-step text generation that rivals 2,000-step diffusion sampling with far fewer training epochs.
Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
This paper introduces a diffusion language model that treats text as a continuous process over binary bitstreams, using entropy-gated stochastic sampling to close the performance gap with autoregressive models. It achieves state-of-the-art results on LM1B and OWT benchmarks while reducing memory footprint.