stabilityai/stable-audio-3-medium

Hugging Face Models Trending Models

Summary

Stability AI releases Stable Audio 3, a family of latent diffusion models for variable-length audio generation and editing, with weights for small and medium models available on Hugging Face.

Task: text-to-audio Tags: stable-audio-3, safetensors, audio-generation, music, sound-effects, diffusion, text-to-audio, en, arxiv:2605.17991, base_model:stabilityai/stable-audio-3-medium-base, base_model:finetune:stabilityai/stable-audio-3-medium-base, license:other, region:us
Original Article
View Cached Full Text

Cached at: 05/24/26, 01:50 AM

stabilityai/stable-audio-3-medium · Hugging Face

Source: https://huggingface.co/stabilityai/stable-audio-3-medium Please note: For commercial use, please refer tohttps://stability.ai/license

https://huggingface.co/stabilityai/stable-audio-3-medium#model-descriptionModel Description

Stable Audio 3is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

https://huggingface.co/stabilityai/stable-audio-3-medium#usageUsage

This model can be used with:

  1. thestable\-audio\-3inference and fine-tuning library
  2. thestable\-audio\-toolsresearch library

https://huggingface.co/stabilityai/stable-audio-3-medium#using-with-stable-audio-3Using withstable\-audio\-3

from stable_audio_3 import StableAudioModel

model = StableAudioModel.from_pretrained("medium")
audio = model.generate(
    prompt=(
        "House music that encapsulates the feeling of being at a festival "
        "in the sunny weather with all your friends 124 BPM"
    ),
    duration=180
)

https://huggingface.co/stabilityai/stable-audio-3-medium#using-with-stable-audio-toolsUsing withstable\-audio\-tools

import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint

device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
  model_half = True

# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-3-medium")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)
if model_half:
  model = model.to(torch.float16)
# Set up text and timing conditioning
conditioning = [{
    "prompt": (
        "A dream-like Synthpop instrumental that would accompany "
        "a dream-sequence in a surrealist movie 120 BPM"
    ),
    "seconds_total": 380
}]

# Generate stereo audio
output = generate_diffusion_cond_inpaint(
    model,
    steps=8,
    cfg_scale=1.0,
    conditioning=conditioning,
    sample_size=sample_size,
    sampler_type="pingpong",
    device=device
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

https://huggingface.co/stabilityai/stable-audio-3-medium#model-detailsModel Details

We use a publicly available pre-trained T5Gemma model (t5gemma-b-b-ul2) for text conditioning. T5Gemma is redistributed under theGemma Terms of Use.

https://huggingface.co/stabilityai/stable-audio-3-medium#training-datasetTraining dataset

https://huggingface.co/stabilityai/stable-audio-3-medium#datasets-usedDatasets Used

Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed fromAudioSparxand a further 472,618 are fromFreesound. The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15), that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454 CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open:https://info.stability.ai/attributions.

Similar Articles

Stable Audio 3

Hacker News Top

Stable Audio 3 introduces a family of fast latent diffusion models for variable-length audio generation and editing, with open-source release of small and medium model weights.

Stable Audio 3.0 (3 minute read)

TLDR AI

Stability AI released Stable Audio 3.0, an open-weight model family for variable-length audio generation up to six minutes, with support for LoRA fine-tuning and audio inpainting, trained on fully licensed data.