Tag
MaineCoon is a 22B real-time text-to-audio-video model that achieves up to 47.5 FPS on a single H100 GPU, enabling low-cost, long-duration streaming with synchronized speech and visuals for live AI characters.
SwanSphere proposes a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies, achieving superior performance in both video-to-spatial and text-to-spatial audio tasks.
Google DeepMind released Magenta RealTime 2, an open music generation model for on-device streaming with low-latency control via text, audio examples, and MIDI.
Stability AI released Stable Audio 3.0, an open-weight model family for variable-length audio generation up to six minutes, with support for LoRA fine-tuning and audio inpainting, trained on fully licensed data.
Stability AI released Stable Audio 3 with open source variants for music and VFX, offering fast and high-quality audio generation.
Stable Audio 3 introduces a family of fast latent diffusion models for variable-length audio generation and editing, with open-source release of small and medium model weights.
WavFlow generates high-fidelity audio directly in raw waveform space using waveform patchify and amplitude lifting, achieving competitive performance on video-to-audio and text-to-audio benchmarks without intermediate latent representations.