Tag
Scenema Audio is a zero-shot expressive voice cloning and speech generation model that produces speech with emotional arcs, pacing, and breath control from text prompts. Built on an audio diffusion transformer, it supports multilingual generation, voice cloning from 10-20 seconds of reference audio, and scene-aware audio with ambient effects.