ScenemaAI/scenema-audio
Summary
Scenema Audio is a zero-shot expressive voice cloning and speech generation model that produces speech with emotional arcs, pacing, and breath control from text prompts. Built on an audio diffusion transformer, it supports multilingual generation, voice cloning from 10-20 seconds of reference audio, and scene-aware audio with ambient effects.
View Cached Full Text
Cached at: 05/15/26, 06:19 PM
ScenemaAI/scenema-audio · Hugging Face
Source: https://huggingface.co/ScenemaAI/scenema-audio Zero-shot expressive voice cloning and speech generation.
Visit scenema.ai/audio to hear all demos and try it out.
Watch the demo video on YouTube
Every existing text-to-speech system converts words into sound, but none of them perform. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.
Built on an audio diffusion transformer extracted fromLTX 2.3’s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.
https://huggingface.co/ScenemaAI/scenema-audio#capabilitiesCapabilities
- Emotional acting: Rage, grief, joy, fear, exhaustion. Emotional state shifts within a single generation via action tags.
- Child voices: Six-year-olds, toddlers, teenagers. Naturally voiced, not pitch-shifted adults.
- Scene-aware audio: Describe the environment and the model generates speech with rain, thunder, crowds, or any ambient audio alongside the voice.
- Zero-shot voice cloning: Provide 10-20 seconds of reference audio with some emotional variability. The model transfers the voice identity onto any emotional performance. No fine-tuning, no enrollment.
- Long-form narration: Generates any length of audio by automatically splitting text and maintaining voice continuity across segments.
- Multilingual: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili.
https://huggingface.co/ScenemaAI/scenema-audio#model-checkpointsModel Checkpoints
FileSizeDescriptionscenema\-audio\-transformer\.safetensors9.8 GBAudio diffusion transformer (bf16)scenema\-audio\-transformer\-int8\.safetensors4.9 GBAudio diffusion transformer (INT8, identical quality)scenema\-audio\-pipeline\.safetensors6.7 GBAudio VAE decoder + vocoder + text projectionscenema\-audio\-vae\-encoder\.safetensors42.7 MBAudio VAE encoder for reference voice encoding
https://huggingface.co/ScenemaAI/scenema-audio#quick-startQuick Start
git clone https://github.com/ScenemaAI/scenema-audio.git
cd scenema-audio
export HF_TOKEN=your_huggingface_token
docker compose up
Models are downloaded on first start (~38 GB) and cached in a Docker volume. See theGitHub repofor full documentation.
https://huggingface.co/ScenemaAI/scenema-audio#prompt-formatPrompt Format
<speak voice="VOICE_DESCRIPTION" gender="male|female"
scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE">
<action>Performance direction.</action>
Speech text here.
</speak>
AttributeRequiredDefaultDescriptionvoiceYesDetailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style.genderYes"male"or"female". Controls pronoun assignment in compiled prompts.sceneNoEnvironmental context. Conditions the ambient audio around the speech.languageNo"en"Language code.
https://huggingface.co/ScenemaAI/scenema-audio#voice-descriptionVoice Description
Thevoiceattribute is the primary control. The richer and more specific, the better:
- Vocal qualities: timbre, pitch, breathiness, rasp, resonance
- Emotional state: rage, tenderness, exhaustion, excitement, grief
- Speaking style: pacing, emphasis, pauses, enunciation
- Character archetypes: “Think Tony Soprano having a breakdown”
- Age and gender: child, elderly, young woman, teenage boy
- Accents: British, Southern American, New Jersey Italian American
https://huggingface.co/ScenemaAI/scenema-audio#action-tagsAction Tags
<action\>tags are stage directions that shape HOW speech is delivered. Place them between speech segments to direct emotional shifts, pacing, and physical delivery:
<speak voice="Middle-aged man, warm but weathered." gender="male">
<action>Calm, almost casual. Staring at his hands.</action>
I used to think I had all the time in the world.
<action>Voice tightens. Fighting to stay composed.</action>
Then one Tuesday morning, the doctor said three words that changed everything.
<action>Long pause. Deep breath. Raw but steady.</action>
And I realized I hadn't called my son in six months.
</speak>
https://huggingface.co/ScenemaAI/scenema-audio#voice-cloningVoice Cloning
Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt and transfers the reference voice’s identity onto the performance.
{
"prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it</action>What are you waiting for?!</speak>",
"reference_voice_url": "https://example.com/reference.wav"
}
Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.
https://huggingface.co/ScenemaAI/scenema-audio#examplesExamples
https://huggingface.co/ScenemaAI/scenema-audio#emotional-actingEmotional Acting
<speak voice="A man on the edge. Explosive rage. Italian-American inflection."
gender="male" scene="A dimly lit office, late at night">
<action>He stands up slowly, voice dangerously low</action>
You come into my house, you eat my food, and then you got the nerve
to tell me how to run my business.
<action>Voice rising, finger pointing</action>
I built this thing from nothing while you were sitting on your ass.
</speak>
https://huggingface.co/ScenemaAI/scenema-audio#child-voiceChild Voice
<speak voice="A six-year-old girl, bright and excited, speaking fast
with breathless enthusiasm. Slight lisp on S sounds."
gender="female">
Mommy look! There is a rainbow and it goes all the way across the whole sky!
</speak>
https://huggingface.co/ScenemaAI/scenema-audio#scene-aware-audioScene-Aware Audio
<speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
gender="male" scene="Open dock in a thunderstorm, heavy rain"
shot="scene">
<sound>Heavy rain and wind howling</sound>
<action>He shouts over the storm</action>
Get the lines! She is pulling loose!
<sound>Thunder cracks overhead</sound>
Move! I said move!
</speak>
https://huggingface.co/ScenemaAI/scenema-audio#api-referenceAPI Reference
https://huggingface.co/ScenemaAI/scenema-audio#post-generatePOST /generate
FieldTypeDefaultDescriptionpromptstringrequired<speak\>XML stringmodestring"generate"``"generate"for full pipeline."voice\_design"for 15s voice preview.reference\_voice\_urlstringnullURL to reference audio for zero-shot voice cloning. 10-20 seconds with emotional variability is ideal.background\_sfxboolfalseKeep generated sound effects in the output.validatebooltrueWhisper speech validation with retry on garbled output.seedint\-1Generation seed.\-1for random.pacefloat1\.5Duration allocation multiplier. Higher = slower speech.min\_match\_ratiofloat0\.90Whisper validation threshold (0.0-1.0).skip\_vcboolfalseSkip voice conversion post-processing.vc\_stepsint25SeedVC diffusion steps (10-50).vc\_cfg\_ratefloat0\.5SeedVC guidance rate (0.0-1.0).
https://huggingface.co/ScenemaAI/scenema-audio#responseResponse
Returns JSON with base64-encoded WAV audio:
{
"status": "succeeded",
"audio": "<base64-encoded WAV>",
"content_type": "audio/wav",
"metadata": {
"duration_s": 12.4,
"sample_rate": 48000,
"processing_ms": 8200,
"seed": 42
}
}
https://huggingface.co/ScenemaAI/scenema-audio#architectureArchitecture
XML prompt (voice + scene + action tags + text)
-> Gemma 3 12B text encoding
-> 8-step distilled latent diffusion
-> Audio VAE decoding
-> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true)
-> SeedVC voice identity transfer (when reference provided or multi-chunk)
-> Output WAV (48kHz stereo)
For longer text, the system splits at sentence boundaries using Kokoro phoneme-level duration estimation and maintains voice continuity between segments via A2V latent conditioning.
https://huggingface.co/ScenemaAI/scenema-audio#vram-requirementsVRAM Requirements
VRAMAudio ModelGemmaNotes16 GBINT8 (4.9 GB)CPU streamingNeeds 32 GB system RAM. ~7s/chunk encode.24 GBINT8 (4.9 GB)NF4 on GPU (~8 GB)Default config. ~0.2s/chunk encode.48 GBbf16 (9.8 GB)bf16 on GPU (24 GB)Best quality. All models resident. VRAM strategy is auto-detected.SageAttention 2recommended for all configurations.
https://huggingface.co/ScenemaAI/scenema-audio#performancePerformance
Benchmarked on NVIDIA RTX 4090 (24 GB), ~55 seconds of output audio:
ConfigurationTotal TimeReal-Time Factorbf16 + bf16 streaming83s0.66xINT8 + NF4 (all GPU)35s1.57x
https://huggingface.co/ScenemaAI/scenema-audio#limitationsLimitations
- Pronunciation: Occasionally garbles complex multi-syllable words and proper nouns.
- 15-second generation window: Each segment capped at ~15s. Longer text splits automatically.
- Emotional range with voice cloning: Identity transfer can reduce emotional extremes. Use a strong archetype in the voice description and provide reference audio with natural emotional variability (10-20 seconds, not monotone).
- Multilingual pronunciation: Language switching mid-speech may cause phonetic drift. Use separate requests per language.
- Generation speed: 3-8 seconds per 15-second segment depending on hardware.
- Reference audio quality: Low-quality references degrade output. Use clean audio with some emotional variability.
- Gemma 3 12B is gated: Requires accepting Google’s terms of use and a HuggingFace token with access.
https://huggingface.co/ScenemaAI/scenema-audio#acknowledgmentsAcknowledgments
- LTX-2by Lightricks for the base audiovisual model
- Gemma 3by Google for the text encoder
- SeedVCby Plachta for voice refinement
- Kokoroby hexgrad for duration estimation
- SageAttentionfor attention acceleration
https://huggingface.co/ScenemaAI/scenema-audio#licenseLicense
The model weights are released under theLTX-2 Community License Agreement. Scenema Audio’s audio diffusion transformer is derived fromLTX 2.3’s audiovisual model, and its weights are subject to the same terms.
The inference code and server are released under theMIT License.
Gemma 3 12B(text encoder) is a gated model requiring acceptance of Google’s terms of use.
Similar Articles
Scenema Audio: Zero-shot expressive voice cloning and speech generation [N]
Scenema AI releases Scenema Audio, an open-source diffusion-based model for zero-shot expressive voice cloning and speech generation, separating emotional performance from voice identity to allow any voice to perform any emotion.
Open source : Turning vocal imitations into sound effects. (New UX for sound generation)
An open-source AI model that generates sound effects from vocal imitations and text descriptions, addressing the challenge of searching for specific sounds.
DramaBox by Resemble AI
DramaBox by Resemble AI converts scene descriptions into AI-generated vocal performances.
ResembleAI/Dramabox
Dramabox is an expressive text-to-speech model by Resemble AI that uses prompt-driven control for speaker identity, emotion, and delivery, with optional voice cloning via a 10-second reference. Built on the LTX-2.3 audio diffusion transformer, it is open-sourced on Hugging Face.
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
SwanVoice is a zero-shot text-to-speech model designed for expressive long-form monologue and dialogue synthesis, combining VAE, flow-matching DiT, and diffusion post-training to achieve higher richness and hierarchy scores than existing baselines.