jdopensource/JoyAI-Echo
Summary
JD Open Source releases JoyAI-Echo (Echo-LongVideo), a text-to-audio-video diffusion model capable of generating minute-level multi-shot videos with consistent character identity and voice, using DMD distillation for 7.5x speedup.
View Cached Full Text
Cached at: 06/08/26, 03:15 PM
jdopensource/JoyAI-Echo Β· Hugging Face
Source: https://huggingface.co/jdopensource/JoyAI-Echo

Echo-LongVideo
π¬ Pushing the Frontier of Long Video Generation
Official model weights forminute-level multi-shot audio-video generationwith a distilled DMD generator, paired cross-modal memory, and story-level consistency.
π Paper|π» Inference Code|𧬠Model|π Usage|π Results|π Citation
https://huggingface.co/jdopensource/JoyAI-Echo#model-summaryModel Summary
Echo-LongVideo(a.k.a.JoyAI-Echo) is a long-form, multi-shot, audio-video generation model. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently across up tofive-minutevideos, and a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a7.5Γ inference speedupwithout sacrificing quality.
In human evaluation, Echo-LongVideo decisively outperformsHappyOyster(directing mode) on long-form generation and surpasses the short-video specialistWan 2.6on human-centric tasks.
This repository hosts thereleased checkpoint. Inference code is released separately β see theUsagesection.
https://huggingface.co/jdopensource/JoyAI-Echo#model-detailsModel Details
- **Developed by:**Echo Team @ Joy Future Academy, JD
- **Model type:**Text-to-(Audio+Video) diffusion transformer, DMD 8-step
- **Modality:**Text β synchronized video + audio
- **Backbone:**Built on top ofLTX-Video
- Text encoder:
google/gemma\-3\-12b\-it(downloaded separately) - **Resolution / length (by default):**1280 Γ 736, 241 frames @ 25 fps per shot
- **Max story length:**up to 5 minutes (multi-shot)
- **License:**LTX-2 Community License Agreement
https://huggingface.co/jdopensource/JoyAI-Echo#highlightsHighlights
- ποΈMinute-level multi-shot storiesfrom a single prompt JSON.
- β‘DMD-distilled few-step inference, ~7.5Γ faster than the original pipeline.
- πJoint audio-video generationin a single pipeline.
- π§ Paired cross-modal memory bankfor story-level identity and voice consistency.
https://huggingface.co/jdopensource/JoyAI-Echo#usageUsage
Inference is run with the standaloneEcho-LongVideoinference repository.
https://huggingface.co/jdopensource/JoyAI-Echo#1-download-the-checkpoint1. Download the checkpoint
huggingface-cli download <org>/Echo-LongVideo \
--local-dir checkpoints
Also download the Gemma text encoder:
huggingface-cli download google/gemma-3-12b-it \
--local-dir checkpoints/gemma-3-12b
Expected layout:
checkpoints/
βββ echo-longvideo-release.safetensors
βββ gemma-3-12b/
https://huggingface.co/jdopensource/JoyAI-Echo#2-get-the-inference-code2. Get the inference code
git clone https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo.git
cd JoyAI-Echo
Environment:Python 3.11 + PyTorch 2.8 + CUDA 12.8(see the inference repoβsenvironment\.yml/requirements\.txt).
https://huggingface.co/jdopensource/JoyAI-Echo#3-write-a-story-prompt3. Write a story prompt
Enhance your prompt first.We provide prompt enhancers β system prompts that expand a short story or idea into well-formed shot prompts:prompts/long\_story\_writer\_system\_prompt\.mdfor long, multi-shot video, andprompts/short\_story\_writer\_system\_prompt\.mdfor single-shot short video. Westrongly recommendrunning your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.
Create a JSON file underprompts/. Each file is a single object with apromptslist, whereevery string is one complete shot. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.
Inside each string, write these parts in order:
PartWhat to describeRoles & SubjectsDescribe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable.Action & DialogueWhat the subject does and speaks.StyleThe overall visual and emotional aesthetic β e.g. realistic motorsport film language, cool daylight, restrained cinematic tension.Camera MovementThe shot type and framing or movement β e.g. a stable close-up on the face, or a medium shot from the waist up.BackgroundThe setting and scene details behind the subject.Sound Effects & BGMThe sounds in the scene and the background music β e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or nobackground music A more convenient prompt-writing workflow will be released as adirector agentfor everyone to use.
https://huggingface.co/jdopensource/JoyAI-Echo#4-run4. Run
python inference.py
Outputs land ininference\_result/outputs/<prompt\-name\>/inference\_<timestamp\>/.
https://huggingface.co/jdopensource/JoyAI-Echo#hardwareHardware
Peak GPU memory is**~46β50 GB**at the default 1280 Γ 736 Γ 241 frame setting β a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:
python inference.py --num-frames 121 --video-height 480 --video-width 832
https://huggingface.co/jdopensource/JoyAI-Echo#resultsResults
https://huggingface.co/jdopensource/JoyAI-Echo#reported-scaleReported Scale
ItemValueπ¬ Long-form coherent story length5 minβ‘ Speedup over the original multi-step pipeline7.5Γπ Benchmark stories100ποΈ Generated evaluation shots3,000π Frames per shot241 @ 25 fps
https://huggingface.co/jdopensource/JoyAI-Echo#human-evaluationHuman Evaluation
GSB user study. Values are the percentage of user preferences.
Aspect (Long Video)JoyAI-EchoTieHappyOyster (Directing)Visual aesthetics**63.6%8.8%27.6%Audio quality81.7%6.5%11.8%Prompt following80.6%13.5%5.9%IP consistency59.4%12.9%27.7% Aspect (Short Video)JoyAI-EchoTieWan 2.6Visual aesthetics58.8%**14.7%26.5%Audio quality32.3%30.9%36.8%Prompt following33.8%36.8%29.4%
https://huggingface.co/jdopensource/JoyAI-Echo#acknowledgementsAcknowledgements
We gratefully acknowledgeLTX-Videofor the base video generator andGemmafor the text encoder, along with the broader open-source community.
https://huggingface.co/jdopensource/JoyAI-Echo#citationCitation
If Echo-LongVideo helps your research or products, please cite:
@techreport{echo2026longvideo,
title = {Echo-LongVideo: Pushing the Frontier of Long Video Generation},
author = {{Echo Team @ Joy Future Academy, JD}},
institution = {Joy Future Academy, JD},
year = {2026},
month = {June},
url = {https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo}
}
https://huggingface.co/jdopensource/JoyAI-Echo#licenseLicense
Released under theLTX-2 Community License Agreement. By downloading or using these weights, you agree to its terms. The bundled Gemma text encoder is governed by Googleβs separate Gemma license.
Similar Articles
@AdinaYakup: JD just released JoyAI-Echo An interesting long video generation model 5 minute multi shot video generation Cross modalβ¦
JD released JoyAI-Echo, a long video generation model capable of 5-minute multi-shot video with cross-modal memory for character and voice consistency, native audio+video generation, and 7.5x speed improvement via DMD distillation.
ResembleAI/Dramabox
Dramabox is an expressive text-to-speech model by Resemble AI that uses prompt-driven control for speaker identity, emotion, and delivery, with optional voice cloning via a 10-second reference. Built on the LTX-2.3 audio diffusion transformer, it is open-sourced on Hugging Face.
Open source : Turning vocal imitations into sound effects. (New UX for sound generation)
An open-source AI model that generates sound effects from vocal imitations and text descriptions, addressing the challenge of searching for specific sounds.
Aratako/Irodori-TTS-500M-v3
Irodori-TTS-500M-v3 is a Japanese TTS model based on Rectified Flow Diffusion Transformer, supporting zero-shot voice cloning and unique emoji-based style/sound effect control.
OpenAI's New Voice Models Want to Do More Than Talk Back
OpenAI has launched three new real-time audio models to enable continuous, multitasking voice interactions that prioritize long-context reasoning, live translation, and seamless tool use.