jdopensource/JoyAI-Echo

Hugging Face Models Trending Models

Summary

JD Open Source releases JoyAI-Echo (Echo-LongVideo), a text-to-audio-video diffusion model capable of generating minute-level multi-shot videos with consistent character identity and voice, using DMD distillation for 7.5x speedup.

Task: text-to-video Tags: ltx-video, JoyAI-Echo, text-to-video, video-generation, audio-video-generation, long-video, multi-shot, dmd, license:other, region:us
Original Article
View Cached Full Text

Cached at: 06/08/26, 03:15 PM

jdopensource/JoyAI-Echo Β· Hugging Face

Source: https://huggingface.co/jdopensource/JoyAI-Echo Echo-LongVideo generated video gallery

Echo-LongVideo

🎬 Pushing the Frontier of Long Video Generation

Official model weights forminute-level multi-shot audio-video generationwith a distilled DMD generator, paired cross-modal memory, and story-level consistency.

πŸ“„ Paper|πŸ’» Inference Code|🧬 Model|πŸš€ Usage|πŸ“Š Results|πŸ“ Citation

Text-to-VideoAudio + Video5 minute long videoModel Weights

https://huggingface.co/jdopensource/JoyAI-Echo#model-summaryModel Summary

Echo-LongVideo(a.k.a.JoyAI-Echo) is a long-form, multi-shot, audio-video generation model. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently across up tofive-minutevideos, and a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a7.5Γ— inference speedupwithout sacrificing quality.

In human evaluation, Echo-LongVideo decisively outperformsHappyOyster(directing mode) on long-form generation and surpasses the short-video specialistWan 2.6on human-centric tasks.

This repository hosts thereleased checkpoint. Inference code is released separately β€” see theUsagesection.

https://huggingface.co/jdopensource/JoyAI-Echo#model-detailsModel Details

  • **Developed by:**Echo Team @ Joy Future Academy, JD
  • **Model type:**Text-to-(Audio+Video) diffusion transformer, DMD 8-step
  • **Modality:**Text β†’ synchronized video + audio
  • **Backbone:**Built on top ofLTX-Video
  • Text encoder:google/gemma\-3\-12b\-it(downloaded separately)
  • **Resolution / length (by default):**1280 Γ— 736, 241 frames @ 25 fps per shot
  • **Max story length:**up to 5 minutes (multi-shot)
  • **License:**LTX-2 Community License Agreement

https://huggingface.co/jdopensource/JoyAI-Echo#highlightsHighlights

  • 🎞️Minute-level multi-shot storiesfrom a single prompt JSON.
  • ⚑DMD-distilled few-step inference, ~7.5Γ— faster than the original pipeline.
  • πŸ”ŠJoint audio-video generationin a single pipeline.
  • 🧠Paired cross-modal memory bankfor story-level identity and voice consistency.

https://huggingface.co/jdopensource/JoyAI-Echo#usageUsage

Inference is run with the standaloneEcho-LongVideoinference repository.

https://huggingface.co/jdopensource/JoyAI-Echo#1-download-the-checkpoint1. Download the checkpoint

huggingface-cli download <org>/Echo-LongVideo \
  --local-dir checkpoints

Also download the Gemma text encoder:

huggingface-cli download google/gemma-3-12b-it \
  --local-dir checkpoints/gemma-3-12b

Expected layout:

checkpoints/
β”œβ”€β”€ echo-longvideo-release.safetensors
└── gemma-3-12b/

https://huggingface.co/jdopensource/JoyAI-Echo#2-get-the-inference-code2. Get the inference code

git clone https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo.git
cd JoyAI-Echo

Environment:Python 3.11 + PyTorch 2.8 + CUDA 12.8(see the inference repo’senvironment\.yml/requirements\.txt).

https://huggingface.co/jdopensource/JoyAI-Echo#3-write-a-story-prompt3. Write a story prompt

Enhance your prompt first.We provide prompt enhancers β€” system prompts that expand a short story or idea into well-formed shot prompts:prompts/long\_story\_writer\_system\_prompt\.mdfor long, multi-shot video, andprompts/short\_story\_writer\_system\_prompt\.mdfor single-shot short video. Westrongly recommendrunning your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.

Create a JSON file underprompts/. Each file is a single object with apromptslist, whereevery string is one complete shot. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.

Inside each string, write these parts in order:

PartWhat to describeRoles & SubjectsDescribe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable.Action & DialogueWhat the subject does and speaks.StyleThe overall visual and emotional aesthetic β€” e.g. realistic motorsport film language, cool daylight, restrained cinematic tension.Camera MovementThe shot type and framing or movement β€” e.g. a stable close-up on the face, or a medium shot from the waist up.BackgroundThe setting and scene details behind the subject.Sound Effects & BGMThe sounds in the scene and the background music β€” e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or nobackground music A more convenient prompt-writing workflow will be released as adirector agentfor everyone to use.

https://huggingface.co/jdopensource/JoyAI-Echo#4-run4. Run

python inference.py

Outputs land ininference\_result/outputs/<prompt\-name\>/inference\_<timestamp\>/.

https://huggingface.co/jdopensource/JoyAI-Echo#hardwareHardware

Peak GPU memory is**~46–50 GB**at the default 1280 Γ— 736 Γ— 241 frame setting β€” a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:

python inference.py --num-frames 121 --video-height 480 --video-width 832

https://huggingface.co/jdopensource/JoyAI-Echo#resultsResults

https://huggingface.co/jdopensource/JoyAI-Echo#reported-scaleReported Scale

ItemValue🎬 Long-form coherent story length5 min⚑ Speedup over the original multi-step pipeline7.5Γ—πŸ“š Benchmark stories100🎞️ Generated evaluation shots3,000πŸ•’ Frames per shot241 @ 25 fps

https://huggingface.co/jdopensource/JoyAI-Echo#human-evaluationHuman Evaluation

GSB user study. Values are the percentage of user preferences.

Aspect (Long Video)JoyAI-EchoTieHappyOyster (Directing)Visual aesthetics**63.6%8.8%27.6%Audio quality81.7%6.5%11.8%Prompt following80.6%13.5%5.9%IP consistency59.4%12.9%27.7% Aspect (Short Video)JoyAI-EchoTieWan 2.6Visual aesthetics58.8%**14.7%26.5%Audio quality32.3%30.9%36.8%Prompt following33.8%36.8%29.4%

https://huggingface.co/jdopensource/JoyAI-Echo#acknowledgementsAcknowledgements

We gratefully acknowledgeLTX-Videofor the base video generator andGemmafor the text encoder, along with the broader open-source community.

https://huggingface.co/jdopensource/JoyAI-Echo#citationCitation

If Echo-LongVideo helps your research or products, please cite:

@techreport{echo2026longvideo,
  title        = {Echo-LongVideo: Pushing the Frontier of Long Video Generation},
  author       = {{Echo Team @ Joy Future Academy, JD}},
  institution  = {Joy Future Academy, JD},
  year         = {2026},
  month        = {June},
  url          = {https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo}
}

https://huggingface.co/jdopensource/JoyAI-Echo#licenseLicense

Released under theLTX-2 Community License Agreement. By downloading or using these weights, you agree to its terms. The bundled Gemma text encoder is governed by Google’s separate Gemma license.

Similar Articles

ResembleAI/Dramabox

Hugging Face Models Trending

Dramabox is an expressive text-to-speech model by Resemble AI that uses prompt-driven control for speaker identity, emotion, and delivery, with optional voice cloning via a 10-second reference. Built on the LTX-2.3 audio diffusion transformer, it is open-sourced on Hugging Face.

Aratako/Irodori-TTS-500M-v3

Hugging Face Models Trending

Irodori-TTS-500M-v3 is a Japanese TTS model based on Rectified Flow Diffusion Transformer, supporting zero-shot voice cloning and unique emoji-based style/sound effect control.