LTX-2: Efficient Joint Audio-Visual Foundation Model

Papers with Code Trending 01/06/26, 06:24 PM Papers

Summary

LTX-2 is introduced as an efficient joint audio-visual foundation model. The text includes a mix of the paper reference and a video script about countries facing existential threats, but the primary classification target is the AI model paper.

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

Original Article

View Cached Full Text

Cached at: 05/08/26, 08:56 AM

Paper page - LTX-2: Efficient Joint Audio-Visual Foundation Model

Source: https://huggingface.co/papers/2601.03233 Create a video in sequence (12 images total), first create the 18:9 format ## FINAL SCRIPT: “5 Countries That Could Disappear in Our Lifetime”

Runtime: ~1:45 – 2:00 Tone: Alarming, but factual

https://huggingface.co/papers/2601.03233#%D1%85%D1%83%D0%BA-000–010HOOK (0:00 – 0:10)

Image: World map with pieces beginning to disappear. Ominous music.

Text: “You look at a world map and think it’s permanent. It’s not. Some countries we know today might not exist when you’re old. Here are 5 nations fighting for survival right now.”

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%965-%D0%BC%D0%B0%D0%BB%D1%8C%D0%B4%D0%B8%D0%B2%D1%8B-maldivesPLACE No5: Maldives

Image: Paradise islands, ocean, waves, people on the beach.

Text: “Number 5: The Maldives. The most beautiful islands in the Indian Ocean. Average height above sea level? Just 1.5 meters. Scientists say if sea levels keep rising, the Maldives could be underwater by the end of this century. The government is already buying land in other countries to move its people. A paradise that’s disappearing.”

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%964-%D1%82%D0%B0%D0%B9%D0%B2%D0%B0%D0%BD%D1%8C-taiwanPLACE No4: Taiwan

Image: Map showing Taiwan next to China, flags.

Text: “Number 4: Taiwan. This is not about climate — it’s about politics. Taiwan has been independent in practice for decades, but China claims it as its territory. Tensions are rising. If China decides to take control by force, Taiwan as an independent country could cease to exist.”

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%963-%D0%BA%D0%B8%D1%80%D0%B8%D0%B1%D0%B0%D1%82%D0%B8-kiribatiPLACE No3: Kiribati

Image: Pacific Ocean, tiny islands, map.

Text: “Number 3: Kiribati. A nation of 33 islands in the Pacific Ocean. Most of them are barely above water. Their president bought land in Fiji just to have somewhere to move when the ocean swallows them. They might be the first country to disappear completely. And it’s happening now.”

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%962-%D0%B1%D0%B0%D0%BD%D0%B3%D0%BB%D0%B0%D0%B4%D0%B5%D1%88-bangladeshPLACE No2: Bangladesh

Image: Floods, people waist-deep in water, map of Bangladesh.

Text: “Number 2: Bangladesh. One of the most densely populated countries on Earth. 170 million people living on a giant river delta. Every year, floods get worse. By 2050, scientists predict 20% of the country could be underwater. That’s 30 million climate refugees. One of the poorest nations could simply become unlivable.”

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%961-%D1%82%D1%83%D0%B2%D0%B0%D0%BB%D1%83-tuvaluPLACE No1: Tuvalu

Image: Tiny island in the middle of the ocean, waves, sun.

Text: “Number 1: Tuvalu. A tiny island nation in the Pacific. The highest point is 4.5 meters above sea level. But when high tides come, the whole country floods. The government is building seawalls, but it might not be enough. Tuvalu could be the first country to lose its land completely. And the scariest part? It could happen in the next 30 years.”

https://huggingface.co/papers/2601.03233#%D0%B0%D1%83%D1%82%D1%80%D0%BE-145–200OUTRO (1:45 – 2:00)

Image: World map with a question mark. Music grows quieter.

Text: “Which of these countries would you save? Let me know in the comments. And if you want more geography and history — subscribe. The next video will be about what happens when a country disappears completely.”

LTX-2: Efficient Joint Audio-Visual Foundation Model

Paper page - LTX-2: Efficient Joint Audio-Visual Foundation Model

https://huggingface.co/papers/2601.03233#%D1%85%D1%83%D0%BA-000–010HOOK (0:00 – 0:10)

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%965-%D0%BC%D0%B0%D0%BB%D1%8C%D0%B4%D0%B8%D0%B2%D1%8B-maldivesPLACE No5: Maldives

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%964-%D1%82%D0%B0%D0%B9%D0%B2%D0%B0%D0%BD%D1%8C-taiwanPLACE No4: Taiwan

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%963-%D0%BA%D0%B8%D1%80%D0%B8%D0%B1%D0%B0%D1%82%D0%B8-kiribatiPLACE No3: Kiribati

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%962-%D0%B1%D0%B0%D0%BD%D0%B3%D0%BB%D0%B0%D0%B4%D0%B5%D1%88-bangladeshPLACE No2: Bangladesh

https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%961-%D1%82%D1%83%D0%B2%D0%B0%D0%BB%D1%83-tuvaluPLACE No1: Tuvalu

https://huggingface.co/papers/2601.03233#%D0%B0%D1%83%D1%82%D1%80%D0%BE-145–200OUTRO (1:45 – 2:00)

Similar Articles

Lightricks/LTX-2

Audio-Visual Intelligence in Large Foundation Models

Lightricks/LTX-2.3

Lightricks/LTX-2.3-22b-IC-LoRA-LipDub

When Vision Speaks for Sound

Submit Feedback