LTX-2: Efficient Joint Audio-Visual Foundation Model

Papers with Code Trending Papers

Summary

LTX-2 is introduced as an efficient joint audio-visual foundation model. The text includes a mix of the paper reference and a video script about countries facing existential threats, but the primary classification target is the AI model paper.

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:56 AM

Paper page - LTX-2: Efficient Joint Audio-Visual Foundation Model

Source: https://huggingface.co/papers/2601.03233 Create a video in sequence (12 images total), first create the 18:9 format ## FINAL SCRIPT: “5 Countries That Could Disappear in Our Lifetime”

Runtime: ~1:45 – 2:00 Tone: Alarming, but factual


https://huggingface.co/papers/2601.03233#%D1%85%D1%83%D0%BA-000–010HOOK (0:00 – 0:10)

Image: World map with pieces beginning to disappear. Ominous music.

Text: “You look at a world map and think it’s permanent. It’s not. Some countries we know today might not exist when you’re old. Here are 5 nations fighting for survival right now.”


https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%965-%D0%BC%D0%B0%D0%BB%D1%8C%D0%B4%D0%B8%D0%B2%D1%8B-maldivesPLACE No5: Maldives

Image: Paradise islands, ocean, waves, people on the beach.

Text: “Number 5: The Maldives. The most beautiful islands in the Indian Ocean. Average height above sea level? Just 1.5 meters. Scientists say if sea levels keep rising, the Maldives could be underwater by the end of this century. The government is already buying land in other countries to move its people. A paradise that’s disappearing.”


https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%964-%D1%82%D0%B0%D0%B9%D0%B2%D0%B0%D0%BD%D1%8C-taiwanPLACE No4: Taiwan

Image: Map showing Taiwan next to China, flags.

Text: “Number 4: Taiwan. This is not about climate — it’s about politics. Taiwan has been independent in practice for decades, but China claims it as its territory. Tensions are rising. If China decides to take control by force, Taiwan as an independent country could cease to exist.”


https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%963-%D0%BA%D0%B8%D1%80%D0%B8%D0%B1%D0%B0%D1%82%D0%B8-kiribatiPLACE No3: Kiribati

Image: Pacific Ocean, tiny islands, map.

Text: “Number 3: Kiribati. A nation of 33 islands in the Pacific Ocean. Most of them are barely above water. Their president bought land in Fiji just to have somewhere to move when the ocean swallows them. They might be the first country to disappear completely. And it’s happening now.”


https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%962-%D0%B1%D0%B0%D0%BD%D0%B3%D0%BB%D0%B0%D0%B4%D0%B5%D1%88-bangladeshPLACE No2: Bangladesh

Image: Floods, people waist-deep in water, map of Bangladesh.

Text: “Number 2: Bangladesh. One of the most densely populated countries on Earth. 170 million people living on a giant river delta. Every year, floods get worse. By 2050, scientists predict 20% of the country could be underwater. That’s 30 million climate refugees. One of the poorest nations could simply become unlivable.”


https://huggingface.co/papers/2601.03233#%D0%BC%D0%B5%D1%81%D1%82%D0%BE-%E2%84%961-%D1%82%D1%83%D0%B2%D0%B0%D0%BB%D1%83-tuvaluPLACE No1: Tuvalu

Image: Tiny island in the middle of the ocean, waves, sun.

Text: “Number 1: Tuvalu. A tiny island nation in the Pacific. The highest point is 4.5 meters above sea level. But when high tides come, the whole country floods. The government is building seawalls, but it might not be enough. Tuvalu could be the first country to lose its land completely. And the scariest part? It could happen in the next 30 years.”


https://huggingface.co/papers/2601.03233#%D0%B0%D1%83%D1%82%D1%80%D0%BE-145–200OUTRO (1:45 – 2:00)

Image: World map with a question mark. Music grows quieter.

Text: “Which of these countries would you save? Let me know in the comments. And if you want more geography and history — subscribe. The next video will be about what happens when a country disappears completely.”

Similar Articles

Lightricks/LTX-2

GitHub Trending (daily)

LTX-2 is the first DiT-based audio-video foundation model from Lightricks, offering synchronized audio and video generation, high fidelity, and production-ready outputs, with open-source code and open model weights.

Audio-Visual Intelligence in Large Foundation Models

Hugging Face Daily Papers

This survey paper provides a comprehensive review of audio-visual intelligence within large foundation models, establishing a unified taxonomy, synthesizing core methodologies, and outlining key datasets, benchmarks, and open research challenges.

Lightricks/LTX-2.3

Hugging Face Models Trending

Lightricks released LTX-2.3, an open-weight diffusion-based audio-video foundation model with improved quality and prompt adherence, available in multiple checkpoints including distilled and LoRA variants for local execution.

Lightricks/LTX-2.3-22b-IC-LoRA-LipDub

Hugging Face Models Trending

This Hugging Face model page introduces an IC-LoRA trained on top of LTX-2.3-22b for lip dubbing, with a project page, paper, and inference pipeline available.

When Vision Speaks for Sound

Hugging Face Daily Papers

This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.