video-to-speech

Tag

Cards List
#video-to-speech

Hierarchical Codec Diffusion for Video-to-Speech Generation

Hugging Face Daily Papers · 2026-04-17 Cached

HiCoDiT is a novel Hierarchical Codec Diffusion Transformer for video-to-speech generation that leverages the hierarchical structure of RVQ-based codec discrete speech tokens, using coarse-to-fine conditioning with dual-scale normalization to achieve strong audio-visual alignment.

0 favorites 0 likes
← Back to home

Submit Feedback