Tag
HiCoDiT is a novel Hierarchical Codec Diffusion Transformer for video-to-speech generation that leverages the hierarchical structure of RVQ-based codec discrete speech tokens, using coarse-to-fine conditioning with dual-scale normalization to achieve strong audio-visual alignment.