Tag
LatentOmni proposes a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states, outperforming explicit text-based chain-of-thought methods in audio-visual reasoning tasks.