Tag
This paper constructs a multimodal dataset of 1000 academic papers with text, images, and audio to study keyword extraction, showing that fusing multiple modalities improves performance.