Multimodal Claim Extraction for Fact-Checking

arXiv cs.CL 04/21/26, 04:00 AM Papers

fact-checking multimodal claim-extraction benchmark social-media llm research

Summary

Researchers present the first benchmark for multimodal claim extraction from social media, evaluating state-of-the-art multimodal LLMs and introducing MICE, an intent-aware framework that improves handling of rhetorical intent and contextual cues in combined text-image posts.

arXiv:2604.16311v1 Announce Type: new Abstract: Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today's misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To address this, we introduce MICE, an intent-aware framework which shows improvements in intent-critical cases.

Original Article

Similar Articles

Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

arXiv cs.CL

MSR-MEL introduces an unsupervised framework using LLMs to synthesize and reason over multi-perspective evidence for multimodal entity linking, outperforming prior methods on standard benchmarks.

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

arXiv cs.CL

Researchers from Peking University introduce CFMS, the first fine-grained Chinese multimodal sarcasm detection benchmark with 2,796 image-text pairs and a triple-level annotation framework (sarcasm identification, target recognition, explanation generation), along with a novel RL-augmented in-context learning method (PGDS) that significantly outperforms existing baselines.

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Hugging Face Daily Papers

Researchers introduce Mind’s Eye, a benchmark of eight visual-cognitive tasks that reveals top multimodal LLMs score under 50% while humans reach 80%, exposing major gaps in visual abstraction, relation mapping and mental transformation.

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

Hugging Face Daily Papers

Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation

arXiv cs.CL

Proposes LiSCP, a lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Achieves superior performance on in-domain and cross-domain detection with notable robustness.

Similar Articles

Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation

Submit Feedback