ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection
Summary
ReMMD introduces a realistic multilingual multi-image agentic verification framework for multimodal misinformation detection, including a benchmark (ReMMDBench) with 500 samples and 2,756 images, and an agent (ReMMD-Agent) that achieves superior veracity performance with reduced costs.
View Cached Full Text
Cached at: 06/24/26, 05:46 AM
Paper page - ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection
Source: https://huggingface.co/papers/2606.24112
Abstract
A comprehensive multimodal misinformation detection framework is introduced that handles complex, multilingual content with multiple images and diverse verification approaches, achieving superior performance while reducing computational costs.
Multimodal misinformation detectionis increasingly important because viral posts now combine long multilingual narratives, several images, mixed provenance, and subtle text--image framing errors. Existing benchmarks and methods remain poorly matched to this setting: they usually isolate short captions, single images, binary labels, or one manipulation source, whileagentic verificationremains costly under realistic evidence search. We present ReMMD, a realistic multilingual multi-imageagentic verificationframework formultimodal misinformation detection. ReMMD includesReMMDBench, a real-worldmultimodal misinformation detectionbenchmark with 500 samples, 2,756 images, five monolingual languages, two cross-lingual settings, three text-length tiers, multi-image posts, five-wayveracity labels, eightdistortion labels,evidence provenance, and rationales. It also includesReMMD-Agent, a persistent-memory verifier that decomposes posts into atomic points, builds a reusable evidence set, and predictsstructured L1/L2/L3 outputs. Across proprietary systems, openLVLMs, MMD-Agent, and T2-Agent,ReMMD-Agentobtains the best five-way veracity performance, with 41.80% accuracy and 39.12% macro-F1 usingGPT-5.2, while reducing cost by 17.5% relative to MMD-Agent and 79.9% relative to T2-Agent. The project is available at https://dang-ai.github.io/ReMMD.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.24112
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.24112 in a model README.md to link it from this page.
Datasets citing this paper1
#### DDAI-D/ReMMDBench Updatedabout 3 hours ago • 5 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.24112 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation
Introduces SynCred-Bench, a benchmark of 600 AI-generated misinformation images across six credible-form categories, showing that existing detectors (including MLLMs, open-source AIGC detectors, and commercial APIs) perform poorly, with human annotators also struggling.
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.
Reinforcing Multimodal Reasoning Against Visual Degradation
This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.
MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA
MARDoc is a memory-aware refinement agent framework for multimodal long document question answering, evaluated on MMLongBench-Doc and DocBench benchmarks using Qwen3-VL models, showing consistent improvements over MLLM-based, RAG-based, and agent-based baselines.