Evaluation Pitfalls and Challenges in Multimedia Event Extraction

arXiv cs.CL 06/26/26, 04:00 AM Papers
Summary
This paper presents a systematic analysis of evaluation pitfalls in multimedia event extraction, identifying issues such as inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings that can lead to overestimated performance.
arXiv:2606.26775v1 Announce Type: new Abstract: Multimedia event extraction aims to jointly identify events and their arguments across multiple modalities, such as text and images, to support more comprehensive event understanding. While recent work reports steady and substantial progress, the reliability and comparability of these results critically depend on consistent and rigorous evaluation. In this work, we present the first systematic analysis of evaluation pitfalls in multimedia event extraction and identify three major sources of issues: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings. We demonstrate, through a series of controlled experiments under a strict evaluation framework, that minor evaluation choices can cause large performance variations and lead to overestimation of a model's ability to ground real-world events across modalities. Our findings highlight the need for comparable evaluation standards and encourage a shift toward more rigorous evaluation in multimedia event extraction.
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:19 AM
# Evaluation Pitfalls and Challenges in Multimedia Event Extraction
Source: [https://arxiv.org/html/2606.26775](https://arxiv.org/html/2606.26775)
Philipp Seeberger,Steffen Freisinger,Tobias Bocklet,Korbinian Riedhammer Technische Hochschule Nürnberg Georg Simon Ohm \{philipp\.seeberger,steffen\.freisinger,tobias\.bocklet,korbinian\.riedhammer\}@th\-nuernberg\.de

###### Abstract

Multimedia event extraction aims to jointly identify events and their arguments across multiple modalities, such as text and images, to support more comprehensive event understanding\. While recent work reports steady and substantial progress, the reliability and comparability of these results critically depend on consistent and rigorous evaluation\. In this work, we present the first systematic analysis of evaluation pitfalls in multimedia event extraction and identify three major sources of issues: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings\. We demonstrate, through a series of controlled experiments under a strict evaluation framework, that minor evaluation choices can cause large performance variations and lead to overestimation of a model’s ability to ground real\-world events across modalities\. Our findings highlight the need for comparable evaluation standards and encourage a shift toward more rigorous evaluation in multimedia event extraction\.111[https://github\.com/seebergerph/StrictEval](https://github.com/seebergerph/StrictEval)

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

Philipp Seeberger, Steffen Freisinger, Tobias Bocklet, Korbinian RiedhammerTechnische Hochschule Nürnberg Georg Simon Ohm\{philipp\.seeberger,steffen\.freisinger,tobias\.bocklet,korbinian\.riedhammer\}@th\-nuernberg\.de

## 1Introduction

Event extraction \(EE\) is a fundamental task in natural language processing and information extraction, aiming to identify, structure, and organize event\-related knowledge from documentsAhn \([2006](https://arxiv.org/html/2606.26775#bib.bib19)\)\. While the majority of existing EE research has focused on textsPenget al\.\([2023a](https://arxiv.org/html/2606.26775#bib.bib32)\); Huanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib20)\), recent work has increasingly explored the integration of additional modalitiesSunet al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib5)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib18)\)\. This shift is motivated by the growing prevalence of multimodal content in contemporary news media and online platformsLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\), where images, videos, and audio provide complementary information that can support more accurate and comprehensive event understanding\.

Prior research has investigated EE or closely related tasks within individual modalitiesYatskaret al\.\([2016](https://arxiv.org/html/2606.26775#bib.bib21)\); Waddenet al\.\([2019](https://arxiv.org/html/2606.26775#bib.bib25)\); Sadhuet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib26)\); Wanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib27)\), including text, images, video, and audio, or has leveraged cross\-modal cues to address specific challenges such as ambiguityZhanget al\.\([2017](https://arxiv.org/html/2606.26775#bib.bib24)\); Tonget al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib23)\)\. However, evaluation mostly remains restricted to a single target modality\. Multimedia event extraction \(MEE\) has recently attracted attention and adopts a holistic view by jointly extracting and evaluating events across multiple modalities, typically combining textual and visual inputsLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\); Chenet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib28)\); Sanderset al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib29)\)\. Despite this progress, existing MEE benchmarks remain limited and evaluation challenging, most notably due to annotation scarcity, lack of train splits, and increased evaluation complexity inherent to multimodal settingsLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\); Sanderset al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib29)\)\.

Prior studies have demonstrated that even traditional textual EE suffers from substantial and often overlooked evaluation challengesZhenget al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib31)\); Penget al\.\([2023b](https://arxiv.org/html/2606.26775#bib.bib30)\); Huanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib20)\), making it prone to hidden pitfalls\. These include discrepancies in data and task assumptions as well as metric design choices that can distort model comparisons and fail to reflect real\-world performanceHuanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib20)\)\. Crucially, extending textual EE to the multimodal setting not only inherits existing evaluation issues but also introduces additional pitfalls\. These arise from factors such as data scarcity, heterogeneous modalities, and multi\-stage pipelines commonly employed in MEELiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\); Liuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2)\); Duet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\); Caoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\)\. As a consequence, inconsistent and under\-specified evaluation settings can easily emerge, posing a potential obstacle to reliably assessing progress in MEE research\.

Motivated by concerns about the reliability and comparability of current evaluations, this work systematically investigates hidden pitfalls and challenges in MEE evaluation, with the goal of raising awareness and encouraging a shift toward more rigorous evaluation practices\. Through an in\-depth analysis of the widely used M2E2 benchmark, we first identify three major categories with several issues: inconsistent data processing, inconsistent task assumptions, and relaxed evaluation settings\. Building on this analysis, we introduce a more rigorous evaluation framework,StrictEval, and use it to examine how hidden pitfalls influence reported performance\. Finally, we show that minor experimental design choices can substantially affect evaluation outcomes\.

In summary, our contributions are twofold: \(1\) We conduct a systematic analysis of evaluation pitfalls and challenges in MEE and propose a more rigorous evaluation framework \(StrictEval\)\. \(2\) We systematically quantify how hidden evaluation pitfalls affect reported performance and reevaluate recent MEE approaches to highlight limitations\.

## 2Background and Related Work

### 2\.1Background

Textual EE is commonly formulated as a two\-stage pipelineAhn \([2006](https://arxiv.org/html/2606.26775#bib.bib19)\)consisting of event detection \(ED\) and event argument extraction \(EAE\)\. Event detection aims to identify event mentions, typically grounded to trigger spans, and classify them into predefined event types\. Event argument extraction focuses on identifying argument spans and assigning them semantic roles conditioned on the detected event mentions\. Analogously, visual EE decomposes the task into detecting events grounded in images and linking their associated semantic roles to visual regions, such as objectsPrattet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib22)\)\. Building on these two research directions, MEE integrates textual and visual information to jointly extract events and their arguments across modalitiesLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\); Chenet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib28)\)\. This multimodal integration introduces the additional subtask of cross\-modal event coreference resolution, with the aim to unify event mentions from different modalities that refer to the same real\-world event into a coherent multimedia event representation \(see[Figure 1](https://arxiv.org/html/2606.26775#S2.F1)\)\.

### 2\.2Related Work

#### Multimedia Event Extraction Benchmarks

While most EE benchmarks focus exclusively on textWalker, Christopheret al\.\([2006](https://arxiv.org/html/2606.26775#bib.bib34)\); Songet al\.\([2015](https://arxiv.org/html/2606.26775#bib.bib35)\); Wanget al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib36)\); Huanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib20)\), early multimodal extensions augment textual datasets with images, but evaluation remains limited to textual eventsZhanget al\.\([2017](https://arxiv.org/html/2606.26775#bib.bib24)\); Tonget al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib23)\)\. To overcome unimodal limitations,Liet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\)introduce the first MEE benchmark, M2E2, which evaluates event and argument extraction for both texts and images\. In addition, M2E2 includes cross\-modal event coreferences, analogous to cross\-document coreferences in textNathet al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib37)\)\. Subsequent work extend to images and videos, such as VM2E2Chenet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib28)\), CMMEventLiuet al\.\([2025b](https://arxiv.org/html/2606.26775#bib.bib14)\), TVEEWanget al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib33)\), and MultiVENT\-GSanderset al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib29)\)\. More recently,Zhanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib18)\)propose a comprehensive benchmark covering textual, visual, and audio inputs by integrating datasets such as M2E2 and ACE with recorded speech\. However, only M2E2 and MultiVENT\-G publicly release complete data while other benchmarks are still closed\-sourceWanget al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib33)\); Liuet al\.\([2025b](https://arxiv.org/html/2606.26775#bib.bib14)\)or lack critical annotationsChenet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib28)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib18)\)\. Moreover, complex annotation formats, abundance of train splits, and missing evaluation scripts further hinder reliable benchmarking\.

![Refer to caption](https://arxiv.org/html/2606.26775v1/x1.png)Figure 1:Overview of the M2E2 multimedia event extraction pipeline\. The example illustrates aTransportevent grounded in text and image\.Pmarkers indicate stages at which pitfalls occur\. TED, TEAE, VED, VEAE, MED, and MEAE denote the textual, visual, and multimedia event detection and argument extraction subtasks, respectively\.
#### Multimedia Event Extraction

Early approaches focus on cross\-modal correlations and align visual and textual representations using large\-scale unlabeled news corpora \(e\.g\., VOA\)Chenet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib28)\); Liuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2),[2024](https://arxiv.org/html/2606.26775#bib.bib6)\), often in combination with contrastive learning objectivesLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1),[2022](https://arxiv.org/html/2606.26775#bib.bib3),[2023](https://arxiv.org/html/2606.26775#bib.bib39)\)\. Subsequent studies explore complementary directions, including augmenting training data with synthetically generated image–text pairsDuet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\), designing sophisticated multi\-grained fusion mechanismsWanget al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib9)\); Liuet al\.\([2025a](https://arxiv.org/html/2606.26775#bib.bib15)\), or leveraging multi\-task learning with pseudo labeling strategiesCaoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\)\. Other works narrow their focus to specific subtasks, such as EDSunet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib38)\)or EAESeebergeret al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib7)\)\. With recent advances in multimodal large language models \(MLLMs\), several instruction\-following approaches have been proposed to enable more universal information extractionSunet al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib5)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib18)\); Yuanet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib12)\); Chenet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib11)\); Yuet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib10)\)\. However, most of these methods primarily work with given image–text pairs and do not explicitly address broader MEE settings such as cross\-modal event coreference resolutionSunet al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib5)\); Yuanet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib12)\); Chenet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib11)\)\. Notably, the majority of existing approaches are evaluated on the M2E2 benchmark, underscoring its role in advancing MEE research\. Despite substantial progress, prior methods adopt diverse task formulations and evaluation protocols, which hinders fair comparison across different modeling approaches\.

#### Evaluation Pitfalls

Recent studies have highlighted numerous issues in the evaluation of textual EE models, including inconsistent data assumptions, processing steps, output space discrepancies, and relaxed evaluation metricsZhenget al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib31)\); Penget al\.\([2023b](https://arxiv.org/html/2606.26775#bib.bib30),[a](https://arxiv.org/html/2606.26775#bib.bib32)\); Huanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib20)\)\. While these works clearly highlight significant differences in textual EE benchmarking, issues in the evaluation of visual and multimedia EE remains relatively underexplored\.

## 3Pitfalls and Challenges in Evaluation

Motivated by evaluation concerns for MEE, we first present our investigation setup \(§[3\.1](https://arxiv.org/html/2606.26775#S3.SS1)\) and systematic analysis to identify evaluation issues \(§[3\.2](https://arxiv.org/html/2606.26775#S3.SS2)\)\. We then provide a detailed analysis of common pitfalls across three major categories: data processing \(§[3\.3](https://arxiv.org/html/2606.26775#S3.SS3)\), task assumptions \(§[3\.4](https://arxiv.org/html/2606.26775#S3.SS4)\), and relaxed evaluation settings \(§[3\.5](https://arxiv.org/html/2606.26775#S3.SS5)\)\. Lastly, building on the insights from this analysis, we introduceStrictEvalas rigorous evaluation framework to address hidden pitfalls \(§[3\.7](https://arxiv.org/html/2606.26775#S3.SS7)\)\.

### 3\.1Preliminaries

To examine the sources and impact of evaluation issues in MEE, we adopt the M2E2 benchmarkLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\)\. Our choice is motivated by two main considerations: \(1\) M2E2 is publicly available and, to the best of our knowledge, the most widely used benchmark in MEE research\. \(2\) As discussed in §[2\.2](https://arxiv.org/html/2606.26775#S2.SS2), alternative benchmarks are often incompleteChenet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib28)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib18)\)or not accessibleWanget al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib33)\); Liuet al\.\([2025b](https://arxiv.org/html/2606.26775#bib.bib14)\)\.

#### M2E2 Dataset

In[Figure 1](https://arxiv.org/html/2606.26775#S2.F1), we show the complete task and involved components\. The M2E2 benchmark comprises 6,167 sentences and 1,014 images from 245 multimedia news documents collected from 108k Voice of America \(VOA\) documents\. Overall, the events cover 8 event types and 15 argument roles, with 1297 textual and 391 visual events\. Thereby, there exist 309 multimedia events which are coreferenced by 192 textual and 203 visual events\. As no training data exists, the benchmark adopts ACEWalker, Christopheret al\.\([2006](https://arxiv.org/html/2606.26775#bib.bib34)\)for textual and imSituYatskaret al\.\([2016](https://arxiv.org/html/2606.26775#bib.bib21)\), with object groundings from SWiGPrattet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib22)\), for visual EE training\. Annotation mappings to the M2E2 schema are provided by the original workLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\)\.

#### M2E2 Evaluation

FollowingLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\), evaluation is conducted separately for textual, visual, and multimedia EE with precision \(P\), recall \(R\), and F1 for the subtasks ED and EAE\. In textual EE, an event mention is correct if its type and trigger offsets match the reference, while arguments must additionally match argument offsets and role types\. For visual EE, a visual event mention is correct if its type matches the reference image, and a visual argument if its event type, role label, and bounding box match a reference argument with IoU\>\>0\.5\. Lastly, a multimedia event mention is correct if its event type and trigger offsets \(or image\) match the reference trigger \(or image\)\. The inherited textual and visual arguments are evaluated using the same criteria as in the textual and visual modality\. However, in our preliminary analysis we observe inconsistencies in the evaluation criteria for multimedia events, which we discuss in detail in §[3\.4](https://arxiv.org/html/2606.26775#S3.SS4)\.

### 3\.2Systematic Analysis

We collect peer\-reviewed MEE studies evaluated on M2E2 published between 2020 and 2025 through keyword and citation\-based searches, resulting in 18 articles across multiple venues \(e\.g\., ACL, ACM, AAAI\)\. Of these, we analyze 15 studies and exclude threeMoghimifaret al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib17)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib18)\); Xinget al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib16)\)due to reliance on custom train\-test splits or newly introduced evaluation metrics, which hinder direct comparison\. The complete set of methods and reported evaluation scores is summarized in[Table 5](https://arxiv.org/html/2606.26775#A1.T5)\. For each study, we review the article, supplementary materials, and, when available, its public codebase, with particular attention on the training and evaluation stages \(see[Figure 1](https://arxiv.org/html/2606.26775#S2.F1)\)\. We focus on data processing, experimental setups, and evaluation protocolsPenget al\.\([2023b](https://arxiv.org/html/2606.26775#bib.bib30)\); Huanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib20)\)\. Through this analysis, we uncover three major categories of evaluation issues: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings\. These issues largely stem from the inherent complexity of MEE, which relies on external training datasets, multi\-stage pipelines, and heterogeneous modalities\. Nevertheless, such inconsistencies can lead to unfair comparisons and performance estimates that do not reflect real\-world scenarios\.

### 3\.3Inconsistent Data Processing

Due to the absence of standardized preprocessing and the reliance on external training datasets, we observe substantial variation in data assumptions across studies\. These differences include training set construction, preprocessing, and postprocessing procedures\.

#### \[P1\] Train Size Discrepancies

As described in §[3\.1](https://arxiv.org/html/2606.26775#S3.SS1), M2E2 relies on external datasets \(e\.g\., ACE and SWiG\) for training, which provide predefined train, development, and test splits\. While the original M2E2 benchmark uses only train splits, subsequent work often incorporates other sets as additional training data\. Consequently, models are optimized on differing numbers of samples \(e\.g\., 75k vs\. 100k images for SWiG\)\.

#### \[P2\] Oracle Trigger Refinement

Due to distributional annotation differences between ACE and M2E2, most reported evaluation scores applies a postprocessing step that adjusts textual ED predictions using M2E2 ground truth annotations\. For example, this script removes predicted event mentions withdeadlyas the trigger span\. However, many studies do not clearly specify this experimental setting or report results exclusively with postprocessing applied\. We argue that ground truth\-based postprocessing does not reflect real\-world conditions\.

#### \[P3\] Verb Mapping Refinement

Because the label ontologies of SWiG and M2E2 differ, the original authors provide a verb and role mapping to align SWiG annotations with the M2E2 schema\. We notice that some studies adopt refined versions of this mapping\. For example, one refined mapping aligns 73 verbs rather than the original 67 verbs to the M2E2 event types\. Such discrepancies in label alignment can lead to performance differences driven more by data engineering than by modeling improvements\.

### 3\.4Inconsistent Task Assumptions

We identify inconsistencies in task assumptions\. For instance, some studies evaluate events only present in texts and imagesLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\); Liuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2)\); Duet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\), while others focus exclusively on multimedia eventsSunet al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib5)\); Yuanet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib12)\); Chenet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib11)\)\(cf\.[Table 5](https://arxiv.org/html/2606.26775#A1.T5)\)\. Similarly, some methods filter test data to exclude samples without events, whereas others evaluate on the full set\. Consequently, reported results are often not directly comparable\.

#### \[P4\] Test Subset Selection

Recent works restrict test\-time predictions to sentences or images containing at least one event\. This filtering reduces the number of sentences from 6167 to 1086 and images from 1014 to 391, while other work evaluate on the full set\. Moreover, methods focusing on the multimedia events task often only evaluate on 309 image\-text pairs derived from the event coreference annotations, further reducing the test set to 192 sentences and 203 images, respectively\.

#### \[P5\] MEE Task Discrepancies

Due to the absence of standardized evaluation scripts for multimedia events, follow up work has adopted different task definitions\. The original M2E2 benchmark considers a multimedia event correct if either the textual or visual event matches the reference and treats cross\-modal coreference as a separate task\. More recent work, however, introduce a stricter setting that additionally requires correct event coreference prediction as additional attachment\. In contrast, some methods assume gold coreference links and evaluate only aligned image\-text pairs\. Despite these substantial differences in task formulation, results are often compared directly\.

![Refer to caption](https://arxiv.org/html/2606.26775v1/x2.png)Figure 2:Illustration of relaxed evaluation settings\. Red edges denote predictions that are incorrectly counted as correct, while red colored text indicates ignored trigger attachments \(offsets such as20, 21\)\. We eliminate these issues inStrictEval\.

### 3\.5Relaxed Evaluation Settings

Similar to observations in TextEEHuanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib20)\), we find that MEE evaluation metrics are often imprecise due to relaxed matching criteria or missing structural constraints\. In[Figure 2](https://arxiv.org/html/2606.26775#S3.F2), we illustrate common issues and discuss them below:

#### \[P6\] Relaxed Textual Evaluation

Some studies ignore trigger offsets during textual EAE evaluation\. When multiple events of the same type appear in a sentence, this relaxation allows arguments to be matched to any event of that type, potentially inflating reported performance such as discussed byHuanget al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib20)\)\.

#### \[P7\] Relaxed Visual Evaluation

Most prior work employs a many\-to\-many matching that considers a visual argument correct if it has the correct role and its IoU exceeds a threshold\. However, this approach allows multiple predictions to match one gold argument \(see[Figure 4](https://arxiv.org/html/2606.26775#A1.F4)\), effectively rewarding recall\-oriented models for visual EAE\. To address this issue, we propose a one\-to\-one strategy via bipartite matching, that penalizes redundant predictions \(see[A\.2](https://arxiv.org/html/2606.26775#A1.SS2)\)\.

#### \[P8\] Relaxed Multimedia Evaluation

We also observe missing attachment constraints in multimedia ED evaluation\. For instance, ignoring trigger offsets allows incorrect event predictions to be counted as correct at the sentence level, rather than requiring span\-level accuracy \(see[Figure 2](https://arxiv.org/html/2606.26775#S3.F2)\)\. This relaxed assumption can substantially overestimate real\-world performance\.

### 3\.6Data Leakage

#### \[P9\] Test Data Leakage

M2E2 forms a subset of the collected VOA samples and several works utilize this image\-caption dataset during training\. We find that some of these works include images and captions from test documents in their training data\. Although no ground\-truth annotations are used, exposure to test images or captions can still result in information leakage\.

### 3\.7StrictEval

Our analysis demonstrates that prior MEE evaluations vary widely in data processing, task assumptions, and matching criteria\. To systematize these discrepancies, we annotate each study with the evaluation pitfalls it exhibits \(P1–P9\) in[Figure 1](https://arxiv.org/html/2606.26775#S2.F1)and cluster them into six distinct setups \(see[Table 1](https://arxiv.org/html/2606.26775#S3.T1)\)\. Based on this analysis, we proposeStrictEval, a more rigorous evaluation framework that eliminates all identified pitfalls\.StrictEvalenforces \(1\) consistent data usage without oracle postprocessing or test exposure, \(2\) a clearly specified task definition evaluated on the full benchmark, and \(3\) strict matching criteria that preserve structural constraints across textual, visual, and multimedia predictions\. As shown in[Table 1](https://arxiv.org/html/2606.26775#S3.T1),StrictEvalrepresents the strictest setup with the goal of providing a reproducible evaluation framework designed to more faithfully reflect real\-world performance\.

Table 1:Evaluation settings used in recent work\.Pxxindicates that a specific setting is used, while ? denotes that the setting is unspecified\. PC evaluates with prediced coreference resolution, whereas GC uses gold coreferences\.Table 2:Unimodal evaluation results ofSingle Taskmodels under different setups \(averaged over 3 runs\)\. Starting from theStrictEvalsetting, each identified issue \(P\) is applied independently\. The EAE relaxed settings only affect the EAE performance\.Δ\\DeltaF1 denotes the absolute difference toStrictEval\.

## 4Experiments and Analysis

The analysis in §[3](https://arxiv.org/html/2606.26775#S3)reveals notable discrepancies and pitfalls in MEE evaluation, raising concerns about the extent to which evaluation design choices influence reported performance\. Starting withStrictEval, we conduct a series of controlled experiments in which each evaluation factor is examined in isolation\.

![Refer to caption](https://arxiv.org/html/2606.26775v1/x3.png)Figure 3:Multimedia ED and EAE scores using different coreference resolution techniques\. Threshold Matching uses CLIP\-based similarity with a threshold of 20\. CLIPRaw\{\}\_\{\\mbox\{Raw\}\}denotes the pretrained model and CLIPVOA\{\}\_\{\\mbox\{VOA\}\}is further fine\-tuned on the VOA image\-caption dataset\. Greedy Matching and Bipartite Matching followsLiuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2)\)\.### 4\.1Experimental Setup

#### MEE Model

We adoptSingle Taskmodels to avoid complex architectural choices which ensures that performance differences stem from evaluation setups rather than model capacity\. We train independent single\-task models for each textual and visual subtask\. Textual and visual ED are implemented as token\-level and image\-level classification models\. For textual and visual EAE, we followLiuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2)\); Duet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\); Caoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\)and classify ground\-truth textual entities and detected visual objects \(via YOLO\) into argument roles\. Multimedia events are constructed through an event coreference resolution stepDuet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\), described below\. Further implementation details are provided in Appendix[A\.3](https://arxiv.org/html/2606.26775#A1.SS3)and[A\.4](https://arxiv.org/html/2606.26775#A1.SS4)\.

#### Event Coreference Resolution

Following prior workLiuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2)\); Duet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\); Seebergeret al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib7)\); Caoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\), we perform event coreference resolution by computing CLIP\-based similarity scores between image–sentence pairsRadfordet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib42)\)\. We note thatStrictEvalitself does not introduce a new coreference model, but instead relies on coreference predictions from existing approaches, which we compare under a unified evaluation setting\. A textual and a visual event are merged when they share the same predicted event type and their image\-text similarity exceeds a threshold of 20\. We adopt this hyperparameter followingDuet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\); Seebergeret al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib7)\); Caoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\)\. The resulting multimedia event inherits all associated arguments from both modalities\.

#### Evaluation Metrics

As outlined in §[3\.1](https://arxiv.org/html/2606.26775#S3.SS1), we report micro\-averaged P, R, and F1\. UnlikeLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\), we follow subsequent work and require multimedia events to match event mentions and their coreference linksLiuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2)\); Duet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\); Seebergeret al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib7)\); Caoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\); Wanget al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib9)\)\. Unless otherwise specified, all experiments use our introduced evaluation frameworkStrictEval\.\.

### 4\.2Unimodal Evaluation and Analysis

In this section, we focus on unimodal EE results\. This allows us to isolate modality\-specific behavior before examining the extraction of multimedia events\. An overview of the unimodal results is presented in[Table 2](https://arxiv.org/html/2606.26775#S3.T2)and an extended analysis in Appendix[A\.6](https://arxiv.org/html/2606.26775#A1.SS6)\.

#### Impact of Data Processing

Data processing choices lead to substantial performance variations\. Applying the oracle trigger refinement step \[P2\] and verb refinements \[P3\] exhibits improvements of up to \+13\.9 F1 for textual and \+7\.0 F1 for visual ED\. In contrast, incorporating the development sets of ACE and SWiG during training \[P1\] results in only marginal gains \(up to \+1\.3 F1\)\. These processing decisions also yield improvements of up to \+6\.8 and \+3\.1 F1 for textual and visual EAE\. Importantly, these discrepancies extend to multimedia ED and EAE since both metrics depend on textual and visual predictions\. These findings underscore the importance of consistent data processing, as new state\-of\-the\-art results may not solely reflect advances in modeling techniques\.

#### Impact of Task Assumptions

Inconsistent task assumptions, particularly test subset selection \[P4\], significantly boosts unimodal EE evaluation scores\. Restricting training and evaluation to texts and images that contain at least one event instance increases F1 scores of up to \+27\.8 and \+30\.0 for textual and visual ED, respectively, and improves EAE scores by up to \+14\.5\. These gains are primarily driven by higher precision, as no\-event instances are excluded from evaluation\. Notably, this issue propagates to downstream multimedia EE and is further amplified when evaluating on smaller subsets, such as the 309 gold event coreference pairs \[P4\]\. However, these results do not reflect real\-world performance, where sentences and images without any targeted events are prevalent\.

#### Impact of Relaxed Evaluation

We further examine the effect of relaxed evaluation settings for textual and visual EAE, as described in §[3\.5](https://arxiv.org/html/2606.26775#S3.SS5)\. For textual EAE, ignoring the trigger span attachments \[P6\] introduces discrepancies of up to 6\.4 F1\. For visual EAE, many\-to\-many matching \[P7\] yields only marginal improvements \(\+0\.3 F1\), however, this highly depends on the object detector and confidence threshold\. As shown in[Table 6](https://arxiv.org/html/2606.26775#A1.T6), object detectors trained on OpenImages often predict overlapping object categories \(e\.g\.,personandsuit\), which are counted as multiple correct predictions under relaxed evaluation \(e\.g\., \+8\.0 F1 foryolo\-x OIandτ=0\.5\\tau=0\.5\)\. This highlights that relaxed evaluation metrics can inflate the actual quality of EAE\.

Table 3:Multimeda ED scores with and without training only on samples with at least one event instance\.

### 4\.3Multimedia Evaluation and Analysis

In this section, we empirically analyze how varying task assumptions and evaluation settings affect multimedia EE results\. Real\-world applications typically do not provide oracle image\-text pairs or prior knowledge about whether a text or image contains a multimedia event\.

#### Experimental Setup

To study the impact of task discrepancies \[P5\] and missing trigger attachments \[P8\], we evaluate three event coreference resolution strategies: threshold\-based, greedy, and bipartite matching approaches proposed in prior workLiuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2)\); Duet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\); Caoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\)\. We further compare evaluations on the full dataset against subsets restricted to samples containing at least one multimedia event, and analyze the effect of applying the same subset selection during training\.

Table 4:Experimental results on the M2E2 benchmark under the original and ourStrictEvalevaluation settings\. In line withSeebergeret al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib7)\), theMMUTFmodel results are based on event predictions fromCAMEL\.
#### Impact on Multimedia Results

[Figure 3](https://arxiv.org/html/2606.26775#S4.F3)presents the multimedia evaluation results\. Consistent with textual evaluation, ignoring trigger offsets for textual events leads to an drop of up to 3\.3 F1 for ED, indicating correct sentence\-level predictions with incorrect trigger spans\. Larger differences arise across task setups\. Restricting evaluation to samples containing multimedia events leads to gains of up to \+21\.2 F1 for threshold\-based methods, although such prior knowledge would not be available at deployment\. Applying the same subset selection during training further amplifies this effect and reveals opposing trends between full and subset evaluations \(see[Table 3](https://arxiv.org/html/2606.26775#S4.T3)\)\. Under realistic conditions, bipartite matching performs best, whereas threshold\-based methods outperform it on the filtered subset \(30\.9 vs\. 20\.4 F1\), attributed to increased recall\. Beyond multimedia ED, we also observe similar patterns for EAE \(bottom row of[Figure 3](https://arxiv.org/html/2606.26775#S4.F3)\), which is connected to ED performance due to the pipeline setup\. Finally, experiments using gold image\-text coreference annotations show the highest scores, highlighting that subset\-based evaluation can substantially overestimate real\-world performance\.

## 5Consistent Evaluation

Our analysis reveals several limitations in current MEE evaluation practices, however, so far ignores the combination of hidden pitfalls and advanced modeling techniques\. Therefore, we reproduce and reevaluate recent methodsDuet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\); Seebergeret al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib7)\); Caoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\); Chenet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib11)\)222Our selection focuses on methods with complete publicly available code and instructions, except forSSGPFas MLLM\-based method\.under their original evaluation settings and compare them with our proposed frameworkStrictEval\. Notably,SSGPFsupports only multimedia evaluation and assumes given image\-text pairs\. Reproduction details are provided in Appendix[A\.5](https://arxiv.org/html/2606.26775#A1.SS5)\.

#### Reevaluation Results and Discrepancies

The results in[Table 4](https://arxiv.org/html/2606.26775#S4.T4)highlight discrepancies between theOriginalandStrictEvalsetting\. First, evaluation scores change substantially in absolute performance levels\. Specifically, textual and visual ED scores differ significantly, primarily due to trigger postprocessing \[P2\], verb refinement \[P3\], and test subset selection \[P4\] \(cf\.[Table 2](https://arxiv.org/html/2606.26775#S3.T2)\)\. This indicates that minor implementation choices can dominate reported gains\. Second, removing these pitfalls consistently leads to lower EAE performance due to the pipeline setup\. This can overestimate advances in multimedia EAE, which are often driven by increased ED performance\. Third, we observe most degradation in multimedia ED and EAE, with drops of up to 56\.8 and 33\.6 F1, respectively\. When neither test subset selection nor gold event coreference annotations are used, precision decreases dramatically, revealing that cross\-modal event coreference resolution remains the largest challenge in a more realistic setting\. Further intuition about these low scores and performance drops is provided in Appendix[A\.6](https://arxiv.org/html/2606.26775#A1.SS6)\.

#### Implications for Evaluation and Future Work

Overall, these findings underscore the need for standardized and transparent evaluation protocols\. We encourage the research community to report detailed evaluation choices, avoid reliance on gold annotations, and adopt unified evaluation pipelines to ensure fair comparison and reproducibility across MEE methods Our analysis further indicates that progress in multimedia event extraction requires a stronger focus on accurate cross\-modal event coreference and semantic alignment, which remain underexplored in recent work\. Finally, future advances would benefit from more comprehensive multimodal datasets with explicit coreference annotations, high\-quality training data, and standardized splits to support robust and comparable evaluation\.

## 6Conclusion

In this work, we present the first systematic analysis of evaluation pitfalls and challenges in MEE and reveal substantial gaps between reported performance and a model’s actual ability to ground events across textual and visual modalities\. Our analysis of the M2E2 benchmark uncovers three major sources of discrepancies: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings\. To address this, we propose the evaluation frameworkStrictEval, which enforces strict evaluation constraints for more challenging evaluation\. Controlled experiments show that minor evaluation choices can significantly affect performance, highlighting that cross\-modal event coreference resolution and precise MEE remain open challenges\.

## Limitations

In this work, we focus on analyzing evaluation pitfalls primarily based on the M2E2 benchmark, which, despite being the most widely used public benchmark for MEE, represents only a subset of possible settings\. As a result, some identified issues and recommended practices may not fully generalize to newer benchmarks, annotation schemes, or domains beyond news media\. Extending our analysis to additional datasets, modalities \(e\.g\., videos or audio\), and task formulations remains an important direction for future work\. Moreover, our empirical analysis relies on a relatively simple MEE model and a limited set of reproduced recent methods\. While this design choice allows us to isolate the impact of evaluation decisions, it does not capture the full diversity of modeling approaches, particularly recent instruction\-following MLLMs\. Although we expect the identified evaluation pitfalls to persist across architectures, their quantitative impact may vary with different modeling paradigms\. Finally, our proposed evaluation framework adopts stricter evaluation settings and result into substantially lower absolute performance scores\. While this may hinder direct comparisons with prior work, our goal is to promote more realistic and transparent evaluation that better reflects the real\-world challenges of MEE\.

## Ethical Considerations

The results reported in this paper are intended to improve evaluation transparency in MEE and should not be interpreted as implying misconduct by prior work\. The identified evaluation pitfalls are subtle and often related to underspecified benchmark evaluation standards, making them easy to be overlooked\. Therefore, the motivation of this work is to raise awareness about these issues and to promote more reliable evaluation practices\.

## References

- The stages of event extraction\.InProceedings of the Workshop on Annotating and Reasoning about Time and Events,B\. Boguraev, R\. Muñoz, and J\. Pustejovsky \(Eds\.\),Sydney, Australia,pp\. 1–8\.External Links:[Link](https://aclanthology.org/W06-0901/)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26775#S2.SS1.p1.1)\.
- J\. Cao, Y\. Hu, Z\. Tan, and X\. Zhao \(2025\)Cross\-modal multi\-task learning for multimedia event extraction\.Proceedings of the AAAI Conference on Artificial Intelligence39\(11\),pp\. 11454–11462\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/33246),[Document](https://dx.doi.org/10.1609/aaai.v39i11.33246)Cited by:[§A\.3](https://arxiv.org/html/2606.26775#A1.SS3.SSS0.Px2.p1.1),[§A\.3](https://arxiv.org/html/2606.26775#A1.SS3.SSS0.Px3.p1.1),[§A\.5](https://arxiv.org/html/2606.26775#A1.SS5.SSS0.Px3.p1.1),[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.15.10.1),[§1](https://arxiv.org/html/2606.26775#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2606.26775#S4.SS3.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.26775#S5.p1.1)\.
- B\. Chen, X\. Lin, C\. Thomas, M\. Li, S\. Yoshida, L\. Chum, H\. Ji, and S\. Chang \(2021\)Joint multimedia event extraction from video and article\.InFindings of the Association for Computational Linguistics: EMNLP 2021,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Punta Cana, Dominican Republic,pp\. 74–88\.External Links:[Link](https://aclanthology.org/2021.findings-emnlp.8/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-emnlp.8)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.26775#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.p1.1)\.
- X\. Chen, X\. Yuan, H\. Li, H\. Yang, G\. Wang, W\. Li, and T\. Mo \(2025\)Stepwise schema\-guided prompting framework with parameter efficient instruction tuning for multimedia event extraction\.In2025 IEEE International Conference on Multimedia and Expo \(ICME\),Vol\.,pp\. 1–6\.External Links:[Document](https://dx.doi.org/10.1109/ICME59968.2025.11210082)Cited by:[§A\.5](https://arxiv.org/html/2606.26775#A1.SS5.SSS0.Px4.p1.1),[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.18.13.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.26775#S3.SS4.p1.1),[§5](https://arxiv.org/html/2606.26775#S5.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§A\.3](https://arxiv.org/html/2606.26775#A1.SS3.SSS0.Px1.p1.1),[§A\.4](https://arxiv.org/html/2606.26775#A1.SS4.p1.1)\.
- Z\. Du, Y\. Li, X\. Guo, Y\. Sun, and B\. Li \(2023\)Training multimedia event extraction with generated images and captions\.InProceedings of the 31st ACM International Conference on Multimedia,MM ’23,New York, NY, USA,pp\. 5504–5513\.External Links:ISBN 9798400701085,[Link](https://doi.org/10.1145/3581783.3612526),[Document](https://dx.doi.org/10.1145/3581783.3612526)Cited by:[§A\.3](https://arxiv.org/html/2606.26775#A1.SS3.SSS0.Px3.p1.1),[§A\.5](https://arxiv.org/html/2606.26775#A1.SS5.SSS0.Px1.p1.1),[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.7.2.1),[§1](https://arxiv.org/html/2606.26775#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.26775#S3.SS4.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2606.26775#S4.SS3.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.26775#S5.p1.1)\.
- K\. Huang, I\. Hsu, T\. Parekh, Z\. Xie, Z\. Zhang, P\. Natarajan, K\. Chang, N\. Peng, and H\. Ji \(2024\)TextEE: benchmark, reevaluation, reflections, and future challenges in event extraction\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 12804–12825\.External Links:[Link](https://aclanthology.org/2024.findings-acl.760/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.760)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p1.1),[§1](https://arxiv.org/html/2606.26775#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.26775#S3.SS2.p1.1),[§3\.5](https://arxiv.org/html/2606.26775#S3.SS5.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.26775#S3.SS5.p1.1)\.
- H\. W\. Kuhn \(1955\)The Hungarian method for the assignment problem\.Naval Research Logistics Quarterly2\(1\-2\),pp\. 83–97\(en\)\.External Links:ISSN 0028\-1441, 1931\-9193,[Link](https://onlinelibrary.wiley.com/doi/10.1002/nav.3800020109),[Document](https://dx.doi.org/10.1002/nav.3800020109)Cited by:[§A\.2](https://arxiv.org/html/2606.26775#A1.SS2.SSS0.Px1.p1.1)\.
- A\. Kuznetsova, H\. Rom, N\. Alldrin, J\. Uijlings, I\. Krasin, J\. Pont\-Tuset, S\. Kamali, S\. Popov, M\. Malloci, A\. Kolesnikov, T\. Duerig, and V\. Ferrari \(2020\)The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale\.International Journal of Computer Vision128\(7\),pp\. 1956–1981\(en\)\.External Links:ISSN 0920\-5691, 1573\-1405,[Link](http://link.springer.com/10.1007/s11263-020-01316-z),[Document](https://dx.doi.org/10.1007/s11263-020-01316-z)Cited by:[§A\.2](https://arxiv.org/html/2606.26775#A1.SS2.SSS0.Px2.p1.1)\.
- J\. Li, C\. Zhang, M\. Du, D\. Min, Y\. Chen, and G\. Qi \(2023\)Three stream based multi\-level event contrastive learning for text\-video event extraction\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 1666–1676\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.103/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.103)Cited by:[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1)\.
- M\. Li, R\. Xu, S\. Wang, L\. Zhou, X\. Lin, C\. Zhu, M\. Zeng, H\. Ji, and S\. Chang \(2022\)CLIP\-event: connecting text and images with event structures\.In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,pp\. 16399–16408\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52688.2022.01593)Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.3.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1)\.
- M\. Li, A\. Zareian, Q\. Zeng, S\. Whitehead, D\. Lu, H\. Ji, and S\. Chang \(2020\)Cross\-media structured common space for multimedia event extraction\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 2557–2568\.External Links:[Link](https://aclanthology.org/2020.acl-main.230/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.230)Cited by:[§A\.3](https://arxiv.org/html/2606.26775#A1.SS3.SSS0.Px1.p1.1),[§A\.6](https://arxiv.org/html/2606.26775#A1.SS6.p1.1),[Table 5](https://arxiv.org/html/2606.26775#A1.T5.1.1.1.1),[Table 5](https://arxiv.org/html/2606.26775#A1.T5.2.2.2.1),[§1](https://arxiv.org/html/2606.26775#S1.p1.1),[§1](https://arxiv.org/html/2606.26775#S1.p2.1),[§1](https://arxiv.org/html/2606.26775#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.26775#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.p1.1),[§3\.4](https://arxiv.org/html/2606.26775#S3.SS4.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px3.p1.1)\.
- T\. Lin, M\. Maire, S\. Belongie, J\. Hays, P\. Perona, D\. Ramanan, P\. Dollár, and C\. L\. Zitnick \(2014\)Microsoft coco: common objects in context\.InComputer Vision – ECCV 2014,D\. Fleet, T\. Pajdla, B\. Schiele, and T\. Tuytelaars \(Eds\.\),Cham,pp\. 740–755\.External Links:ISBN 978\-3\-319\-10602\-1Cited by:[§A\.2](https://arxiv.org/html/2606.26775#A1.SS2.SSS0.Px2.p1.1)\.
- J\. Liu, Y\. Chen, and J\. Xu \(2022\)Multimedia event extraction from news with a unified contrastive learning framework\.InProceedings of the 30th ACM International Conference on Multimedia,MM ’22,New York, NY, USA,pp\. 1945–1953\.External Links:ISBN 9781450392037,[Link](https://doi.org/10.1145/3503161.3548132),[Document](https://dx.doi.org/10.1145/3503161.3548132)Cited by:[§A\.3](https://arxiv.org/html/2606.26775#A1.SS3.SSS0.Px3.p1.1),[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.6.1.1),[§1](https://arxiv.org/html/2606.26775#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.26775#S3.SS4.p1.1),[Figure 3](https://arxiv.org/html/2606.26775#S4.F3),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2606.26775#S4.SS3.SSS0.Px1.p1.1)\.
- M\. Liu, Z\. Hu, B\. Zhou, H\. Hu, C\. Qiu, and X\. Zhang \(2025a\)Cross\-modal event extraction based on adaptive feature selection and semantic\-aware graph\.Knowledge\-Based Systems326,pp\. 114038\.External Links:ISSN 0950\-7051,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.knosys.2025.114038),[Link](https://www.sciencedirect.com/science/article/pii/S0950705125010834)Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.13.8.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1)\.
- M\. Liu, B\. Zhou, H\. Hu, C\. Qiu, and X\. Zhang \(2025b\)Cross\-modal event extraction via visual event grounding and semantic relation filling\.Information Processing & Management62\(3\),pp\. 104027\.External Links:ISSN 0306\-4573,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ipm.2024.104027),[Link](https://www.sciencedirect.com/science/article/pii/S0306457324003868)Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.12.7.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.p1.1)\.
- Y\. Liu, F\. Liu, L\. Jiao, Q\. Bao, L\. Sun, S\. Li, L\. Li, and X\. Liu \(2024\)Multi\-grained gradual inference model for multimedia event extraction\.IEEE Transactions on Circuits and Systems for Video Technology34\(10\),pp\. 10507–10520\.External Links:[Document](https://dx.doi.org/10.1109/TCSVT.2024.3402242)Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.9.4.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1)\.
- F\. Moghimifar, F\. Shiri, V\. Nguyen, Y\. Li, and G\. Haffari \(2023\)Theia: weakly supervised multimodal event extraction from incomplete data\.InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics \(Volume 2: Short Papers\),J\. C\. Park, Y\. Arase, B\. Hu, W\. Lu, D\. Wijaya, A\. Purwarianti, and A\. A\. Krisnadhi \(Eds\.\),Nusa Dua, Bali,pp\. 139–145\.External Links:[Link](https://aclanthology.org/2023.ijcnlp-short.16/),[Document](https://dx.doi.org/10.18653/v1/2023.ijcnlp-short.16)Cited by:[§3\.2](https://arxiv.org/html/2606.26775#S3.SS2.p1.1)\.
- A\. Nath, H\. Jamil, S\. R\. Ahmed, G\. A\. Baker, R\. Ghosh, J\. H\. Martin, N\. Blanchard, and N\. Krishnaswamy \(2024\)Multimodal cross\-document event coreference resolution using linear semantic transfer and mixed\-modality ensembles\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 11901–11916\.External Links:[Link](https://aclanthology.org/2024.lrec-main.1039/)Cited by:[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1)\.
- H\. Peng, X\. Wang, F\. Yao, Z\. Wang, C\. Zhu, K\. Zeng, L\. Hou, and J\. Li \(2023a\)OmniEvent: a comprehensive, fair, and easy\-to\-use toolkit for event understanding\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Y\. Feng and E\. Lefever \(Eds\.\),Singapore,pp\. 508–517\.External Links:[Link](https://aclanthology.org/2023.emnlp-demo.46/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-demo.46)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px3.p1.1)\.
- H\. Peng, X\. Wang, F\. Yao, K\. Zeng, L\. Hou, J\. Li, Z\. Liu, and W\. Shen \(2023b\)The devil is in the details: on the pitfalls of event extraction evaluation\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 9206–9227\.External Links:[Link](https://aclanthology.org/2023.findings-acl.586/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.586)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.26775#S3.SS2.p1.1)\.
- S\. Pratt, M\. Yatskar, L\. Weihs, A\. Farhadi, and A\. Kembhavi \(2020\)Grounded Situation Recognition\.InComputer Vision – ECCV 2020,A\. Vedaldi, H\. Bischof, T\. Brox, and J\. Frahm \(Eds\.\),Vol\.12349,pp\. 314–332\(en\)\.Note:Series Title: Lecture Notes in Computer ScienceExternal Links:ISBN 978\-3\-030\-58547\-1 978\-3\-030\-58548\-8,[Link](https://link.springer.com/10.1007/978-3-030-58548-8_19),[Document](https://dx.doi.org/10.1007/978-3-030-58548-8%5F19)Cited by:[§2\.1](https://arxiv.org/html/2606.26775#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.SSS0.Px1.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning Transferable Visual Models From Natural Language Supervision\.arXiv\.Note:Version Number: 1External Links:[Link](https://arxiv.org/abs/2103.00020),[Document](https://dx.doi.org/10.48550/ARXIV.2103.00020)Cited by:[§A\.3](https://arxiv.org/html/2606.26775#A1.SS3.SSS0.Px2.p1.1),[§A\.4](https://arxiv.org/html/2606.26775#A1.SS4.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Redmon, S\. Divvala, R\. Girshick, and A\. Farhadi \(2016\)You only look once: unified, real\-time object detection\.In2016 IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,pp\. 779–788\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2016.91)Cited by:[§A\.2](https://arxiv.org/html/2606.26775#A1.SS2.SSS0.Px2.p1.1)\.
- S\. Ren, K\. He, R\. Girshick, and J\. Sun \(2015\)Faster r\-cnn: towards real\-time object detection with region proposal networks\.InAdvances in Neural Information Processing Systems,C\. Cortes, N\. Lawrence, D\. Lee, M\. Sugiyama, and R\. Garnett \(Eds\.\),Vol\.28,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf)Cited by:[§A\.2](https://arxiv.org/html/2606.26775#A1.SS2.SSS0.Px2.p1.1)\.
- A\. Sadhu, T\. Gupta, M\. Yatskar, R\. Nevatia, and A\. Kembhavi \(2021\)Visual semantic role labeling for video understanding\.InThe IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p2.1)\.
- K\. Sanders, R\. Kriz, D\. Etter, H\. Recknor, A\. Martin, C\. Carpenter, J\. Lin, and B\. Van Durme \(2024\)Grounding partially\-defined events in multimodal data\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 15905–15927\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.934/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.934)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1)\.
- P\. Seeberger, D\. Wagner, and K\. Riedhammer \(2024\)MMUTF: multimodal multimedia event argument extraction with unified template filling\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 6539–6548\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.381/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.381)Cited by:[§A\.3](https://arxiv.org/html/2606.26775#A1.SS3.SSS0.Px3.p1.1),[§A\.5](https://arxiv.org/html/2606.26775#A1.SS5.SSS0.Px2.p1.1),[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.10.5.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px3.p1.1),[Table 4](https://arxiv.org/html/2606.26775#S4.T4),[§5](https://arxiv.org/html/2606.26775#S5.p1.1)\.
- Z\. Song, A\. Bies, S\. Strassel, T\. Riese, J\. Mott, J\. Ellis, J\. Wright, S\. Kulick, N\. Ryant, and X\. Ma \(2015\)From light to rich ERE: annotation of entities, relations, and events\.InProceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation,E\. Hovy, T\. Mitamura, and M\. Palmer \(Eds\.\),Denver, Colorado,pp\. 89–98\.External Links:[Link](https://aclanthology.org/W15-0812/),[Document](https://dx.doi.org/10.3115/v1/W15-0812)Cited by:[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1)\.
- L\. Sun, K\. Zhang, Q\. Li, and R\. Lou \(2024\)UMIE: unified multimodal information extraction with instruction tuning\.Proceedings of the AAAI Conference on Artificial Intelligence38\(17\),pp\. 19062–19070\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/29873),[Document](https://dx.doi.org/10.1609/aaai.v38i17.29873)Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.8.3.1),[§1](https://arxiv.org/html/2606.26775#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.26775#S3.SS4.p1.1)\.
- Y\. Sun, K\. Zhang, and Y\. Su \(2023\)Multimodal question answering for unified information extraction\.External Links:2310\.03017,[Link](https://arxiv.org/abs/2310.03017)Cited by:[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1)\.
- M\. Tong, S\. Wang, Y\. Cao, B\. Xu, J\. Li, L\. Hou, and T\. Chua \(2020\)Image enhanced event detection in news articles\.Proceedings of the AAAI Conference on Artificial Intelligence34\(05\),pp\. 9040–9047\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/6437),[Document](https://dx.doi.org/10.1609/aaai.v34i05.6437)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1)\.
- R\. Varghese and S\. M\. \(2024\)YOLOv8: a novel object detection algorithm with enhanced performance and robustness\.In2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems \(ADICS\),Vol\.,pp\. 1–6\.External Links:[Document](https://dx.doi.org/10.1109/ADICS58448.2024.10533619)Cited by:[§A\.2](https://arxiv.org/html/2606.26775#A1.SS2.SSS0.Px2.p1.1),[§A\.4](https://arxiv.org/html/2606.26775#A1.SS4.p1.1)\.
- D\. Wadden, U\. Wennberg, Y\. Luan, and H\. Hajishirzi \(2019\)Entity, relation, and event extraction with contextualized span representations\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 5784–5789\.External Links:[Link](https://aclanthology.org/D19-1585/),[Document](https://dx.doi.org/10.18653/v1/D19-1585)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p2.1)\.
- Walker, Christopher, Strassel, Stephanie, Medero, Julie, and Maeda, Kazuaki \(2006\)ACE 2005 Multilingual Training Corpus\.Linguistic Data Consortium\.Note:Artwork Size: 1572864 KB Pages: 1572864 KBExternal Links:[Link](https://catalog.ldc.upenn.edu/LDC2006T06),[Document](https://dx.doi.org/10.35111/MWXC-VH88)Cited by:[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.SSS0.Px1.p1.1)\.
- B\. Wang, M\. Zhang, H\. Fei, Y\. Zhao, B\. Li, S\. Wu, W\. Ji, and M\. Zhang \(2024\)SpeechEE: a novel benchmark for speech event extraction\.InProceedings of the 32nd ACM International Conference on Multimedia,MM ’24,New York, NY, USA,pp\. 10449–10458\.External Links:ISBN 9798400706868,[Link](https://doi.org/10.1145/3664647.3680669),[Document](https://dx.doi.org/10.1145/3664647.3680669)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p2.1)\.
- S\. Wang, M\. Ju, Y\. Zhang, Y\. Zheng, M\. Wang, and G\. Qi \(2023\)Cross\-modal contrastive learning for event extraction\.InDatabase Systems for Advanced Applications,X\. Wang, M\. L\. Sapino, W\. Han, A\. El Abbadi, G\. Dobbie, Z\. Feng, Y\. Shao, and H\. Yin \(Eds\.\),Cham,pp\. 699–715\.External Links:ISBN 978\-3\-031\-30675\-4Cited by:[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.p1.1)\.
- X\. Wang, T\. Sun, G\. Liu, Z\. Yang, J\. Liu, and Z\. Xu \(2025\)MGFSG\-ee: a method based on multi\-grained fusion and scene graph enhancement for event extraction\.InProceedings of the 34th ACM International Conference on Information and Knowledge Management,CIKM ’25,New York, NY, USA,pp\. 3103–3112\.External Links:ISBN 9798400720406,[Link](https://doi.org/10.1145/3746252.3761235),[Document](https://dx.doi.org/10.1145/3746252.3761235)Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.16.11.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26775#S4.SS1.SSS0.Px3.p1.1)\.
- X\. Wang, Z\. Wang, X\. Han, W\. Jiang, R\. Han, Z\. Liu, J\. Li, P\. Li, Y\. Lin, and J\. Zhou \(2020\)MAVEN: A Massive General Domain Event Detection Dataset\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 1652–1671\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.129/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.129)Cited by:[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. Le Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. Rush \(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Online,pp\. 38–45\.External Links:[Link](https://aclanthology.org/2020.emnlp-demos.6),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-demos.6)Cited by:[§A\.4](https://arxiv.org/html/2606.26775#A1.SS4.p1.1)\.
- F\. Xing, Z\. Wang, W\. Wang, and H\. Zhang \(2025\)Benchmarking and improving LVLMs on event extraction from multimedia documents\.InProceedings of the 18th International Natural Language Generation Conference,L\. Flek, S\. Narayan, L\. H\. Phương, and J\. Pei \(Eds\.\),Hanoi, Vietnam,pp\. 734–742\.External Links:[Link](https://aclanthology.org/2025.inlg-main.42/)Cited by:[§3\.2](https://arxiv.org/html/2606.26775#S3.SS2.p1.1)\.
- M\. Yatskar, L\. Zettlemoyer, and A\. Farhadi \(2016\)Situation recognition: visual semantic role labeling for image understanding\.InConference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.SSS0.Px1.p1.1)\.
- J\. Yu, Y\. Lin, Z\. Gao, X\. Qiu, and L\. Rui \(2025\)Multimedia event extraction with LLM knowledge editing\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 4116–4124\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.205/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.205),ISBN 979\-8\-89176\-332\-6Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.17.12.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1)\.
- L\. Yuan, Y\. Cai, X\. Shen, Q\. Li, Q\. Huang, Z\. Deng, and T\. Wang \(2025\)Collaborative multi\-lora experts with achievement\-based multi\-tasks loss for unified multimodal information extraction\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence, IJCAI\-25,J\. Kwok \(Ed\.\),pp\. 6940–6948\.Note:Main TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2025/772),[Link](https://doi.org/10.24963/ijcai.2025/772)Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.14.9.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.26775#S3.SS4.p1.1)\.
- M\. Zhang, H\. Fei, B\. Wang, S\. Wu, Y\. Cao, F\. Li, and M\. Zhang \(2024\)Recognizing everything from all modalities at once: grounded multimodal universal information extraction\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 14498–14511\.External Links:[Link](https://aclanthology.org/2024.findings-acl.863/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.863)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.26775#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.26775#S3.SS2.p1.1)\.
- T\. Zhang, S\. Whitehead, H\. Zhang, H\. Li, J\. Ellis, L\. Huang, W\. Liu, H\. Ji, and S\. Chang \(2017\)Improving event extraction via multimodal integration\.InProceedings of the 25th ACM International Conference on Multimedia,MM ’17,New York, NY, USA,pp\. 270–278\.External Links:ISBN 9781450349062,[Link](https://doi.org/10.1145/3123266.3123294),[Document](https://dx.doi.org/10.1145/3123266.3123294)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px1.p1.1)\.
- Y\. Zhang, Y\. Xu, M\. Tang, X\. Lin, Y\. Wang, H\. Xu, and G\. Gou \(2025\)RDA: regularized domain adaptation for multimedia event extraction\.InAdvanced Intelligent Computing Technology and Applications,D\. Huang, C\. Zhang, Q\. Zhang, and Y\. Pan \(Eds\.\),Singapore,pp\. 308–319\.External Links:ISBN 978\-981\-96\-9884\-4Cited by:[Table 5](https://arxiv.org/html/2606.26775#A1.T5.3.3.11.6.1)\.
- S\. Zheng, W\. Cao, W\. Xu, and J\. Bian \(2021\)Revisiting the evaluation of end\-to\-end event extraction\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 4609–4617\.External Links:[Link](https://aclanthology.org/2021.findings-acl.405/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-acl.405)Cited by:[§1](https://arxiv.org/html/2606.26775#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26775#S2.SS2.SSS0.Px3.p1.1)\.

## Appendix AAppendix

### A\.1Articles for Systematic Analysis

[Table 5](https://arxiv.org/html/2606.26775#A1.T5)presents a comprehensive list of studies included in our systematic analysis, along with their reported evaluation scores\. These works cover diverse modeling paradigms, ranging from early pipeline\-based approaches to recent instruction\-following and MLLMs\.

Table 5:Reported F1 scores of methods on M2E2\.
### A\.2One\-to\-One Matching Strategy

As described in §[3\.5](https://arxiv.org/html/2606.26775#S3.SS5), visual arguments are initially evaluated using a many\-to\-many matching strategy\. In this setting, multiple predicted arguments may be matched to the same ground\-truth argument \(many\-to\-one\) and a single predicted argument may match multiple ground\-truth arguments \(one\-to\-many\)\. While this approach captures all overlapping predictions, it can inflate evaluation scores and does not enforce a strict mapping between predicted and gold arguments\.

#### One\-to\-One Matching

To address this limitation, we adopt a one\-to\-one matching strategy based on the Hungarian algorithmKuhn \([1955](https://arxiv.org/html/2606.26775#bib.bib48)\)\. Predicted and ground\-truth arguments are modeled as nodes in a bipartite graph and a globally optimal matching is computed between the two sets\. This formulation guarantees that each predicted argument is matched to at most one ground\-truth argument, yielding a stricter and more interpretable evaluation\.

#### Empirical Impact Analysis

In[Table 6](https://arxiv.org/html/2606.26775#A1.T6), we show the impact for object detectors YOLOv8Redmonet al\.\([2016](https://arxiv.org/html/2606.26775#bib.bib46)\); Varghese and M\. \([2024](https://arxiv.org/html/2606.26775#bib.bib47)\)and Faster R\-CNNRenet al\.\([2015](https://arxiv.org/html/2606.26775#bib.bib45)\), trained on COCOLinet al\.\([2014](https://arxiv.org/html/2606.26775#bib.bib44)\)and OpenImagesKuznetsovaet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib43)\), respectively\. Models trained on COCO show robust performance under the stricter matching strategy, whereas results on OpenImages exhibit substantial differences\. For example, we observe an F1 increase of \+8\.0 forYOLO\-X OIatτ=0\.5\\tau=0\.5\. This behavior can be attributed to overlapping object categories in OpenImages \(e\.g\.,personandsuit\) which are otherwise counted as multiple correct predictions under many\-to\-many matching \(see[Figure 4](https://arxiv.org/html/2606.26775#A1.F4)\)\.

Table 6:Visual EAE scores using ground truth events with the many\-to\-many \(original\) and one\-to\-one \(ours\) evaluation\. Models labeledCCandOIcorrespond to object detectors trained on COCO \(80 classes\) and OpenImages \(600 classes\), respectively\. The parameterτ\\taudenotes the chosen minimum confidence threshold for each object\.

### A\.3ProposedSingle TaskModels

In this section, we describe theSingle Taskmodels used in the experiments reported in §[4](https://arxiv.org/html/2606.26775#S4)\. Each subtask is modeled independently, including textual ED / EAE, visual ED / EAE, and Event Coreference Resolution\.

#### Textual Event Extraction

For textual ED and EAE, we employ BERTDevlinet al\.\([2019](https://arxiv.org/html/2606.26775#bib.bib41)\)as the text encoder\. Given an input sentencess, BERT produces contextualized subtoken representations, which are mean\-pooled to obtain token\-level embeddings\. Textual ED is formulated as a sequence labeling task where each token is classified into an event type using a linear classifier\. For textual EAE, followingLiet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\), we assume gold entity mentions and perform role classification by mean\-pooling the subtoken representations of each entity mention\. Textual ED and EAE are trained as separate models using cross\-entropy loss\.

#### Visual Event Extraction

For visual ED and EAE, we adopt the CLIPRadfordet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib42)\)vision encoder\. Given an input imageii, CLIP produces global and patch\-level representations\. Visual ED is treated as image\-level classification by feeding the \[CLS\] token representation into a linear classifier\. For visual EAE, we first detect objects using an offline detectorCaoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\)\. Patch representations corresponding to each object are mean\-pooled to form object embeddings, which are then classified into argument roles using a linear layer\. Analogous to the textual tasks, Visual ED and EAE are trained independently with cross\-entropy loss\.

#### Multimedia Event Extraction

As described in §[4\.1](https://arxiv.org/html/2606.26775#S4.SS1), event coreference resolution between textual and visual events is performed using CLIP\-based similarity scores\. Following prior workDuet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\); Seebergeret al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib7)\); Caoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\), we construct a multimedia event when a text–image pair shares the same predicted event type and their similarity score exceeds a threshold of 20\. The resulting multimedia event aggregates all associated textual and visual arguments\. In addition to this heuristic approach, we also report results obtained using greedy and bipartite matching strategiesLiuet al\.\([2022](https://arxiv.org/html/2606.26775#bib.bib2)\)\.

#### Training

We train all models using a learning rate of1×10−51\\times 10^\{\-5\}for encoder parameters and1×10−41\\times 10^\{\-4\}for classifier parameters\. Textual models are trained with a batch size of 16 for 20 epochs, while visual models use a batch size of 64 and are trained for 10 epochs\. Model performance is evaluated at the end of each epoch on the ACE development set for textual tasks and the SWiG development set for visual tasks\. We select the checkpoint with the best development set performance for final evaluation\.

### A\.4Implementation Details

All models are implemented using theTransformers\(Wolfet al\.,[2020](https://arxiv.org/html/2606.26775#bib.bib40)\)\(v4\.55\.0\) library in conjunction withPyTorch\(v2\.8\.0\)\. Unless otherwise specified, we usebert\-base\-uncasedDevlinet al\.\([2019](https://arxiv.org/html/2606.26775#bib.bib41)\)with 222M parameters as text encoder andclip\-vit\-base\-patch16Radfordet al\.\([2021](https://arxiv.org/html/2606.26775#bib.bib42)\)with 85M parameters as vision encoder\. Object detections are obtained using YOLOv8Varghese and M\. \([2024](https://arxiv.org/html/2606.26775#bib.bib47)\)333[https://docs\.ultralytics\.com/models/yolov8](https://docs.ultralytics.com/models/yolov8)trained on COCO and detections with confidence scores below 0\.8 are discarded\. All experiments are conducted on NVIDIA A100 GPUs within a single compute node running CUDA 12\.3\. We run each experiment with three random seeds and report the average performance across runs\.

### A\.5Reproduction Details

In this section, we provide detailed reproduction information for the models discussed in §[5](https://arxiv.org/html/2606.26775#S5), along with potential explanations for any observed differences in performance scores\.

#### CAMEL

We use the official released code444[https://github\.com/ZILIN003/CAMEL](https://github.com/ZILIN003/CAMEL)provided byDuet al\.\([2023](https://arxiv.org/html/2606.26775#bib.bib4)\)\. Our reproduced F1 scores largely align with the reported results, except for multimedia ED \(48\.3 vs\. 57\.5 F1\) and EAE \(26\.5 vs 33\.2 F1\)\. We attribute these discrepancies primarily to the absence of balanced visual ED training in our reproduction that reduced recall in favor of precision\.

#### MMUTF

We use the official released code555[https://github\.com/seebergerph/MMUTF](https://github.com/seebergerph/MMUTF)provided bySeebergeret al\.\([2024](https://arxiv.org/html/2606.26775#bib.bib7)\)\. Following the original paper, which mainly focuses on EAE, we use ED predictions fromCAMEL\. Our reproduced scores closely match the reported results with minor decreases in multimedia ED and EAE, which we also attribute to the lower recall ofCAMELpredictions\.

#### X\-MTL

We use the official released code666[https://github\.com/aoine\-dev/X\-MTL](https://github.com/aoine-dev/X-MTL)provided byCaoet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib8)\)\. Our reproduced results show only minor deviations from the originally reported results \(e\.g\., 56\.6 vs\. 52\.9 F1 for textual ED\)\. We believe these differences are related to inconsistencies in the pseudo\-labeled VOA image\-caption dataset, caused by broken links during dataset construction\.

#### SSGPF

We re\-implementSSGPFusingLLaVA\-v1\.5\-7Bas described in the paper777[https://github\.com/MartinYuanNJU/SSGPF](https://github.com/MartinYuanNJU/SSGPF)Chenet al\.\([2025](https://arxiv.org/html/2606.26775#bib.bib11)\)\. The model assumes aligned image\-sentence pairs and requires manually written event and role descriptions\. We obtain comparable performance on multimedia ED \(65\.7 vs\. 61\.0 F1\) but observe a substantial drop on EAE \(36\.0 vs\. 14\.1 F1\), which we attribute to implementation differences in EAE evaluation, visual grounding \(SEEM\), and role descriptions\. ForStrictEval, we use the proposed fine\-tuned cross\-modal retrieval model to construct image\-sentence pairs\.

### A\.6Analysis ofStrictEval

We base our experiments on the human\-annotatedM2E2dataset \(Inter\-Annotator Agreement of 81\.2%\)Liet al\.\([2020](https://arxiv.org/html/2606.26775#bib.bib1)\)and remove relaxed assumptions used in prior work to ensure more consistent comparison\. The resulting substantially lower scores are primarily due to stricter evaluation protocols that expose false positives otherwise ignored, which we detail next\. Using our proposedSingle Taskmodels, we provide further intuition for these effects\.

#### Textual Evaluation

In textual evaluation, oracle trigger refinement888[https://github\.com/jianliu\-ml/Multimedia\-EE/blob/main/code/textualEE/refine\_result\.py](https://github.com/jianliu-ml/Multimedia-EE/blob/main/code/textualEE/refine_result.py)and test subset selection remove a large number of incorrect predictions and negative samples\. For example, theSingle Taskmodel predicts 3,969 text events, of which refinement removes 1,636 false positive events and 1,885 arguments prior to evaluation\. Furthermore, our manual analysis suggests annotation discrepancies betweenACEandM2E2, potentially contributing to the large number of false positive events\.

#### Visual Evaluation

Similarly, test subset selection excludes 623 images without annotated events, restricting evaluation to only 391 positive samples\. As a result, false positives from the excluded images are not counted\. For instance, theSingle Taskmodel predicts 88 false positive visual events and 341 arguments on these omitted images\. Evaluating only on positive samples therefore inflates precision by ignoring predictions on negative images\.

#### Multimedia Evaluation

For multimedia evaluation, recent work often assumes access to ground truth image–sentence pairs or applies post\-hoc filtering using gold sentence and image IDs \(e\.g\., via coreference annotations\)\. This substantially reduces the number of evaluated pairs and removes negative candidates\. In contrast,StrictEvalconstructs and evaluates over all possible image–sentence pairs\. Consequently, models must handle a much larger and noisier candidate space, leading to a significant increase in false positives\. For our baseline models, post\-hoc filtering excludes 1,118 false positive multimedia events, resulting into a substantial drop in precision and F1 scores\.

### A\.7Additional Experimental Results

This section supplements the multimedia results presented in §[4\.3](https://arxiv.org/html/2606.26775#S4.SS3)by providing EAE scores along with precision and recall\. We report evaluations for different coreference resolution strategies: threshold\-based \([Table 7](https://arxiv.org/html/2606.26775#A1.T7)and[Table 8](https://arxiv.org/html/2606.26775#A1.T8)\), greedy matching \([Table 9](https://arxiv.org/html/2606.26775#A1.T9)\), and bipartite matching \([Table 10](https://arxiv.org/html/2606.26775#A1.T10)\)\.

![Refer to caption](https://arxiv.org/html/2606.26775v1/x4.png)Figure 4:Qualitative error analysis of visual EAE\. Gold rectangles denote ground\-truth argument roles while red rectangles indicate correctly counted predictions\. The top row illustrates a common failure case in which overlapping objects \(often other persons, suits, or shirts\) are incorrectly counted as matches\. This inflates performance metrics such as recall\. In contrast, the bottom row shows our proposed one\-to\-one version which alleviates these error cases\.Table 7:Evaluation results of MEE model with threshold\-based coreference resolution \(CLIPRaw\{\}\_\{\\mbox\{Raw\}\}\)\.Table 8:Evaluation results of MEE model with threshold\-based coreference resolution \(CLIPVOA\{\}\_\{\\mbox\{VOA\}\}\)\.Table 9:Evaluation results of MEE model with greedy matching coreference resolution\.Table 10:Evaluation results of MEE model with bipartite matching coreference resolution\.
Evaluation Pitfalls and Challenges in Multimedia Event Extraction

Similar Articles

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

MVEB: Massive Video Embedding Benchmark

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Multimodal Claim Extraction for Fact-Checking

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Submit Feedback

Similar Articles

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild
MVEB: Massive Video Embedding Benchmark
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
Multimodal Claim Extraction for Fact-Checking
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory