Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning

arXiv cs.AI 05/14/26, 04:00 AM Papers
Summary
This paper proposes a strikingness-aware evaluation framework for Temporal Knowledge Graph Reasoning (TKGR) that weights events by rarity to better assess model reasoning, addressing overestimation from trivial repeated events.
arXiv:2605.13153v1 Announce Type: new Abstract: Temporal Knowledge Graph Reasoning (TKGR) aims at inferring missing (especially future) events from historical data. Current evaluation in TKGR uniformly weights all events, ignoring that most are trivial repetitions, which overestimate the true reasoning ability. Therefore, the rare outstanding events, whose prediction demands deeper reasoning, should be distinguished and emphasized. To this end, we propose a strikingness-aware evaluation framework, which introduces a rule-based strikingness measuring framework (RSMF) to quantify event strikingness by comparing its expected occurrence with peer events derived from temporal rules. Strikingness is then integrated as a weighting factor into metrics like weighted MRR and Hits@k. Experiments on four TKG benchmarks reveal: 1) All representative models perform worse as event strikingness increases, 2) Path-based methods excel on low-strikingness events and representation-based ones on high-strikingness events, 3) We design an ensemble method whose gains stem from fitting trivial events rather than reasoning improvement. Our framework provides a more rigorous evaluation, refocusing the field on predicting outstanding events.
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:15 AM
# Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning
Source: [https://arxiv.org/html/2605.13153](https://arxiv.org/html/2605.13153)
Shengzhe Zhang2&Wei Wei1,2,3,1School of Computer Science & Technology, Huazhong University of Science and Technology 2Institute of Artificial Intelligence, Huazhong University of Science and Technology 3School of Artificial Intelligence & Automation, Huazhong University of Science and Technology \{huangrk, zsz, weiw\}@hust\.edu\.comCorresponding author

###### Abstract

Temporal Knowledge Graph Reasoning \(TKGR\) aims at inferring missing \(especially future\) events from historical data\. Current evaluation in TKGR uniformly weights all events, ignoring that most are trivial repetitions, which overestimate the true reasoning ability\. Therefore, the rare outstanding events, whose prediction demands deeper reasoning, should be distinguished and emphasized\. To this end, we propose a strikingness\-aware evaluation framework, which introduces a rule\-based strikingness measuring framework \(RSMF\) to quantify event strikingness by comparing its expected occurrence with peer events derived from temporal rules\. Strikingness is then integrated as a weighting factor into metrics like weighted MRR and Hits@k\. Experiments on four TKG benchmarks reveal: 1\) All representative models perform worse as event strikingness increases, 2\) Path\-based methods excel on low\-strikingness events and representation\-based ones on high\-strikingness events, 3\) We design an ensemble method whose gains stem from fitting trivial events rather than reasoning improvement\. Our framework provides a more rigorous evaluation, refocusing the field on predicting outstanding events\.

## 1Introduction

Recent advances in Temporal Knowledge Graph Reasoning \(TKGR\) have led to substantial progress, which can be broadly categorized into two classes according to whether they forecast future events: interpolation and extrapolation reasoningJinet al\.\([2020](https://arxiv.org/html/2605.13153#bib.bib22)\)\. The former refers to inferring missing historical facts, while the latter involves predicting future events, also named temporal knowledge graph forecastingSunet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib32)\)\. This work primarily focuses on extrapolation reasoning, which is essential for many high\-risk applications like financial risk controlAven \([2013](https://arxiv.org/html/2605.13153#bib.bib20)\)\.

Despite promising empirical resultsLianget al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib14)\), many reported gains may largely arise from data biases, leading to misjudgment of advancements in the fieldKervadecet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib69)\)\. A historical parallel exists in the static KGR field, potential data leakage in well\-known benchmarks \(WN18, FB15k\) led to an overestimation of the reasoning capabilities of modelsToutanova and Chen \([2015](https://arxiv.org/html/2605.13153#bib.bib70)\); Dettmerset al\.\([2018](https://arxiv.org/html/2605.13153#bib.bib52)\)\. Over 94% and 81% of queries, such as \(A,hypernym, ?\), in WN18 and FB15k can easily be mapped to a training triple \(B,hyponym, A\) if it is known that hyponym is the inverse of hypernym\. Recently, a similar phenomenon is emerging in the TKGR field: over 80% of events have occurred in prior history in the ICEWS dataZhuet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib71)\)\. It may enable heuristic\-based predictions, inflating state\-of\-the\-art \(SOTA\) performance, under the existing TKGR evaluation framework, on common events while obscuring poor accuracy on fewer than 10% of truly challenging, striking cases\. For example, given a query such as\(A,MakeVisit,?,Tq\)\(A,MakeVisit,?,T\_\{q\}\), they may output an answerBBby selecting either the most frequently occurring historical event\(A,MakeVisit,B,Ti\)\(A,MakeVisit,B,T\_\{i\}\)Leeet al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib29)\); Xuet al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib74)\)\. It raises doubts about the predictive quality of current TKGR methods and whether the current evaluation framework could reasonably reflect these models’ forecasting capabilitiesGastingeret al\.\([2024b](https://arxiv.org/html/2605.13153#bib.bib30)\)\.

The above flaw in static KGR has been addressed by removing inverse relation triples from their training sets, i\.e\., creating WN18RRDettmerset al\.\([2018](https://arxiv.org/html/2605.13153#bib.bib52)\)and FB15k\-237Toutanova and Chen \([2015](https://arxiv.org/html/2605.13153#bib.bib70)\)\. However, this straightforward removal strategy is fundamentally inapplicable to TKGR, since all historical events, even repetitive ones, constitute essential evidence for forecasting the future\. This dilemma raises a critical question: How to construct a more meaningful evaluation framework for TKGR without deleting data?

A principled feasible alternative is to re\-weight test\-instance gains, rather than treating all instances uniformly\. Specifically, trivial events such as \(A,MakeVisit, B\) occurring across different timestamps frequently should be assigned lower weights\. In contrast, rarer outstanding events like \(A,Sign Agreement, B\), which require deeper temporal reasoning, should be emphasized\. Generally, accurately inferring outstanding events offers far greater practical value than merely predicting numerous trivial ones\. However, measuring the weights and automatically identifying outstanding ones from a large volume of trivial events is nontrivial\. While there have been some efforts to measure strikingness of facts in static KGs, there is a notable lack of studies in TKGs field\. This gap stems from two challenges: First, beyond the statistical features, a comprehensive TKGR evaluation framework necessitates incorporating both semantic and temporal relevance\. Second, since the ground\-truth impact of a future event is unknowable in advance, any measure of its strikingness can only be derived from observable historical patterns\.

To address this, we propose aRule\-basedStrikingnessMeasuringFramework \(RSMF\) to measure the strikingness of future events based on historical evidence\. RSMF first leverages first\-order temporal rules to retrieve peer events for the target event\. Subsequently, it computes the expected occurrence of candidate events with the semantic confidence of rules, the temporal characteristics of events, and the frequency of event repetition\. The strikingness of the future event is derived by contrasting its expected occurrence with that of its peer events\. Finally, we construct a strikingness\-aware evaluation framework by using strikingness as a weighting factor\. Experimentally, we evaluate eight representative baselines under the striking\-aware evaluation framework across four widely adopted TKG datasets, including three path\-based, three representation\-based, and two large language model \(LLM\)\-based approaches\. Our contributions and key findings can be summarized as follows:

- •We propose RSMF to quantify event strikingness in TKGs, and build a new corresponding striking\-aware TKGR evaluation framework that re\-weights test instances with their strikingness\.
- •For evaluated models, reasoning performance decreases as event strikingness increases, i\.e\., events with higher strikingness are more difficult to predict\.
- •We find distinct performance patterns across baselines: path\-based methods show stronger performance in low\-strikingness events, whereas representation\-based approaches excel at high\-strikingness events\.
- •We design an ensemble method combining path\- and representation\-based models, aiming to leverage their complementary strengths\. Consequently, it separately obtains significant and marginal gains in the original and our proposed strikingness\-aware framework\. Analysis reveals that while the method’s gains come from dominant low\-strikingness events, whereas, performance on rare high\-strikingness events decreases\.

## 2Related Work

##### Temporal Knowledge Graph Reasoning Evaluation

In recent years, researchers have proposed various extrapolation TKGR methods, including graph neural networks\-basedLiet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib23),[2022](https://arxiv.org/html/2605.13153#bib.bib31)\); Chenet al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib33)\), rule\-basedLiuet al\.\([2022](https://arxiv.org/html/2605.13153#bib.bib27)\); Huanget al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib24)\), reinforcement learning basedSunet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib32)\); Zhenget al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib56)\); Donget al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib73)\), and the increasingly popular large language models\-based methodsLeeet al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib29)\); Liaoet al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib25)\); Xiaet al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib57)\)\. Alongside these advancements, the common rank\-based evaluation, for link prediction like KGR, methods are undergoing continuous refinement\. Initially, to address the issue of multiple answers for a single query, correct answers other than the target answer are filtered out during ranking to avoid underestimating model performanceBordeset al\.\([2013](https://arxiv.org/html/2605.13153#bib.bib34)\)\. Subsequently, to accommodate TKGR, time\-aware filteringHanet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib44)\)and time interval prediction evaluation were introducedJainet al\.\([2020](https://arxiv.org/html/2605.13153#bib.bib59)\)\. Considerable efforts have also been made in re\-evaluating the performances of various models to establish fair comparisonsSunet al\.\([2020](https://arxiv.org/html/2605.13153#bib.bib45)\); Ruffinelliet al\.\([2020](https://arxiv.org/html/2605.13153#bib.bib58)\); Gastingeret al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib26)\)\. In addition, a valuable baseline namedRecurrencyhighlight flaws in datasets and offered significant insightsGastingeret al\.\([2024b](https://arxiv.org/html/2605.13153#bib.bib30)\)\. To explore the capability boundaries of TKGR models, some studies attempt to build new benchmark datasets tailored to different scenarios, such as context\-awareMaet al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib49)\), multi\-modalLiet al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib50)\), and large\-scale settingsGastingeret al\.\([2024a](https://arxiv.org/html/2605.13153#bib.bib51)\)\. However, they may also suffer from the above data biases, as the repetitive pattern is an inherent characteristic of TKGs\.

##### Outstanding Facts Mining in Knowledge Graph

Outstanding facts \(OFs\) mining focuses on quantifying the strikingness of facts\. Early research focused on extracting OFs from unstructured data \(e\.g\., text\)Angiulliet al\.\([2009](https://arxiv.org/html/2605.13153#bib.bib28)\); Hassanet al\.\([2014](https://arxiv.org/html/2605.13153#bib.bib60)\); Wuet al\.\([2012](https://arxiv.org/html/2605.13153#bib.bib61)\)\. MaverickZhanget al\.\([2018](https://arxiv.org/html/2605.13153#bib.bib16)\)firstly measured event strikingness in static KGs with the specific attribute values of entities\. FMINERYanget al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib17)\)introduced context entity constraints and designed a pattern relevance model to optimize the process of event searching\. The robustness of measured outstanding events is further explored using perturbation analysisXiaoet al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib18)\)\. To the best of our knowledge, our framework is the first principled extension of the established Outstanding Fact Mining paradigm from static KGs to TKGs\. Furthermore, we transform a mining technique into a comprehensive evaluation framework, creating weighted metrics that reorient the TKGR field towards valuing outstanding reasoning\.

## 3Strikingness\-Aware Evaluation

### 3\.1Preliminaries

##### Temporal Knowledge Graph Reasoning

A TKG can be represented as a sequence of timestamp KGs, denoted as𝒢=\{𝒢1,𝒢2,\\mathcal\{G\}=\\\{\\mathcal\{G\}\_\{1\},\\mathcal\{G\}\_\{2\},…,𝒢t\}\\ldots,\\mathcal\{G\}\_\{t\}\\\}\. Each KG at a specific timestampttis defined as𝒢t=\(ℰ,ℛ,ℱt\)\\mathcal\{G\}\_\{t\}=\\left\(\\mathcal\{E\},\\mathcal\{R\},\\mathcal\{F\}\_\{t\}\\right\), whereℰ\\mathcal\{E\}is the set of entities,ℛ\\mathcal\{R\}represents the set of relations, andℱt=\{\(s,r,o,t\)\}\\mathcal\{F\}\_\{t\}=\\\{\(s,r,o,t\)\\\}refers to the set of events observed at timestamptt\. Given a query\(s,r,?,t\)\(s,r,?,t\), a reasonable TKGR model is to infer the objectoobased on the facts observed beforett, wheressandooare subject and object entities,rris a relation, andttis a timestamp\. For instance, the query \(Markieff Morris,join, ?, 2025\-02\) requires the model to predictthe Lakersbased on events before 2025\-02 to validate its forecasting capability in practice\.

![Refer to caption](https://arxiv.org/html/2605.13153v1/x1.png)Figure 1:An example of strikingness measuring for target future event\(SouthKorea,SignAgreements,NorthKorea,2025/08/15\)\(South\\ Korea,Sign\\ Agreements,North\\ Korea,2025/08/15\)\(replacing object\)\. InPeer Events Retrieval, RMFS retrieves the historical events and constructs the peer events with the rule set\. TheExpectation of OccurrenceandStrikingness Calculationare calculated by the rule grounding and strikingness scoring function\.
##### Strikingness of Events

Strikingness quantifies how outstanding a target eventf=\(s,r,o,t\)f=\(s,r,o,t\)is, which is a continuous value in the range \[0, 1\]\. The closer the value is to 1, the more outstanding the event, and vice versa\. Events with low strikingness can be referred to astrivial events, while events with high strikingness can be referred to asoutstanding events\. Since it is not meaningful to compare two entirely unrelated events, such asMarkieff Morris will join the Lakers in 2025andThe Federal Reserve will cut interest rates in 2026, the strikingness thus is defined by comparing it with peer events𝒫\\mathcal\{P\}\.

##### Peer Events

A peer event is a related event of the target event generated by replacing entities or relations\. For a target future eventf=\(s,r,o,t\)f=\(s,r,o,t\), its peer events are defined as𝒫fs=\{\(s′,r,o,t\)\|s′∈ℰ\}\\mathcal\{P\}\_\{f\}^\{s\}=\\\{\(s^\{\\prime\},r,o,t\)\|s^\{\\prime\}\\in\\mathcal\{E\}\\\},𝒫fo=\{\(s,r,o′,t\)\|o′∈ℰ\}\\mathcal\{P\}\_\{f\}^\{o\}=\\\{\(s,r,o^\{\\prime\},t\)\|o^\{\\prime\}\\in\\mathcal\{E\}\\\}, and𝒫fr=\{\(s,r′,o,t\)\|r′∈ℛ\}\\mathcal\{P\}\_\{f\}^\{r\}=\\\{\(s,r^\{\\prime\},o,t\)\|r^\{\\prime\}\\in\\mathcal\{R\}\\\}\.

### 3\.2Strikingness Measuring

To measure the strikingness of an event, three key challenges must be addressed: 1\) Constructing a set of peer events that can be used for comparison with the target future event, 2\) Computing the expectation of occurrence of the target event and its peer events, and 3\) Calculating the strikingness score of the target event using the expectation of occurrence\. The overall procedure for RSMF is outlined in Figure[1](https://arxiv.org/html/2605.13153#S3.F1)\.

##### Peer Events Retrieval

Peer events can be obtained by replacing the entities or relations of the target event\. However, direct substitution may generate many meaningless peer events, such as \(Markieff Morris,join,Microsoft, 2025\-02\)\. Therefore, we utilize temporal rules to constrain the generation of peer events from historical KGs\.

For a target future eventf=\(s,r,o,t\)f=\(s,r,o,t\), we first obtain the rule setTRTRcorresponding to the relationrrthrough rule miningLiuet al\.\([2022](https://arxiv.org/html/2605.13153#bib.bib27)\)\. While higher\-order rules could capture more complex patterns, they also introduce exponential computational complexity and risk of overfitting\. Thus, we only use the length 1 rules as a practical measure\. A detailed complexity analysis is provided in Appendix[C](https://arxiv.org/html/2605.13153#A3)\. A temporal rule is defined as follows:

\(E1,rh,E2,T2\)←\(E1,rb,E2,T1\)\\displaystyle\(E\_\{1\},r\_\{h\},E\_\{2\},T\_\{2\}\)\\leftarrow\(E\_\{1\},r\_\{b\},E\_\{2\},T\_\{1\}\)\(1\)
whereT1<T2T\_\{1\}<T\_\{2\},rhr\_\{h\}andrbr\_\{b\}denote rule head and body relation,EiE\_\{i\}andTiT\_\{i\}indicate entity and timestamp variables\.

For ease of understanding, we take the retrieval of the object entity as an example\. We first mask the object entity in the target event to convert it into a queryfq=\(s,r,?,t\)f\_\{q\}=\(s,r,?,t\), and then use the historical KG sequences and temporal rules to search for historical events that support this query\. For a ruletr∈TRtr\\in TR, we ground the rule body in the historical KG sequences\{𝒢i\}i=t−wt−1\\left\\\{\\mathcal\{G\}\_\{i\}\\right\\\}\_\{i=t\-w\}^\{t\-1\}to obtain the grounded historical events:

ℱf,tro=\{\(s,rb,o′,t′\)\|t−w≤t′<t\}\\displaystyle\\mathcal\{F\}\_\{f,tr\}^\{o\}=\\\{\(s,r\_\{b\},o^\{\\prime\},t^\{\\prime\}\)\|t\-w\\leq t^\{\\prime\}<t\\\}\(2\)whererbr\_\{b\}represents the relation of the rule body fortrtr, whilewwcontrols the window of the historical KG sequences\. The grounded events of all rule bodies fortrtr:

ℱfo=⋃tr∈TRℱf,tro\\displaystyle\\mathcal\{F\}\_\{f\}^\{o\}=\\bigcup\_\{tr\\in TR\}\\mathcal\{F\}\_\{f,tr\}^\{o\}\(3\)Then, we take the object entities inℱfo\\mathcal\{F\}\_\{f\}^\{o\}as the candidate set𝒞fo=\{o′\|\(s,rb,o′,t′\)∈ℱfo\}\\mathcal\{C\}\_\{f\}^\{o\}=\\\{o^\{\\prime\}\|\(s,r\_\{b\},o^\{\\prime\},t^\{\\prime\}\)\\in\\mathcal\{F\}\_\{f\}^\{o\}\\\}for object entity replacement\. Further, peer object events𝒫fo=\{\(s,r,o′,t\)\|o′∈𝒞fo\}\\mathcal\{P\}\_\{f\}^\{o\}=\\\{\(s,r,o^\{\\prime\},t\)\|o^\{\\prime\}\\in\\mathcal\{C\}\_\{f\}^\{o\}\\\}of the target eventffcan be generated by substituting the object entitiesoo\. The peer relation events𝒫fr\\mathcal\{P\}\_\{f\}^\{r\}and the peer subject events𝒫fs\\mathcal\{P\}\_\{f\}^\{s\}could be obtained similarly\.

##### Expectation of Occurrence

Strikingness is related to human expectations regarding the occurrence of an event\. To this end, we employ a rule\-based approach to compute the expected scores of the target event and its peer events\. For a peer eventf′f^\{\\prime\}, we ground the instances by applying each ruletr∈TRtr\\in TRto obtain the rule grounding events pair:

rgf′tr=\(s,r,o′,th\)←\(s,rb,o′,ty\)\\displaystyle rg\_\{f^\{\\prime\}\}^\{tr\}=\(s,r,o^\{\\prime\},t\_\{h\}\)\\leftarrow\(s,r\_\{b\},o^\{\\prime\},t\_\{y\}\)\(4\)Eachrgf′trrg\_\{f^\{\\prime\}\}^\{tr\}consists of a rule body and a rule head means the body event could support the occurrence of head event\.

Furthermore, we also consider the impact of event frequency on strikingness measurement\. Intuitively, the more frequently the rule grounding observed, the higher the expectation of eventf′f^\{\\prime\}\. Therefore, we iteratively collect rule grounding to obtain the set of rule grounding:

RGf′tr=\{\(s,r,o′,thi\)←\(s,rb,o′,tyj\)\}j=1n,\\displaystyle RG\_\{f^\{\\prime\}\}^\{tr\}=\\\{\(s,r,o^\{\\prime\},t\_\{h\_\{i\}\}\)\\leftarrow\(s,r\_\{b\},o^\{\\prime\},t\_\{y\_\{j\}\}\)\\\}\_\{j=1\}^\{n\},withty1<th1≤ty2<th2≤…≤tyn<thn\\displaystyle with\\ t\_\{y\_\{1\}\}<t\_\{h\_\{1\}\}\\leq t\_\{y\_\{2\}\}<t\_\{h\_\{2\}\}\\leq\.\.\.\\leq t\_\{y\_\{n\}\}<t\_\{h\_\{n\}\}\(5\)
wherety1≥t−wt\_\{y\_\{1\}\}\\geq t\-w, andnnrepresents the number of rule groundings\. The temporal constraint is utilized to avoid overlap and redundancy\. Additionally, we setthn=tt\_\{h\_\{n\}\}=t, indicating that in the temporally closest grounding, only the rule body is a historical event, which provides support for reasoning about the potential future eventf′f^\{\\prime\}\.

After obtaining the rule grounding set, we compute the expectation score of target event and its peer events\. Following the rule\-based reasoning methodsOttet al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib66)\); Huanget al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib24)\), we design the expectation score from two aspects\. First, the effectiveness of rule\-based reasoning will decay over time, and we use an exponential distribution to model this decay\. Second, since different rules contribute variably to expectation, we employ the confidence of each rule to reflect its contribution\. The expectation score is calculated as follows:

scf′=∑tr∈TR∑rgf′tr∈RGf′trconf\(tr\)∗e−λ\(t−ty\)\\displaystyle sc\_\{f^\{\\prime\}\}=\\sum\_\{tr\\in TR\}\\sum\_\{rg\_\{f^\{\\prime\}\}^\{tr\}\\in RG\_\{f^\{\\prime\}\}^\{tr\}\}conf\(tr\)\*e^\{\-\\lambda\(t\-t\_\{y\}\)\}\(6\)whereconf\(tr\)conf\(tr\)represents the confidence of ruletrtr,λ\>0\\lambda\>0is the temporal decay coefficient, andtyt\_\{y\}denotes the timestamp of the rule body in the rule groundingrgf′trrg\_\{f^\{\\prime\}\}^\{tr\}\.

According to the different elements being replaced, we obtain the corresponding score sets of the target eventffand transform them to vectors, denoted as𝐬𝐜fs∈ℝ\|𝒞fs\|\\mathbf\{sc\}\_\{f\}^\{s\}\\in\\mathbb\{R\}^\{\|\\mathcal\{C\}\_\{f\}^\{s\}\|\},𝐬𝐜r∈ℝ\|𝒞fr\|\\mathbf\{sc\}^\{r\}\\in\\mathbb\{R\}^\{\|\\mathcal\{C\}\_\{f\}^\{r\}\|\}, and𝐬𝐜o∈ℝ\|𝒞fo\|\\mathbf\{sc\}^\{o\}\\in\\mathbb\{R\}^\{\|\\mathcal\{C\}\_\{f\}^\{o\}\|\}\. These three sets of scores estimate the expectation of events from different perspectives\.

##### Strikingness Calculation

The expectation score reflects the prior perception of the likelihood of an event’s occurrence\. Strikingness measures the degree to which the event exceeds the prior expectation and is thus inversely correlated with the expectation score\. That is, the higher the expectation score assigned to the eventf′f^\{\\prime\}, the less striking the occurrence of the event\. To constrain strikingness within the range \[0, 1\], the obtained score sets need to be normalized as follows:

𝐬𝐜normbe=𝐬𝐜fbe/‖𝐬𝐜fbe‖2\\displaystyle\\mathbf\{sc\}\_\{norm\}^\{be\}=\\mathbf\{sc\}\_\{f\}^\{be\}/\|\|\\mathbf\{sc\}\_\{f\}^\{be\}\|\|\_\{2\}\(7\)where body elementbe∈\{s,r,o\}be\\in\\\{s,r,o\\\}is an replaced element, and\|\|⋅\|\|2\|\|\\cdot\|\|\_\{2\}is L2 normalization\.

Then, we compare the normalized scores of peer eventsf′f^\{\\prime\}to highlight the prominence of the target eventff\. The strikingness scoring function accounts for both the magnitude of peer event scores and the differences between them\. Based on the strikingness measure proposed inAngiulliet al\.\([2009](https://arxiv.org/html/2605.13153#bib.bib28)\), we adopt the following function to calculate the strikingness of the body elements:

skfbe=∑scf′be∗\(scf′be−scfbe\)∗𝕀\(scf′be\>scfbe\)\\displaystyle sk\_\{f\}^\{be\}=\\sum sc\_\{f^\{\\prime\}\}^\{be\}\*\(sc\_\{f^\{\\prime\}\}^\{be\}\-sc\_\{f\}^\{be\}\)\*\\mathbb\{I\}\(sc\_\{f^\{\\prime\}\}^\{be\}\>sc\_\{f\}^\{be\}\)\(8\)
wherescf′be∈𝐬𝐜normbesc\_\{f^\{\\prime\}\}^\{be\}\\in\\mathbf\{sc\}\_\{norm\}^\{be\}, and𝕀\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function that returns the value 1 if the condition is true and 0 otherwise\.

Finally, we weight the strikingness of all body elements to obtain the final strikingness of potential future eventff:

skf\\displaystyle sk\_\{f\}=αsskfs\+αoskfo\+αrskfr\\displaystyle=\\alpha^\{s\}sk\_\{f\}^\{s\}\+\\alpha^\{o\}sk\_\{f\}^\{o\}\+\\alpha^\{r\}sk\_\{f\}^\{r\}\(9\)whereαs;αo;αr∈\[0,1\]\\alpha^\{s\};\\alpha^\{o\};\\alpha^\{r\}\\in\[0,1\]are weights of body elements andαs\+αo\+αr=1\\alpha^\{s\}\+\\alpha^\{o\}\+\\alpha^\{r\}=1\. We provide the proof for the boundskf∈\[0,1\]sk\_\{f\}\\in\[0,1\]in Appendix[A](https://arxiv.org/html/2605.13153#A1)\.

### 3\.3Striking\-aware Evaluation Framework

Given a query\(sq,rq,?,tq\)\(s\_\{q\},r\_\{q\},?,t\_\{q\}\), a TKGR could output a score vector𝐲∈ℝ\|ℰ\|\\mathbf\{y\}\\in\\mathbb\{R\}^\{\|\\mathcal\{E\}\|\}\. Through ranking𝐲\\mathbf\{y\}, the rank of the answer entity could be obtained\. The original evaluation method calculates the Mean Reciprocal Ranking \(MRR\) and Hits@k based on ranks\. However, the approach assigns equal weight to all future events, making it unable to capture the model’s ability to predict outstanding events\. To address this limitation, we propose a striking\-aware evaluation framework to evaluate existing TKGR baselines\. Specifically, the computed strikingness scores are used as weighting factors to calculate the Weighted MRR \(WMRR\) and Weighted Hits@k \(WHits@k\) metrics, as described below:

WMRR=∑i=1\|N\|\(si\+b\)∗1ranki∑i=1\|N\|\(si\+b\)\\displaystyle\{\\rm WMRR\}=\\frac\{\\sum\_\{i=1\}^\{\|N\|\}\(s\_\{i\}\+b\)\*\\frac\{1\}\{rank\_\{i\}\}\}\{\\sum\_\{i=1\}^\{\|N\|\}\(s\_\{i\}\+b\)\}\(10\)WHits@k=∑i=1\|N\|\(si\+b\)∗𝕀\(ranki≤k\)∑i=1\|N\|\(si\+b\)\\displaystyle\{\\rm WHits@k\}=\\frac\{\\sum\_\{i=1\}^\{\|N\|\}\(s\_\{i\}\+b\)\*\\mathbb\{I\}\(rank\_\{i\}\\leq k\)\}\{\{\\sum\_\{i=1\}^\{\|N\|\}\(s\_\{i\}\+b\)\}\}\(11\)where\|N\|\|N\|is the size of the test set andsis\_\{i\}is the strikingness of the event, which ensures that high\-strikingness events contribute more to the metric\. Settingbbis equivalent to assigning a higher cost of mis\-prediction to outstanding events in the evaluation\. Specifically, withb=0\.1b=0\.1, an event with strikingnesssk=1sk=1receives approximately ten times the weight of an event withsk=0sk=0in the metric calculation, since\(1\+b\)/\(0\+b\)=11\(1\+b\)/\(0\+b\)=11\. This reflects our value judgment that the utility of correctly predicting a critical outstanding event far outweighs that of correctly predicting a routine trivial one\. The parameterbbthus quantifies and incorporates this value judgment into the evaluation framework\.

In addition, we introduce a simple ensemble method\. Instead of constructing complex networks, we followMeilickeet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib65)\); Liuet al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib63)\); Wanget al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib64)\)to combine the output scores of the existing path\- and representation\-based methods straightforwardly to obtain the final score:

𝐲ensem=η𝐲path\+\(1−η\)𝐲representation\\displaystyle\\mathbf\{y\}\_\{ensem\}=\\eta\\mathbf\{y\}\_\{path\}\+\(1\-\\eta\)\\mathbf\{y\}\_\{representation\}\(12\)whereη∈\[0,1\]\\eta\\in\[0,1\]is a hyperparameter\. We perform a hyperparameter search for theη\\etaon the validation set, and subsequently apply it to the test set\.Our purpose in constructing the ensemble method is not to pursue higher performance\. Throughout the paper, we focus on investigating the boundaries of TKGR models’ reasoning capabilities\.

## 4Experiments

### 4\.1Implement Settings

##### Datasets

Extensive experiments are conducted on four TKG datasets: ICEWS14, ICEWS18, ICEWS05\-15, and GDELT\. The datasets are divided in chronological order\.

![Refer to caption](https://arxiv.org/html/2605.13153v1/x2.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x4.png)

Figure 2:Group performances on ICEWS14 and ICEWS18\. In each group, the bars denote the number of test events, while the lines indicate the average performance\.
##### Baselines

We conducted comparisons under a unified experimental framework, focusing on reproducible approaches, including path\-based methods: RecurrencyGastingeret al\.\([2024b](https://arxiv.org/html/2605.13153#bib.bib30)\), TITerSunet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib32)\), and TLogicLiuet al\.\([2022](https://arxiv.org/html/2605.13153#bib.bib27)\), representation\-based methods: RE\-GCNLiet al\.\([2021](https://arxiv.org/html/2605.13153#bib.bib23)\), TiRGNLiet al\.\([2022](https://arxiv.org/html/2605.13153#bib.bib31)\), and LogCLChenet al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib33)\), and LLM\-based methods: ICLLeeet al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib29)\)and GenTKGLiaoet al\.\([2024](https://arxiv.org/html/2605.13153#bib.bib25)\)\. We follow the time\-aware filtering settings described inGastingeret al\.\([2023](https://arxiv.org/html/2605.13153#bib.bib26)\)111[https://github\.com/nec\-research/TKG\-Forecasting\-Evaluation](https://github.com/nec-research/TKG-Forecasting-Evaluation)\. The code and strikingness weights are available222[https://github\.com/PersimmonZ1/RSMF](https://github.com/PersimmonZ1/RSMF)\.

##### Hyperparameters

We setλ\\lambdaas 0\.1 following TLogicLiuet al\.\([2022](https://arxiv.org/html/2605.13153#bib.bib27)\)\. Andαs\\alpha^\{s\},αo\\alpha^\{o\},αr\\alpha^\{r\}are set to 0\.4, 0\.4, 0\.2\. The recommended parameters setting is provided to establish a well\-calibrated evaluation framework to facilitate reproducibility and promote community adoption\. More implementation details are provided in the Appendix[B](https://arxiv.org/html/2605.13153#A2)\.

### 4\.2TKGR Performance on Group Strikingness

Performance Across GroupsDistinct from previous works that only report overall metrics for model performance, we group the data based on strikingness and compute the average performance within each group\. Figure[2](https://arxiv.org/html/2605.13153#S4.F2)presents the grouped MRR results for six baseline models and our proposed ensemble model on the ICEWS14\. It could be observed that the volume of events decreases as strikingness increases\. That is, low\-strikingness events dominate the test set, whereas high\-strikingness events are scarce, which aligns with human intuition\. On performance, all models exhibit a decline with increasing strikingness, indicating that events with higher strikingness are more difficult to predict\.

In overall trends, we find that path\-based methods excel at predicting events with low strikingness, while representation\-based methods demonstrate superior performance on high\-strikingness events\. This indicates that different categories of methods have distinct advantages in forecasting future events with different levels of strikingness\. Based on this finding, we attempt to investigate whether a simple ensemble method can combine the advantages of both methods\. Results on ICEWS0515 and GDELT exhibit consistent conclusions\. Due to space limitations, they are shown in Appendix[D](https://arxiv.org/html/2605.13153#A4)\.

For the ensemble method, we observe two phenomena:1\) Trade\-off: For events withsk<0\.1sk<0\.1orsk\>0\.5sk\>0\.5, the performance of the ensemble method lies between those of the individual methods\.2\) Enhancement: For events with0\.1<sk<0\.50\.1<sk<0\.5, the ensemble method outperforms both individual methods\. Based on these observations, it can be inferred that some models that leverage the ensemble method mainly improve the performance for low\-strikingness events, which are predominant\. For instance, comparing RE\-GCN and TiRGN, which incorporates additional repeated events information, TiRGN significantly outperforms RE\-GCN on low\-strikingness events, while achieving similar performance on high\-strikingness events\.

ModelTypeICEWS14S\(0\.6,0\.7\)S\(0\.6,0\.7\)S\(0\.7,0\.8\)S\(0\.7,0\.8\)S\(0\.8,0\.9\)S\(0\.8,0\.9\)S\(0\.9,1\.0\)S\(0\.9,1\.0\)RecurrencyHighNOfNO\_\{f\}19\.1013\.4612\.055\.23LowNOfNO\_\{f\}12\.285\.775\.361\.65TITerHighNOfNO\_\{f\}27\.6121\.9222\.8916\.67LowNOfNO\_\{f\}17\.5111\.419\.137\.44TLogicHighNOfNO\_\{f\}28\.9621\.2819\.0814\.60LowNOfNO\_\{f\}15\.5710\.908\.734\.55RE\-GCNHighNOfNO\_\{f\}38\.3631\.0330\.9224\.57LowNOfNO\_\{f\}17\.8119\.2317\.8614\.98TiRGNHighNOfNO\_\{f\}35\.9731\.1529\.1221\.78LowNOfNO\_\{f\}19\.3120\.2618\.2514\.88LogCLHighNOfNO\_\{f\}49\.4042\.5643\.9833\.45LowNOfNO\_\{f\}27\.1028\.4629\.5625\.83EnsembleHighNOfNO\_\{f\}49\.7040\.6442\.7732\.60LowNOfNO\_\{f\}26\.6525\.9028\.3724\.69

Table 1:The Hits@3 metric of High and LowNOfNO\_\{f\}events within the high\-strikingness groups on ICEWS14\.DatasetModel\(W\)MRR\(W\)Hits@1\(W\)Hits@3\(W\)Hits@10ORG↑\\uparrowSK↑\\uparrowΔ\\Delta↓\\downarrowORG↑\\uparrowSK↑\\uparrowΔ\\Delta↓\\downarrowORG↑\\uparrowSK↑\\uparrowΔ\\Delta↓\\downarrowORG↑\\uparrowSK↑\\uparrowΔ\\Delta↓\\downarrowICEWS14ICL\-\-\-32\.4015\.9550\.77 %45\.9426\.1843\.01 %56\.5936\.5935\.34 %GenTKG\-\-\-37\.0418\.2850\.65 %48\.4328\.0742\.04 %53\.6233\.9636\.67 %Recurrency37\.1219\.4747\.55 %29\.6913\.6254\.13 %40\.7521\.4947\.26 %51\.2630\.7540\.01 %TITer41\.8725\.4639\.19 %32\.9717\.5546\.77 %46\.4528\.3938\.88 %58\.3140\.7230\.17 %TLogic42\.5224\.9241\.39 %33\.1916\.6849\.74 %47\.6328\.5340\.10 %60\.2741\.4131\.29 %RE\-GCN42\.4329\.9229\.48 %31\.9019\.7638\.06 %47\.5933\.8628\.85 %62\.7449\.9020\.47 %TiRGN44\.4530\.4331\.54 %33\.7720\.3039\.89 %49\.5734\.0931\.23 %64\.8950\.6921\.88 %LogCL48\.8438\.1721\.85 %37\.7626\.8828\.81 %54\.6043\.2320\.82 %70\.4360\.7813\.70 %Ensemble51\.3538\.6724\.69%40\.2327\.1832\.44%57\.6044\.0723\.49%72\.5661\.4915\.26%ICEWS18ICL\-\-\-19\.279\.4251\.12 %31\.3517\.8743\.0 %43\.9728\.5835\.00 %GenTKG\-\-\-21\.3611\.0448\.31 %33\.5120\.3539\.27 %40\.0326\.6833\.35 %Recurrency28\.6615\.9744\.28 %20\.7710\.2250\.79 %32\.2518\.0544\.03 %43\.5426\.8338\.38 %TITer29\.6518\.6037\.27 %21\.5812\.1043\.93 %33\.0620\.4638\.11 %44\.9831\.3230\.37 %TLogic29\.5917\.3041\.53 %20\.4210\.2150\.00 %33\.6019\.6141\.64 %48\.0632\.0533\.31 %RE\-GCN32\.7823\.9626\.91 %22\.5414\.8734\.03 %36\.9126\.8327\.31 %52\.7442\.0320\.31 %TiRGN33\.5423\.5929\.67 %22\.9214\.2138\.00 %38\.0926\.6729\.98 %54\.3842\.2522\.31 %LogCL35\.4328\.3519\.98 %24\.0917\.7926\.15 %40\.2232\.1120\.16 %58\.0449\.6614\.44 %Ensemble37\.7428\.4924\.51%26\.1917\.7732\.15%42\.8432\.3124\.58%60\.7050\.1517\.38%ICEWS0515Recurrency44\.3926\.6639\.94 %35\.6818\.8947\.06 %49\.2630\.0539\.00 %60\.5441\.6131\.27 %TITer48\.0331\.3934\.65 %38\.6122\.4341\.91 %53\.0334\.8534\.28 %65\.4248\.8125\.39 %TLogic46\.5630\.6534\.17 %35\.4820\.5042\.22 %53\.2735\.6533\.08 %67\.2550\.6924\.62 %RE\-GCN47\.9333\.7329\.63 %37\.3223\.3937\.33 %53\.7838\.3328\.73 %68\.1654\.2320\.44 %TiRGN49\.9035\.0129\.84 %38\.9524\.1238\.07 %56\.1340\.0828\.59 %70\.6956\.5220\.05 %LogCL56\.9545\.3920\.30 %45\.8833\.6226\.72 %63\.7351\.7518\.80 %77\.7968\.3812\.10 %Ensemble58\.5346\.1021\.24%47\.4834\.1728\.03%65\.4352\.6119\.59%79\.2369\.4112\.39%GDELTRecurrency24\.3716\.5332\.17 %16\.439\.9639\.38 %26\.7917\.9832\.89 %39\.7029\.1026\.70 %TITer20\.1713\.1634\.75 %14\.238\.1842\.52 %21\.9814\.0636\.03 %30\.6722\.0728\.04 %TLogic19\.7712\.4437\.08 %12\.236\.5346\.61 %21\.6713\.4338\.02 %35\.6225\.0129\.79 %RE\-GCN19\.7314\.3927\.07 %12\.507\.8537\.20 %20\.9615\.2127\.43 %33\.8927\.1020\.04 %TiRGN21\.2515\.4127\.48 %13\.278\.3836\.85 %22\.8116\.3828\.19 %37\.0129\.2021\.10 %LogCL23\.7419\.4717\.99 %14\.6210\.7926\.20 %25\.5720\.8818\.34 %42\.3337\.1412\.26 %Ensemble25\.2619\.6922\.05%15\.6010\.7131\.35%27\.5821\.3722\.52%45\.0338\.0015\.61%

Table 2:Performance comparison of original \(‘ORG’\) and striking\-aware \(‘SK’\) evaluation, the higher value means better performance \(↑\\uparrow\)\.Δ\\Deltarepresents the relative performance decrease across two evaluation settings, and the smaller value indicates that the model is less affected by repetitive bias \(↓\\downarrow\)\. The best results are bolded, and the second\-best results are underlined\.![Refer to caption](https://arxiv.org/html/2605.13153v1/x5.png)Figure 3:WMRR with different biasbbon ICEWS18\.##### Predictability

To better understand the predictability of high\-strikingness events, we introduce Neighborhood Overlap \(NOfNO\_\{f\}\), a structural metric that quantifies the richness of historical interactions between the subject and object entities\. As shown in Table[1](https://arxiv.org/html/2605.13153#S4.T1), within each high\-strikingness interval, events with higherNOfNO\_\{f\}\(richer historical evidence\) are consistently more predictable than those with lowerNOfNO\_\{f\}\. This confirms that even among outstanding events, those supported by sufficient historical evidence remain learnable\. The findings validate that our strikingness measure aligns not only with event rarity but also with predictive difficulty rooted in evidence scarcity\. The definition ofNOfNO\_\{f\}and more analysis are provided in the Appendix[E](https://arxiv.org/html/2605.13153#A5)\.

### 4\.3Strikingness\-Aware Evaluation for TKGR

To mitigate the low\-strikingness bias in the dataset and more comprehensively evaluate the ability of TKGR models to forecast future events, we propose a striking\-aware evaluation framework\. We evaluate existing models using weighted MRR and weighted Hits@k\. The hyperparameterbbdetermines the extent to which the evaluation results emphasize the model’s ability to predict outstanding events\. By adjustingbb, the framework can place more or less weight on events, thereby controlling the balance performance between overall events and outstanding events\. Figure[3](https://arxiv.org/html/2605.13153#S4.F3)shows the WMRR of the models on ICEWS18 under differentbb\. A smallbbindicates that WMRR focuses on the model’s ability to predict outstanding events\. When the b becomes very large, i\.e\.,b≥100b\\geq 100, the results of WMRR are close to the original MRR\. Additionally, the comparison of model performance reverses as the value ofbbchanges\. Whenb\>0\.1b\>0\.1, Ensemble outperforms LogCL, and TiRGN outperforms RE\-GCN\. However, whenb<0\.1b<0\.1, the results change such that LogCL surpasses Ensemble, and RE\-GCN outperforms TiRGN\. This demonstrates that LogCL has a superior ability to predict outstanding events compared to the Ensemble method, and a similar conclusion holds for RE\-GCN and TiRGN\.

We empirically selectb=0\.1\\boldsymbol\{b=0\.1\}to provide a unified evaluation result and facilitate fair comparison\. This setting balances the metric’s emphasis between the model’s ability to reason events with varying levels of strikingness\.Table[2](https://arxiv.org/html/2605.13153#S4.T2)shows the experimental results of the models under both the original and the striking\-aware evaluation frameworks\. Obviously, the absolute values of the striking\-aware metrics are significantly lower than those of the original evaluation framework, which aligns more closely with the recognized challenges of the future event forecasting task\. Nevertheless, it is important to emphasize that our motivation is not to lower the scores purposely\. Instead, we aim to reveal the models’ comprehensive predictive capabilities through the differences between the original and striking\-aware metrics\. Furthermore, adjustments to other hyperparameters have a negligible impact on model ranking evaluations, demonstrating the robustness of the strikingness\-aware evaluation framework\. Further details are provided in the Appendix[F](https://arxiv.org/html/2605.13153#A6)\.

As shown, path\-based methods exhibit a reduction of over 30% across all datasets, with the heuristic baseline Recurrency demonstrating a particularly striking decrease of 50%\. In contrast, methods based on evolutionary representations experience a notably smaller decline, with all reductions remaining below 30%\. Among them, LogCL achieves the most robust performance, with its reduction consistently maintained at around 20%\. This indicates that the existing path\-based methods have not demonstrated the claimed multi\-hop reasoning on TKGs\. Methods based on LLMs exhibit similar behavior to path\-based approaches because the context window size constrains their reasoning, typically limited to first\-order neighborhood information as input\.

For the ensemble method, it still achieves state\-of\-the\-art performance under the striking\-aware metrics\. However, the improvement is limited compared to the original metrics\. This is because the ensemble approach primarily enhances the model’s predictive performance on events with low strikingness, whereas the striking\-aware evaluation framework focuses on the model’s ability to predict outstanding events\. By restricting the improvements from low\-strikingness events, the striking\-aware evaluation allows researchers to more objectively assess progress in the field\. The results of other ensemble combinations are reported in Appendix[G](https://arxiv.org/html/2605.13153#A7)\.

### 4\.4Outstanding Event Mining

Through quantifying the strikingness of future events and extracting the top ones, we can mine outstanding events\. Events in ICEWS18 with different strikingness as cases are showed in Table[3](https://arxiv.org/html/2605.13153#S4.T3)\. It is obvious that events with high strikingness are more likely to capture people’s attention and may have a significant impact on the future\.

EventsskskCommando \(Kosovo\)→2018\-10\-04Occupy territorySerbia\\text\{Commando \(Kosovo\)\}\\xrightarrow\[\\text\{2018\-10\-04\}\]\{\\text\{Occupy territory\}\}\\text\{Serbia\}1\.0Taliban→2018\-9\-28Threaten with military forceMilitary \(Afghanistan\)\\text\{Taliban\}\\xrightarrow\[\\text\{2018\-9\-28\}\]\{\\text\{Threaten with military force\}\}\\text\{Military \(Afghanistan\)\}0\.819Buhari→2018\-10\-12Mobilize or increase armed forcesNigeria\\text\{Buhari\}\\xrightarrow\[\\text\{2018\-10\-12\}\]\{\\text\{Mobilize or increase armed forces \}\}\\text\{Nigeria\}0\.704Malaysia→2018\-10\-15Sign agreementHong Kong\\text\{Malaysia\}\\xrightarrow\[\\text\{2018\-10\-15\}\]\{\\text\{Sign agreement\}\}\\text\{Hong Kong\}0\.682Russia→2018\-9\-29Engage in material cooperationChina\\text\{Russia\}\\xrightarrow\[\\text\{2018\-9\-29\}\]\{\\text\{Engage in material cooperation\}\}\\text\{China\}0\.575Moon Jae\-in→2018\-10\-09intent to negotiateItaly\\text\{Moon Jae\-in\}\\xrightarrow\[\\text\{2018\-10\-09\}\]\{\\text\{intent to negotiate\}\}\\text\{Italy\}0\.349India→2018\-10\-05Make statementRussia\\text\{India\}\\xrightarrow\[\\text\{2018\-10\-05\}\]\{\\text\{Make statement\}\}\\text\{Russia\}0\.103France→2018\-10\-27ConsultGermany\\text\{France\}\\xrightarrow\[\\text\{2018\-10\-27\}\]\{\\text\{Consult\}\}\\text\{Germany\}0\.005

Table 3:Case events with different strikingness\.To validate the effectiveness of the proposed RSMF in measuring event strikingness, we conducted a human evaluation study involving six volunteers\. The six volunteers are all graduate students \(Master’s or Ph\.D\. candidates\) in artificial intelligence, with research backgrounds spanning temporal knowledge graphs, knowledge graphs, relation extraction, and named entity recognition\. Specifically, we randomly sampled 3000 events and a peer event for each target event, providing the contextual information for all events\. The volunteers were asked to evaluate which event in each pair exhibited higher strikingness based on the given context\.

H1H2H3H4H5H6Average0\.6830\.7260\.6980\.6670\.7030\.6960\.696Table 4:Cohen’s Kappa between humans and RSMF\.Table[4](https://arxiv.org/html/2605.13153#S4.T4)reports the Cohen’s Kappa coefficients between the evaluations of individual annotators and the proposed RSMF\. Cohen’s Kappa is a widely used measure for inter\-rater agreement, with values ranging from \-1 to 1, where \-1 denotes “less than chance agreement” and 1 represents “almost perfect agreement\.” As shown in Table[4](https://arxiv.org/html/2605.13153#S4.T4), the average Cohen’s Kappa coefficient between RSMF and human evaluators is 0\.696, indicating that RSMF achieves “substantial agreement” with human evaluators on the strikingness of events\.

We also conduct analysis experiments to validate four aspects of outstanding events: Novelty, Rarity, Context Dependence, and Time Sensitivity\. The characteristics analysis is provided in Appendix[F](https://arxiv.org/html/2605.13153#A6)\.

## 5Conclusion

We observe that the current TKGR evaluation overweights trivial repetitive events, overshadowing models’ ability to predict rare yet meaningful ones\. To rectify this, we introduce a strikingness\-aware evaluation framework that quantifies event strikingness through rule\-based peer\-event comparison and incorporates it as a dynamic weight into ranking metrics\. Experiments on four benchmarks demonstrate: 1\) a consistent performance drop as event strikingness increases, 2\) a clear divide between path\-based methods \(strong on low\-strikingness events\) and representation\-based methods \(superior on high\-strikingness events\), and 3\) we design a simple ensemble method and find it mainly improve prediction of repetitive events, with limited gains on rare, striking events\. Our framework recenters evaluation on the forecasting of outstanding events, offering a rigorous and meaningful benchmark for TKGR, and calls for future work to prioritize reasoning beyond repetition\.

## Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No\. 62276110 and in part by the fund of Joint Laboratory of HUST and Pingan Property & Casualty Research \(HPL\)\. The authors would also like to thank the anonymous reviewers for their comments on improving the quality of this paper\.

## References

- F\. Angiulli, F\. Fassetti, and L\. Palopoli \(2009\)Detecting outlying properties of exceptional objects\.ACM Transactions on Database Systems34\(1\),pp\. 7:1–7:62\.External Links:[Link](https://doi.org/10.1145/1508857.1508864),[Document](https://dx.doi.org/10.1145/1508857.1508864)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.13153#S3.SS2.SSS0.Px3.p2.2)\.
- T\. Aven \(2013\)On the meaning of a black swan in a risk context\.Safety science57,pp\. 44–51\.Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p1.1)\.
- A\. Bordes, N\. Usunier, A\. García\-Durán, J\. Weston, and O\. Yakhnenko \(2013\)Translating embeddings for modeling multi\-relational data\.InIn Proceedings of the 27th Conference on Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://proceedings.neurips.cc/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Chen, H\. Wan, Y\. Wu, S\. Zhao, J\. Cheng, Y\. Li, and Y\. Lin \(2024\)Local\-global history\-aware contrastive learning for temporal knowledge graph reasoning\.In40th IEEE International Conference on Data Engineering \(ICDE\),External Links:[Link](https://doi.org/10.1109/ICDE60146.2024.00062),[Document](https://dx.doi.org/10.1109/ICDE60146.2024.00062)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1)\.
- T\. Dettmers, P\. Minervini, P\. Stenetorp, and S\. Riedel \(2018\)Convolutional 2d knowledge graph embeddings\.InProceedings of the 32nd Conference on Artificial Intelligence, \(AAAI\),External Links:[Link](https://doi.org/10.1609/aaai.v32i1.11573),[Document](https://dx.doi.org/10.1609/AAAI.V32I1.11573)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p2.3),[§1](https://arxiv.org/html/2605.13153#S1.p3.1)\.
- H\. Dong, Z\. Ning, P\. Wang, Z\. Qiao, P\. Wang, Y\. Zhou, and Y\. Fu \(2023\)Adaptive path\-memory network for temporal knowledge graph reasoning\.InProceedings of the 32nd International Joint Conference on Artificial Intelligence \(IJCAI\),External Links:[Link](https://doi.org/10.24963/ijcai.2023/232),[Document](https://dx.doi.org/10.24963/IJCAI.2023/232)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Gastinger, S\. Huang, M\. Galkin, E\. Loghmani, A\. Parviz, F\. Poursafaei, J\. Danovitch, E\. Rossi, I\. Koutis, H\. Stuckenschmidt, R\. Rabbany, and G\. Rabusseau \(2024a\)TGB 2\.0: A benchmark for learning on temporal knowledge graphs and heterogeneous graphs\.InIn Proceedings of 38th Conferenceon Neural Information Processing Systems \(NeurIPS\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/fda026cf2423a01fcbcf1e1e43ee9a50-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Gastinger, C\. Meilicke, F\. Errica, T\. Sztyler, A\. Schülke, and H\. Stuckenschmidt \(2024b\)History repeats itself: A baseline for temporal knowledge graph forecasting\.InProceedings of the 33rd International Joint Conference on Artificial Intelligence \(IJCAI\),External Links:[Link](https://www.ijcai.org/proceedings/2024/444)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p2.3),[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Gastinger, T\. Sztyler, L\. Sharma, A\. Schuelke, and H\. Stuckenschmidt \(2023\)Comparing apples and oranges? on the evaluation of methods for temporal knowledge graph forecasting\.InMachine Learning and Knowledge Discovery in Databases: Research Track \- European Conference \(ECML\-PKDD\),Lecture Notes in Computer Science, Vol\.14171,pp\. 533–549\.External Links:[Link](https://doi.org/10.1007/978-3-031-43418-1%5C_32),[Document](https://dx.doi.org/10.1007/978-3-031-43418-1%5F32)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Han, P\. Chen, Y\. Ma, and V\. Tresp \(2021\)Explainable subgraph reasoning for forecasting on temporal knowledge graphs\.In9th International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=pGIHq1m7PU)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Hassan, A\. Sultana, Y\. Wu, G\. Zhang, C\. Li, J\. Yang, and C\. Yu \(2014\)Data in, fact out: automated monitoring of facts by factwatcher\.Proceedings of the VLDB Endowment7\(13\),pp\. 1557–1560\.External Links:[Link](http://www.vldb.org/pvldb/vol7/p1557-hassan.pdf),[Document](https://dx.doi.org/10.14778/2733004.2733029)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Huang, W\. Wei, X\. Qu, S\. Zhang, D\. Chen, and Y\. Cheng \(2024\)Confidence is not timeless: modeling temporal validity for rule\-based temporal knowledge graph forecasting\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.580),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.580)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.13153#S3.SS2.SSS0.Px2.p4.6)\.
- P\. Jain, S\. Rathi, Mausam, and S\. Chakrabarti \(2020\)Temporal knowledge base completion: new algorithms and evaluation protocols\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://doi.org/10.18653/v1/2020.emnlp-main.305),[Document](https://dx.doi.org/10.18653/V1/2020.EMNLP-MAIN.305)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Jin, M\. Qu, X\. Jin, and X\. Ren \(2020\)Recurrent event network: autoregressive structure inferenceover temporal knowledge graphs\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://doi.org/10.18653/v1/2020.emnlp-main.541),[Document](https://dx.doi.org/10.18653/V1/2020.EMNLP-MAIN.541)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p1.1)\.
- C\. Kervadec, G\. Antipov, M\. Baccouche, and C\. Wolf \(2021\)Roses are red, violets are blue… but should VQA expect them to?\.InIEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),External Links:[Link](https://openaccess.thecvf.com/content/CVPR2021/html/Kervadec%5C_Roses%5C_Are%5C_Red%5C_Violets%5C_Are%5C_Blue...%5C_but%5C_Should%5C_VQA%5C_Expect%5C_CVPR%5C_2021%5C_paper.html),[Document](https://dx.doi.org/10.1109/CVPR46437.2021.00280)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p2.3)\.
- D\. Lee, K\. Ahrabian, W\. Jin, F\. Morstatter, and J\. Pujara \(2023\)Temporal knowledge graph forecasting without knowledge using in\-context learning\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://doi.org/10.18653/v1/2023.emnlp-main.36),[Document](https://dx.doi.org/10.18653/V1/2023.EMNLP-MAIN.36)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p2.3),[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1)\.
- H\. Li, Z\. Yang, Y\. Ma, Y\. Bin, Y\. Yang, and T\. Chua \(2024\)MM\-forecast: A multimodal approach to temporal event forecasting with large language models\.InProceedings of the 32nd ACM International Conference on Multimedia \(MM\),pp\. 2776–2785\.External Links:[Link](https://doi.org/10.1145/3664647.3681593),[Document](https://dx.doi.org/10.1145/3664647.3681593)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, S\. Sun, and J\. Zhao \(2022\)TiRGN: time\-guided recurrent graph network with local\-global historical patterns for temporal knowledge graph reasoning\.InProceedings of the 31st International Joint Conference on Artificial Intelligence \(IJCAI\),External Links:[Link](https://doi.org/10.24963/ijcai.2022/299),[Document](https://dx.doi.org/10.24963/IJCAI.2022/299)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Li, X\. Jin, W\. Li, S\. Guan, J\. Guo, H\. Shen, Y\. Wang, and X\. Cheng \(2021\)Temporal knowledge graph reasoning based on evolutional representation learning\.InThe 44th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\),External Links:[Link](https://doi.org/10.1145/3404835.3462963),[Document](https://dx.doi.org/10.1145/3404835.3462963)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1)\.
- K\. Liang, L\. Meng, M\. Liu, Y\. Liu, W\. Tu, S\. Wang, S\. Zhou, X\. Liu, F\. Sun, and K\. He \(2024\)A survey of knowledge graph reasoning on graph types: static, dynamic, and multi\-modal\.IEEE Transactions on Pattern Analysis and Machine Intelligence \(TPAMI\)46\(12\),pp\. 9456–9478\.External Links:[Link](https://doi.org/10.1109/TPAMI.2024.3417451),[Document](https://dx.doi.org/10.1109/TPAMI.2024.3417451)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p2.3)\.
- R\. Liao, X\. Jia, Y\. Li, Y\. Ma, and V\. Tresp \(2024\)GenTKG: generative forecasting on temporal knowledge graph with large language models\.InFindings of the Association for Computational Linguistics \(NAACL\),External Links:[Link](https://doi.org/10.18653/v1/2024.findings-naacl.268),[Document](https://dx.doi.org/10.18653/V1/2024.FINDINGS-NAACL.268)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Liu, Y\. Ma, M\. Hildebrandt, M\. Joblin, and V\. Tresp \(2022\)TLogic: temporal logical rules for explainable link forecasting on temporal knowledge graphs\.In36th Conference on Artificial Intelligence \(AAAI\),External Links:[Link](https://doi.org/10.1609/aaai.v36i4.20330),[Document](https://dx.doi.org/10.1609/AAAI.V36I4.20330)Cited by:[§B\.2](https://arxiv.org/html/2605.13153#A2.SS2.p1.3),[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.13153#S3.SS2.SSS0.Px1.p2.3),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px3.p1.4)\.
- Z\. Liu, L\. Tan, M\. Li, Y\. Wan, H\. Jin, and X\. Shi \(2023\)SiMFy: A simple yet effective approach for temporal knowledge graph reasoning\.InFindings of the Association for Computational Linguistics \(EMNLP\),External Links:[Link](https://doi.org/10.18653/v1/2023.findings-emnlp.249),[Document](https://dx.doi.org/10.18653/V1/2023.FINDINGS-EMNLP.249)Cited by:[§3\.3](https://arxiv.org/html/2605.13153#S3.SS3.p3.3)\.
- Y\. Ma, C\. Ye, Z\. Wu, X\. Wang, Y\. Cao, and T\. Chua \(2023\)Context\-aware event forecasting via graph disentanglement\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\),pp\. 1643–1652\.External Links:[Link](https://doi.org/10.1145/3580305.3599285),[Document](https://dx.doi.org/10.1145/3580305.3599285)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Meilicke, P\. Betz, and H\. Stuckenschmidt \(2021\)Why a naive way to combine symbolic and latent knowledge base completion works surprisingly well\.In3rd Conference on Automated Knowledge Base Construction \(AKBC\),External Links:[Link](https://doi.org/10.24432/C5PK5V),[Document](https://dx.doi.org/10.24432/C5PK5V)Cited by:[§3\.3](https://arxiv.org/html/2605.13153#S3.SS3.p3.3)\.
- S\. Ott, P\. Betz, D\. Stepanova, M\. H\. Gad\-Elrab, C\. Meilicke, and H\. Stuckenschmidt \(2023\)Rule\-based knowledge graph completion with canonical models\.InProceedings of the 32nd ACM International Conference on Information and Knowledge Management \(CIKM\),External Links:[Link](https://doi.org/10.1145/3583780.3615042),[Document](https://dx.doi.org/10.1145/3583780.3615042)Cited by:[§3\.2](https://arxiv.org/html/2605.13153#S3.SS2.SSS0.Px2.p4.6)\.
- D\. Ruffinelli, S\. Broscheit, and R\. Gemulla \(2020\)You CAN teach an old dog new tricks\! on training knowledge graph embeddings\.In8th International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=BkxSmlBFvr)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Sun, J\. Zhong, Y\. Ma, Z\. Han, and K\. He \(2021\)TimeTraveler: reinforcement learning for temporal knowledge graph forecasting\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://doi.org/10.18653/v1/2021.emnlp-main.655),[Document](https://dx.doi.org/10.18653/V1/2021.EMNLP-MAIN.655)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p1.1),[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13153#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Sun, S\. Vashishth, S\. Sanyal, P\. P\. Talukdar, and Y\. Yang \(2020\)A re\-evaluation of knowledge graph completion methods\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:[Link](https://doi.org/10.18653/v1/2020.acl-main.489),[Document](https://dx.doi.org/10.18653/V1/2020.ACL-MAIN.489)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Toutanova and D\. Chen \(2015\)Observed versus latent features for knowledge base and text inference\.InProceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality,pp\. 57–66\.External Links:[Link](https://doi.org/10.18653/v1/W15-4007),[Document](https://dx.doi.org/10.18653/V1/W15-4007)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p2.3),[§1](https://arxiv.org/html/2605.13153#S1.p3.1)\.
- J\. Wang, K\. Sun, L\. Luo, W\. Wei, Y\. Hu, A\. W\. Liew, S\. Pan, and B\. Yin \(2024\)Large language models\-guided dynamic adaptation for temporal knowledge graph reasoning\.InIn Proceedings of 38th Conferenceon Neural Information Processing Systems \(NeurIPS\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/0fd17409385ab9304e5019c6a6eb327a-Abstract-Conference.html)Cited by:[§3\.3](https://arxiv.org/html/2605.13153#S3.SS3.p3.3)\.
- Y\. Wu, P\. K\. Agarwal, C\. Li, J\. Yang, and C\. Yu \(2012\)On ”one of the few” objects\.InThe 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining \(KDD\),pp\. 1487–1495\.External Links:[Link](https://doi.org/10.1145/2339530.2339762),[Document](https://dx.doi.org/10.1145/2339530.2339762)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Xia, D\. Wang, Q\. Liu, L\. Wang, S\. Wu, and X\. Zhang \(2024\)Chain\-of\-history reasoning for temporal knowledge graph forecasting\.InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11\-16, 2024,pp\. 16144–16159\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-acl.955),[Document](https://dx.doi.org/10.18653/V1/2024.FINDINGS-ACL.955)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Xiao, Y\. Li, Y\. Wang, P\. Karras, K\. Mouratidis, and N\. R\. Avlona \(2024\)How to avoid jumping to conclusions: measuring the robustness of outstanding facts in knowledge graphs\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\),pp\. 3539–3550\.External Links:[Link](https://doi.org/10.1145/3637528.3671763),[Document](https://dx.doi.org/10.1145/3637528.3671763)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Xu, J\. Ou, H\. Xu, and L\. Fu \(2023\)Temporal knowledge graph reasoning with historical contrastive learning\.In37th Conference on Artificial Intelligence \(AAAI\),External Links:[Link](https://doi.org/10.1609/aaai.v37i4.25601),[Document](https://dx.doi.org/10.1609/AAAI.V37I4.25601)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p2.3)\.
- Y\. Yang, Y\. Li, P\. Karras, and A\. K\. H\. Tung \(2021\)Context\-aware outstanding fact mining from knowledge graphs\.InThe 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(SIGKDD\),pp\. 2006–2016\.External Links:[Link](https://doi.org/10.1145/3447548.3467272),[Document](https://dx.doi.org/10.1145/3447548.3467272)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Zhang, D\. Jimenez, and C\. Li \(2018\)Maverick: discovering exceptional facts from knowledge graphs\.InProceedings of the 2018 International Conference on Management of Data \(SIGMOD\),pp\. 1317–1332\.External Links:[Link](https://doi.org/10.1145/3183713.3183730),[Document](https://dx.doi.org/10.1145/3183713.3183730)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Zheng, H\. Yin, T\. Chen, Q\. V\. H\. Nguyen, W\. Chen, and L\. Zhao \(2023\)DREAM: adaptive reinforcement learning based on attention mechanism for temporal knowledge graph reasoning\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\),pp\. 1578–1588\.External Links:[Link](https://doi.org/10.1145/3539618.3591671),[Document](https://dx.doi.org/10.1145/3539618.3591671)Cited by:[§2](https://arxiv.org/html/2605.13153#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Zhu, M\. Chen, C\. Fan, G\. Cheng, and Y\. Zhang \(2021\)Learning from history: modeling temporal knowledge graphs with sequential copy\-generation networks\.In35th Conference on Artificial Intelligence \(AAAI\),External Links:[Link](https://doi.org/10.1609/aaai.v35i5.16604),[Document](https://dx.doi.org/10.1609/AAAI.V35I5.16604)Cited by:[§1](https://arxiv.org/html/2605.13153#S1.p2.3)\.

## Appendix AProof of Strikingness Boundedness under L2 Normalization

###### Lemma 1\(Boundedness ofskfbesk\_\{f\}^\{be\}andskfsk\_\{f\}\)\.

Given the L2\-normalized score vector𝐯=𝐬𝐜fbe‖𝐬𝐜fbe‖2\\mathbf\{v\}=\\frac\{\\mathbf\{sc\}^\{be\}\_\{f\}\}\{\\\|\\mathbf\{sc\}^\{be\}\_\{f\}\\\|\_\{2\}\}with allvi≥0v\_\{i\}\\geq 0and‖𝐯‖2=1\\\|\\mathbf\{v\}\\\|\_\{2\}=1, the element\-wise strikingnessskfbesk\_\{f\}^\{be\}defined in Eq\. \(8\) satisfies

0≤skfbe≤1\.0\\leq sk\_\{f\}^\{be\}\\leq 1\.Consequently, the overall strikingnessskfsk\_\{f\}defined in Eq\. \(9\) also lies in\[0,1\]\[0,1\]:

0≤skf≤1\.0\\leq sk\_\{f\}\\leq 1\.

###### Proof\.

We prove the two inequalities separately\.

##### 1\. Non\-negativity ofskfbesk\_\{f\}^\{be\}\.

By definition,

skfbe=∑ivi\(vi−vf\)⋅𝕀\(vi\>vf\),sk\_\{f\}^\{be\}=\\sum\_\{i\}v\_\{i\}\(v\_\{i\}\-v\_\{f\}\)\\cdot\\mathbb\{I\}\(v\_\{i\}\>v\_\{f\}\),wherevfv\_\{f\}is the score of the target event\. Sincevi≥0v\_\{i\}\\geq 0and the indicator function𝕀\(vi\>vf\)\\mathbb\{I\}\(v\_\{i\}\>v\_\{f\}\)ensures that only terms withvi\>vfv\_\{i\}\>v\_\{f\}are included, each termvi\(vi−vf\)v\_\{i\}\(v\_\{i\}\-v\_\{f\}\)is non\-negative\. Henceskfbe≥0sk\_\{f\}^\{be\}\\geq 0\.

##### 2\. Upper bound ofskfbesk\_\{f\}^\{be\}\.

LetS=\{i∣vi\>vf\}S=\\\{i\\mid v\_\{i\}\>v\_\{f\}\\\}\. Then

skfbe=∑i∈Svi\(vi−vf\)\.sk\_\{f\}^\{be\}=\\sum\_\{i\\in S\}v\_\{i\}\(v\_\{i\}\-v\_\{f\}\)\.Becausevf≥0v\_\{f\}\\geq 0, we havevi−vf≤viv\_\{i\}\-v\_\{f\}\\leq v\_\{i\}for alli∈Si\\in S, and therefore

vi\(vi−vf\)≤vi2\.v\_\{i\}\(v\_\{i\}\-v\_\{f\}\)\\leq v\_\{i\}^\{2\}\.Summing overi∈Si\\in Syields

skfbe≤∑i∈Svi2\.sk\_\{f\}^\{be\}\\leq\\sum\_\{i\\in S\}v\_\{i\}^\{2\}\.Since𝐯\\mathbf\{v\}is a unit vector in the L2 sense and all its components are non\-negative,

∑ivi2=1and∑i∈Svi2≤1\.\\sum\_\{i\}v\_\{i\}^\{2\}=1\\quad\\text\{and\}\\quad\\sum\_\{i\\in S\}v\_\{i\}^\{2\}\\leq 1\.Thusskfbe≤1sk\_\{f\}^\{be\}\\leq 1\.

##### 3\. Boundedness ofskfsk\_\{f\}\.

Recall that

skf=αsskfs\+αoskfo\+αrskfr,sk\_\{f\}=\\alpha^\{s\}sk\_\{f\}^\{s\}\+\\alpha^\{o\}sk\_\{f\}^\{o\}\+\\alpha^\{r\}sk\_\{f\}^\{r\},whereαs,αo,αr∈\[0,1\]\\alpha^\{s\},\\alpha^\{o\},\\alpha^\{r\}\\in\[0,1\]andαs\+αo\+αr=1\\alpha^\{s\}\+\\alpha^\{o\}\+\\alpha^\{r\}=1\. Since eachskfbe∈\[0,1\]sk\_\{f\}^\{be\}\\in\[0,1\], we have

0≤skf≤αs⋅1\+αo⋅1\+αr⋅1=1\.0\\leq sk\_\{f\}\\leq\\alpha^\{s\}\\cdot 1\+\\alpha^\{o\}\\cdot 1\+\\alpha^\{r\}\\cdot 1=1\.This completes the proof\. ∎

## Appendix BExperimental Setup

### B\.1Details of Datasets

The ICEWS datasets provide time\-stamped political and socio\-economic events curated from real\-world interactions, while GDELT records a broad range of global events with fine\-grained temporal annotations\. For the TKGR task, the training, validation, and test sets are divided strictly in chronological order\. It is worth mentioning that datasets like YAGO and WIKI are not used because they transform time\-spanning facts into instantaneous ones, which does not align with our focus on event\-centric temporal knowledge graphs\.

Table 5:Details of the TKG datasets\.DatasetICEWS14ICEWS18ICEWS05\-15GDELTEntities6,86923,03310,0947,691Relations230256251240Train74,845373,018368,8681,734,399Valid8,51445,99546,302238,765Test7,37149,54546,159305,241Granularity24 hours24 hours24 hours15 mins

### B\.2Implementation Details

Our models were implemented using Python 3\.10 and the PyTorch 1\.13\.1 framework\. All experiments were conducted on a server equipped with an NVIDIA 3090TI GPU with 24GB memory, an Intel\(R\) i9\-12900K CPU, and 256GB of RAM\. The software environment includes CUDA 11\.6 and cuDNN 8\.4\. For the ensemble method, we perform a grid search for the hyperparameterη\\etaover the range\[0,1\]\[0,1\]with a step size of0\.10\.1, and determine its optimal value by evaluating performance on the validation set of each dataset\. We employ the rule mining framework from TLogicLiuet al\.\[[2022](https://arxiv.org/html/2605.13153#bib.bib27)\]to learn and extract temporal rules, with the distinction that we focus exclusively on rules of length 1\. We confine the learning of rules, their confidence scores, and the calculation of strikingness strictly to the training set before each query timestamp, thereby preventing any test data leakage\.

For a target event,scfsc\_\{f\}is defined as 0 when its set of peer events is empty\. This may appear to assign high strikingness to a target event, which can be justified as follows: 1\) If other peer events exist, a high strikingness score is reasonable, as the target event indeed fits the description of being unexpected\. 2\) If the peer event set for the target is empty, then according to Equation[8](https://arxiv.org/html/2605.13153#S3.E8), itsskbesk^\{be\}is calculated as 0\. This ensures that events truly lacking evidential support are not incorrectly identified as high\-strikingness\. Moreover, in the ICEWS14 test set, fewer than 15 events \(less than 0\.2%\) have completely empty peer sets \(subject, object, and relation\)\. We argue that this negligible proportion of special cases does not affect the validity of our evaluation conclusions\.

In addition, if no historical rule supports any candidate replacement for the body element \(i\.e\.,𝒞fbe=∅\\mathcal\{C\}^\{be\}\_\{f\}=\\emptyset\), we define𝐬𝐜normbe=𝟎\\mathbf\{sc\}^\{be\}\_\{norm\}=\\mathbf\{0\}\. Consequently, if the normalized vector is zero, thenskfbe=0sk^\{be\}\_\{f\}=0by Equation[8](https://arxiv.org/html/2605.13153#S3.E8), and the overall strikingnessskfsk\_\{f\}remains bounded in\[0,1\]\[0,1\]\.

### B\.3Original Metrics

Given a test eventf=\(s,r,o,t\)f=\(s,r,o,t\), its rank of score is computed by corrupting candidate entities\. Specifically, the object entity would be replaced by another candidateece\_\{c\}, and the candidate eventfc=\(s,r,c,t\)f\_\{c\}=\(s,r,c,t\)would be scored\. By adding the reverse events, i\.e\.,f′=\(o,r−1,s,t\)f^\{\\prime\}=\(o,r^\{\-1\},s,t\)and replacingssbycc, the subject entity prediction could be achieved\. The two metrics are widely used to evaluate model performance:

Mean Reciprocal Rank \(MRR\)evaluates ranking quality by averaging the inverse rank of the correct result:

MRR=12∗\|ℱtest\|∑f∈ℱtest\(1rankf\+1rankf′\)\\displaystyle MRR=\\frac\{1\}\{2\*\|\\mathcal\{F\}\_\{test\}\|\}\\sum\_\{f\\in\\mathcal\{F\}\_\{test\}\}\(\\frac\{1\}\{rank\_\{f\}\}\+\\frac\{1\}\{rank\_\{f^\{\\prime\}\}\}\)Hits@k\(H@k\)measures the proportion of queries where the correct result appears in the top\-k positions:

Hits@k=12∗\|ℱtest\|∑f∈ℱtest𝕀\{rankf≤k\}\+𝕀\{rankf′≤k\}\\displaystyle Hits@k=\\frac\{1\}\{2\*\|\\mathcal\{F\}\_\{test\}\|\}\\sum\_\{f\\in\\mathcal\{F\}\_\{test\}\}\\mathbb\{I\}\\\{rank\_\{f\}\\leq k\\\}\+\\mathbb\{I\}\\\{rank\_\{f^\{\\prime\}\}\\leq k\\\}where𝕀\{True\}=1\\mathbb\{I\}\\\{True\\\}=1and𝕀\{False\}=0\\mathbb\{I\}\\\{False\\\}=0\.

### B\.4Baselines

In this paper, we conducted experiments with three categories of baselines: path\-based, representation\-based, and LLM\-based methods\. Previously, inconsistent experimental setups and dataset usage led to unfair comparisons\. Therefore, instead of indiscriminately evaluating recent methods, we conducted comparisons under a unified experimental framework, focusing on community\-recognized and reproducible approaches111[https://github\.com/nec\-research/TKG\-Forecasting\-Evaluation](https://github.com/nec-research/TKG-Forecasting-Evaluation)\. Additionally, we evaluate the recent state\-of\-the\-art methods, such as LogCL and LLM\-based methods, to ensure our study reflects the latest advances in the field\. As LLM\-based methods directly output the top 100 candidate answers without assigning explicit scores to each candidate, only Hits@k can be computed, whereas MRR is not applicable\. Therefore, we did not report the performance of LLM\-based methods in the grouped strikingness performance analysis\. Moreover, due to the high computational cost of LLM inference, their application to large\-scale datasets such as ICEWS05\-15 and GDELT remains impractical in Table[2](https://arxiv.org/html/2605.13153#S4.T2)\. The details of baselines are as follows:

#### B\.4\.1Path\-based

##### Recurrency

Recurrency assigns scores to candidates by calculating the recency and frequency of events related to the query, requiring only two hyperparameters to be searched in order to achieve baseline performance\.

##### TLogic

TLogic generates answers by learning and applying rules to observed events before the query timestamp and scores the answer candidates relying on the rules’ confidences and time differences\.

##### TITer

TITer treats the historical TKG as the environment and the historical domain as the action space\. Starting from the subject, it employs reinforcement learning to navigate through the graph towards candidate entities\.

#### B\.4\.2Representation\-based

##### RE\-GCN

RE\-GCN segments the historical TKG into a sequence of KG snapshots based on timestamps\. It models the representations of entities and relations within each snapshot using Graph Convolutional Networks\. Subsequently, the Recurrent Neural Network is employed to capture and compute the temporal evolution of these representations\.

##### TiRGN

TiRGN extends RE\-GCN by introducing a global historical encoder designed to gather repeated historical facts\.

##### LogCL

LogCL leverages contrastive learning between local and global representations to enhance the quality and robustness of the entity representations\.

#### B\.4\.3LLM\-based

##### ICL

ICL directly leverages LLM’s in\-context learning ability\. It formulates historical TKG events into sequential prompts without fine\-tuning\.

##### GenTKG

GenTKG is a retrieval\-augmented generation framework that combines temporal logical rule\-based retrieval \(TLR\) and few\-shot parameter\-efficient instruction tuning \(FIT\)\. It aligns LLMs with TKG forecasting, outperforming traditional methods with minimal training data\.

## Appendix CComputation Complexity Analysis

In this section, we analyze the computational complexity of the Rule\-based Strikingness Measuring Framework \(RSMF\)\. Let us define:NeN\_\{e\}: number of entities,NrN\_\{r\}: number of relations,WW: historical window size \(number of timestamps\),LL: length of temporal rules \(number of events in rule body\),RR: number of mined rules, andEwE\_\{w\}: average number of events in the historical window\. For a target eventf=\(s,r,o,t\)f=\(s,r,o,t\), the computation involves:

##### Peer Event Generation

Generating peer events by replacing subject, object, or relation yieldsO\(Ne\+Nr\)O\(N\_\{e\}\+N\_\{r\}\)candidates\.

##### Rule Grounding

For each peer event and each rule of lengthLL, we search for matching rule bodies in the historical window\. The number of possible matches for a length\-LLrule is bounded byO\(Ew⋅\(Ne⋅W\)L−1\)O\\left\(E\_\{w\}\\cdot\(N\_\{e\}\\cdot W\)^\{L\-1\}\\right\)\. Since the first event can be matched to any of theEwE\_\{w\}events in the window, and each subsequent event in the rule chain may involve a new entity and timestamp\.

![Refer to caption](https://arxiv.org/html/2605.13153v1/x6.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x7.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x8.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x9.png)

Figure 4:Comparing with other strikingness baselines\.![Refer to caption](https://arxiv.org/html/2605.13153v1/x10.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x11.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x12.png)

Figure 5:Group performances of different models and data volume on ICEWS05\-15 and GDELT\.
##### Total Complexity

Considering allRRrules and\(Ne\+Nr\)\(N\_\{e\}\+N\_\{r\}\)peer candidates, the overall complexity per query is𝒪\(R⋅\(Ne\+Nr\)⋅Ew⋅\(Ne⋅W\)L−1\)\\mathcal\{O\}\\left\(R\\cdot\(N\_\{e\}\+N\_\{r\}\)\\cdot E\_\{w\}\\cdot\(N\_\{e\}\\cdot W\)^\{L\-1\}\\right\)\. WhenL=1L=1, the term\(Ne⋅W\)L−1\(N\_\{e\}\\cdot W\)^\{L\-1\}becomesO\(1\)O\(1\), simplifying to𝒪\(R⋅\(Ne\+Nr\)⋅Ew\)\\mathcal\{O\}\(R\\cdot\(N\_\{e\}\+N\_\{r\}\)\\cdot E\_\{w\}\), which is linear in the window size and entity count\. However, forL≥2L\\geq 2, the complexity grows exponentially withLL, due to the factor\(Ne⋅W\)L−1\(N\_\{e\}\\cdot W\)^\{L\-1\}\.

### C\.1Why First\-Order Rules Are Sufficient

While we acknowledge that higher\-order temporal rules \(L≥L\\geq2\) can capture more complex multi\-hop dependencies, we deliberately restrict RSMF to first\-order \(length‑1\) rules for the following reasons, which align with the primary goal of our work: to construct a practical and scalable strikingness\-aware evaluation framework rather than to perform exhaustive temporal pattern mining:

Computational Feasibility: As shown above, longer rules lead to exponential growth in grounding complexity, making them infeasible for large\-scale TKGs with thousands of entities and fine\-grained timestamps \(e\.g\., GDELT with 15\-minute granularity\)\. Adequate Expressiveness for Strikingness: Strikingness is primarily concerned with whether an event is expected given its immediate historical patterns\. First\-order rules already capture the most direct temporal dependencies \(e\.g\., “if A visited B recently, A may visit B again”\), which are sufficient for distinguishing repetitive vs\. outstanding events\.

We emphasize that our goal is not to claim that first‑order rules are universally optimal for all temporal pattern mining tasks\. Rather, for the specific purpose of de\-emphasizing trivial repetitions in TKGR evaluation\. Future work may explore hybrid or approximate higher‑order strategies where computational resources permit, but such extensions are orthogonal to the core contribution of this paper\.

##### Beyond Circular Reasoning

A legitimate concern is whether our strikingness measure creates a self\-fulfilling evaluation: if high strikingness indicates lack of local evidence, methods relying on such evidence might appear to fail by definition\. We argue this is not the case\.

- •Path\-based methods are assumed to perform multi\-hop reasoning, yet they perform worst on high\-strikingness events\. In fact, events that can only be supported by multi‑hop evidence, rather than one‑hop evidence, are more likely to be marked as high\-strikingness events by RSMF\. This reveals that their real strength lies in fitting shallow, repetitive patterns rather than multi\-hop reasoning\.
- •Representation\-based methods also declinewith increasing strikingness, though less severely, confirming that high strikingness corresponds to a general prediction challenge, not merely a lack of local evidence\.
- •Neighborhood overlap \(NOfNO\_\{f\}\) analysisshows that even among high\-strikingness events, those with richer historical interactions remain more predictable\. This indicates RSMF captures difficulty beyond mere evidence absence\.

Using length\-1 rulesensures our measure is method\-agnostic and focuses on immediate temporal expectations\. The observed performance differences thus reflect true capability gaps, not circularity\. Overall, RSMF does not penalize methods for lacking the evidence it uses; it exposes which methods can reason beyond trivial repetitions\.

## Appendix DMore Group Results and Analysis

Figure[5](https://arxiv.org/html/2605.13153#A3.F5)shows the grouped MRR results for baseline models and our proposed ensemble model on the ICEWS05\-15 and GDELT datasets\. Consistent with our analytical findings in the main manuscript, path\-based methods yield better predictions for low\-strikingness events, while representation\-based approaches excel in high\-strikingness scenarios\. The ensemble method strikes a trade\-off at both ends of the strikingness and demonstrates enhanced performance in the mid\-range\.

### D\.1Measuring Events with Frequency and Recency

To contextualize RSMF, we compare it with two intuitive baseline measures:Frequency Inverse\(Freq Inv\) andTemporal Inverse\(Temp Inv\)\. Both are designed as simple proxies for event strikingness, yet our analysis reveals critical limitations that underscore the necessity of RSMF’s rule\-based, peer‑comparative design:

Frequency Inverse \(Freq Inv\)measures historical uncommonness of the\(s,r\)\(s,r\)pair:

SKfreq\(f\)=1−countℋ\(s,r\)max\(s′,r′\)⁡\(countℋ\(s′,r′\)\)\.SK\_\{\\text\{freq\}\}\(f\)=1\-\\frac\{\\text\{count\}\_\{\\mathcal\{H\}\}\(s,r\)\}\{\\max\_\{\(s^\{\\prime\},r^\{\\prime\}\)\}\(\\text\{count\}\_\{\\mathcal\{H\}\}\(s^\{\\prime\},r^\{\\prime\}\)\)\}\.\(13\)
Temporal Inverse \(Temp Inv\)considers recency of the exact event\(s,r,o\)\(s,r,o\):

Sktemp\(f\)=1−exp⁡\(−λ⋅\(t−tlast\)\),Sk\_\{\\text\{temp\}\}\(f\)=1\-\\exp\\left\(\-\\lambda\\cdot\(t\-t\_\{\\text\{last\}\}\)\\right\),\(14\)wheretlastt\_\{\\text\{last\}\}is its most recent occurrence time, andλ=0\.005\\lambda=0\.005\. If the event never occurred,SKtime\(f\)=1SK\_\{\\text\{time\}\}\(f\)=1\.

Figure[4](https://arxiv.org/html/2605.13153#A3.F4)shows the group results of different strikingness measurements\.

##### Distribution Violates the Rarity Principle

Visual inspection of the volume distributions reveals that both Freq Inv and Temp Inv assign a score \(1\.01\.0\) to a substantially larger proportion of test events\. In RSMF’s grouping, the volume of events decays sharply as strikingness increases, producing the expected long‑tail distribution where truly outstanding events are rare\. In contrast, Freq Inv and Temp Inv show a centralized volume distribution across strikingness bins, with a notably high volume remaining even in the highest bin \(\[0\.9,1\.0\]\[0\.9,1\.0\]\)\. This contradicts the basic premise that outstanding events should be scarce\. The inflated high‑strikingness populations arise from inherent simplifications: Freq Inv relies on the long‑tail frequency of\(s,r\)\(s,r\)pairs, and Temp Inv treats any non‑recent exact repeat as striking, regardless of contextual expectation\.

##### Predictive Correlation is Weak or Trivial\.

Model performance grouped by each measure reveals:

- •Freq Invexhibits erratic, non‑monotonic MRR trends across bins\. The performance curve fluctuates without a clear gradient, indicating that frequency alone does not stably correlate with prediction difficulty\.
- •Temp Invshows a modest decline in MRR as strikingness increases, but this primarily reflects the trivial fact that events without recent repetitions are harder to predict\.

Crucially,neither baseline can discriminate between model families\. Under RSMF, the performance gap between path‑based and representation‑based methods widens considerably as strikingness increases\. Under Freq Inv and Temp Inv, this gap remains narrow and inconsistent across bins, failing to expose the models’ distinct capabilities, especially for the ensemble method\.

##### Why RSMF is Necessary\.

RSMF’s rule‑grounded peer‑event comparison incorporates both semantic confidence and temporal decay, enabling it to:

1. 1\.produces a realistic long‑tail strikingness distribution,
2. 2\.creates a sharp, monotonic gradient of prediction difficulty,
3. 3\.reveals systematic differences in model‑family performance, and
4. 4\.offer explainable strikingness assessments through rules and peer events\.

### D\.2Statistical Test Details

We conducted two complementary statistical tests to compare model performance between the lowest strikingness bin \(sk<0\.2sk<0\.2\) and the highest bin \(sk\>0\.8sk\>0\.8\):

Welch’s t‑test\(independent samples, unequal variances assumed\)

Mann‑Whitney U test\(non‑parametric, one‑sided alternative that low‑strikingness performance is greater\)

Tests were performed separately for Hits@1 and Hits@3 metrics\. Sample sizes are 31,078 for the low bin and 9,316 for the high bin across all models\. All comparisons yield highly significant results \(p<0\.001p<0\.001\)\. The t‑statistics are exceptionally large \(t\>60t\>60\), and the U‑statistics are consistently on the order of10810^\{8\}, reflecting both large effect sizes and substantial sample sizes\. The consistency between parametric \(t‑test\) and non‑parametric \(U‑test\) results confirms the robustness of the conclusion\. The extreme significance levels \(p≪0\.001p\\ll 0\.001\) are expected given the large sample sizes \(n\>40,000n\>40,000combined per test\) and large observed differences \(Δ\\DeltaHits@3\>35%\>35\\%for all models\)\. These statistical tests provide formal confirmation that the performance degradation for high‑strikingness events is not due to chance variation\.

ModelTypeICEWS18S\(0\.6,0\.7\)S\(0\.6,0\.7\)S\(0\.7,0\.8\)S\(0\.7,0\.8\)S\(0\.8,0\.9\)S\(0\.8,0\.9\)S\(0\.9,1\.0\)S\(0\.9,1\.0\)RecurrencyHighNOfNO\_\{f\}10\.408\.947\.264\.13LowNOfNO\_\{f\}8\.055\.062\.972\.20TITerHighNOfNO\_\{f\}15\.7113\.8614\.3210\.22LowNOfNO\_\{f\}10\.197\.815\.114\.51TLogicHighNOfNO\_\{f\}13\.7411\.3212\.026\.24LowNOfNO\_\{f\}6\.564\.712\.822\.05RE\-GCNHighNOfNO\_\{f\}23\.5523\.1922\.3620\.43LowNOfNO\_\{f\}15\.8014\.8810\.9112\.42TiRGNHighNOfNO\_\{f\}22\.1222\.9821\.3718\.89LowNOfNO\_\{f\}15\.6514\.8610\.8011\.90LogCLHighNOfNO\_\{f\}31\.9532\.5230\.3028\.17LowNOfNO\_\{f\}18\.9621\.7721\.4522\.71EnsembleHighNOfNO\_\{f\}29\.0929\.1628\.3725\.95LowNOfNO\_\{f\}17\.7919\.6220\.2521\.31

Table 6:The Hits@3 metric of High and LowNOfNO\_\{f\}events within the high\-strikingness range on ICEWS18\.Model TypeICEWS14ICEWS18S\(0,0\.1\)S\(0,0\.1\)S\(0\.1,0\.2\)S\(0\.1,0\.2\)S\(0,0\.1\)S\(0,0\.1\)S\(0\.1,0\.2\)S\(0\.1,0\.2\)6\-model\-H@362\.7041\.8545\.8126\.685\-model\-H@376\.1259\.0960\.4640\.074\-model\-H@384\.8569\.5571\.0949\.89Table 7:The Hits@3 of multi\-models’ intersection in the low\-strikingness groups on ICEWS14 and ICEWS18\.

## Appendix EMore Predictability Analysis

TheNOfNO\_\{f\}measures the degree of overlap between the neighbors of the subject entity and the object entity for the test samplef=\(s,r,o,t\)f=\(s,r,o,t\)in the historical KGs, which indicates the existence of more complex multi\-hop historical interactions between the subject and object of eventff\. The formula is as follows:

NOf=‖Ns∩No‖‖Ns∪No‖\\displaystyle NO\_\{f\}=\\frac\{\|\|N\_\{s\}\\cap N\_\{o\}\|\|\}\{\|\|N\_\{s\}\\cup N\_\{o\}\|\|\}\(15\)whereNs=\{o′\|\(s,r′,o′,t′\)\}∪\{o′\|\(o′,r′,s,t′\)\}N\_\{s\}=\\\{o^\{\\prime\}\|\(s,r^\{\\prime\},o^\{\\prime\},t^\{\\prime\}\)\\\}\\cup\\\{o^\{\\prime\}\|\(o^\{\\prime\},r^\{\\prime\},s,t^\{\\prime\}\)\\\}andNo=\{s′\|\(s′,r′,o,t′\)\}∪\{s′\|\(o,r′,s′,t′\)\}N\_\{o\}=\\\{s^\{\\prime\}\|\(s^\{\\prime\},r^\{\\prime\},o,t^\{\\prime\}\)\\\}\\cup\\\{s^\{\\prime\}\|\(o,r^\{\\prime\},s^\{\\prime\},t^\{\\prime\}\)\\\}denote the set of neighboring entities of s and o, andt−w≤t′<tt\-w\\leq t^\{\\prime\}<tis consistent with the window for calculating strikingness\.

The results of predictability analysis on high\-strikingness groups of ICEWS18 are shown in Table[6](https://arxiv.org/html/2605.13153#A4.T6)\. Consistent with our prior findings, future events with higherNOfNO\_\{f\}exhibit higher prediction accuracy\.

### E\.1Predict Pattern on Low\-strikingness Events across Different Models

We verify whether events in thelow\-strikingness groupsexhibit the same easy patterns, that is, whether different models consistently predict mostly the same trivial events correctly\.S\(sk1,sk2\)S\(sk\_\{1\},sk\_\{2\}\)represents the set of events with strikingness in the range\[sk1,sk2\)\[sk\_\{1\},sk\_\{2\}\), and “n‑model‑H@3” indicates that the Hits@3 predictions of n out of the six models overlap\. We calculated the intersection of the Hits@3 metric of models within the low\-strikingness groups\. The results in Table[7](https://arxiv.org/html/2605.13153#A4.T7)demonstrate that for events with extremely low strikingnessS\(0,0\.1\)S\(0,0\.1\), multiple models consistently make correct predictions\. However, for events with strikingness inS\(0\.1,0\.2\)S\(0\.1,0\.2\), the overlap in model predictions drops sharply, indicating that even for trivial events, different models possess distinct prediction patterns\. It further explains why the ensemble method in Figure[2](https://arxiv.org/html/2605.13153#S4.F2)exhibits an enhancement pattern\.

Let𝒬\\mathcal\{Q\}be the set of test queries andℳ=M1,M2,…,MN\\mathcal\{M\}=\{M\_\{1\},M\_\{2\},\\dots,M\_\{N\}\}denoteNNbaseline models\. For a queryq∈𝒬q\\in\\mathcal\{Q\}and a modelMi∈ℳM\_\{i\}\\in\\mathcal\{M\}, we define an indicator function𝕀Hits@3\(Mi,q\)\\mathbb\{I\}\_\{\\text\{Hits@3\}\}\(M\_\{i\},q\)to indicate whether modelMiM\_\{i\}achieves Hits@3 = 1 on queryqq\. If yes,𝕀Hits@3\(Mi,q\)=1\\mathbb\{I\}\_\{\\text\{Hits@3\}\}\(M\_\{i\},q\)=1, otherwise 0\. If at leastnnmodels inℳ\\mathcal\{M\}simultaneously achieve Hits@3 = 1 onqq, i\.e\.,∑i=1N𝕀Hits@3\(Mi,q\)≥n\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\_\{\\text\{Hits@3\}\}\(M\_\{i\},q\)\\geq n\. We say the queryqqsatisfies thenn\-Model\-H@3 condition\. Consequently, the overallnn\-Model\-Hits@3 performance on the test set𝒬\\mathcal\{Q\}is reported as follows:

∑q∈𝒬∑i=1N𝕀Hits@3\(Mi,q\)≥n\|𝒬\|\.\\frac\{\\sum\_\{q\\in\\mathcal\{Q\}\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\_\{\\text\{Hits@3\}\}\(M\_\{i\},q\)\\geq n\}\{\|\\mathcal\{Q\}\|\}\.\(16\)
![Refer to caption](https://arxiv.org/html/2605.13153v1/x13.png)Figure 6:Relation between strikingness and novelty\.![Refer to caption](https://arxiv.org/html/2605.13153v1/x14.png)Figure 7:Strikingness and count of events with different relations on ICEWS18\.![Refer to caption](https://arxiv.org/html/2605.13153v1/x15.png)

\(a\)ICEWS14
![Refer to caption](https://arxiv.org/html/2605.13153v1/x16.png)

\(b\)ICEWS05\-15
![Refer to caption](https://arxiv.org/html/2605.13153v1/x17.png)

\(c\)GDELT
![Refer to caption](https://arxiv.org/html/2605.13153v1/x9.png)

Figure 8:WMRR with different biasbbon ICEWS14, ICEWS05\-15, and GDELT\.

## Appendix FAnalysis of Strikingness

### F\.1Characteristics of Strikingness

We conduct analysis experiments to validate four aspects of outstanding events: Novelty, Rarity, Context Dependence, and Time Sensitivity\.

##### Novelty of Outstanding Event

In Figure[6](https://arxiv.org/html/2605.13153#A5.F6), we explored the relationship between events’ historical repetition count and strikingness on ICEWS14\. The boxes represent the distribution of strikingness for events with a given repetition count, while the blue line illustrates the average strikingness\. It can be observed that the strikingness of events decreases sharply as the historical repetition count increases, with the most pronounced drop occurring between first\-occurring events and those that have occurred before\. Nevertheless, the box plot reveals that even first\-occurring events can demonstrate low strikingness\. This is because, although an event may not have occurred historically, it can still be considered anticipated as long as there exists sufficient rules to support its occurrence\. Conversely, highly frequent events are typically unlikely to be perceived as outstanding due to their routine nature\.

##### Rarity of Outstanding Event

Additionally, we further explored the relationship between strikingness and event relation types in Figure[7](https://arxiv.org/html/2605.13153#A5.F7)\. As shown, events in TKGs, categorized by relation types, generally follow a long\-tailed distribution\. Moreover, the strikingness of relation\-specific events tends to increase as the count of events associated with a relation decreases, suggesting that events with few\-shot relation are more likely to be outstanding\. However, it is also observed that some events with few\-shot relation exhibit very low strikingness\. It is because, although events involving few\-shot relations constitute a relatively small proportion, they may occur repeatedly\.

![Refer to caption](https://arxiv.org/html/2605.13153v1/x18.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x19.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x20.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x21.png)

Figure 9:Distribution on ICEWS14 with different hyperparameters\.![Refer to caption](https://arxiv.org/html/2605.13153v1/x22.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x23.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x24.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x25.png)

![Refer to caption](https://arxiv.org/html/2605.13153v1/x26.png)

Figure 10:Performance on ICEWS14 under strikingness\-aware evaluation framework withb=0\.1b=0\.1and different hyperparameters\.![Refer to caption](https://arxiv.org/html/2605.13153v1/x27.png)

\(a\)ICEWS14
![Refer to caption](https://arxiv.org/html/2605.13153v1/x28.png)

\(b\)ICEWS18
![Refer to caption](https://arxiv.org/html/2605.13153v1/x29.png)

\(c\)ICEWS05\-15
![Refer to caption](https://arxiv.org/html/2605.13153v1/x30.png)

\(d\)GDELT

Figure 11:Ensemble results of different baseline models\. The lower triangle \(including the diagonal\) represents the original metrics and the upper triangle corresponds to the striking\-aware metrics
##### Context Dependence, and Time Sensitivity of Outstanding Event

In Figure[9](https://arxiv.org/html/2605.13153#A6.F9), we analyze the distribution of event strikingness under different parameter settings\. The results demonstrate that the strikingness distribution changes with the variation of parameters, indicating that the proposed RSMF effectively selects the expected outstanding events by adjusting the specific parameters\. Specifically,τ\\tauis the confidence threshold for constraining the number of applicable rules,αs\\alpha^\{s\}denotes the weight of the entity in strikingness calculation, andwwinfluences the window length of historical knowledge graphs\. Adjustments to these parameters will affect the context information of future events, reflecting the context dependence of outstanding events\. Similarly,λ\\lambdacontrols the temporal decay rate of rules, showcasing the time sensitivity\. It should be emphasized that these parameters, unlike model hyperparameters, are not designed for achieving optimal model performance\. Our goal is to establish a well\-calibrated evaluation framework\. The parameters for strikingness calculation in RSMF are instrumental in implementing its core definition and generating a long\-tailed distribution of event strikingness \(Figure[9](https://arxiv.org/html/2605.13153#A6.F9)\)\. To facilitate reproducibility and promote community adoption, we provide standardized configurations as recommended defaults\.

### F\.2Hyperparameter Sensitivity of Evaluation

The construction of RSMF involves several hyperparameters\. As shown in Figure[9](https://arxiv.org/html/2605.13153#A6.F9), these hyperparameters can change the values and distribution of strikingness\. Consequently, a natural concern arises: could these hyperparameters affect the outcomes of the strikingness‑aware evaluation framework, that is, causing a model A to outperform model B under one hyperparameter setting while underperforming under another? Figure[10](https://arxiv.org/html/2605.13153#A6.F10)shows the influence of the involved hyperparameters on evaluation results on ICEWS14\. It can be observed that although the WMRR values change, the relative performance ranking among models remains consistent throughout\. This indicates that the strikingness\-aware evaluation framework is robust to hyperparameter variations and can be reliably applied to assess TKGR models\.

We also report the evaluation results with respect to parameterbbon the other datasets in Figure[8](https://arxiv.org/html/2605.13153#A5.F8)\. As shown, the position at which evaluation results \(i\.e\., model rankings\) change varies across different datasets\. Atb=0\.1b=0\.1, both the WMRR and the relative model rankings remain comparatively stable, which is why we recommend this as the default\.

## Appendix GEnsemble Combinations

We hypothesize that each baseline offers distinct predictive perspectives and thus explore various ensemble combinations for prediction\. The results of all possible ensemble combinations are presented in Figure[11](https://arxiv.org/html/2605.13153#A6.F11), where the lower triangle \(including the diagonal\) represents the original metrics and the upper triangle corresponds to the striking\-aware metrics\. As observed, nearly all ensemble combinations achieve significant improvements on the original metrics\. However, such gains are negligible on the striking\-aware metrics\. This is because ensemble methods primarily enhance the prediction of low\-strikingness events, while for high\-strikingness events, they may even introduce conflicts\.
Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning

Similar Articles

AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

Scalable Uncertainty Reasoning in Knowledge Graphs

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation

Submit Feedback

Similar Articles

AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning
SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs
Scalable Uncertainty Reasoning in Knowledge Graphs
Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation
TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation