GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

arXiv cs.AI Papers

Summary

This paper presents GRID, an end-to-end framework for constructing security knowledge graphs from cyber threat intelligence (CTI) articles using LLMs, introducing a task-bank reward training method to improve precision and recall without expensive LLM-as-judge rewards. The approach achieves strong results on a benchmark of 249 CTI articles from five sources.

arXiv:2605.16714v1 Announce Type: new Abstract: Security knowledge graphs can provide computable external memory for security agents, but constructing them from long-form cyber threat intelligence (CTI) remains difficult: LLMs often lack grounded security-domain knowledge, and end-to-end document-to-graph training is hard to supervise with cheap, stable rewards. We present GRID (Graph Representation of Intelligence Data), an end-to-end framework for security text knowledge graph construction. GRID first builds security-domain supervision from CTI articles by creating traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then turns document-to-graph learning into a scripted task bank combining four-option multi-select questions with triple-level regex matching targets, yielding more stable task-specific rewards than repeatedly scoring full graph outputs with an LLM judge. Using this supervision pipeline, we train two Qwen3-4B-Instruct-2507-based 4B extractors: a primary Task-bank Reward model and a secondary End2End Reward model with LLM-as-judge precision/recall rewards. On 249 CTI articles from GRID, CASIE, CTINexus, MalKG, and SecureNLP, the Task-bank Reward model with the ontology-guided GRID extraction pipeline reaches 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1, achieving the best source-averaged recall and near-top Avg F1 with lower token usage and deployment cost. The End2End Reward model reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1. Further analyses show that task-bank rewards can be built once offline and reused across later post-training runs, outperforming online End2End LLM-as-judge reward and weaker alternatives such as Choice-only Reward and End2End SFT without RL.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:35 AM

# GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction
Source: [https://arxiv.org/html/2605.16714](https://arxiv.org/html/2605.16714)
Liangyi Huang1Zichen Liu1Fei Shao2Shang Ma3 Mengshi Zhang4Zihao Chen5Yanfang Ye3Xusheng Xiao1 1Arizona State University2Case Western Reserve University3University of Notre Dame 4TensorBlock5Facebook lhuan139@asu\.edu

###### Abstract

Security knowledge graphs can serve as computable and traceable external memory for security agents\. Our goal is to equip LLMs with security\-domain knowledge for knowledge graph extraction from long\-form security text\. However, existing LLMs largely lack such domain knowledge grounded in real security text, and end\-to\-end document\-to\-graph training is difficult to supervise with cheap and stable rewards\. We present Graph Representation of Intelligence Data \(G\.R\.I\.D\.\), an end\-to\-end framework for security text knowledge graph construction\.GRIDfirst constructs security\-domain supervision from security\-related CTI articles in an unsupervised manner by constructing traceable article\-graph alignments through graph extraction and knowledge\-graph\-conditioned text revision\. It then reformulates document\-to\-graph learning into a scripted task bank that combines four\-option multi\-select questions with triple\-level regex matching targets, yielding cheaper and more stable task\-specific rewards than asking an LLM judge to score full graph outputs at every training step\. Based on this supervision pipeline, we train two Qwen3\-4B\-Instruct\-2507\-based 4B extractors: a primary Task\-bank Reward model and a secondary End2End Reward model for direct article\-to\-knowledge\-graph generation with LLM\-as\-judge precision/recall rewards\. On a unified benchmark of 249 CTI articles from five sources: GRID, CASIE, CTINexus, MalKG, and SecureNLP, the post\-trained Task\-bank Reward model together with the ontology\-guided GRID extraction pipeline reaches 84\.62% source\-averaged precision, 64\.91% source\-averaged recall, and 68\.53% Avg F1, achieving the best source\-averaged recall and a near\-tied top Avg F1 with much less token usage and lower deployment cost\. The secondary End2End Reward model reaches 76\.91% source\-averaged precision, 53\.85% source\-averaged recall, and 58\.06% Avg F1\. Further analyses show that the task\-bank reward can be constructed once offline and reused across later post\-training runs while outperforming the online End2End LLM\-as\-judge reward as well as weaker alternatives such as Choice\-only Reward and End2End SFT without RL, and that both article rewriting and article\-complexity\-ordered training are necessary for the best performance\.

## 1Introduction

In recent years, cyber attacks have become more frequent, more complex, and more expensiveGao et al\. \([2018](https://arxiv.org/html/2605.16714#bib.bib18)\); Hutchins et al\. \([2011](https://arxiv.org/html/2605.16714#bib.bib24)\); dep \([2021](https://arxiv.org/html/2605.16714#bib.bib3)\); Times \([2014](https://arxiv.org/html/2605.16714#bib.bib60)\)\. The 2024 Report on the Cybersecurity Posture of the United States shows that reported ransomware incidents increased by 22% since 2022, and the related costs increased by 74%Office of the National Cyber Director \([2024](https://arxiv.org/html/2605.16714#bib.bib42)\)\. However, many organizations still lack a good understanding of current threats\. A recent study shows that 79% of security decision\-makers often ignore threat actor information, only 35% believe their organizations understand adversaries’ tactics, techniques, and procedures, and 68% think their threat intelligence capabilities need major improvementMandiant \([2024](https://arxiv.org/html/2605.16714#bib.bib31)\)\.

Cyber Threat Intelligence \(CTI\) is therefore important for cyber defenseMcMillan \([2013](https://arxiv.org/html/2605.16714#bib.bib32)\); Wagner et al\. \([2019](https://arxiv.org/html/2605.16714#bib.bib62)\)\. Common structured CTI sources, such as Indicators of Compromise \(IOCs\), Common Vulnerabilities and Exposures \(CVEs\), and cyber kill chains, are useful but often miss important attack contextObrst et al\. \([2012](https://arxiv.org/html/2605.16714#bib.bib40)\); Liao et al\. \([2016](https://arxiv.org/html/2605.16714#bib.bib28)\); Catakoglu et al\. \([2016](https://arxiv.org/html/2605.16714#bib.bib6)\); Senki \([2016](https://arxiv.org/html/2605.16714#bib.bib56);[https://arxiv.org/html/2605.16714#bib.bib55](https://arxiv.org/html/2605.16714#bib.bib55)\); MITRE \([2020](https://arxiv.org/html/2605.16714#bib.bib39)\); cyb \([2021](https://arxiv.org/html/2605.16714#bib.bib2)\); Corporation \([2022](https://arxiv.org/html/2605.16714#bib.bib12)\)\. In contrast, unstructured CTI articles, such as technical blogs and threat reports, often describe attack behaviors, attacker goals, exploited vulnerabilities, and malware evolution in much more detailLiao et al\. \([2016](https://arxiv.org/html/2605.16714#bib.bib28)\); Dong et al\. \([2019](https://arxiv.org/html/2605.16714#bib.bib16)\)\. This has motivated efforts to organize CTI knowledge into structured resources such as MITRE ATT&CK and NVD, and to automatically extract threat knowledge from CTI articlesCorporation \([2022](https://arxiv.org/html/2605.16714#bib.bib12)\); of Standards & Technology \([2021](https://arxiv.org/html/2605.16714#bib.bib41)\); Li et al\. \([2022](https://arxiv.org/html/2605.16714#bib.bib27)\); Satvat et al\. \([2021](https://arxiv.org/html/2605.16714#bib.bib52)\); Gao et al\. \([2022](https://arxiv.org/html/2605.16714#bib.bib19)\); Wang et al\. \([2023](https://arxiv.org/html/2605.16714#bib.bib63)\); OpenAI \([2023](https://arxiv.org/html/2605.16714#bib.bib43)\); Huang & Xiao \([2024](https://arxiv.org/html/2605.16714#bib.bib22)\)\.

Knowledge graphs are useful for this purpose because they represent both security entities and their relations in one structure\. Prior work has shown that graph\-based threat knowledge helps forensic analysis and attack reconstruction by matching known threat behaviors with system auditing events and provenance graphsMilajerdi et al\. \([2019](https://arxiv.org/html/2605.16714#bib.bib38)\); King & Chen \([2003](https://arxiv.org/html/2605.16714#bib.bib25)\); dep \([2021](https://arxiv.org/html/2605.16714#bib.bib3)\); Xu et al\. \([2022](https://arxiv.org/html/2605.16714#bib.bib65)\)\. More generally, graph\-structured knowledge also helps LLMs and agents reason over connected evidence\. Think\-on\-Graph reports state\-of\-the\-art results on 6 of 9 reasoning benchmarks, G\-Retriever improves valid\-node grounding from 31% to 77% and fully valid graph grounding from 8% to 62%, and recent graph\-memory systems for agents report a 26% relative improvement over memory baselines and up to 18\.5% higher accuracy on long\-horizon tasksSun et al\. \([2023a](https://arxiv.org/html/2605.16714#bib.bib58)\); He et al\. \([2024](https://arxiv.org/html/2605.16714#bib.bib20)\); Chhikara et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib8)\);[Rasmussen et al\.](https://arxiv.org/html/2605.16714#bib.bib49)\. Figure[1](https://arxiv.org/html/2605.16714#S1.F1)shows a Log4Shell example with a knowledge graph and its attack summary from CTI articles\.

![Refer to caption](https://arxiv.org/html/2605.16714v1/figs/exampleintro-horizontal.png)Figure 1:Log4Shell \(CVE\-2021\-44228\) CTI articles: knowledge graph \(left\) and attack summary \(right\)In this paper, we propose Graph Representation of Intelligence Data \(G\.R\.I\.D\.\), a low\-cost framework for security text knowledge graph construction\. Rather than treating security knowledge graph extraction as a standalone prompting problem,GRIDprovides a complete pipeline for developing and assessing LLM\-based security knowledge graph extraction systems\. This pipeline spans unsupervised supervision construction from CTI articles, cheaper task\-bank rewards for end\-to\-end document\-to\-graph learning, a fixed two\-prompt inference pipeline, and trustworthy automatic evaluation against human\-annotated CTI data\.

Challenges\. We next summarize the key challenges faced by existing approaches\.

- •Lack of Integration of Security Domain Knowledge: Most existing LLMs are designed for general\-purpose use rather than knowledge graph extraction from security text, especially at small model scales, leaving practitioners dependent on expensive commercial APIs\.
- •Lack of High\-Quality CTI Article\-Graph Alignment Data: Training LLMs for CTI knowledge graph extraction requires supervision that tightly aligns real CTI text with graph outputs, but there is currently no public high\-quality dataset that provides such article\-graph pairs for end\-to\-end training\.
- •Expensive Reward Signals for Open\-Ended Knowledge Graph Extraction: End\-to\-end knowledge graph extraction is an open\-ended generation task, making reward design for reinforcement learning difficult and expensive\. Even when LLM\-as\-judge is available, asking it to score full extracted graphs at training time still incurs high cost\.
- •Shallow Shortcut Learning in Relation Extraction: LLMs tend to rely on superficial lexical overlap, local co\-occurrence, or other surface heuristics when predicting relations, instead of deeply understanding entity semantics, aliases, structural hierarchy, and relation constraints in CTI narratives\.

Contributions\.

- •Automatic Annotation of Article\-Graph Alignment Data:GRIDintroduces an automatic data annotation algorithm for CTI knowledge graph extraction\. It first generates a traceable knowledge graph that preserves verbatim evidence anchors from the source text, and then performs knowledge\-graph\-conditioned text revision to remove CTI information that is not captured by the graph while retaining non\-CTI context\. This yields high\-quality article\-graph alignments without requiring large\-scale manual annotation\.
- •Low\-Cost Task\-Bank Reformulation for RL Training:GRIDreformulates open\-ended knowledge graph extraction into scripted supervision tasks that combine four\-option multi\-select questions with triple\-level regex targets\. This replaces full\-graph scoring with cheaper task\-level checks\.
- •Ontology\-Guided CTI Knowledge Graph Extraction:GRIDdesigns a CTI\-oriented ontology that explicitly models entity types, relation categories, aliases, and hierarchy, so that extraction depends on entity semantics and constraints rather than only shallow textual cues; together with the post\-trained model, this lowers deployment cost\.
- •Out\-of\-the\-box Benchmark and Trustworthy Automatic Evaluation: To address the lack of an out\-of\-the\-box benchmark for CTI knowledge graph extraction,GRIDalso builds a test benchmark centered on real CTI articles\. The benchmark combines manually annotated real\-world CTI data with multiple existing security text datasets, and is paired with a trustworthy automatic evaluator based on text\-provable precision and recall\.

Evaluations\. We evaluateGRIDon a unified benchmark of 249 CTI articles from five sources, comprising 49 GRID articles \(avg\. 1,102 tokens and 15\.35 ground\-truth edges\), 50 CASIE articles \(avg\. 537 tokens and 7\.94 ground\-truth edges\), 50 CTINexus articles \(avg\. 191 tokens and 11\.80 edges\), 50 MalKG articles \(avg\. 6,632 tokens and 48\.90 edges\), and 50 SecureNLP articles \(avg\. 11,000 tokens and 68\.66 edges\), after removing articles whose ground\-truth knowledge graphs contain fewer than five edges\. For all RQs, we report effectiveness results using a calibrated LLM judge that reaches 86\.0% agreement with annotations from three human reviewers\. Using Qwen3\-4B\-Instruct\-2507, we train two 4B extractors: a primary Task\-bank Reward model and a comparison model trained with online End2End LLM\-as\-judge reward\. On this benchmark, the post\-trained Task\-bank Reward model together with the ontology\-guided GRID pipeline achieves 84\.62% source\-averaged precision, 64\.91% source\-averaged recall, and 68\.53% Avg F1, giving the best source\-averaged recall and a near\-tied top Avg F1 with much less token usage than CTINexus\. The online End2End LLM\-as\-judge reward model reaches 76\.91% precision, 53\.85% recall, and 58\.06% Avg F1\. RQ2 ablations further validate Task\-bank Reward as an effective reward design, outperforming online End2End reward, Choice\-only Reward, End2End SFT without RL, and the base model\. Under the same training budget,GRID’s full setting achieves higher training reward and a higher test\-set score reflecting precision and recall than the variants without article rewriting or article\-complexity ordering\. Code and data are accessible at[https://github\.com/anonymousauthorname/ProjectGRID](https://github.com/anonymousauthorname/ProjectGRID)gri \([2026](https://arxiv.org/html/2605.16714#bib.bib4)\)\.

## 2Approach

Figure[2](https://arxiv.org/html/2605.16714#S2.F2)summarizes the overall pipeline ofGRID, and the rest of this section explains each step in turn\.

![Refer to caption](https://arxiv.org/html/2605.16714v1/figs/overview.png)Figure 2:Overview ofGRID### 2\.1Automatic Annotation of Article\-Graph Alignment Data

GRIDmaps each raw CTI article to an aligned pair\(a′,G′\)\(a^\{\\prime\},G^\{\\prime\}\)through a two\-stage annotate\-and\-revise loop\. It first extracts a traceable knowledge graph under a strict text\-provable constraint, and then rewrites the article against that graph so that unsupported security content is removed while graph\-grounded evidence and non\-security context are preserved\. The result is a revised article whose security\-bearing content is explicitly aligned with the extracted graph; Algorithm[1](https://arxiv.org/html/2605.16714#alg1)gives the procedure\.

Input:Raw CTI article

aa, traceable extraction prompt

PtraceP\_\{\\mathrm\{trace\}\}, revision prompt

PrevP\_\{\\mathrm\{rev\}\}
Output:Article\-graph alignment

\(a′,G′\)\(a^\{\\prime\},G^\{\\prime\}\)consisting of revised article

a′a^\{\\prime\}and text\-grounded knowledge graph

G′G^\{\\prime\}
1

2

G←LLMExtract​\(a,Ptrace\)G\\leftarrow\\mathrm\{LLMExtract\}\(a,P\_\{\\mathrm\{trace\}\}\)
3Parse

GGinto entity list

EEand relation list

RR
4foreach*r∈Rr\\in R*do

5keep sentence\-local subject/object mentions in

\(r\.sub,r\.obj\)\(r\.\\mathrm\{sub\},r\.\\mathrm\{obj\}\)
6keep verbatim evidence anchors in

\(r\.raw​\_​sub​\_​name,r\.raw​\_​obj​\_​name,r\.raw​\_​text​\_​start,r\.raw​\_​text​\_​end\)\(r\.\\mathrm\{raw\\\_sub\\\_name\},r\.\\mathrm\{raw\\\_obj\\\_name\},r\.\\mathrm\{raw\\\_text\\\_start\},r\.\\mathrm\{raw\\\_text\\\_end\}\)
7

8Mark all anchor spans in

aathat are protected by

RR
9

a′←LLMRevise​\(a,G,Prev\)a^\{\\prime\}\\leftarrow\\mathrm\{LLMRevise\}\(a,G,P\_\{\\mathrm\{rev\}\}\), deleting unsupported security mentions while keeping protected anchors and non\-security context

10

G′←\(E,R\)G^\{\\prime\}\\leftarrow\(E,R\)
11return

\(a′,G′\)\(a^\{\\prime\},G^\{\\prime\}\)

Algorithm 1Automatic Annotation of Article\-Graph Alignment Data
### 2\.2Task Bank Construction

Rather than directly rewarding full document\-to\-graph generation,GRIDconverts each article\-graph alignment\(a′,G′\)\(a^\{\\prime\},G^\{\\prime\}\)into two easy\-to\-check RL task families\. The first creates four\-option multi\-select questions, where the ground\-truth answer can contain any subset of 0–4 correct options\. The second creates one triple\-level regex target for each ground\-truth KG edge, so that graph supervision can be reduced to per\-edge matching rather than whole\-graph judging\.

On the choice side, distractors are contrastive negatives that may look related in the article but are actually invalid, or may seem reasonable based on real\-world CTI experience even though the article itself does not support them\. On the regex side, matching is normalized at the entity and relation levels but remains edge\-aligned\. Table[1](https://arxiv.org/html/2605.16714#S2.T1)and Table[2](https://arxiv.org/html/2605.16714#S2.T2)summarize the two rule families\.

Table 1:Choice\-question patterns for precision and hallucination checksFamilyPatternDefinitionPrecisionSupported triplesAmong the four options, 0\-4 are text\-supported; the rest are near\-miss distractors with a wrong subject, object, or relation\.Incorrect triplesAmong the four options, 0\-4 are deliberately incorrect; the rest are supported triples\.HallucinationRelation illusionThe subject and object are grounded, but their relation is unsupported\.Object illusionThe subject and relation are grounded, but the object is unsupported\.Subject illusionThe relation and object are grounded, but the subject is unsupported\.Total illusionAll three triple elements are unsupported, but the triple remains CTI\-plausible\.Partial illusionOnly one triple element is grounded; the other two are unsupported\.Table 2:Regex rules for triple\-level matchingAspectRuleEdge alignmentMaintains one regex triplet per ground\-truth edge\.Entity normalizationNormalizes aliases, abbreviations, number, hyphenation, and shortened head nouns for subjects and objects\.Role\-preserving generalizationPermits limited generalization only under role\-preserving equivalence\.Relation normalizationNormalizes inflection, voice, prepositions, and common CTI paraphrases\.
### 2\.3Article Complexity Modeling

GRIDassigns each source article a scalar complexity score so that the exported training set can be reordered from easy to hard\. In the current training pipeline, complexity is defined only at the article level, and all reward\-related prompts derived from the same article inherit the same score\.

For an articleaa,GRIDcomputes

Carticle​\(a\)=12​Cbase​\(a\)\+12​Cgraph​\(a\),C\_\{\\mathrm\{article\}\}\(a\)=\\frac\{1\}\{2\}C\_\{\\mathrm\{base\}\}\(a\)\+\\frac\{1\}\{2\}C\_\{\\mathrm\{graph\}\}\(a\),whereCbaseC\_\{\\mathrm\{base\}\}averages percentile ranks of article length, entity count, relation count, and text\-normalized entity/relation density, whileCgraphC\_\{\\mathrm\{graph\}\}averages percentile ranks of alias, connectivity, span, and fixed\-order crossing statistics\. This article\-level score is attached to every training item derived from the same article and is later used for article\-complexity\-ordered RL training\.

### 2\.4Reward Design for RL Training

The RL stage trains the LLM model with Soft Adaptive Policy Optimization \(SAPO\), an improved GRPO\-style variantGao et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib17)\)\. In the current implementation, task\-bank rewards are computed locally by task type\. Both task types additionally receive a 0\.1 format reward when the output follows the required reasoning\-before\-answer format\. For a choice task, the main reward is 1\.0 when the predicted answer set exactly matches the ground\-truth answer set, 0\.5 when the two sets overlap but are not identical, and 0\.0 otherwise\.

For a regex task, the main reward isrregex=nmatch/ngtr\_\{\\mathrm\{regex\}\}=n\_\{\\mathrm\{match\}\}/n\_\{\\mathrm\{gt\}\}, wherenmatchn\_\{\\mathrm\{match\}\}is the number of matched ground\-truth regex triplets,ngtn\_\{\\mathrm\{gt\}\}is the total number of ground\-truth regex triplets, and a ground\-truth regex triplet is counted as matched only when its subject, relation, and object are all matched by the same predicted triple\.

### 2\.5Ontology\-Guided Security Knowledge Graph Extraction

The ontology is not used as a post\-hoc label list\. Instead, it guides the LLM to recover entity relations that are not directly stated through explicit verbs or relative clauses, by providing typed entity categories, normalized relation families, aliases, and hierarchy constraints\. Appendix Figure[4](https://arxiv.org/html/2605.16714#A2.F4)gives a prompt\-block view, and the full ontology inventory, including predefined entity and relation types, is summarized in Appendix[B](https://arxiv.org/html/2605.16714#A2)\.

It enforces text\-provable truth by forbidding external completion, subject elevation, behavior\-to\-structure conversion, and subject\-changing chain deduction; treats alias, lineage, categorization, and normalized relation types as ontology\-native semantics rather than post\-hoc cleanup; and uses a connectivity recheck to remove unsupported isolated entities instead of hallucinating links\. In GRID inference, Step 1 proposes candidate entities, and Step 2 finalizes entities and relations against the original article\.

### 2\.6LLM\-Based Automatic Evaluation

For end\-to\-end knowledge graph extraction,GRIDuses an LLM\-based evaluator that scores precision and recall with two prompt templates governed by the same text\-provable principle as the extractor\. The precision prompt audits predicted edges one by one and asks whether each predicted edge is either directly supported by the article text or equivalent to a ground\-truth edge under the judge rules\. The recall prompt audits ground\-truth edges one by one and asks whether each ground\-truth edge is captured by the predicted graph, either by a directly equivalent predicted edge or by a text\-supported, subject\-preserving combination of predicted edges allowed by the judge rules\. In both prompts, the judge is required to return the edge index being audited, a binary verdict, and either supporting text evidence or the matched counterpart predicted edge\(s\)\. The full judge rules are summarized in Appendix[C](https://arxiv.org/html/2605.16714#A3), Table[8](https://arxiv.org/html/2605.16714#A3.T8)\.

## 3Evaluation

In the evaluations, we focus on the following research questions:

- •RQ1: How do the two mainGRIDsystems compare with representative CTI knowledge\-graph baselines?
- •RQ2: How do the main post\-training designs differ in effectiveness and engineering cost?
- •RQ3: How much do article rewriting and article\-complexity ordering matter within the primary Task\-bank Reward setup?

### 3\.1Evaluation Setup

Evaluation Dataset\. Our benchmark contains 249 CTI articles drawn from five sources after removing articles whose ground\-truth knowledge graphs contain fewer than five edges\. We first manually annotate 59 real\-world CTI articles and retain 49 filtered GRID articles \(avg\. 1,102 tokens and 15\.35 edges\)\. We then apply the same ground\-truth\-graph\-size filter to four external sources and retain 50 articles each from CASIE \(537 tokens and 7\.94 ground\-truth edges per article on average\), CTINexus \(avg\. 191 tokens and 11\.80 edges\), MalKG \(avg\. 6,632 tokens and 48\.90 edges\), and SecureNLP \(avg\. 11,000 tokens and 68\.66 edges\)Satyapanich et al\. \([2020](https://arxiv.org/html/2605.16714#bib.bib53)\); Cheng et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib7)\); Rastogi et al\. \([2021](https://arxiv.org/html/2605.16714#bib.bib50)\); Phandi et al\. \([2018](https://arxiv.org/html/2605.16714#bib.bib46)\)\.

Implementation\. We implement training with VERLSheng et al\. \([2024](https://arxiv.org/html/2605.16714#bib.bib57)\)and adopt Qwen3\-4B\-Instruct\-2507Qwen Team \([2025a](https://arxiv.org/html/2605.16714#bib.bib47)\)as the base extractor model\. This 4B backbone is strong enough for post\-training while still keeping repeated SFT/RL ablations affordable\. The rationale for this backbone choice is further discussed in Section[4](https://arxiv.org/html/2605.16714#S4)\. Additional implementation details are deferred to Appendix[E](https://arxiv.org/html/2605.16714#A5)\.

Training Setup\. For the two main RQ1 systems, the primary Task\-bank Reward model is trained on 800 CTI articles, whereas the secondary End2End Reward model is trained on 1,000 CTI articles\. Task\-bank Reward and End2End Reward use training batch sizes of 16 and 64, respectively, with rollout counts of 4 and 8\. During training\-data construction and the End2End training\-side reward computation, we use GPT\-5\.3 Codex MediumOpenAI \([2026a](https://arxiv.org/html/2605.16714#bib.bib44)\)\. At test time, all models receive the full article and are evaluated on full\-document extraction\. Additional training and testing details are summarized in Appendix[E](https://arxiv.org/html/2605.16714#A5)\.

LLM Judge Calibration\. Because exact string matching would unfairly penalize many semantically correct but surface\-divergent extractions, we adopt the indexed edge\-wise evaluation protocol described in Section[2\.6](https://arxiv.org/html/2605.16714#S2.SS6)\. We first fixed the GPT\-5\.4 mini judge configuration, with reasoning effort set to medium, temperature 0\.1, and a maximum output budget of 65,536 tokens\. We then manually inspected pilot judge outputs and iteratively refined the prompts by adding more detailed decision rules and representative examples\. After freezing the final prompt set, we calibrated the judge against human judgments on 378 manually reviewed audit items from three human reviewers\. The resulting agreement is 80\.6% for precision \(154/191\) and 91\.4% for recall \(171/187\), yielding an overall agreement of 86\.0% \(325/378\) with human annotations; detailed calibration slices and confusion matrices are deferred to Appendix[C\.1](https://arxiv.org/html/2605.16714#A3.SS1)\. The same final judge configuration and prompt set were then used for all RQs\.

Baselines\. For RQ1, we compare twoGRIDsystems with eight representative baselines\. The twoGRIDmodels are the primary Task\-bank Reward system and the secondary End2End Reward system\. The eight external baselines are CTINexusCheng et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib7)\), CTIKGHuang & Xiao \([2024](https://arxiv.org/html/2605.16714#bib.bib22)\), CogneeCognee Contributors \([2025](https://arxiv.org/html/2605.16714#bib.bib10)\), LLM\-CAKGWang et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib64)\), GraphitiZep \([2025](https://arxiv.org/html/2605.16714#bib.bib69)\), GraphRAGMicrosoft \([2024a](https://arxiv.org/html/2605.16714#bib.bib36)\), KnowGLRossiello et al\. \([2023](https://arxiv.org/html/2605.16714#bib.bib51)\), and AttacKG\+Zhang et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib70)\)\. All LLM\-based systems are instantiated with the same base model asGRID, namely Qwen3\-4B\-Instruct\-2507Qwen Team \([2025a](https://arxiv.org/html/2605.16714#bib.bib47)\), with temperature 0\.7 and a maximum generation budget of 32,768 tokens\. We additionally surveyed LLM\-TIKGHu et al\. \([2024](https://arxiv.org/html/2605.16714#bib.bib21)\), CTI\-ThinkerYang et al\. \([2026](https://arxiv.org/html/2605.16714#bib.bib66)\), CodeKGCBi et al\. \([2024](https://arxiv.org/html/2605.16714#bib.bib5)\), SecTKGSun et al\. \([2023b](https://arxiv.org/html/2605.16714#bib.bib59)\), and the related resources CS13KLi et al\. \([2024](https://arxiv.org/html/2605.16714#bib.bib26)\), HRTCYue et al\. \([2024](https://arxiv.org/html/2605.16714#bib.bib68)\), and BVTEDLiu et al\. \([2024](https://arxiv.org/html/2605.16714#bib.bib30)\), but did not include them in this work because of incomplete public implementations, unavailable source data, or access restrictions\. Among non\-LLM methods, KnowGL is the only one retained in the main table; REBELHuguet Cabot & Navigli \([2021](https://arxiv.org/html/2605.16714#bib.bib23)\)and EXTRACTORSatvat et al\. \([2021](https://arxiv.org/html/2605.16714#bib.bib52)\)are not used because their performance is clearly lower\.

### 3\.2RQ1: Comparison with Baselines

Table 3:Per\-source precision, recall, and F1 \(%\)\. Avg columns are arithmetic means over the five sourcesMethodSourceAvgCASIECTINexusGRIDMalKGSecureNLPPRF1PRF1PRF1PRF1PRF1PRF1GRID\(Task\-bank\)81\.8077\.6877\.2286\.6981\.1082\.5884\.1878\.1179\.6284\.4338\.5544\.6685\.9749\.1058\.5884\.6264\.9168\.53GRID\(End2End\)80\.8866\.9870\.7680\.3076\.6776\.8378\.7164\.9367\.3574\.9424\.8731\.6169\.7135\.8143\.7776\.9153\.8558\.06CTINexus83\.4461\.4167\.4986\.7591\.0287\.8383\.6271\.6475\.9188\.7139\.4248\.2485\.0753\.8863\.8085\.5263\.4768\.66Cognee48\.8743\.7539\.7963\.9864\.5861\.8057\.1352\.0150\.2463\.7427\.7330\.3368\.0448\.7452\.5460\.3547\.3646\.94LLM\-CAKG85\.3952\.7658\.8179\.9055\.6363\.4180\.6569\.6172\.6078\.4938\.2746\.7080\.3565\.1668\.7080\.9656\.2962\.04Graphiti70\.5024\.3932\.4070\.8735\.0544\.3969\.9746\.1351\.5074\.1232\.1439\.0271\.7833\.8843\.3571\.4534\.3242\.13CTIKG87\.4134\.5144\.5782\.6740\.8250\.2584\.2843\.0652\.3079\.2018\.9126\.2984\.1240\.6151\.1283\.5435\.5844\.91GraphRAG90\.3012\.4316\.6987\.0933\.7943\.8177\.2835\.7044\.0788\.1021\.3230\.2584\.9320\.2625\.6285\.5424\.7032\.09AttacKG\+26\.3519\.5416\.6725\.6536\.2726\.5526\.5025\.6620\.0648\.2213\.6415\.0448\.4744\.0142\.9535\.0427\.8224\.25KnowGL25\.830\.400\.5327\.341\.422\.3523\.711\.932\.2028\.942\.052\.8528\.805\.586\.8926\.932\.282\.96

Table[3](https://arxiv.org/html/2605.16714#S3.T3)reports per\-source precision, recall, and F1 on the five sources\. In the Avg columns, precision and recall are arithmetic means across the five sources, and Avg F1 is the arithmetic mean of the five per\-source F1 values\. Under this aggregation, Task\-bank achieves the best source\-averaged recall at 64\.91%\. Because recall measures how much ground\-truth threat knowledge is recovered, CTINexus’s lower average recall \(63\.47%\) means that it misses more critical cybersecurity information overall\. CTINexus attains a nearly identical Avg F1 \(68\.66% vs\. 68\.53%\) and slightly higher Avg precision \(85\.52% vs\. 84\.62%\), but its strongest result appears on the CTINexus source itself, which is also the shortest source in our suite at only 191 tokens on average\. By contrast, if we exclude the CTINexus source itself and re\-average only over the other four sources, Task\-bank has higher average recall and Avg F1\. The efficiency gap also matters: in the standard design of CTINexus, the method uses five inference steps, whereas GRID uses a lower\-cost fixed two\-prompt pipeline\. As a result, GRID’s inference token cost is roughly 40% of the similarly strong CTINexus pipelineCheng et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib7)\)\. End2End Reward reaches 58\.06% Avg F1\. GraphRAG still attains the highest Avg precision \(85\.54%\), but its Avg F1 drops to 32\.09% because recall collapses\. Across the five sources, both GRID variants remain substantially stronger than almost all other baselines\. In particular, Task\-bank achieves the best recall and F1 on both the GRID source \(78\.11%, 79\.62%\) and CASIE \(77\.68%, 77\.22%\), and the best precision on SecureNLP \(85\.97%\)\.

### 3\.3RQ2: Post\-Training Variants in the GRID Framework

Table 4:Post\-training variants in GRID across five sourcesPost\-training designPrecision \(%\)Recall \(%\)F1 \(%\)Task\-bank84\.6264\.9168\.53End2End76\.9153\.8558\.06Choice\-only72\.0333\.7340\.56End2End SFT without RL69\.7530\.4837\.13No post\-training73\.9927\.2135\.44Table[4](https://arxiv.org/html/2605.16714#S3.T4)compares five post\-training variants in the GRID framework\. Here, precision and recall denote five\-source averages, and F1 denotes the arithmetic mean of the five per\-source F1 values\. The first two rows are the two mainGRIDsystems already shown in RQ1: the primary Task\-bank Reward model and the secondary End2End Reward model\. The remaining three rows are weaker variants built around them: Choice\-only Reward without the regex branch, End2End SFT without RL, and the original No post\-training model\. These ablations further validate Task\-bank Reward as an effective reward design\. It remains the strongest setting within GRID at 68\.53% Avg F1 and 64\.91% source\-averaged recall, outperforming End2End Reward at 58\.06% Avg F1 as well as Choice\-only Reward, End2End SFT without RL, and No post\-training at 40\.56%, 37\.13%, and 35\.44%, respectively\. In particular, compared with Choice\-only Reward, Task\-bank Reward improves source\-averaged recall by 31\.18 recall points, showing that the regex branch effectively boosts recall\.

Although End2End Reward remains strong, it still trails Task\-bank Reward by 11\.06 recall points and 10\.47 F1 points while incurring much higher online judge cost\. Task\-bank’s one\-time offline cost is about $60 in total, including about $33 for multi\-select question generation and $27 for regex\-target generation; the resulting supervision bank can then be reused across later SFT and RL runs\. End2End Reward incurs about $942 of online LLM\-as\-judge cost, and this cost grows further as the number of RL steps increases\. The concrete End2End training\-cost calculation is summarized in Appendix[E](https://arxiv.org/html/2605.16714#A5)\.

### 3\.4RQ3: Ablation of Article Rewriting and Article\-Complexity\-Ordered Training

RQ3 studies two design choices in the Task\-bank Reward pipeline: article rewriting and article\-complexity\-ordered training\. We compare three variants under matched RL settings with the same original Qwen3\-4B\-Instruct\-2507 model, 500 training articles, supervision tasks, reward function, and optimization hyperparameters\. Evaluation uses a fixed 25\-article subset \(five per source; seed 42\)\. The variants differ only in whether supervision is built from revised or raw article–KG pairs and whether article blocks follow article\-complexity order or random order\. We report the training reward together with the small\-evaluation\-set score, defined as the average of precision and recall:

- •Full setting: revised\-article supervision with article\-complexity ordering\.
- •w/o article\-complexity ordering: the same supervision, but random article\-block order\.
- •w/o article rewriting: supervision regenerated from raw article–KG pairs, with article\-complexity ordering retained\.

Table 5:RQ3 ablation under a shared 225\-step training budgetSettingTrain rewardTest score\(P\+R\)/2\(P\+R\)/2Full setting0\.79170\.6641w/o article rewriting0\.62650\.6371w/o article\-complexity ordering0\.39060\.5025![Refer to caption](https://arxiv.org/html/2605.16714v1/fig/rq3.png)Figure 3:RQ3 ablation curves up to 225 steps\. Left: RL training reward \(first\-order EMA, smoothing weight 0\.6\)\. Right: test score\(P\+R\)/2\(P\+R\)/2on the fixed 25\-article RQ3 setAt the shared 225\-step budget, the full setting reaches the best test score\(P\+R\)/2\(P\+R\)/2of 0\.6641, compared with 0\.6371 for w/o article rewriting and 0\.5025 for w/o article\-complexity ordering\. This shows that, without manual supervision, both second\-pass article revision and article\-complexity\-ordered training improve RL reward and test\-set precision and recall\.

## 4Discussion

We did not train an agent\-loop extractor because agent\-loop RL requires a more complex reward interface thanGRIDand current open\-source stacks remain immature for stable optimization\. We therefore focus on a non\-agent extractor in this work\.

We chose Qwen3\-4B\-Instruct\-2507Qwen Team \([2025b](https://arxiv.org/html/2605.16714#bib.bib48)\)because 8B and 14B full\-parameter post\-training is much slower on our 4×\\timesRTX 6000 Ada setup and cannot sustain prompt and response lengths as long as those used for the 4B model, and a recent fine\-tuning report ranks it ahead of Llama\-3\.1\-8B\-Instruct and Llama\-3\.2\-3B\-Instruct\. Notably, the latest publicly released small Llama checkpoint still dates back to September 2024distil labs \([2025](https://arxiv.org/html/2605.16714#bib.bib15)\); Meta \([2024a](https://arxiv.org/html/2605.16714#bib.bib33);[b](https://arxiv.org/html/2605.16714#bib.bib34)\)\.

## 5Related Work

Security knowledge graph extraction\.Methods range from supervised pipelines such as EXTRACTORSatvat et al\. \([2021](https://arxiv.org/html/2605.16714#bib.bib52)\)to LLM\-based systems such as CTIKGHuang & Xiao \([2024](https://arxiv.org/html/2605.16714#bib.bib22)\), CTINexusCheng et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib7)\), CTI\-ThinkerYang et al\. \([2026](https://arxiv.org/html/2605.16714#bib.bib66)\), LLM\-TIKGHu et al\. \([2024](https://arxiv.org/html/2605.16714#bib.bib21)\), LLM\-CAKGWang et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib64)\), and AttacKG\+Zhang et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib70)\)\. Most optimize only one stage of the pipeline and rely on general\-purpose large commercial LLMs instead of economical small open\-source models\. In contrast,GRIDlearns from large volumes of security text to equip the extractor itself with security\-domain knowledge\.

RL\-based relation extraction\.R1\-REDai et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib14)\)introduces RLVR into relation extraction, but its main task is relation classification over given entity pairs rather than open triplet extraction or complete article\-level knowledge graph extraction from full security reports\.

## 6Conclusion

We presentedGRID, a CTI knowledge graph extraction framework combining article–graph alignment, task\-bank post\-training, and ontology\-guided inference\. On a unified CTI test set spanning five sources, the Task\-bank Reward model with GRID inference achieved the best source\-averaged recall and a near\-tied top Avg F1 at about 40% of the inference tokens of the similarly strong CTINexus pipeline; RQ2/RQ3 further showed reusable task\-bank supervision and gains from KG\-conditioned article rewriting plus article\-complexity ordering\.

## References

- \(1\)Virustotal \- free online virus, malware and url scanner\.[https://www\.virustotal\.com/en/](https://www.virustotal.com/en/)\.
- cyb \(2021\)Cyber kill chain, 2021\.https://www\.lockheedmartin\.com/en\-us/capabilities/cyber/cyber\-kill\-chain\.html\.
- dep \(2021\)DepImpact Project Website, 2021\.https://github\.com/usenixsub/DepImpact\.
- gri \(2026\)Projectgrid repository, 2026\.https://github\.com/anonymousauthorname/ProjectGRID\.
- Bi et al\. \(2024\)Zhen Bi, Jing Chen, Yinuo Jiang, Feiyu Xiong, Wei Guo, Huajun Chen, and Ningyu Zhang\.Codekgc: Code language model for generative knowledge graph construction\.*ACM Transactions on Asian and Low\-Resource Language Information Processing*, 23\(3\):1–16, 2024\.
- Catakoglu et al\. \(2016\)Onur Catakoglu, Marco Balduzzi, and Davide Balzarotti\.Automatic extraction of indicators of compromise for web applications\.In*Proceedings of the 25th international conference on world wide web*, pp\. 333–343, 2016\.
- Cheng et al\. \(2025\)Yutong Cheng, Osama Bajaber, Saimon Amanuel Tsegai, Dawn Song, and Peng Gao\.Ctinexus: Automatic cyber threat intelligence knowledge graph construction using large language models\.In*2025 IEEE 10th European Symposium on Security and Privacy \(EuroS&P\)*, pp\. 923–938\. IEEE, 2025\.
- Chhikara et al\. \(2025\)Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav\.Mem0: Building production\-ready ai agents with scalable long\-term memory\.*arXiv preprint arXiv:2504\.19413*, 2025\.
- Cisco \(2024\)Cisco\.Common types of cyber attacks, 2024\.URL[https://www\.cisco\.com/c/en/us/products/security/common\-cyberattacks\.html](https://www.cisco.com/c/en/us/products/security/common-cyberattacks.html)\.
- Cognee Contributors \(2025\)Cognee Contributors\.Cognee, 2025\.URL[https://github\.com/topoteretes/cognee](https://github.com/topoteretes/cognee)\.GitHub repository\.
- Conference on Language Modeling \(2026\)Conference on Language Modeling\.COLM 2026: Call for papers\.[https://colmweb\.org/cfp\.html](https://colmweb.org/cfp.html), 2026\.Accessed: 2026\-03\-29\.
- Corporation \(2022\)The MITRE Corporation\.Mitre att&ck, 2022\.https://attack\.mitre\.org/\.
- CrowdStrike \(2024\)CrowdStrike\.Crowdstrike 2024 global threat report, 2024\.URL[https://www\.crowdstrike\.com/global\-threat\-report/](https://www.crowdstrike.com/global-threat-report/)\.
- Dai et al\. \(2025\)Runpeng Dai, Tong Zheng, Run Yang, and Hongtu Zhu\.R1\-re: Cross\-domain relationship extraction with rlvr\.*arXiv e\-prints*, pp\. arXiv–2507, 2025\.
- distil labs \(2025\)distil labs\.We benchmarked 12 small language models across 8 tasks to find the best base model for fine\-tuning\.[https://www\.distillabs\.ai/blog/](https://www.distillabs.ai/blog/), December 2025\.Published: 2025\-12\-10; accessed: 2026\-03\-29\.
- Dong et al\. \(2019\)Ying Dong, Wenbo Guo, Yueqi Chen, Xinyu Xing, Yuqing Zhang, and Gang Wang\.Towards the detection of inconsistencies in public security vulnerability reports\.In*28th USENIX security symposium \(USENIX Security 19\)*, pp\. 869–885, 2019\.
- Gao et al\. \(2025\)Chang Gao, Chujie Zheng, Xiong\-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin\.Soft adaptive policy optimization\.*arXiv preprint arXiv:2511\.20347*, 2025\.
- Gao et al\. \(2018\)Peng Gao, Xusheng Xiao, Zhichun Li, Fengyuan Xu, Sanjeev R\. Kulkarni, and Prateek Mittal\.AIQL: Enabling efficient attack investigation from system monitoring data\.In*USENIX Annual Technical Conference \(ATC\)*, pp\. 113–126, 2018\.
- Gao et al\. \(2022\)Peng Gao, Xiaoyuan Liu, Edward Choi, Sibo Ma, Xinyu Yang, Zhengjie Ji, Zilin Zhang, and Dawn Song\.Threatkg: A threat knowledge graph for automated open\-source cyber threat intelligence gathering and management\.*arXiv preprint arXiv:2212\.10388*, 2022\.
- He et al\. \(2024\)Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi\.G\-retriever: Retrieval\-augmented generation for textual graph understanding and question answering\.*Advances in Neural Information Processing Systems*, 37:132876–132907, 2024\.
- Hu et al\. \(2024\)Yuelin Hu, Futai Zou, Jiajia Han, Xin Sun, and Yilei Wang\.Llm\-tikg: Threat intelligence knowledge graph construction utilizing large language model\.*Computers & Security*, 145:103999, 2024\.
- Huang & Xiao \(2024\)Liangyi Huang and Xusheng Xiao\.Ctikg: Llm\-powered knowledge graph construction from cyber threat intelligence\.In*First Conference on Language Modeling*, 2024\.
- Huguet Cabot & Navigli \(2021\)Pere\-Lluís Huguet Cabot and Roberto Navigli\.REBEL: Relation extraction by end\-to\-end language generation\.In*Findings of the Association for Computational Linguistics: EMNLP 2021*, pp\. 2370–2381, Punta Cana, Dominican Republic, November 2021\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2021\.findings\-emnlp\.204](https://aclanthology.org/2021.findings-emnlp.204)\.
- Hutchins et al\. \(2011\)Eric M Hutchins, Michael J Cloppert, and Rohan M Amin\.Intelligence\-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains\.*Leading Issues in Information Warfare & Security Research*, 1:80, 2011\.
- King & Chen \(2003\)Samuel T\. King and Peter M\. Chen\.Backtracking intrusions\.In*ACM Symposium on Operating systems principles \(SOSP\)*, pp\. 223–236\. ACM, 2003\.
- Li et al\. \(2024\)Hongyi Li, Ze Shi, Chengwei Pan, Di Zhao, and Nan Sun\.Cybersecurity knowledge graphs construction and quality assessment\.*Complex & Intelligent Systems*, 10\(1\):1201–1217, 2024\.
- Li et al\. \(2022\)Zhenyuan Li, Jun Zeng, Yan Chen, and Zhenkai Liang\.Attackg: Constructing technique knowledge graph from cyber threat intelligence reports\.In*European Symposium on Research in Computer Security*, pp\. 589–609\. Springer, 2022\.
- Liao et al\. \(2016\)Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem Beyah\.Acing the ioc game: Toward automatic discovery and analysis of open\-source cyber threat intelligence\.In*Proceedings of the 2016 ACM SIGSAC conference on computer and communications security*, pp\. 755–766, 2016\.
- Lim et al\. \(2017\)Swee Kiat Lim, Aldrian Obaja Muis, Wei Lu, and Chen Hui Ong\.Malwaretextdb: A database for annotated malware articles\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 1557–1567\. Association for Computational Linguistics, 2017\.doi:10\.18653/v1/P17\-1143\.URL[https://aclanthology\.org/P17\-1143/](https://aclanthology.org/P17-1143/)\.
- Liu et al\. \(2024\)Kai Liu, Yi Wang, Zhaoyun Ding, Aiping Li, and Weiming Zhang\.Bvted: A specialized bilingual \(chinese–english\) dataset for vulnerability triple extraction tasks\.*Applied Sciences*, 14\(16\):7310, 2024\.
- Mandiant \(2024\)Mandiant\.Global perspectives on threat intelligence, 2024\.URL[https://assets\.starlinkme\.net/gitex\-vendor\-assets/mandiant/Global%20Perspectives%20on%20Threat%20Intelligence\.pdf](https://assets.starlinkme.net/gitex-vendor-assets/mandiant/Global%20Perspectives%20on%20Threat%20Intelligence.pdf)\.
- McMillan \(2013\)Rob McMillan\.Open threat intelligence, 2013\.https://www\.gartner\.com/doc/2487216/definition\-threat\-intelligence\.
- Meta \(2024a\)Meta\.Llama\-3\.1\-8b\-instruct\.[https://huggingface\.co/meta\-llama/Llama\-3\.1\-8B\-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), July 2024a\.Released: 2024\-07\-23; accessed: 2026\-03\-29\.
- Meta \(2024b\)Meta\.Llama\-3\.2\-3b\.[https://huggingface\.co/meta\-llama/Llama\-3\.2\-3B](https://huggingface.co/meta-llama/Llama-3.2-3B), September 2024b\.Released: 2024\-09\-25, accessed: 2026\-03\-29\.
- Micro \(2023\)Trend Micro\.Ransomware spotlight: Magniber, 2023\.URL[https://www\.trendmicro\.com/vinfo/us/security/news/ransomware\-spotlight/ransomware\-spotlight\-magniber](https://www.trendmicro.com/vinfo/us/security/news/ransomware-spotlight/ransomware-spotlight-magniber)\.
- Microsoft \(2024a\)Microsoft\.Graphrag, 2024a\.URL[https://github\.com/microsoft/graphrag](https://github.com/microsoft/graphrag)\.GitHub repository\.
- Microsoft \(2024b\)Microsoft\.What is a cyberattack?, 2024b\.URL[https://www\.microsoft\.com/en\-us/security/business/security\-101/what\-is\-a\-cyberattack](https://www.microsoft.com/en-us/security/business/security-101/what-is-a-cyberattack)\.
- Milajerdi et al\. \(2019\)Sadegh M\. Milajerdi, Birhanu Eshete, Rigel Gjomemo, and V\.N\. Venkatakrishnan\.Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting\.In*ACM Conference on Computer and Communications Security \(CCS\)*, pp\. 1795–1812, 2019\.
- MITRE \(2020\)MITRE\.Common Vulnerabilities and Exposures \(CVE\), 2020\.https://cve\.mitre\.org/\.
- Obrst et al\. \(2012\)Leo Obrst, Penny Chase, and Richard Markeloff\.Developing an ontology of the cyber security domain\.In*STIDS*, pp\. 49–56\. Citeseer, 2012\.
- of Standards & Technology \(2021\)National Institute of Standards and Technology\.National vulnerability database \(nvd\), 2021\.https://nvd\.nist\.gov/\.
- Office of the National Cyber Director \(2024\)Office of the National Cyber Director\.2024 report on the cybersecurity posture of the united states, 2024\.URL[https://www\.whitehouse\.gov/wp\-content/uploads/2024/05/2024\-Report\-on\-the\-Cybersecurity\-Posture\-of\-the\-United\-States\.pdf](https://www.whitehouse.gov/wp-content/uploads/2024/05/2024-Report-on-the-Cybersecurity-Posture-of-the-United-States.pdf)\.
- OpenAI \(2023\)OpenAI\.Chatgpt: Applications, opportunities, and threats\.*arXiv preprint arXiv:2304\.09103*, 2023\.
- OpenAI \(2026a\)OpenAI\.Introducing gpt\-5\.3\-codex\.[https://openai\.com/index/introducing\-gpt\-5\-3\-codex/](https://openai.com/index/introducing-gpt-5-3-codex/), February 2026a\.February 5, 2026\.
- OpenAI \(2026b\)OpenAI\.Introducing gpt\-5\.4 mini and nano\.[https://openai\.com/index/introducing\-gpt\-5\-4\-mini\-and\-nano/](https://openai.com/index/introducing-gpt-5-4-mini-and-nano/), March 2026b\.March 17, 2026\.
- Phandi et al\. \(2018\)Peter Phandi, Amila Silva, and Wei Lu\.Semeval\-2018 task 8: Semantic extraction from cybersecurity reports using natural language processing \(SecureNLP\)\.In*Proceedings of the 12th International Workshop on Semantic Evaluation*, pp\. 697–706\. Association for Computational Linguistics, 2018\.doi:10\.18653/v1/S18\-1113\.URL[https://aclanthology\.org/S18\-1113/](https://aclanthology.org/S18-1113/)\.
- Qwen Team \(2025a\)Qwen Team\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025a\.doi:10\.48550/arXiv\.2505\.09388\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Qwen Team \(2025b\)Qwen Team\.Qwen3\-4b\-instruct\-2507\.[https://huggingface\.co/Qwen/Qwen3\-4B\-Instruct\-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507), 2025b\.Accessed: 2026\-03\-29\.
- \(49\)Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef\.Zep: A temporal knowledge graph architecture for agent memory, 2025\.*URL https://arxiv\. org/abs/2501\.13956*\.
- Rastogi et al\. \(2021\)Nidhi Rastogi, Sharmishtha Dutta, Ryan Christian, Mohammad Zaki, Alex Gittens, and Charu Aggarwal\.Information prediction using knowledge graphs for contextual malware threat intelligence\.*arXiv preprint arXiv:2102\.05571*, 2021\.URL[https://arxiv\.org/abs/2102\.05571](https://arxiv.org/abs/2102.05571)\.
- Rossiello et al\. \(2023\)Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Nandana Mihindukulasooriya, Owen Cornec, and Alfio Massimiliano Gliozzo\.Knowgl: Knowledge generation and linking from text\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 37, pp\. 16476–16478, 2023\.
- Satvat et al\. \(2021\)Kiavash Satvat, Rigel Gjomemo, and VN Venkatakrishnan\.Extractor: Extracting attack behavior from threat reports\.In*2021 IEEE European Symposium on Security and Privacy \(EuroS&P\)*, pp\. 598–615\. IEEE, 2021\.
- Satyapanich et al\. \(2020\)Taneeya Satyapanich, Francis Ferraro, and Tim Finin\.Casie: Extracting cybersecurity event information from text\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 34, pp\. 8749–8757, 2020\.
- Securelist \(2024\)Securelist\.Threat categories, 2024\.URL[https://securelist\.com/threat\-categories/](https://securelist.com/threat-categories/)\.
- \(55\)Senki\.Open source threat intelligence feeds\.https://www\.senki\.org/operators\-security\-toolkit/open\-source\-threat\-intelligence\-feeds/\.
- Senki \(2016\)Senki\.Real\-time threat intelligence, 2016\.https://www\.recordedfuture\.com/\.
- Sheng et al\. \(2024\)Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu\.Hybridflow: A flexible and efficient rlhf framework\.*arXiv preprint arXiv: 2409\.19256*, 2024\.
- Sun et al\. \(2023a\)Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M Ni, Heung\-Yeung Shum, and Jian Guo\.Think\-on\-graph: Deep and responsible reasoning of large language model on knowledge graph, 2024\.*URL https://arxiv\. org/abs/2307\.07697*, 2023a\.
- Sun et al\. \(2023b\)Siqi Sun, Cheng Huang, Tiejun Wu, and Yi Shen\.Sectkg: A knowledge graph for open\-source security tools\.*International Journal of Intelligent Systems*, 2023\(1\):4464974, 2023b\.
- Times \(2014\)New York Times\.Target data breach incident, 2014\.http://www\.nytimes\.com/2014/02/27/business/target\-reports\-on\-fourth\-quarter\-earnings\.html?\_r=1\.
- Unit42 by Palo Alto Networks \(2017\)Unit42 by Palo Alto Networks\.How the eitest campaigns path to angler ek evolved over time, 2017\.URL[https://unit42\.paloaltonetworks\.com/unit42\-how\-the\-eltest\-campaigns\-path\-to\-angler\-ek\-evolved\-over\-time/](https://unit42.paloaltonetworks.com/unit42-how-the-eltest-campaigns-path-to-angler-ek-evolved-over-time/)\.
- Wagner et al\. \(2019\)Thomas D Wagner, Khaled Mahbub, Esther Palomar, and Ali E Abdallah\.Cyber threat intelligence sharing: Survey and research directions\.*Computers & Security*, 87:101589, 2019\.
- Wang et al\. \(2023\)Y\. Wang, Y\. Zhang, Y\. Li, and X\. Liu\.A bibliometric review of large language models research from 2017 to 2023\.*arXiv preprint arXiv:2304\.02020*, 2023\.
- Wang et al\. \(2025\)Zhihua Wang, Siyuan Fei, Youlin Hu, Dacheng Shan, Shitao Xiao, Lizhao You, and Peijun Chen\.Automated attack knowledge graph construction with large language models\.In*Proceedings of the 2025 2nd International Conference on Computer and Multimedia Technology*, pp\. 700–706, 2025\.
- Xu et al\. \(2022\)Zhiqiang Xu, Pengcheng Fang, Changlin Liu, Xusheng Xiao, Yu Wen, and Dan Meng\.Depcomm: Graph summarization on system audit logs for attack investigation\.In*2022 IEEE Symposium on Security and Privacy \(SP\)*, pp\. 540–557\. IEEE, 2022\.
- Yang et al\. \(2026\)Xiuzhang Yang, Ruijie Zhong, Yuling Chen, Guojun Peng, Di Yao, Chaofan Chen, Chenyang Wang, Dongni Zhang, Yilin Zhou, and Zixuan Yang\.Cti\-thinker: an llm\-driven system for cti knowledge graph construction and attack reasoning\.*Cybersecurity*, 9\(1\):106, 2026\.
- Yu et al\. \(2025\)Yao\-Ching Yu, Tsun\-Han Chiang, Cheng\-Wei Tsai, Chien\-Ming Huang, and Wen\-Kwang Tsao\.Primus: A pioneering collection of open\-source datasets for cybersecurity llm training\.*arXiv preprint arXiv:2502\.11191*, 2025\.URL[https://arxiv\.org/abs/2502\.11191](https://arxiv.org/abs/2502.11191)\.
- Yue et al\. \(2024\)HuanZhou Yue, XuRen Wang, Rong Chen, ZhengWei Jiang, YuXia Fu, and Jun Jiang\.Hrtc: A triplet joint extraction model based on cyber threat intelligence\.In*International Conference on Knowledge Science, Engineering and Management*, pp\. 214–223\. Springer, 2024\.
- Zep \(2025\)Zep\.Graphiti, 2025\.URL[https://github\.com/getzep/graphiti](https://github.com/getzep/graphiti)\.GitHub repository\.
- Zhang et al\. \(2025\)Yongheng Zhang, Tingwen Du, Yunshan Ma, Xiang Wang, Yi Xie, Guozheng Yang, Yuliang Lu, and Ee\-Chien Chang\.Attackg\+: Boosting attack graph construction with large language models\.*Computers & Security*, 150:104220, 2025\.

## Appendix ALLM Usage Disclosure

In line with the COLM 2026 policy on LLM use disclosureConference on Language Modeling \([2026](https://arxiv.org/html/2605.16714#bib.bib11)\), we disclose the non\-minor LLM usage in this work’s research pipeline\. In training\-data construction, we use GPT\-5\.3 Codex MediumOpenAI \([2026a](https://arxiv.org/html/2605.16714#bib.bib44)\)to generate traceable article\-to\-KG alignments, perform KG\-conditioned text revision, create four\-option multi\-select questions, generate triple\-level regex targets, and compute the training\-side precision/recall reward used by the End2End setting\. In evaluation on the human\-annotated benchmark articles, we use GPT\-5\.4 miniOpenAI \([2026b](https://arxiv.org/html/2605.16714#bib.bib45)\)as an LLM\-as\-judge to measure the effectiveness of different methods by computing precision and recall for their generated knowledge graphs against human\-annotated ground\-truth graphs\.

## Appendix BSecurity Ontology Inventory

The ontology used byGRIDdefines both the structured fields attached to graph elements and the controlled vocabularies used during extraction and validation\. Each entity recordsname,type,alias, andparent entity\. Each relation recordssub,rel,rel\_type, andobj\. The predefined inventories below follow the GRID ontology used in this paper\.

Our ontology is inspired by STIX rather than being a direct copy\. During manual labeling, we gradually found that CTI articles repeatedly required several practical refinements beyond the raw STIX inventory\. The following types are introduced:

Merged attacker/activity labels\.threat\-actor\-or\-intrusion\-set Offensive tools vs\. legitimate\-but\-abused software\.hacker\-tool,general\-software Component\-level types\.detailed\-part\-of\-malware\-or\-hackertool detailed\-part\-of\-general\-software Analysis\- or fallback\-oriented types\.security\-product, malware\-analysis\-document\- or\-publication\-or\-conference,abstract\-concept, generic\-noun,noise

These types are introduced to guide the LLM toward deeper structural relations among entities in CTI articles, rather than to turn extraction into a closed label\-only task\. The extraction remains fundamentally open\-ended, with the ontology serving as a lightweight scaffold for reasoning and normalization\.

![Refer to caption](https://arxiv.org/html/2605.16714v1/figs/promptblock.png)Figure 4:Prompt blocks for theGRIDextractorTable 6:Entity types in the GRID ontologyuser\-accountidentitythreat\-actor\-or\-intrusion\-setcampaignmalwarehacker\-toolgeneral\-softwaresecurity\-productdetailed\-part\-of\-malware\-or\-hackertooldetailed\-part\-of\-general\-softwareattack\-patternvulnerabilityfileprocesswindows\-registry\-keycourse\-of\-actionurldomain\-nameipv4\-addripv6\-addrnetwork\-trafficinfrastructureemail\-addressmac\-addressindicatormalware\-analysis\-document\-or\-publication\-or\-conferencecredential\-valuex509\-certificatelocationabstract\-conceptgeneric\-nounothernoiseTable 7:Relation types in the GRID ontologyexploitsbypassesmalicious\-investigates\-track\-detectsimpersonatestargetscompromisesleads\-todropsdownloadsexecutesdeliversbeacons\-toexfiltrate\-toleakscommunicates\-withresolves\-tohostsprovidesauthored\-byownscontrolsattributed\-toaffiliated\-withcooperates\-withis\-part\-ofconsists\-ofhasdepends\-oncreates\-or\-generatesmodifies\-or\-removes\-or\-replacesusesvariant\-ofderived\-fromalias\-ofcompares\-tocategorized\-aslocated\-atoriginates\-fromindicatesmitigatesbased\-onresearch\-describes\-analysis\-of\-characterizes\-detectsnegationother
## Appendix CJudge Rules for Automatic Evaluation

The automatic evaluator scores precision and recall under an explicit rule set rather than unconstrained semantic similarity\. Table[8](https://arxiv.org/html/2605.16714#A3.T8)summarizes the main judge rules\.

Table 8:Judge rules in the automatic evaluatorRuleMeaningText\-Provable TruthA relation counts only if it can be supported by the article text itself\. The judge must not use external world knowledge, domain defaults, or subject elevation to justify a match\.Subject\-Preserving Chain ReasoningMulti\-hop reasoning is allowed only when the full chain is text\-supported and the subject remains the same throughout the deduction\.General\-Specific EquivalenceCoarser and finer phrasings may match when they refer to the same fact stated in the text rather than to different entities or events\.Action\-Technique and Relation NormalizationA surface action, a normalized technique name, and a canonical relation label may match when they denote the same behavior in context\.Relation Hierarchy ToleranceParent\-child relation gaps are tolerated when the article supports the specific behavior and therefore also licenses the coarser relation, or vice versa\.Attribute\-as\-Structure MatchingStructural facts may be represented either as edges or as entity attributes\. In particular,aliasandparent entityfields can satisfy corresponding ground\-truth structural relations\.Alias and Hierarchy EquivalenceAlias forms and near\-family variants can be treated as equivalent only when the text or entity attributes explicitly support that equivalence\.Indexed Edge\-wise AuditingPrecision is judged edge by edge over predicted triples, while recall is judged edge by edge over ground\-truth triples\. For each audited edge, the judge must return its index, a binary decision, and either supporting text evidence or the matched counterpart predicted edge\(s\)\.Malformed Extraction RejectionPredictions whose subject or object is a pronoun, a full clause, or another non\-entity span are rejected and cannot count as true positives\.### C\.1Judge Calibration Against Human Annotations

To calibrate the indexed edge\-wise evaluator, we compare the LLM\-as\-judge against human judgments on 378 manually reviewed audit items labeled by three human reviewers, including 191 precision items and 187 recall items\. The reviewed set spans the methods evaluated in RQ1 and RQ2\.

The human interface records whether the annotator agrees or disagrees with the LLM judgment\. We therefore convert these decisions into the corresponding human precision labels \(TP vs\. FP\) and human recall labels \(TP vs\. FN\)\. For precision, agreeing with an LLM\-TP or disagreeing with an LLM\-FP implies a human TP, while agreeing with an LLM\-FP or disagreeing with an LLM\-TP implies a human FP\. For recall, the same conversion yields human TP and FN labels\.

Table 9:Judge–human confusion matrices\. Rows are LLM labels and columns are human labels\.Precision calibration

Human’sdecisionTPFPLLMTP87\.6%12\.4%FP27\.9%72\.1%

Recall calibration

Human’sdecisionTPFNLLMTP78\.6%21\.4%FN4\.8%95\.2%

The resulting agreement is 80\.6% for precision \(154/191\) and 91\.4% for recall \(171/187\), yielding an overall agreement of 86\.0% \(325/378\)\.

Table 10:Per\-source agreement rates between the LLM judge and three human reviewers\.SourceReviewer 1Reviewer 2Reviewer 3PrecisionagreementRecallagreementPrecisionagreementRecallagreementPrecisionagreementRecallagreementGRID100\.0%81\.8%77\.8%88\.9%62\.5%100\.0%CASIE87\.5%80\.0%81\.8%100\.0%100\.0%80\.0%CTINexus62\.5%100\.0%80\.0%100\.0%100\.0%100\.0%MalKG70\.8%100\.0%77\.3%88\.9%84\.6%92\.9%SecureNLP84\.2%93\.5%85\.7%88\.0%77\.8%86\.7%

## Appendix DBaseline Adaptation Details

For these LLM\-based baselines, we first locate and reproduce their public implementations from the corresponding papers, project pages, and code repositories, then apply only the engineering adaptations needed to run them under the same original Qwen3\-4B\-Instruct\-2507 model and unified article\-level benchmark protocol without changing each method’s core extraction logic\. For CTINexus and CTIKG, we retain their released multi\-stage extraction workflows and normalize only the final outputs to our common article\-level\{entities, relations\}JSON schema\. For GraphRAG, we use the official Microsoft graph\-extraction prompt and its zero\-shot extraction loop\. For Graphiti and Cognee, we preserve their released node/edge or cascade\-extraction logic while adapting them to full\-document chunked execution and article\-level merging so that they can be evaluated on long CTI articles under a unified benchmark protocol\. For LLM\-CAKG, we keep its appendix\-style prompts but adapt it to a full\-document chunked processing pipeline\. For AttacKG\+, we retain the original rewrite\-then\-extract pipeline and adapt it to the same full\-document article\-level setting\. For KnowGL, we evaluate the released HuggingFace model and parse its native decoded outputs into the same triple schema\. Across prompt\-based baselines, we additionally use a shared runtime wrapper for robust JSON repair and schema normalization, without changing each method’s core extraction logic\.

## Appendix EAdditional Training and Testing Details

We collected CTI articles from public online CTI sources such as MandiantMandiant \([2024](https://arxiv.org/html/2605.16714#bib.bib31)\), CrowdStrikeCrowdStrike \([2024](https://arxiv.org/html/2605.16714#bib.bib13)\), Trend MicroMicro \([2023](https://arxiv.org/html/2605.16714#bib.bib35)\), SecurelistSecurelist \([2024](https://arxiv.org/html/2605.16714#bib.bib54)\), Unit42Unit42 by Palo Alto Networks \([2017](https://arxiv.org/html/2605.16714#bib.bib61)\), Microsoft SecurityMicrosoft \([2024b](https://arxiv.org/html/2605.16714#bib.bib37)\), Cisco TalosCisco \([2024](https://arxiv.org/html/2605.16714#bib.bib9)\), and VirusTotal[Vir](https://arxiv.org/html/2605.16714#bib.bib1)to construct our CTI corpus, reserved the benchmark articles in this corpus for evaluation, and used the remaining corpus articles together with randomly sampled articles from the Primus datasetYu et al\. \([2025](https://arxiv.org/html/2605.16714#bib.bib67)\)for training\. For the primary Task\-bank Reward setting, each training article contributes a supervision bank consisting of up to 20 four\-option multi\-select questions together with article\-level regex targets aligned to the ground\-truth KG edges\. During VERL post\-training, the two internal GRID extraction stages are written into a single ontology\-guided extraction prompt and optimized as one generation task, rather than executed as two separate LLM calls during post\-training\. The 4096\-token filter is applied only during training prompt construction; at test time, models receive the full article and are evaluated on full\-document extraction\.

Among the external sources used in evaluation, SecureNLP refers to the SemEval\-2018 Task 8 shared\-task dataset, an extension of MalwareTextDBLim et al\. \([2017](https://arxiv.org/html/2605.16714#bib.bib29)\); Phandi et al\. \([2018](https://arxiv.org/html/2605.16714#bib.bib46)\)\.

During RL training, the maximum response length is 4096 for Task\-bank Reward and 8192 for End2End Reward\. All experiments are run on a server with Ubuntu 20\.04\.6, an AMD 5955WX CPU, 256GB memory, and four Nvidia RTX 6000 Ada GPUs\.

For the End2End Reward setting discussed in RQ2, we use batch size 64 and rollout count 8 for 13 RL steps\. Under this configuration, the training\-side online LLM\-as\-judge reward computation costs about $942 in total\.

## Appendix FReproducibility Statement

The data artifacts, code, prompts, and evaluation scripts used in this work are available in the anonymous project repository at[https://github\.com/anonymousauthorname/ProjectGRID](https://github.com/anonymousauthorname/ProjectGRID)\. The corresponding post\-trained model weights are available in an anonymous Hugging Face repository, and the access link is also provided in the GitHub repository\.

Similar Articles

Generating Graph-like Rules for Knowledge Graph Reasoning via Diffusion Models

arXiv cs.AI

This paper introduces GRiD, a framework that uses diffusion models and reinforcement learning to generate graph-like rules (e.g., cycles, branches) for knowledge graph reasoning, addressing the limitations of existing chain-rule mining methods. Experiments on six benchmarks show competitive performance in KG completion tasks.