GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

arXiv cs.LG Papers

Summary

GraphDiffMed is a medication recommendation framework that uses dual-scale differential attention and pharmacological graph priors to improve recommendation quality and safety on EHR data. Experiments on MIMIC-III show consistent improvements over baselines.

arXiv:2605.20188v1 Announce Type: new Abstract: Recommending safe and effective medication combinations from electronic health records (EHRs) is a core clinical AI problem, yet it remains difficult because patient trajectories are long, noisy, and clinically heterogeneous. Existing methods typically excel at either temporal modeling across visits or pharmacological knowledge integration (e.g., drug-drug interactions, DDIs), but rarely achieve both while robustly suppressing noise. We present GraphDiffMed, a knowledge-constrained medication recommendation framework built on dual-scale Differential Attention v2. Differential attention is applied at both intra-visit and inter-visit levels to filter spurious signals within encounters and across longitudinal history, while pharmacological constraints are incorporated during learning. Experiments on MIMIC-III and ablation studies show that this design consistently improves recommendation quality and ranking over strong baselines while achieving a more favorable safety performance balance. We further find that the strongest-performing configuration uses only demographic auxiliary features under our experimental setting. Overall, GraphDiffMed demonstrates that combining noise-aware attention with pharmacological constraints yields more reliable and clinically meaningful medication recommendation. We open-source our code at https://github.com/saxenakrati09/GraphDiffMed.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:20 AM

# GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation
Source: [https://arxiv.org/html/2605.20188](https://arxiv.org/html/2605.20188)
11institutetext:Kyushu Institute of Technology, Kitakyushu, Fukuoka, Japan
11email:saxena\.krati536@mail\.kyutech\.jp, tom@brain\.kyutech\.ac\.jp###### Abstract

Recommending safe and effective medication combinations from electronic health records \(EHRs\) is a core clinical AI problem, yet it remains difficult because patient trajectories are long, noisy, and clinically heterogeneous\. Existing methods typically excel at either temporal modeling across visits or pharmacological knowledge integration \(e\.g\., drug\-drug interactions, DDIs\), but rarely achieve both while robustly suppressing noise\. We present GraphDiffMed, a knowledge\-constrained medication recommendation framework built on dual\-scale Differential Attention v2\. Differential attention is applied at both intra\-visit and inter\-visit levels to filter spurious signals within encounters and across longitudinal history, while pharmacological constraints are incorporated during learning\. Experiments on MIMIC\-III and ablation studies show that this design consistently improves recommendation quality and ranking over strong baselines while achieving a more favorable safety performance balance\. We further find that the strongest\-performing configuration uses only demographic auxiliary features under our experimental setting\. Overall, GraphDiffMed demonstrates that combining noise\-aware attention with pharmacological constraints yields more reliable and clinically meaningful medication recommendation\. We open\-source our code at[https://github\.com/saxenakrati09/GraphDiffMed](https://github.com/saxenakrati09/GraphDiffMed)\.

## 1Introduction

Medication recommendation is a central task in clinical decision support\. Its goal is to recommend an appropriate set of medications for the current visit using longitudinal electronic health records \(EHRs\), including diagnoses, procedures, laboratory findings, and prior prescriptions, while controlling the risk of harmful drug\-drug interactions \(DDIs\)\. The task is clinically important because modern care routinely involves polypharmacy, and clinicians must decide under time pressure, incomplete information, and evolving patient conditions\. As a result, medication recommendation has become a standard benchmark in healthcare AI, with mature datasets, widely adopted evaluation protocols, and many competing methods\.

Despite this maturity, the problem remains technically difficult\. EHR data are sparse and noisy, with missing values and documentation inconsistencies that can create spurious patterns\. Treatment effects unfold over time, so models must capture both local signals within a visit and long\-range dependencies across visits\. The output space is combinatorial because multiple medications are prescribed jointly, and safety cannot be separated from efficacy because of interaction risks\. Meanwhile, substantial pharmacological knowledge exists in structured resources such as DDI graphs and molecular relations, but integrating it effectively with data\-driven temporal modeling remains an open challenge\.

Existing approaches provide partial solutions\. Graph/knowledge\-informed methods\[[11](https://arxiv.org/html/2605.20188#bib.bib31),[7](https://arxiv.org/html/2605.20188#bib.bib6),[6](https://arxiv.org/html/2605.20188#bib.bib4)\]incorporate drug\-interaction and/or molecular structure signals, but can still inherit biases from observational EHR data\. Sequential and hierarchical EHR models\[[2](https://arxiv.org/html/2605.20188#bib.bib32),[19](https://arxiv.org/html/2605.20188#bib.bib21),[9](https://arxiv.org/html/2605.20188#bib.bib12),[14](https://arxiv.org/html/2605.20188#bib.bib15)\]capture longitudinal visit dynamics \(sometimes with patient similarity/attention\), yet usually do not explicitly inject structured pharmacological knowledge\. Hybrid/multimodal methods\[[15](https://arxiv.org/html/2605.20188#bib.bib10),[1](https://arxiv.org/html/2605.20188#bib.bib25),[3](https://arxiv.org/html/2605.20188#bib.bib2)\]fuse multiple information sources to improve recommendation quality, but their attention is often largely data\-driven rather than pharmacologically constrained\. In parallel, LLM\-distillation approaches\[[8](https://arxiv.org/html/2605.20188#bib.bib5)\]extract clinical semantics via prompting and distill them into smaller recommenders, providing an alternative route to external knowledge beyond explicit pharmacology graphs\.

Conventional attention mechanisms often optimize predictive fit but do not reliably separate noisy co\-occurrence patterns from clinically meaningful polypharmacy signals, potentially over\-penalizing medication combinations that are clinically justified in complex cases\. To address this quality\-safety tension, we proposeGraphDiffMed, a knowledge\-constrained framework built on dual\-scale Differential Attention v2 \(DiffAttn\_v2\)\. DiffAttn\_v2 is used at both intra\-visit and inter\-visit levels, and pharmacological constraints are incorporated as part of the full training framework\.

Our contributions are as follows:

- •We introduce the first dual\-scale application of Differential Attention v2 for medication recommendation\.
- •We analyze how knowledge constraints affect the safety\-performance balance under different modality settings\.
- •We provide a transparent decomposition of where gains originate through research question \(RQ\)\-structured ablations\.
- •We offer a clinical interpretation of the observed DDI profile, showing that slightly higher absolute DDI rates relative to conservative baselines can reflect more complete recommendations for complex polypharmacy cases\.

The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2605.20188#S2)reviews related work in medication recommendation, attention mechanisms, and knowledge integration\. Section[3](https://arxiv.org/html/2605.20188#S3)presents the GraphDiffMed architecture\. Section[4](https://arxiv.org/html/2605.20188#S4)describes datasets, settings, and evaluation metrics\. Section[5](https://arxiv.org/html/2605.20188#S5)reports main results and ablations\. Section[6](https://arxiv.org/html/2605.20188#S6)discusses clinical implications, limitations, and conclusions\.

## 2Related Works

Medication recommendation models\.Deep learning methods have steadily improved temporal and structural modeling for medication recommendation\. Early sequential models such as RETAIN\[[2](https://arxiv.org/html/2605.20188#bib.bib32)\]introduced interpretable RNN attention for clinical sequences, and LEAP\[[19](https://arxiv.org/html/2605.20188#bib.bib21)\]cast prescribing as sequential decision making\. To model complex interactions, graph\-based methods including GAMENet\[[11](https://arxiv.org/html/2605.20188#bib.bib31)\]and SafeDrug\[[7](https://arxiv.org/html/2605.20188#bib.bib6)\]integrated external knowledge, explicitly representing drug\-drug interaction \(DDI\) networks and molecular structure\. Later architectures diversified: SHAPE\[[9](https://arxiv.org/html/2605.20188#bib.bib12)\], DAPSNet\[[14](https://arxiv.org/html/2605.20188#bib.bib15)\], and A\-GSTCN\[[18](https://arxiv.org/html/2605.20188#bib.bib16)\]strengthened hierarchical temporal learning and patient similarity; CIDGMed\[[6](https://arxiv.org/html/2605.20188#bib.bib4)\]used causal inference to correct historical bias; and multimodal models such as PROMISE\[[15](https://arxiv.org/html/2605.20188#bib.bib10)\]and MIFNet\[[3](https://arxiv.org/html/2605.20188#bib.bib2)\]fused heterogeneous EHR modalities\. More recently, LEADER\[[8](https://arxiv.org/html/2605.20188#bib.bib5)\]explored distilling clinical semantics from large language models\. Despite strong empirical results, most methods still rely on largely data\-driven temporal and structural attention, and even knowledge\-aware models often weakly enforce validated pharmacological rules\.

Attention mechanisms and differential attention\.This limitation motivates closer examination of attention design itself\. In healthcare modeling, standard paradigms\-self\-attention, cross\-attention, and multi\-head attention\[[13](https://arxiv.org/html/2605.20188#bib.bib28)\]\-are widely used to emphasize relevant patient history\. Yet these unconstrained mechanisms often overfit noise and institution\-specific co\-prescription artifacts in EHRs\. To improve robustness, Differential Attention\[[17](https://arxiv.org/html/2605.20188#bib.bib34)\]introduced subtractive noise cancellation\. Differential Transformer v2\[[16](https://arxiv.org/html/2605.20188#bib.bib26)\]extended this with query\-dependent, per\-token gating using sigmoid\-constrainedλ\\lambdafor fine\-grained suppression\. Although developed for natural language and vision, differential attention is now entering clinical modeling\. Recently, DADA\-MED\[[10](https://arxiv.org/html/2605.20188#bib.bib30)\]added laboratory events and applied foundational differential attention at the intra\-visit level\. However, vanilla differential attention remains agnostic to external pharmacological structure, leaving it vulnerable to learning clinically unsafe correlations in rare drug combinations\.

Knowledge graph integration and DDI modeling\.This remaining gap points to explicit pharmacological grounding\. Integrating external pharmacological knowledge is a standard approach to improving recommendation safety\. Databases such as DrugBank\[[5](https://arxiv.org/html/2605.20188#bib.bib35)\]and TWOSIDES\[[12](https://arxiv.org/html/2605.20188#bib.bib36)\]are commonly used through embedding pretraining, graph neural network \(GNN\) message passing, or structural attention biasing\. To reduce adverse events, state\-of\-the\-art models \(e\.g\., GAMENet\[[11](https://arxiv.org/html/2605.20188#bib.bib31)\], PROMISE\[[15](https://arxiv.org/html/2605.20188#bib.bib10)\], and REFINE\[[1](https://arxiv.org/html/2605.20188#bib.bib25)\]\) usually frame DDI reduction as regularization by adding post hoc loss penalties\. This penalty\-centric design creates a clinical tension: trading broad DDI minimization against preserving therapeutically necessary, closely monitored polypharmacy\. GraphDiffMed addresses this broader gap with a dual\-scale differential\-attention backbone and pharmacological constraints\.

## 3Methodology

### 3\.1Problem Formulation

Let a patient’s clinical history be represented as a sequence of visits:𝒫=\{V1,V2,…,VT−1\}\\mathcal\{P\}=\\\{V\_\{1\},V\_\{2\},\.\.\.,V\_\{T\-1\}\\\}, whereVtV\_\{t\}denotes thett\-th visit\. Each visitVtV\_\{t\}contains:

- •DtD\_\{t\}: set of diagnoses \(ICD codes\)
- •PtP\_\{t\}: set of procedures \(CPT/ICD\-9 procedure codes\)
- •MtM\_\{t\}: set of prescribed medications
- •LtL\_\{t\}: set of laboratory events \(test ID, value pairs\)
- •GtG\_\{t\}: patient gender \(binary\)
- •AtA\_\{t\}: patient age at visittt

Given patient visits up to timeTT, the medication recommendation task predicts the medication setMTM\_\{T\}for visitTT\. Following standard benchmark practice in prior work, the model uses non\-medication modalities from visits1​…​T1\\ldots T\(diagnoses/procedures in the default setting, and optionally demographics/laboratory events in additional\-modality experiments\), while medication inputs are restricted to visits1​…​T−11\\ldots T\-1\. Thus,MTM\_\{T\}is never used as input, and no visits afterTTare used\.

### 3\.2GraphDiffMed Architecture Overview

GraphDiffMed follows a staged pipeline from representation learning to clinically informed prediction\. First, a multi\-modal embedding layer encodes diagnoses, procedures, medications, laboratory events, and demographic signals in a shared latent space\. For intra\-visit reasoning, each modality is represented as a single visit\-level vector obtained by summing entity embeddings after graph processing, and graph\-biased differential cross\-attention computes a 1×\\times1 attention between the pooled medication vector \(query\) and pooled diagnosis/procedure vectors \(key/value\), effectively learning a gated projection of clinical context into medication representation space\. Inter\-visit encoding then captures longitudinal dependencies across historical visits\. Effective graph bias is injected in inter\-visit attention as a visit\-set prior projected to medication\-token positions\. Finally, the aggregated patient representation is passed to a medication prediction head, followed by a causal review module that adjusts scores using diagnosis\-medication and procedure\-medication causal effects\. An overview is shown in[Figure 1](https://arxiv.org/html/2605.20188#S3.F1)\.

Multi\-Modal Embedding Layer\.This component follows prior designs and is adapted with explicit source separation\. The modality processing pipeline for diagnoses, procedures, medications, laboratory events, and demographics is adopted from DADA\-MED\[[10](https://arxiv.org/html/2605.20188#bib.bib30)\]: diagnosis/procedure/medication codes are mapped to learnable embeddings \(𝐃e,𝐏e,𝐌e\\mathbf\{D\}\_\{e\},\\mathbf\{P\}\_\{e\},\\mathbf\{M\}\_\{e\}\), laboratory events are encoded from normalized \(test ID, value\) pairs as𝐋e=ReLU​\(𝐖lab​\[IDnorm,valuenorm\]\)\\mathbf\{L\}\_\{e\}=\\mathrm\{ReLU\}\(\\mathbf\{W\}\_\{\\text\{lab\}\}\[\\mathrm\{ID\}\_\{\\text\{norm\}\},\\mathrm\{value\}\_\{\\text\{norm\}\}\]\)and aggregated within each visit, and demographics are represented by a gender embedding and a linear age projection \(𝐆e,𝐀e\\mathbf\{G\}\_\{e\},\\mathbf\{A\}\_\{e\}\)\.

![Refer to caption](https://arxiv.org/html/2605.20188v1/images/AIAI2026.drawio.png)Figure 1:Overview of GraphDiffMedWe use the CIDGMed pipeline\[[6](https://arxiv.org/html/2605.20188#bib.bib4)\]for multimodal representation learning that jointly models diagnoses, procedures, and medications for visit\-level recommendation \(The second top box in[Figure 1](https://arxiv.org/html/2605.20188#S3.F1)is adapted from CIDGMed\)\. For medication representation, we draw on CIDGMed’s fine\-grained molecular branch, which maps each medication to molecular structures, performs GIN\-style message passing over molecular nodes, and aggregates molecular information into medication embeddings through a learnable medication\-molecule relation matrix\. In CIDGMed, this medication branch is not used in isolation: diagnoses and procedures are also embedded separately and connected to medications through causal\-effect matrices, after which dual\-granularity representation learning integrates diagnosis, procedure, and medication information for visit\-level recommendation\.

Graph\-Biased Differential Attention\.Our proposed method GraphDiffMed builds on DiffAttn\_v2\[[16](https://arxiv.org/html/2605.20188#bib.bib26)\]and injects pharmacological structure directly into attention formation\. Let𝐗\\mathbf\{X\}be the query\-side input and𝐘\\mathbf\{Y\}the key\-value input\. Following DiffAttn\_v2, queries are projected to doubled head space \(for paired heads\), i\.e\.,𝐐=𝐖Q​𝐗\\mathbf\{Q\}=\\mathbf\{W\}\_\{Q\}\\mathbf\{X\}with2​H2Hheads, while keys/values are𝐊=𝐖K​𝐘\\mathbf\{K\}=\\mathbf\{W\}\_\{K\}\\mathbf\{Y\}and𝐕=𝐖V​𝐘\\mathbf\{V\}=\\mathbf\{W\}\_\{V\}\\mathbf\{Y\}withHHheads that are repeat\-interleaved to align with2​H2Hquery heads\. After reshaping, attention logits are

𝐒=𝐐𝐊Tdh\+λgraph​𝐁graph,𝐀=softmax​\(𝐒\),𝐂=𝐀𝐕,\\mathbf\{S\}=\\frac\{\\mathbf\{Q\}\\mathbf\{K\}^\{T\}\}\{\\sqrt\{d\_\{h\}\}\}\+\\lambda\_\{\\text\{graph\}\}\\mathbf\{B\}\_\{\\text\{graph\}\},\\qquad\\mathbf\{A\}=\\mathrm\{softmax\}\(\\mathbf\{S\}\),\\qquad\\mathbf\{C\}=\\mathbf\{A\}\\mathbf\{V\},wheredhd\_\{h\}is the per\-head dimension,λgraph\\lambda\_\{\\text\{graph\}\}is a fixed scaling hyperparameter, and𝐁graph\\mathbf\{B\}\_\{\\text\{graph\}\}is the DDI\-derived bias term\. The context tensor𝐂\\mathbf\{C\}has2​H2Hheads and is split into even/odd head pairs:

𝐂1=𝐂\[:,0::2,:,:\],𝐂2=𝐂\[:,1::2,:,:\]\.\\mathbf\{C\}\_\{1\}=\\mathbf\{C\}\[:,0::2,:,:\],\\qquad\\mathbf\{C\}\_\{2\}=\\mathbf\{C\}\[:,1::2,:,:\]\.Thus,𝐂1\\mathbf\{C\}\_\{1\}and𝐂2\\mathbf\{C\}\_\{2\}are not additional variables, but the two paired attention contexts produced by DiffAttn\_v2\. A query\-dependent gate is then computed per token and per head,

𝝀=σ​\(𝐖λ​𝐗\),\\boldsymbol\{\\lambda\}=\\sigma\(\\mathbf\{W\}\_\{\\lambda\}\\mathbf\{X\}\),and differential denoising is applied as

𝐂diff=𝐂1−𝝀⊙𝐂2\.\\mathbf\{C\}\_\{\\text\{diff\}\}=\\mathbf\{C\}\_\{1\}\-\\boldsymbol\{\\lambda\}\\odot\\mathbf\{C\}\_\{2\}\.The final attention output is obtained with an output projection,𝐎=𝐖O​𝐂diff\\mathbf\{O\}=\\mathbf\{W\}\_\{O\}\\mathbf\{C\}\_\{\\text\{diff\}\}\. This subtraction suppresses shared noise while preserving query\-relevant signal\.

The key idea is that graph bias is added before softmax so pharmacological structure can shape attention before differential subtraction\. In the current model, only the inter\-visit bias matrix has this effect, because it varies across visit pairs\. A single uniform scalar bias \(as would be in the case of intra\-visit attention\) does not change attention weights after softmax, so graph bias is applied only in inter\-visit attention\. We use a visit\-set\-level bias projected to medication\-related entries of the inter\-visit attention matrix,

𝐁graphinter​\[i,j\]=𝕀med​\(i,j\)⋅1\|Mquery\|​\|Mkey\|​∑mq∈Mquery∑mk∈Mkey𝐀D​D​I​\[mq,mk\]\.\\mathbf\{B\}\_\{\\text\{graph\}\}^\{\\text\{inter\}\}\[i,j\]=\\mathbb\{I\}\_\{\\text\{med\}\}\(i,j\)\\cdot\\frac\{1\}\{\|M\_\{\\text\{query\}\}\|\|M\_\{\\text\{key\}\}\|\}\\sum\_\{m\_\{q\}\\in M\_\{\\text\{query\}\}\}\\sum\_\{m\_\{k\}\\in M\_\{\\text\{key\}\}\}\\mathbf\{A\}\_\{DDI\}\[m\_\{q\},m\_\{k\}\]\.where𝕀med​\(i,j\)=1\\mathbb\{I\}\_\{\\text\{med\}\}\(i,j\)=1if both positionsiiandjjcorrespond to medication channels, and0otherwise\. For each current\-visit/historical\-visit pair, we compute one scalar: the mean DDI density over all medication pairs across the two sets\. This scalar is written to the corresponding entries of the inter\-visit attention matrix\. As a result, the prior can vary across historical visits, but not across individual drug pairs within a given visit pair\. This is a design choice that follows the representation level: the temporal encoder attends over visit\-pooled GRU states, not raw medication\-token sequences\. Therefore, the graph\-prior effects observed in this study come from inter\-visit bias at visit\-set granularity\. Achieving pair\-level inter\-visit bias would require attention over medication\-level sequences instead of pooled visit states\.

Model Pipeline, Objective, and Implementation\.

The remaining modeling pipeline follows DADA\-MED\[[10](https://arxiv.org/html/2605.20188#bib.bib30)\], while GraphDiffMed replaces only the attention operator with the graph\-biased differential attention introduced above\. Specifically, five components are retained: \(1\)*intra\-visit encoding*, where visit\-level diagnosis, procedure, and prior\-medication embeddings are aggregated as𝐡D\(t\)=∑d∈Dt𝐃e​\[d\]\\mathbf\{h\}\_\{D\}^\{\(t\)\}=\\sum\_\{d\\in D\_\{t\}\}\\mathbf\{D\}\_\{e\}\[d\],𝐡P\(t\)=∑p∈Pt𝐏e​\[p\]\\mathbf\{h\}\_\{P\}^\{\(t\)\}=\\sum\_\{p\\in P\_\{t\}\}\\mathbf\{P\}\_\{e\}\[p\], and𝐡M\(t\)=∑m∈Mt−1𝐌e​\[m\]\\mathbf\{h\}\_\{M\}^\{\(t\)\}=\\sum\_\{m\\in M\_\{t\-1\}\}\\mathbf\{M\}\_\{e\}\[m\], then refined by modality\-specific homo\-graphs \(causal graphs for diagnoses/procedures and DDI graph for medications\)\. Intra\-visit differential cross\-attention is then computed between pooled vectors \(medication query, diagnosis/procedure key\-value\), yielding a 1×\\times1 attention map per head rather than token\-sequence attention; \(2\)*sequential temporal encoding*with modality\-wise GRUs, e\.g\.,𝐎D,𝐡Dfinal=GRUD​\(\[𝐡D\(1\),…,𝐡D\(T\)\]\)\\mathbf\{O\}\_\{D\},\\mathbf\{h\}\_\{D\}^\{\\text\{final\}\}=\\mathrm\{GRU\}\_\{D\}\(\[\\mathbf\{h\}\_\{D\}^\{\(1\)\},\\ldots,\\mathbf\{h\}\_\{D\}^\{\(T\)\}\]\)\(analogously for procedures and prior\-medication states\); \(3\)*inter\-visit attention*, where current\-visit query𝐪visit\\mathbf\{q\}\_\{\\text\{visit\}\}\(formed from current diagnosis/procedure states together with prior\-medication and demographic embeddings𝐆e,𝐀e\\mathbf\{G\}\_\{e\},\\mathbf\{A\}\_\{e\}\) attends over historical key\-value context𝐤𝐯prev\\mathbf\{kv\}\_\{\\text\{prev\}\}\(from𝐎D,𝐎P,𝐎M\\mathbf\{O\}\_\{D\},\\mathbf\{O\}\_\{P\},\\mathbf\{O\}\_\{M\}\) with𝐁graphinter\\mathbf\{B\}\_\{\\text\{graph\}\}^\{\\text\{inter\}\}; \(4\)*patient representation aggregation*, where final hidden states, demographic embeddings, intra\-/inter\-visit attention summaries, and last\-visit states are concatenated into𝐫patient\\mathbf\{r\}\_\{\\text\{patient\}\}; and \(5\)*medication prediction with causal review*, where logits𝐳=𝐖out​ReLU​\(𝐫patient\)\\mathbf\{z\}=\\mathbf\{W\}\_\{\\text\{out\}\}\\,\\mathrm\{ReLU\}\(\\mathbf\{r\}\_\{\\text\{patient\}\}\)are adjusted using diagnosis/procedure\-to\-medication causal effects \(viacmD/P=maxd/p∈DT/PT⁡CausalEffect​\(d/p,m\)c\_\{m\}^\{D/P\}=\\max\_\{d/p\\in D\_\{T\}/P\_\{T\}\}\\mathrm\{CausalEffect\}\(d/p,m\)\) before obtaining probabilities𝐩=σ​\(𝐳\)\\mathbf\{p\}=\\sigma\(\\mathbf\{z\}\)\. Key symbols used in this block are summarized in Table[1](https://arxiv.org/html/2605.20188#S3.T1)\.

Table 1:Notation used in model pipeline and objective\.Training uses the multi\-term objective,

ℒ=ℒBCE\+β​\(t\)​ℒDDI\+α​ℒreg,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{BCE\}\}\+\\beta\(t\)\\mathcal\{L\}\_\{\\text\{DDI\}\}\+\\alpha\\mathcal\{L\}\_\{\\text\{reg\}\},with binary cross\-entropy

ℒBCE=−1nM​∑m=1nM\[ym​log⁡pm\+\(1−ym\)​log⁡\(1−pm\)\],\\mathcal\{L\}\_\{\\text\{BCE\}\}=\-\\frac\{1\}\{n\_\{M\}\}\\sum\_\{m=1\}^\{n\_\{M\}\}\\left\[y\_\{m\}\\log p\_\{m\}\+\(1\-y\_\{m\}\)\\log\(1\-p\_\{m\}\)\\right\],DDI regularization

ℒDDI=0\.0005\|Mpred\|2​∑i,j∈Mpredpi​pj​𝐀D​D​I​\[i,j\],\\mathcal\{L\}\_\{\\text\{DDI\}\}=\\frac\{0\.0005\}\{\|M\_\{\\text\{pred\}\}\|^\{2\}\}\\sum\_\{i,j\\in M\_\{\\text\{pred\}\}\}p\_\{i\}p\_\{j\}\\mathbf\{A\}\_\{DDI\}\[i,j\],annealed DDI weighting

β​\(t\)=β0​\(1−exp⁡\(−γ​DDIcurrent−DDItargetDDItarget\)\),\\beta\(t\)=\\beta\_\{0\}\\left\(1\-\\exp\\left\(\-\\gamma\\frac\{\\mathrm\{DDI\}\_\{\\text\{current\}\}\-\\mathrm\{DDI\}\_\{\\text\{target\}\}\}\{\\mathrm\{DDI\}\_\{\\text\{target\}\}\}\\right\)\\right\),and L2 regularizationℒreg=‖θ‖22\\mathcal\{L\}\_\{\\text\{reg\}\}=\\\|\\theta\\\|\_\{2\}^\{2\}, whereDDItarget=0\.06\\mathrm\{DDI\}\_\{\\text\{target\}\}=0\.06andγ=2\.5\\gamma=2\.5\. The annealedℒDDI\\mathcal\{L\}\_\{\\text\{DDI\}\}term is active during training in all reported variants and directly optimizes lower interacting co\-prescriptions at the output level\. To avoid ambiguity, safety\-aware behavior in GraphDiffMed comes from two complementary mechanisms: \(i\) output\-level annealed DDI regularization \(direct optimization\), and \(ii\) attention\-level graph bias \(a pharmacological structural prior whose effect on DDI is emergent and configuration\-dependent rather than directly optimized\)\. Here,ttis the training step \(or epoch\) index;β​\(t\)\\beta\(t\)is the dynamic weight for the DDI penalty;β0\\beta\_\{0\}is the base DDI\-penalty coefficient; andα\\alphais the L2\-regularization coefficient\. InℒBCE\\mathcal\{L\}\_\{\\text\{BCE\}\},nMn\_\{M\}is the medication\-vocabulary size,mmindexes medications,ym∈\{0,1\}y\_\{m\}\\in\\\{0,1\\\}is the ground\-truth multi\-label target for medicationmm, andpm∈\(0,1\)p\_\{m\}\\in\(0,1\)is its predicted probability\. InℒDDI\\mathcal\{L\}\_\{\\text\{DDI\}\},MpredM\_\{\\text\{pred\}\}is the predicted medication set,\|Mpred\|\|M\_\{\\text\{pred\}\}\|is its cardinality,i,j∈Mpredi,j\\in M\_\{\\text\{pred\}\}index predicted medications,pi,pjp\_\{i\},p\_\{j\}are their predicted probabilities, and𝐀D​D​I​\[i,j\]∈\{0,1\}\\mathbf\{A\}\_\{DDI\}\[i,j\]\\in\\\{0,1\\\}indicates whether the pair has a known interaction\.DDIcurrent\\mathrm\{DDI\}\_\{\\text\{current\}\}is the current batch/model DDI rate,DDItarget\\mathrm\{DDI\}\_\{\\text\{target\}\}is the desired target DDI rate,γ\\gammacontrols annealing sharpness, andθ\\thetadenotes all trainable parameters\.

Implementation is configured with embedding dimensiond=64d=64, differential attention headsH=8H=8\(thus2​H=162H=16\), GRU hidden size6464, dropout0\.70\.7, and fixed graph\-bias scaleλgraph=0\.1\\lambda\_\{\\text\{graph\}\}=0\.1\(not optimized during training\)\. Training uses Adam with learning rate5×10−45\\times 10^\{\-4\}, patient\-level batching \(one patient per batch\),2020epochs, and regularization weightα=0\.005\\alpha=0\.005\. Diagnoses, procedures, and medications are always included as standard modalities; we then evaluate additional\-modality settingsG\(gender only\),GY\(demographics: gender\+age\),L\(lab events only\), andLGY\(labs\+demographics\), plus a full setting using all modalities jointly\.

## 4Experimental Setup

Table 2:Dataset preprocessing, statistics, and split configuration\.CategoryItemValuePreprocessingSingle\-visit patientsFiltered outMedication code systemATC level 3Diagnosis code systemICD\-9Procedure code systemICD\-9 procedure codesLaboratory eventsMIMIC\-III LABEVENTS \(test ID, value\)DemographicsGender \(binary\), age \(years\)StatisticsTotal patients6,350Total visits15,032Average visits per patient2\.37Diagnosis vocabulary size \(nDn\_\{D\}\)1,958Procedure vocabulary size \(nPn\_\{P\}\)1,426Medication vocabulary size \(nMn\_\{M\}\)145DDI pairs1,318SplitTrain5,080 patientsValidation635 patientsTest635 patients

We evaluate on MIMIC\-III\[[4](https://arxiv.org/html/2605.20188#bib.bib29)\]\. Preprocessing, statistics, and data splits are summarized in Table[2](https://arxiv.org/html/2605.20188#S4.T2)\. We compare GraphDiffMed with DADA\-MED\[[10](https://arxiv.org/html/2605.20188#bib.bib30)\], CIDGMed\[[6](https://arxiv.org/html/2605.20188#bib.bib4)\], LEADER\[[8](https://arxiv.org/html/2605.20188#bib.bib5)\], LEAP\[[19](https://arxiv.org/html/2605.20188#bib.bib21)\], REFINE\[[1](https://arxiv.org/html/2605.20188#bib.bib25)\], MIFNet\[[3](https://arxiv.org/html/2605.20188#bib.bib2)\], SHAPE\[[9](https://arxiv.org/html/2605.20188#bib.bib12)\], DAPSNet\[[14](https://arxiv.org/html/2605.20188#bib.bib15)\], PROMISE\[[15](https://arxiv.org/html/2605.20188#bib.bib10)\], and A\-GSTCN\[[18](https://arxiv.org/html/2605.20188#bib.bib16)\]\. We report Jaccard \(primary\), DDI Rate, F1, and PRAUC, plus Avg \#Meds as an auxiliary prescribing\-intensity indicator\. Avg \#Meds is the mean number of medications recommended per visit and is reported only as an*auxiliary*contextual metric \(not a primary model\-selection target as is standard in state\-of\-the\-art reporting\)\. Results are mean±\\pmstandard deviation over five seeds \(1, 3, 16, 18, 1234\); model selection uses validation Jaccard, and test estimates use bootstrap resampling \(10 iterations, 80% test subset each\)\.

#### 4\.4 Research Questions

Our experiments are designed to answer the following research questions:

- •RQ1: How does GraphDiffMed compare against state\-of\-the\-art medication recommendation methods across all metrics?
- •RQ2: What is the contribution of the graph bias component, isolated per modality configuration, through ablation studies?
- •RQ3: How do different modality combinations \(demographics, lab events\) affect performance, both with and without graph bias?
- •RQ4: What is the isolated contribution of applying DiffAttn\_v2 at both temporal scales \(without graph bias\) compared to the full GraphDiffMed model?
- •RQ5: Does architectural innovation \(dual\-scale differential attention \+ graph bias\) provide gains over simpler data\-side engineering such as training set augmentation?

## 5Results and Analysis

#### 5\.1 Main Results \(RQ1\)

Table[3](https://arxiv.org/html/2605.20188#S5.T3)reports the primary comparison on MIMIC\-III\. GraphDiffMed \(GY\) is best on Jaccard, F1, and PRAUC, while PROMISE achieves the lowest DDI\. This pattern indicates that the proposed model improves recommendation quality and ranking, with a safety profile that must be interpreted jointly with coverage rather than as an isolated minimum\-DDI objective\.

Table 3:Main results on MIMIC\-III test set\. The best performing model isGraphDiffMed \(GY\)\.
#### 5\.2 Ablation and Modality Effects \(RQ2, RQ3, RQ4\)

Table[4](https://arxiv.org/html/2605.20188#S5.T4)jointly answers RQ2, RQ3, and RQ4\. Here,*Dual v2*denotes the dual\-scale DiffAttn\_v2 architecture \(intra\-visit \+ inter\-visit differential attention\)*without*graph bias; GraphDiffMed adds graph bias on top of the same backbone\. First, compared with the baseline \(v1\), Dual v2 variants improve Jaccard and F1, confirming that dual\-scale denoising is the primary performance driver\. Second, the baseline yields the lowest DDI, but this comes with clearly lower Jaccard and F1, indicating a conservative quality\-safety point\. Third, the best Jaccard/F1 are achieved by Dual v2 \(LGY\), but with higher DDI\. Notably, GraphDiffMed \(GY\) attains the second\-best Jaccard/F1 together with the best PRAUC and a substantially lower DDI than Dual v2 \(LGY\), indicating a strong quality\-safety trade\-off\.

Table 4:Ablation across dual\-scale DiffAttn\_v2 and graph bias\. Here,\(−\)\(\-\)denotes the default modality set\(D,P,M\)\(D,P,M\), while L, GY, and LGY indicate additional modalities appended to this base\. Bold and italic values show the first and the second best results respectively\.
#### 5\.3 Comparison with Data Augmentation \(RQ5\)

Table[5](https://arxiv.org/html/2605.20188#S5.T5)compares the augmentation\-only variants against the proposed architecture\. Relative to Table[3](https://arxiv.org/html/2605.20188#S5.T3)and Table[4](https://arxiv.org/html/2605.20188#S5.T4), augmentation alone yields smaller gains, indicating that architectural changes \(dual\-scale differential attention with graph bias\) are the dominant source of improvement\.

Table 5:Augmentation\-only variants \(without graph\-biased differential attention\)\.
#### 5\.4 Statistical testing, Attention, Error Profile

Statistical testing supports the same overall pattern\. Augmentation\-only variants are mostly non\-significant on primary quality metrics \(Jaccard, F1, PRAUC\), with isolated effects mainly on DDI or Avg meds \(e\.g\., LGY augmentation: DDIp=0\.0086p=0\.0086\)\. In contrast, dual\-scale DiffAttn\_v2 variants show consistent and significant gains in Jaccard and F1 versus baseline for the base setting\(D,P,M\)\(D,P,M\), GY, and LGY, both without and with graph bias\. The strongest significance appears in the graph\-biased GY setting \(Jaccardp=0\.0030p=0\.0030, F1p=0\.0016p=0\.0016\)\. PRAUC significance is mainly observed in the base\(D,P,M\)\(D,P,M\)and GY settings, whereas DDI differences are usually non\-significant except in LGY variants\. Overall, the most reliable effect is improvement in predictive quality, while DDI trade\-offs are concentrated in higher\-modality settings\.

Qualitatively, inter\-visit attention is clinically coherent, preserving continuity of chronic high\-risk medications; with graph bias, medication\-to\-medication focus is less diffuse and better aligned with known DDI\-relevant pairs, supporting its role as a structural prior rather than a hard rule\. Errors are also clinically plausible: false negatives are more common for rare or episodic medications, while false positives are often reasonable alternatives under physician\-level prescribing variability\. Some predicted interacting pairs may reflect overfitting, but many are combinations used in monitored ICU practice, consistent with the observed quality\-safety trade\-off\.

## 6Discussion, Limitations, and Conclusion

GraphDiffMed improves medication recommendation primarily through dual\-scale Differential Attention v2, which strengthens noise suppression within encounters and across longitudinal history\. Ablations indicate that most Jaccard/F1 gains come from this architectural change, while knowledge constraints provide a secondary, configuration\-dependent contribution to the quality\-safety balance \(most clearly in the GY setting\)\.

The empirical profile is clinically meaningful rather than purely metric\-driven\. Within ablations, Dual v2 \(LGY\) reaches the top Jaccard and F1, but at a higher DDI level\. GraphDiffMed \(GY\), which uses only demographics as auxiliary inputs, preserves near\-peak Jaccard/F1 \(second\-best\), achieves the best PRAUC, and lowers DDI relative to Dual v2 \(LGY\)\. The baseline remains the lowest\-DDI setting, but its lower Jaccard/F1 indicates under\-recommendation\. In ICU practice, some interacting pairs are clinically necessary and managed through monitoring, dose adjustment, and scheduling; accordingly, the binary DDI metric should be interpreted together with recommendation completeness because it does not distinguish contraindicated from manageable interactions\.

This demographics finding is notable\. Age and gender provide stable, low\-noise stratification signals that are broadly available across EHR systems, whereas laboratory features are high\-dimensional, sparse, and temporally noisy in MIMIC\-III ICU trajectories\. Under fixed model capacity and strong regularization, cleaner auxiliary signals appear to be more useful than noisier high\-dimensional inputs\. This also helps explain why graph bias contributes more reliably in cleaner\-feature configurations and is diluted when noisy modalities dominate attention patterns\.

Several limitations remain\. Evaluation is restricted to MIMIC\-III ICU data, so external validity to MIMIC\-IV, eICU, and outpatient settings is still open\. ATC level\-3 aggregation simplifies medication space and omits dosage/ formulation granularity\. Agreement\-based metrics \(Jaccard/F1/PRAUC\) proxy clinician prescribing behavior rather than causal treatment optimality, and prospective outcome validation is still needed\. The DDI knowledge used in this study follows the baseline preprocessing pipeline \(e\.g\., CIDGMed\) and is a binary adjacency matrix without severity or mechanism weights; therefore, severity\-aware comparison would require matched reimplementation across baselines for strict fairness\. In addition, inter\-visit graph bias is computed at visit\-set granularity on pooled visit states, not individual drug\-pair granularity, which limits influence relative to explicit pairwise DDI constraints\. We also fixλgraph\\lambda\_\{\\text\{graph\}\}as a hyperparameter \(0\.10\.1\) rather than learning it, which may understate or overstate the attainable impact of the graph prior\.

Future work should prioritize adding an explicit intra\-visit medication\-pair graph\-bias mechanism \(e\.g\., a sparse current\-visit DDI submatrix over medication tokens\), implementing medication\-level \(non\-pooled\) inter\-visit attention to enable pair\-level cross\-visit graph bias, severity\-aware DDI modeling, multitask learning with adverse\-event prediction, personalized DDI risk, cross\-dataset validation, stronger explanation methods \(including counterfactuals\), and prospective longitudinal evaluation; integration with language models for clinician\-facing explanations and online adaptation are additional promising directions\.

In summary, GraphDiffMed demonstrates that dual\-scale differential denoising is the main driver of improved recommendation quality, with knowledge constraints contributing to safety\-performance trade\-offs in selected settings\. These trends are supported by statistical testing: Jaccard/F1 gains are consistently significant for Dual v2 and GraphDiffMed in the GY and LGY settings, whereas DDI differences are mostly non\-significant except in higher\-modality LGY variants\. The results show that clean, interpretable auxiliary features and noise\-robust attention can outperform more complex but noisier multimodal settings\.

## References

- \[1\]S\. Bhoi, M\. L\. Lee, W\. Hsu, and N\. C\. Tan\(2024\)REFINE: a fine\-grained medication recommendation system using deep learning and personalized drug interaction modeling\.Advances in Neural Information Processing Systems36\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p3.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[2\]E\. Choi, M\. T\. Bahadori, J\. Sun, J\. Kulas, A\. Schuetz, and W\. Stewart\(2016\)Retain: an interpretable predictive model for healthcare using reverse time attention mechanism\.Advances in neural information processing systems29\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1)\.
- \[3\]J\. Huo, Z\. Hong, M\. Chen, and Y\. Duan\(2024\)MIFNet: multimodal interactive fusion network for medication recommendation\.The Journal of Supercomputing,pp\. 1–33\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[4\]A\. E\. Johnson, T\. J\. Pollard, L\. Shen, L\. H\. Lehman, M\. Feng, M\. Ghassemi, B\. Moody, P\. Szolovits, L\. Anthony Celi, and R\. G\. Mark\(2016\)MIMIC\-iii, a freely accessible critical care database\.Scientific data3\(1\),pp\. 1–9\.Cited by:[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[5\]C\. Knox, M\. Wilson, C\. M\. Klinger, M\. Franklin, E\. Oler, A\. Wilson, A\. Pon, J\. Cox, N\. E\. Chin, S\. A\. Strawbridge,et al\.\(2024\)DrugBank 6\.0: the drugbank knowledgebase for 2024\.Nucleic acids research52\(D1\),pp\. D1265–D1275\.Cited by:[§2](https://arxiv.org/html/2605.20188#S2.p3.1)\.
- \[6\]S\. Liang, X\. Li, S\. Mu, C\. Li, Y\. Lei, Y\. Hou, and T\. Ma\(2025\)CIDGMed: causal inference\-driven medication recommendation with enhanced dual\-granularity learning\.Knowledge\-Based Systems309,pp\. 112685\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20188#S3.SS2.p3.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[7\]J\. Liu, Z\. Wan, X\. Hu, and Q\. Zhu\(2024\)Safe drug recommendation through forward data imputation and recurrent residual neural network\.Applied Soft Computing161,pp\. 111723\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1)\.
- \[8\]Q\. Liu, X\. Wu, X\. Zhao, Y\. Zhu, Z\. Zhang, F\. Tian, and Y\. Zheng\(2024\)Large language model distilling medication recommendation model\.arXiv preprint arXiv:2402\.02803\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[9\]S\. Liu, X\. Wang, J\. Du, Y\. Hou, X\. Zhao, H\. Xu, H\. Wang, Y\. Xiang, and B\. Tang\(2023\)SHAPE: a sample\-adaptive hierarchical prediction network for medication recommendation\.IEEE Journal of Biomedical and Health Informatics\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[10\]K\. Saxena and T\. Shibata\(2025\)DADA\-med: data\-augmented dual attention model for enhanced medication recommendations\.InIFIP International Conference on Artificial Intelligence Applications and Innovations,pp\. 83–97\.Cited by:[§2](https://arxiv.org/html/2605.20188#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.20188#S3.SS2.p2.3),[§3\.2](https://arxiv.org/html/2605.20188#S3.SS2.p7.14),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[11\]J\. Shang, C\. Xiao, T\. Ma, H\. Li, and J\. Sun\(2019\)Gamenet: graph augmented memory networks for recommending medication combination\.Inproceedings of the AAAI Conference on Artificial Intelligence,Vol\.33,pp\. 1126–1133\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§2](https://arxiv.org/html/2605.20188#S2.p3.1)\.
- \[12\]N\. P\. Tatonetti, P\. P\. Ye, R\. Daneshjou, and R\. B\. Altman\(2012\)Data\-driven prediction of drug effects and interactions\.Science translational medicine4\(125\),pp\. 125ra31–125ra31\.Cited by:[§2](https://arxiv.org/html/2605.20188#S2.p3.1)\.
- \[13\]A\. Vaswani\(2017\)Attention is all you need\.Advances in Neural Information Processing Systems\.Cited by:[§2](https://arxiv.org/html/2605.20188#S2.p2.1)\.
- \[14\]J\. Wu, Y\. Dong, Z\. Gao, T\. Gong, and C\. Li\(2023\)Dual attention and patient similarity network for drug recommendation\.Bioinformatics39\(1\),pp\. btad003\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[15\]J\. Wu, X\. Yu, K\. He, Z\. Gao, and T\. Gong\(2024\)PROMISE: a pre\-trained knowledge\-infused multimodal representation learning framework for medication recommendation\.Information Processing & Management61\(4\),pp\. 103758\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§2](https://arxiv.org/html/2605.20188#S2.p3.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[16\]T\. Ye, L\. Dong, Y\. Sun, and F\. Wei\(2026\-01\-20\)Differential transformer v2\(Website\)External Links:[Link](https://aka.ms/diff-transformer-v2)Cited by:[§2](https://arxiv.org/html/2605.20188#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.20188#S3.SS2.p4.8)\.
- \[17\]T\. Ye, L\. Dong, Y\. Xia, Y\. Sun, Y\. Zhu, G\. Huang, and F\. Wei\(2024\)Differential transformer\.arXiv preprint arXiv:2410\.05258\.Cited by:[§2](https://arxiv.org/html/2605.20188#S2.p2.1)\.
- \[18\]W\. Yue, M\. Wang, L\. Zhang, L\. Zhang, J\. Huang, J\. Wan, N\. Xiong, and A\. V\. Vasilakos\(2023\)A\-gstcn: an augmented graph structural–temporal convolution network for medication recommendation based on electronic health records\.Bioengineering10\(11\),pp\. 1241\.Cited by:[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.
- \[19\]Y\. Zhang, R\. Chen, J\. Tang, W\. F\. Stewart, and J\. Sun\(2017\)LEAP: learning to prescribe effective and safe treatment combinations for multimorbidity\.Inproceedings of the 23rd ACM SIGKDD international conference on knowledge Discovery and data Mining,pp\. 1315–1324\.Cited by:[§1](https://arxiv.org/html/2605.20188#S1.p3.1),[§2](https://arxiv.org/html/2605.20188#S2.p1.1),[§4](https://arxiv.org/html/2605.20188#S4.p1.1)\.

Similar Articles

Treatment Effect Estimation with Differentiated Networked Effect on Graph Data

arXiv cs.LG

This paper addresses the challenge of estimating individual treatment effects from graph data by modeling differentiated networked effects, proposing a mechanism with partial attention and a message amplifier to capture varying neighbor importance and scale. Experiments show improved performance over existing methods.