Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs
Summary
Proposes CoMAG, a unified backbone for multimodal attributed graphs that learns task-adaptive reliable contexts and performs modality-preserving alignment, achieving state-of-the-art results on graph-level prediction, modality matching, and graph-conditioned generation.
View Cached Full Text
Cached at: 06/15/26, 09:11 AM
# Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs
Source: [https://arxiv.org/html/2606.14172](https://arxiv.org/html/2606.14172)
###### Abstract
Multimodal Attributed Graphs \(MAGs\) model real\-world entities by coupling graph topology with heterogeneous semantic attributes such as text and images\. This data structure naturally supports both graph\-centric tasks, which require structural and class\-discriminative representations, and modality\-centric tasks, which require fine\-grained cross\-modal correspondence\. However, existing MAG methods often rely on a fixed graph context or a uniformly fused multimodal representation\. Such designs can lead to task\-agnostic context propagation and over\-compressed cross\-modal fusion, making it difficult to satisfy different task requirements while preserving modality\-specific evidence\. To address this challenge, we proposeCoMAG\(Context\-aware Modality\-Topology Co\-Alignment\), a unified MAG backbone that learns task\-adaptive reliable contexts and performs modality\-preserving alignment within those contexts\. Concretely, CoMAG first performs Reliable Context Learning by estimating edge reliability from multimodal semantic consistency, supplementing the raw topology with semantic neighbors, and selecting context components through a task\-aware gate\. It then performs Modality\-preserving Hop\-token Alignment by maintaining modality\-specific multi\-hop trajectories, matching modality\-hop tokens across modalities, and using shared\-private representation decoupling to produce graph representations and modality representations from the same forward pass\. We further provide theoretical analysis of stable propagation, over\-smoothing mitigation, and modality\-collapse control\. Experiments on nine OpenMAG datasets compare CoMAG with feature\-only, graph\-only, multimodal, and unified MAG baselines across graph\-level prediction, modality matching, and graph\-conditioned generation protocols\. The results show that CoMAG achieves the best reported performance among compared methods, indicating that task\-adaptive reliable contexts and modality\-preserving alignment jointly strengthen structural prediction, cross\-modal matching, and graph\-conditioned generation while retaining sparse edge\-linear complexity\. Code is available at[https://anonymous\.4open\.science/r/CoMAG](https://anonymous.4open.science/r/CoMAG)\.
## IIntroduction
Multimodal Attributed Graphs \(MAGs\) provide a natural abstraction for real\-world entities whose relational dependencies and semantic attributes are inseparable\. In e\-commerce\[[43](https://arxiv.org/html/2606.14172#bib.bib3)\], products are connected by co\-purchase relations while being described by titles, reviews, and images\. In cellular networks\[[9](https://arxiv.org/html/2606.14172#bib.bib4)\], cell infrastructures can be modeled as connected nodes, which are described with heterogeneous attributes such as radio signals, and geographic context\. MAG learning therefore aims to jointly exploit graph topology and multimodal semantics encoded by modern vision, language, or vision\-language models\[[50](https://arxiv.org/html/2606.14172#bib.bib8),[44](https://arxiv.org/html/2606.14172#bib.bib9)\]\. This setting supports two complementary task families\.Graph\-centric tasks, such as node classification, link prediction, and node clustering, evaluate whether representations preserve structural, relational, and class\-discriminative information\.Modality\-centric tasks, such as cross\-modal retrieval, modality matching, modality alignment, and graph\-conditioned generation, evaluate whether representations capture fine\-grained semantic correspondences across modalities\. A unified MAG backbone is therefore expected to support graph reasoning and multimodal semantic matching from the same representation learning pipeline\.
Despite notable progress, existing MAG methods remain limited by the assumption that a single graph context or a single fused representation can serve all downstream objectives\.\(1\) Task\-agnostic context propagation\.Graph\-centric and modality\-centric tasks often require different neighborhoods\. Classification may prefer label\-consistent neighbors, link prediction may rely on structural complementarity, clustering may require compact yet separable communities, and retrieval may benefit from semantically aligned nodes that are not directly connected in the original graph\. However, many methods propagate over the original topology or a uniformly learned topology, which can amplify noisy edges, overlook missing semantic relations, and treat task\-conflicting edges in the same way\.\(2\) Over\-compressed cross\-modal fusion\.Existing fusion or alignment mechanisms often encourage different modalities to collapse into a common representation space\. While such compression can highlight shared semantics, it may also remove modality\-specific details that are essential for retrieval, matching, generation, and fine\-grained alignment\. Recent studies on neighborhood tokenization, modality\-aware propagation, task\-specific topology, and shared\-private disentanglement provide useful components\[[8](https://arxiv.org/html/2606.14172#bib.bib17),[16](https://arxiv.org/html/2606.14172#bib.bib16),[17](https://arxiv.org/html/2606.14172#bib.bib15),[49](https://arxiv.org/html/2606.14172#bib.bib52),[28](https://arxiv.org/html/2606.14172#bib.bib42),[57](https://arxiv.org/html/2606.14172#bib.bib12),[12](https://arxiv.org/html/2606.14172#bib.bib13)\], but they do not fully resolve how to learn task\-adaptive graph contexts while preserving modality\-specific evidence\.
Therefore, we center this research around the following question:How can a MAG backbone learn reliable contexts for different tasks while aligning modalities without erasing their distinctive information?Our starting point is that context construction and modality alignment should be treated as two coupled but distinct problems\. Rather than directly aggregating all neighbors from the raw graph, the model should identify which structural edges are reliable, recover semantic neighbor relations that the original graph misses, and adapt the resulting context to the target task\. Rather than forcing all modalities into a single fused embedding, the model should maintain modality\-specific propagation trajectories and align them at a finer granularity, so that cross\-modal consensus and modality\-private evidence can both be retained\. This perspective shifts MAG learning from simple topology\-modality fusion toward context\-aware modality\-topology co\-alignment\.
To this end, we proposeCoMAG\(Context\-awareModality\-topology co\-Alignment for multimodal attributedGraphs\), a unified framework designed to serve graph\-centric and modality\-centric tasks through a shared backbone\. CoMAG is built on two technical pillars\.Reliable Context Learningestimates edge reliability from multimodal semantic consistency, supplements the original graph with semantic neighbors, and selects an appropriate context according to the task\.Modality\-preserving Hop\-token Alignmentpropagates each modality along the learned context as a multi\-hop trajectory, treats each modality\-hop pair as an alignment token, and decouples the resulting representation into shared consensus and modality\-specific residual components\. The shared component supports graph\-oriented reasoning, while the modality\-specific components preserve the discriminative information required by retrieval, matching, and generation\. Under the OpenMAG protocol, these design choices yield the best reported performance among evaluated methods on graph\-level prediction and modality\-level matching and generation, suggesting that reliable topology, semantic context recovery, and modality\-private evidence are complementary for unified MAG learning\. Our main contributions are summarized below\.
1. 1\.Problem Reframing\.We identify unified MAG representation learning as a problem of task\-adaptive context construction and modality\-preserving alignment, rather than a direct extension of graph aggregation or multimodal fusion\.
2. 2\.Novel Framework\.We introduce CoMAG, which integrates reliable context learning, modality\-specific hop trajectories, hop\-token cross\-modal matching, and shared\-private decoupling into a unified MAG backbone\.
3. 3\.SOTA Performance\.We validate CoMAG through extensive evaluations to demonstrate its superior performance against competitive baselines and exhibit its robustness against various challenging scenarios\.
## IIRelated Work
### II\-AMultimodal Attributed Graph Learning
Multimodal attributed graph learning considers relational data whose nodes carry heterogeneous semantic evidence\. Recent surveys cast this setting as a general multimodal graph\-learning problem rather than a recommendation\-only formulation\[[7](https://arxiv.org/html/2606.14172#bib.bib6),[36](https://arxiv.org/html/2606.14172#bib.bib7)\]\. New benchmarks further broaden MAG evaluation across heterogeneous modalities, graph\-centric prediction, and modality\-centric reasoning\[[47](https://arxiv.org/html/2606.14172#bib.bib1),[58](https://arxiv.org/html/2606.14172#bib.bib2),[54](https://arxiv.org/html/2606.14172#bib.bib5)\]\. Representative models inject modality\-aware convolution, attention, and recommendation objectives into message passing\[[50](https://arxiv.org/html/2606.14172#bib.bib8),[44](https://arxiv.org/html/2606.14172#bib.bib9),[18](https://arxiv.org/html/2606.14172#bib.bib10),[11](https://arxiv.org/html/2606.14172#bib.bib11)\]\. Other studies target clustering, biomedical analysis, or architecture search over multimodal graph structures\[[57](https://arxiv.org/html/2606.14172#bib.bib12),[12](https://arxiv.org/html/2606.14172#bib.bib13),[21](https://arxiv.org/html/2606.14172#bib.bib23),[1](https://arxiv.org/html/2606.14172#bib.bib24)\]\. Recent graph\-transformer and unified\-embedding methods further bind multimodal graph signals in shared spaces\[[14](https://arxiv.org/html/2606.14172#bib.bib14),[17](https://arxiv.org/html/2606.14172#bib.bib15),[16](https://arxiv.org/html/2606.14172#bib.bib16)\]\. Foundation\-model and generative directions extend this line to graph\-enhanced multimodal understanding, graph\-language assistants, and image generation from MAGs\[[8](https://arxiv.org/html/2606.14172#bib.bib17),[33](https://arxiv.org/html/2606.14172#bib.bib18),[55](https://arxiv.org/html/2606.14172#bib.bib22)\]\. Together, these works confirm the need to model topology and modality semantics jointly, but most still rely on task\-specific architectures or a fixed graph context\. CoMAG retains the broad MAG objective while learning task\-adaptive reliable contexts and modality\-preserving outputs within one forward pipeline\.
### II\-BReliable Graph Context and Multi\-hop Propagation
Graph neural networks supply the standard machinery for propagating information over relational structures\. Classical message\-passing models aggregate local neighborhoods through convolution, attention, sampling, or higher\-order aggregation\[[19](https://arxiv.org/html/2606.14172#bib.bib25),[5](https://arxiv.org/html/2606.14172#bib.bib26),[45](https://arxiv.org/html/2606.14172#bib.bib27),[13](https://arxiv.org/html/2606.14172#bib.bib28),[53](https://arxiv.org/html/2606.14172#bib.bib31)\]\. Later analyses identify over\-smoothing and depth limits, while residual and reversible designs improve deep propagation\[[24](https://arxiv.org/html/2606.14172#bib.bib29),[34](https://arxiv.org/html/2606.14172#bib.bib30),[2](https://arxiv.org/html/2606.14172#bib.bib32),[22](https://arxiv.org/html/2606.14172#bib.bib36)\]\. Propagation\-filter methods further control how signals accumulate across multiple hops\[[20](https://arxiv.org/html/2606.14172#bib.bib33),[51](https://arxiv.org/html/2606.14172#bib.bib34),[4](https://arxiv.org/html/2606.14172#bib.bib35)\]\. Heterophily studies and graph transformer variants show that useful context can differ sharply from local homophilous neighborhoods\[[32](https://arxiv.org/html/2606.14172#bib.bib37),[25](https://arxiv.org/html/2606.14172#bib.bib38),[29](https://arxiv.org/html/2606.14172#bib.bib40),[38](https://arxiv.org/html/2606.14172#bib.bib41)\]\. In MAGs, context reliability is even more task\-dependent because an edge should be trusted only when its structural relation is compatible with multimodal semantics\. Task\-specific topology modification and multimodal graph transformers indicate that context should adapt to the objective\[[28](https://arxiv.org/html/2606.14172#bib.bib42),[16](https://arxiv.org/html/2606.14172#bib.bib16),[17](https://arxiv.org/html/2606.14172#bib.bib15)\]\. CoMAG follows this principle by constructing a reliability\-gated structural graph, a semantic complement graph, and a task\-conditioned context mixture before propagation, making context learning part of the backbone itself\.
### II\-CCross\-modal Alignment and Shared\-private Representation Learning
Cross\-modal alignment seeks comparable representations for heterogeneous modalities while retaining evidence that is unique to each source\. Vision\-language pretraining and multimodal binding models provide strong alignment priors\[[40](https://arxiv.org/html/2606.14172#bib.bib43),[10](https://arxiv.org/html/2606.14172#bib.bib44),[3](https://arxiv.org/html/2606.14172#bib.bib45)\], while modern text and vision encoders supply expressive node attributes for MAGs\[[6](https://arxiv.org/html/2606.14172#bib.bib46),[35](https://arxiv.org/html/2606.14172#bib.bib47),[42](https://arxiv.org/html/2606.14172#bib.bib48),[41](https://arxiv.org/html/2606.14172#bib.bib49),[27](https://arxiv.org/html/2606.14172#bib.bib50)\]\. Graph\-aware alignment methods inject relational evidence into image\-text matching, multimodal path alignment, and graph\-enhanced multimodal understanding\[[26](https://arxiv.org/html/2606.14172#bib.bib51),[49](https://arxiv.org/html/2606.14172#bib.bib52),[33](https://arxiv.org/html/2606.14172#bib.bib18),[8](https://arxiv.org/html/2606.14172#bib.bib17)\]\. Graph contrastive pretraining similarly learns robust representations by contrasting contextual signals\[[56](https://arxiv.org/html/2606.14172#bib.bib53),[39](https://arxiv.org/html/2606.14172#bib.bib54),[46](https://arxiv.org/html/2606.14172#bib.bib55),[37](https://arxiv.org/html/2606.14172#bib.bib56)\], and masked graph modeling reconstructs hidden evidence from surrounding context\[[15](https://arxiv.org/html/2606.14172#bib.bib57),[23](https://arxiv.org/html/2606.14172#bib.bib58)\]\. Yet forcing all modalities into one fused embedding can obscure modality\-specific cues, a concern also studied in shared\-specific and missing\-modality learning\[[48](https://arxiv.org/html/2606.14172#bib.bib59),[31](https://arxiv.org/html/2606.14172#bib.bib60),[30](https://arxiv.org/html/2606.14172#bib.bib61),[52](https://arxiv.org/html/2606.14172#bib.bib62)\]\. This issue is especially visible in MAGs, where cross\-modal agreement may emerge at different propagation depths\. A textual neighborhood can become category\-discriminative after several hops, whereas visual evidence may remain local and appearance\-driven\. Final\-embedding alignment alone cannot tell whether agreement comes from shared semantics or from useful modality\-private evidence that should be preserved\. Consequently, a unified MAG backbone needs alignment that is fine\-grained enough to compare modality\-hop evidence while still separating consensus from modality\-specific residual information forzigz\_\{i\}^\{g\}andei\(m\)e\_\{i\}^\{\(m\)\}\.
## IIIPreliminaries
### III\-AMultimodal Attributed Graph
A Multimodal Attributed Graph \(MAG\) is denoted as𝒢=\(𝒱,ℰ,\{X\(m\)\}m=1M\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\},\\\{X^\{\(m\)\}\\\}\_\{m=1\}^\{M\}\), where𝒱\\mathcal\{V\}containsN=\|𝒱\|N=\|\\mathcal\{V\}\|nodes,ℰ⊆𝒱×𝒱\\mathcal\{E\}\\subseteq\\mathcal\{V\}\\times\\mathcal\{V\}is represented by adjacency matrixA∈ℝN×NA\\in\\mathbb\{R\}^\{N\\times N\}, andX\(m\)∈ℝN×dmX^\{\(m\)\}\\in\\mathbb\{R\}^\{N\\times d\_\{m\}\}is the node feature matrix for modalitymm\. Theii\-th node has modality observations\{xi\(m\)\}m=1M\\\{x\_\{i\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}, wherexi\(m\)x\_\{i\}^\{\(m\)\}is theii\-th row ofX\(m\)X^\{\(m\)\}\. Different modalities may have different raw dimensions, encoders, and noise patterns, so a MAG learner must combine relational and modality evidence without assuming equal reliability for every task\. Given a task indexτ\\tau, a unified backbone should producezig∈ℝdz\_\{i\}^\{g\}\\in\\mathbb\{R\}^\{d\}for graph\-centric objectives andei\(m\)∈ℝde\_\{i\}^\{\(m\)\}\\in\\mathbb\{R\}^\{d\}for modality\-centric objectives\. We writeZg=\[z1g,…,zNg\]⊤Z^\{g\}=\[z\_\{1\}^\{g\},\\ldots,z\_\{N\}^\{g\}\]^\{\\top\}andE\(m\)=\[e1\(m\),…,eN\(m\)\]⊤E^\{\(m\)\}=\[e\_\{1\}^\{\(m\)\},\\ldots,e\_\{N\}^\{\(m\)\}\]^\{\\top\}\. Our experiments use text and visual features, while the notation allows arbitraryMM\.
### III\-BLearning Objectives
CoMAG is trained to support graph\-centric and modality\-centric objectives within one representation\-learning problem\. For graph\-centric supervision,zigz\_\{i\}^\{g\}supports node classification, link prediction, and node clustering\. We summarize these terms asℒgraph=ℒcls\+ℒlp\+ℒcluster\\mathcal\{L\}\_\{\\mathrm\{graph\}\}=\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\+\\mathcal\{L\}\_\{\\mathrm\{lp\}\}\+\\mathcal\{L\}\_\{\\mathrm\{cluster\}\}, whereℒcls\\mathcal\{L\}\_\{\\mathrm\{cls\}\}predicts node labels,ℒlp\\mathcal\{L\}\_\{\\mathrm\{lp\}\}scores whether a pair\(i,j\)\(i,j\)should be connected, andℒcluster\\mathcal\{L\}\_\{\\mathrm\{cluster\}\}encourages structurally and semantically related nodes to form coherent groups\. This objective family requiresZgZ^\{g\}to retain topology, neighborhood context, and class\-level semantics\.
For modality\-centric supervision,\{ei\(m\)\}m=1M\\\{e\_\{i\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}should make different modalities comparable without erasing modality\-specific evidence\. Form≠nm\\neq n, the pair\(ei\(m\),ei\(n\)\)\(e\_\{i\}^\{\(m\)\},e\_\{i\}^\{\(n\)\}\)is positive because both representations describe nodeii, while\(ei\(m\),ej\(n\)\)\(e\_\{i\}^\{\(m\)\},e\_\{j\}^\{\(n\)\}\)withj≠ij\\neq igives a negative pair\. The retrieval or matching objective therefore increases similarity for aligned modality pairs and decreases it for mismatched pairs, and the full training loss instantiates this behavior through the modality term in Section[IV\-F](https://arxiv.org/html/2606.14172#S4.SS6)\. Generation and alignment tasks reuse the same modality\-aware representations as graph\-conditioned semantic evidence\.
Two auxiliary objectives regularize the shared backbone\. The orthogonality objectiveℒ⟂\\mathcal\{L\}\_\{\\perp\}separates cross\-modal consensus from modality\-private residuals, preventing alignment from collapsing all modalities into one over\-compressed embedding\. The smoothness objectiveℒsmooth\\mathcal\{L\}\_\{\\mathrm\{smooth\}\}regularizesZgZ^\{g\}over the learned context graph so reliable neighbors have compatible graph representations\. Together, CoMAG optimizesℒ=λgℒgraph\+λmℒret\+λ⟂ℒ⟂\+λsℒsmooth\\mathcal\{L\}=\\lambda\_\{g\}\\mathcal\{L\}\_\{\\mathrm\{graph\}\}\+\\lambda\_\{m\}\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\+\\lambda\_\{\\perp\}\\mathcal\{L\}\_\{\\perp\}\+\\lambda\_\{s\}\\mathcal\{L\}\_\{\\mathrm\{smooth\}\}, with detailed definitions in Section[IV\-F](https://arxiv.org/html/2606.14172#S4.SS6)\. Thus, the preliminary objective is to learn graph and modality representations jointly without sacrificing structural discrimination or modality\-specific evidence\.
Figure 1:CoMAG architecture\.CoMAG builds reliable task\-adaptive contexts, propagates modality\-specific hop trajectories, aligns hop tokens, and decouples shared and private representations for graph\- and modality\-level outputs\.
## IVMethodology
### IV\-AOverview of CoMAG
CoMAG addresses the two limitations identified earlier, namely fixed context propagation and over\-compressed modality fusion\. Given a MAG withMMmodalities, it first learns a task\-adaptive context graph, then propagates each modality as a separate multi\-hop trajectory, aligns modality\-hop evidence across modalities, and finally separates shared graph evidence from modality\-private residual evidence\. Graph construction, propagation, alignment, and decoupling remain distinct stages, so each stage addresses one failure mode while the full pipeline is optimized end to end\. Algorithm[1](https://arxiv.org/html/2606.14172#alg1)summarizes the resulting inference and training procedure at the module level\.
### IV\-BReliable Context Learning
Reliable context learning constructs the graph along which information will propagate\. Instead of treating every observed edge as equally useful, CoMAG asks whether the edge is supported by the modalities attached to its endpoints\. Each modality is first mapped into the same hidden dimension, making textual and visual evidence comparable while preserving modality\-specific noise patterns\.
Edge reliability estimation\.For each observed edge\(i,j\)∈ℰ\(i,j\)\\in\\mathcal\{E\}, CoMAG compares whether same\-modality and cross\-modality evidence agree across the two endpoints\. Agreement suggests that the edge reflects a meaningful relation, while conflict tells the model to reduce the edge influence before message passing\. This reliability decision is parameterized as
Rij=σ\(gR\(\[Aij,Sijintra,Sijcross,\|Sijintra−Sijcross\|\]\)\),\\displaystyle R\_\{ij\}=\\sigma\\\!\\Big\(g\_\{R\}\\big\(\[A\_\{ij\},S\_\{ij\}^\{\\mathrm\{intra\}\},S\_\{ij\}^\{\\mathrm\{cross\}\},\|S\_\{ij\}^\{\\mathrm\{intra\}\}\-S\_\{ij\}^\{\\mathrm\{cross\}\}\|\]\\big\)\\Big\),\(1\)whereSijintraS\_\{ij\}^\{\\mathrm\{intra\}\}andSijcrossS\_\{ij\}^\{\\mathrm\{cross\}\}denote within\-modality and cross\-modality consistency, andgR\(⋅\)g\_\{R\}\(\\cdot\)is a lightweight scoring network\. The resulting valueRij∈\[0,1\]R\_\{ij\}\\in\[0,1\]determines how strongly the observed edge should participate in propagation\.
Semantic context recovery\.Filtering unreliable edges improves the observed topology but does not restore relations that are missing from the graph\. CoMAG therefore builds a semantic complement graph from cross\-modal similarity and keeps only high\-confidence neighbors\. The observed topology protects message passing from spurious edges that already exist, while the semantic complement supplies context that the raw topology fails to expose\. The two context components are
Arel\\displaystyle A^\{\\mathrm\{rel\}\}=Norm\(A⊙R\),\\displaystyle=\\operatorname\{Norm\}\(A\\odot R\),\(2\)Asem\\displaystyle A^\{\\mathrm\{sem\}\}=Norm\(TopK\(S¯cross\)\)\.\\displaystyle=\\operatorname\{Norm\}\\left\(\\operatorname\{TopK\}\\left\(\\bar\{S\}^\{\\mathrm\{cross\}\}\\right\)\\right\)\.HereS¯cross\\bar\{S\}^\{\\mathrm\{cross\}\}denotes the cross\-modal semantic similarity matrix used to recover high\-confidence missing neighbors\.
Task\-adaptive context gate\.Different downstream objectives do not always need the same structural prior\. CoMAG uses the task embedding and a graph\-level summary to choose how much mass to assign to reliable topology, semantic complement, and self information\. The resulting context graph is
Qτ=πτ,1Arel\+πτ,2Asem\+πτ,3I\.Q\_\{\\tau\}=\\pi\_\{\\tau,1\}A^\{\\mathrm\{rel\}\}\+\\pi\_\{\\tau,2\}A^\{\\mathrm\{sem\}\}\+\\pi\_\{\\tau,3\}I\.\(3\)HereArelA^\{\\mathrm\{rel\}\}is the normalized reliability\-filtered topology,AsemA^\{\\mathrm\{sem\}\}is the normalized semantic complement graph, andπτ\\pi\_\{\\tau\}is a task\-conditioned simplex weight\. The identity channel keeps self evidence available, which prevents the context graph from forcing every node to rely only on neighbors\.
### IV\-CModality\-specific Multi\-hop Context Trajectory
OnceQτQ\_\{\\tau\}is built, CoMAG propagates each modality through its own trajectory\. Every modality receives the same task\-aware context graph while keeping its own hidden state across hops, allowing each modality to react differently to the same graph context\.
Task\-conditioned propagation coefficients\.For taskτ\\tau, CoMAG generates modality\-specific coefficients by
\(γτ,m,ατ,m,βτ,m\)=softmax\(Wc\(m\)qτ\),\(\\gamma\_\{\\tau,m\},\\alpha\_\{\\tau,m\},\\beta\_\{\\tau,m\}\)=\\operatorname\{softmax\}\\left\(W\_\{c\}^\{\(m\)\}q\_\{\\tau\}\\right\),\(4\)so each modality can adjust how much it relies on its original signal, its own propagated context, and contextual evidence from other modalities\. The softmax form yields a convex update, which is the condition used later in Theorem[1](https://arxiv.org/html/2606.14172#Thmtheorem1)\.
Multi\-hop context propagation\.Starting fromHτ,0\(m\)=H0\(m\)H\_\{\\tau,0\}^\{\(m\)\}=H\_\{0\}^\{\(m\)\}, CoMAG updates modalitymmas
H¯τ,k\(−m\)\\displaystyle\\bar\{H\}\_\{\\tau,k\}^\{\(\-m\)\}=1M−1∑n≠mHτ,k\(n\),\\displaystyle=\\frac\{1\}\{M\-1\}\\sum\_\{n\\neq m\}H\_\{\\tau,k\}^\{\(n\)\},\(5\)Hτ,k\+1\(m\)\\displaystyle H\_\{\\tau,k\+1\}^\{\(m\)\}=γmH0\(m\)\+αmQτHτ,k\(m\)\+βmQτH¯τ,k\(−m\)\.\\displaystyle=\\gamma\_\{m\}H\_\{0\}^\{\(m\)\}\+\\alpha\_\{m\}Q\_\{\\tau\}H\_\{\\tau,k\}^\{\(m\)\}\+\\beta\_\{m\}Q\_\{\\tau\}\\bar\{H\}\_\{\\tau,k\}^\{\(\-m\)\}\.The first term re\-injects the original modality evidence, the second propagates same\-modality context, and the third lets other modalities correct the trajectory through the learned graph\. Each hop also uses normalization and non\-linearity in implementation\. CoMAG keeps the whole sequence of hidden states afterKKsteps, so early states preserve local modality evidence while later states encode broader graph context\.
### IV\-DHop\-token Cross\-modal Alignment
Cross\-modal correspondence need not be hop\-synchronous\. CoMAG treats each modality\-hop state as an alignment token and attaches modality and hop\-position embeddings, so matching remains aware of both evidence source and receptive\-field depth\. The token sequence is
ui\(m,k\)\\displaystyle u\_\{i\}^\{\(m,k\)\}=Hτ,k\(m\)\[i\]\+rm\+ℓk,\\displaystyle=H\_\{\\tau,k\}^\{\(m\)\}\[i\]\+r\_\{m\}\+\\ell\_\{k\},\(6\)Ui\(m\)\\displaystyle U\_\{i\}^\{\(m\)\}=\[ui\(m,0\),…,ui\(m,K\)\],\\displaystyle=\\left\[u\_\{i\}^\{\(m,0\)\},\\ldots,u\_\{i\}^\{\(m,K\)\}\\right\],wherermr\_\{m\}andℓk\\ell\_\{k\}are learnable modality and hop\-position embeddings\.
Distance\-penalized cross\-modal attention\.For each ordered modality pair\(m,n\)\(m,n\), CoMAG compares the token sequence of modalitymmagainst that of modalitynn\. The attention score may match across hops, but the distance penalty prevents remote hops from dominating without strong semantic support\. The matching matrix is
Bim←n=softmax\(Qi\(m\)\(Ki\(n\)\)⊤dh−λhDhop\),\\displaystyle B\_\{i\}^\{m\\leftarrow n\}=\\operatorname\{softmax\}\\left\(\\frac\{Q\_\{i\}^\{\(m\)\}\(K\_\{i\}^\{\(n\)\}\)^\{\\top\}\}\{\\sqrt\{d\_\{h\}\}\}\-\\lambda\_\{h\}D\_\{\\mathrm\{hop\}\}\\right\),\(7\)whereDhop\[k,l\]=\|k−l\|D\_\{\\mathrm\{hop\}\}\[k,l\]=\|k\-l\|, andQi\(m\)Q\_\{i\}^\{\(m\)\}andKi\(n\)K\_\{i\}^\{\(n\)\}are projected token sequences\. The coefficientλh\\lambda\_\{h\}encourages nearby\-hop matching when semantic scores are comparable while still allowing cross\-hop transfer when the evidence justifies it\.
Aligned token construction\.The matching matrix imports value tokens from modalitynninto the hop positions of modalitymm\. CoMAG blends the imported evidence with the original tokens, keeping each modality trajectory visible while adding cross\-modal context\.
Algorithm 1CoMAG Training Pipeline0:MAG
𝒢\\mathcal\{G\}, task embedding
qτq\_\{\\tau\}, hop depth
KK, semantic budget TopK, weights
λg,λm,λ⟂,λs\\lambda\_\{g\},\\lambda\_\{m\},\\lambda\_\{\\perp\},\\lambda\_\{s\}
0:Context graph
QτQ\_\{\\tau\}, graph outputs
\{zig\}\\\{z\_\{i\}^\{g\}\\\}, modality outputs
\{ei\(m\)\}\\\{e\_\{i\}^\{\(m\)\}\\\}, and loss
ℒ\\mathcal\{L\}
1:Encode each raw modality into hidden features
\{H0\(m\)\}m=1M\\\{H\_\{0\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}\.
2:Estimate edge reliability
RijR\_\{ij\}for observed edges with Eq\. \([1](https://arxiv.org/html/2606.14172#S4.E1)\)\.
3:Build context components by Eq\. \([2](https://arxiv.org/html/2606.14172#S4.E2)\)\.
4:Mix the two graphs with self connections to obtain
QτQ\_\{\\tau\}by Eq\. \([3](https://arxiv.org/html/2606.14172#S4.E3)\)\.
5:Generate task\-conditioned propagation coefficients by Eq\. \([4](https://arxiv.org/html/2606.14172#S4.E4)\)\.
6:for
k=0,…,K−1k=0,\\ldots,K\-1do
7:Update each modality trajectory with Eq\. \([5](https://arxiv.org/html/2606.14172#S4.E5)\)\.
8:endfor
9:Build modality\-hop tokens by Eq\. \([6](https://arxiv.org/html/2606.14172#S4.E6)\) and align them with Eq\. \([7](https://arxiv.org/html/2606.14172#S4.E7)\)\.
10:Fuse reliable consensus tokens into
sis\_\{i\}by Eq\. \([8](https://arxiv.org/html/2606.14172#S4.E8)\)\.
11:Extract modality\-private residuals
pi\(m\)p\_\{i\}^\{\(m\)\}by Eq\. \([9](https://arxiv.org/html/2606.14172#S4.E9)\)\.
12:Decode
zigz\_\{i\}^\{g\}and
ei\(m\)e\_\{i\}^\{\(m\)\}by Eq\. \([10](https://arxiv.org/html/2606.14172#S4.E10)\)\.
13:iftrainingthen
14:Apply the orthogonal constraint in Eq\. \([11](https://arxiv.org/html/2606.14172#S4.E11)\)\.
15:Optimize the total objective in Eq\. \([12](https://arxiv.org/html/2606.14172#S4.E12)\)\.
16:endif
17:return
Qτ,\{zig\},\{ei\(m\)\}Q\_\{\\tau\},\\\{z\_\{i\}^\{g\}\\\},\\\{e\_\{i\}^\{\(m\)\}\\\}and
ℒ\\mathcal\{L\}if training\.
### IV\-EShared\-private Representation Decoupling
Shared\-private decoupling prevents cross\-modal alignment from erasing useful modality\-specific evidence\. The aligned tokens form a shared consensus, while modality\-level tasks retain details that may not be useful for graph\-centric prediction\. CoMAG therefore constructs one shared representation and one private residual for each modality\.
Shared fusion\.For each token, CoMAG measures how consistent the original modality\-hop state is with its aligned counterpart\. Consistent tokens receive larger weights, while unreliable or weakly aligned tokens contribute less to the shared representation\. The shared representation is
si=∑m=1M∑k=0Kai\(m,k\)fi\(m,k\)\.s\_\{i\}=\\sum\_\{m=1\}^\{M\}\\sum\_\{k=0\}^\{K\}a\_\{i\}^\{\(m,k\)\}f\_\{i\}^\{\(m,k\)\}\.\(8\)Hereai\(m,k\)a\_\{i\}^\{\(m,k\)\}is a learned consistency\-aware token weight, andfi\(m,k\)f\_\{i\}^\{\(m,k\)\}is the fused token feature built from the original and aligned tokens\. This weighted fusion gives graph\-centric tasks a stable consensus representation\.
Private residual extraction\.The private branch pools the original trajectory of each modality and removes the part already explained by the shared representation\.
pi\(m\)=hpool,i\(m\)−Pmsi,p\_\{i\}^\{\(m\)\}=h\_\{\\mathrm\{pool\},i\}^\{\(m\)\}\-P\_\{m\}s\_\{i\},\(9\)wherehpool,i\(m\)h\_\{\\mathrm\{pool\},i\}^\{\(m\)\}is the pooled trajectory feature andPmP\_\{m\}is a modality\-specific projection\. The subtraction makes the private branch carry residual evidence\.
Output representations and orthogonality\.CoMAG usessis\_\{i\}as the graph\-centric representationzigz\_\{i\}^\{g\}and combinessis\_\{i\}withpi\(m\)p\_\{i\}^\{\(m\)\}to obtain the modality\-centric representationei\(m\)e\_\{i\}^\{\(m\)\}\.
zig\\displaystyle z\_\{i\}^\{g\}=si,\\displaystyle=s\_\{i\},\(10\)ei\(m\)\\displaystyle e\_\{i\}^\{\(m\)\}=Norm\(Wssi\+Wppi\(m\)\)\.\\displaystyle=\\operatorname\{Norm\}\\left\(W\_\{s\}s\_\{i\}\+W\_\{p\}p\_\{i\}^\{\(m\)\}\\right\)\.This output design lets graph prediction rely on cross\-modal consensus while retrieval, matching, and generation retain modality\-specific cues\. CoMAG further limits redundancy between shared and private subspaces with
ℒ⟂=1NM∑i=1N∑m=1M\(si⊤pi\(m\)‖si‖2‖pi\(m\)‖2\)2\.\\mathcal\{L\}\_\{\\perp\}=\\frac\{1\}\{NM\}\\sum\_\{i=1\}^\{N\}\\sum\_\{m=1\}^\{M\}\\left\(\\frac\{s\_\{i\}^\{\\top\}p\_\{i\}^\{\(m\)\}\}\{\\\|s\_\{i\}\\\|\_\{2\}\\\|p\_\{i\}^\{\(m\)\}\\\|\_\{2\}\}\\right\)^\{2\}\.\(11\)Proposition[4](https://arxiv.org/html/2606.14172#Thmtheorem4)explains why reducing this alignment helps preserve modality\-private information\.
### IV\-FOptimization
CoMAG is trained with graph supervision, modality supervision, and two auxiliary regularizers\. The graph term covers available classification, link prediction, and clustering targets, while the modality term pulls paired node modalities together and separates mismatched pairs\. The smoothness term regularizesZgZ^\{g\}overQτQ\_\{\\tau\}, so compatibility follows reliable context edges rather than the raw graph alone\. The complete objective is
ℒ=λgℒgraph\+λmℒret\+λ⟂ℒ⟂\+λsℒsmooth\.\\mathcal\{L\}=\\lambda\_\{g\}\\mathcal\{L\}\_\{\\mathrm\{graph\}\}\+\\lambda\_\{m\}\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\+\\lambda\_\{\\perp\}\\mathcal\{L\}\_\{\\perp\}\+\\lambda\_\{s\}\\mathcal\{L\}\_\{\\mathrm\{smooth\}\}\.\(12\)The weightsλg\\lambda\_\{g\},λm\\lambda\_\{m\},λ⟂\\lambda\_\{\\perp\}, andλs\\lambda\_\{s\}balance task supervision, modality alignment, shared\-private separation, and context smoothness\. The graph and modality terms train the two output families, while the orthogonal and smoothness terms keep consensus and context reliable without forcing both objectives into the same representation\.
TABLE I:OpenMAG dataset statistics\.
## VTheoretical Analysis
We analyze the linear core of Eq\. \(5\)\. The following results provide compact theoretical justification for stable propagation, over\-smoothing mitigation, hop\-distance regularization, and shared\-private collapse control\.
###### Theorem 1\(Stable Context Propagation\)
AssumeQτQ\_\{\\tau\}is non\-negative and row\-normalized, and the propagation coefficients in Eq\. \([4](https://arxiv.org/html/2606.14172#S4.E4)\) satisfyγm\+αm\+βm=1\\gamma\_\{m\}\+\\alpha\_\{m\}\+\\beta\_\{m\}=1withγmin=minmγm\>0\\gamma\_\{\\min\}=\\min\_\{m\}\\gamma\_\{m\}\>0\. Then the stacked propagation core of Eq\. \([5](https://arxiv.org/html/2606.14172#S4.E5)\) can be written as
𝐇k\+1=Pτ𝐇k\+Γ𝐇0,\\mathbf\{H\}\_\{k\+1\}=P\_\{\\tau\}\\mathbf\{H\}\_\{k\}\+\\Gamma\\mathbf\{H\}\_\{0\},\(13\)whereρ\(Pτ\)≤1−γmin<1\\rho\(P\_\{\\tau\}\)\\leq 1\-\\gamma\_\{\\min\}<1\. Therefore,\{𝐇k\}\\\{\\mathbf\{H\}\_\{k\}\\\}converges to the unique fixed point
𝐇∗=\(I−Pτ\)−1Γ𝐇0,\\mathbf\{H\}^\{\*\}=\(I\-P\_\{\\tau\}\)^\{\-1\}\\Gamma\\mathbf\{H\}\_\{0\},\(14\)with asymptotic linear rate at most1−γmin1\-\\gamma\_\{\\min\}\.
Proof sketch\.Stacking modalities gives a homogeneous operator with row mass at most1−γmin1\-\\gamma\_\{\\min\}, soρ\(Pτ\)<1\\rho\(P\_\{\\tau\}\)<1and the Neumann series yields Eq\. \([14](https://arxiv.org/html/2606.14172#S5.E14)\)\. The result separates adaptivity from instability because the task gate may change the context mixture and the limiting representation, but the residual mass keeps the propagation operator contractive under the stated normalization\.
###### Theorem 2\(Residual Mitigates Over\-smoothing\)
LetΠ\\Pibe the centering projection that removes the constant node\-wise component, and let𝐇~k=Π𝐇k\\tilde\{\\mathbf\{H\}\}\_\{k\}=\\Pi\\mathbf\{H\}\_\{k\}\. For any centered eigencomponent of the stacked propagation matrixPτP\_\{\\tau\}with eigenvalueμ\\musatisfying\|μ\|<1\|\\mu\|<1, propagation without residual hasKK\-hop response
gnores\(μ,K\)=μK,g\_\{\\mathrm\{nores\}\}\(\\mu,K\)=\\mu^\{K\},\(15\)which vanishes asKKgrows\. With residual injection, the corresponding fixed\-point response is
gres\(Pτ\)=\(I−Pτ\)−1Γ,g\_\{\\mathrm\{res\}\}\(P\_\{\\tau\}\)=\(I\-P\_\{\\tau\}\)^\{\-1\}\\Gamma,\(16\)which is finite becauseρ\(Pτ\)<1\\rho\(P\_\{\\tau\}\)<1and remains non\-zero for centered input components retained byΓ\\Gamma\.
Proof sketch\.Without residual injection, centered components decay as Eq\. \([15](https://arxiv.org/html/2606.14172#S5.E15)\)\. CoMAG re\-injects the initial signal, giving the finite geometric filter in Eq\. \([16](https://arxiv.org/html/2606.14172#S5.E16)\)\. This explains why CoMAG keeps the whole trajectory rather than using only the final propagated state\. Early hops preserve local modality evidence, later hops encode task\-conditioned context, and the residual channel prevents the trajectory from degenerating into a purely low\-frequency signal\.
###### Proposition 3\(Hop\-distance Penalty Regularizes Cross\-hop Alignment\)
For a fixed query hopkkin Eq\. \([7](https://arxiv.org/html/2606.14172#S4.E7)\), letakla\_\{kl\}denote the scaled semantic attention score before applying the hop\-distance penalty\. For two candidate hopsl1l\_\{1\}andl2l\_\{2\}, the attention ratio satisfies
Bim←n\[k,l2\]Bim←n\[k,l1\]=exp\(akl2−akl1−λhΔd\),\\frac\{B\_\{i\}^\{m\\leftarrow n\}\[k,l\_\{2\}\]\}\{B\_\{i\}^\{m\\leftarrow n\}\[k,l\_\{1\}\]\}=\\exp\\left\(a\_\{kl\_\{2\}\}\-a\_\{kl\_\{1\}\}\-\\lambda\_\{h\}\\Delta d\\right\),\(17\)whereΔd=\|k−l2\|−\|k−l1\|\\Delta d=\|k\-l\_\{2\}\|\-\|k\-l\_\{1\}\|\. Ifl2l\_\{2\}is farther fromkkthanl1l\_\{1\}, then it must overcome an extra factorexp\(−λhΔd\)\\exp\(\-\\lambda\_\{h\}\\Delta d\)to receive larger attention\.
Proof sketch\.The same\-row softmax normalizer cancels, leaving the semantic\-score difference minus the hop\-distance penalty in Eq\. \([17](https://arxiv.org/html/2606.14172#S5.E17)\)\. The proposition makesλh\\lambda\_\{h\}interpretable as a distance prior\. Increasing it favors hop\-synchronous matching, while smaller values allow stronger cross\-hop transfers when the semantic attention score warrants them\.
###### Proposition 4\(Orthogonal Decoupling Controls Modality Collapse\)
Assume‖pi\(m\)‖2≤Rp\\\|p\_\{i\}^\{\(m\)\}\\\|\_\{2\}\\leq R\_\{p\}for all nodes and modalities\. Under the orthogonal loss in Eq\. \([11](https://arxiv.org/html/2606.14172#S4.E11)\), the mean projection of private residuals onto shared directions is bounded by
1NM∑i,m‖projsi\(pi\(m\)\)‖2≤Rpℒ⟂\.\\frac\{1\}\{NM\}\\sum\_\{i,m\}\\left\\\|\\operatorname\{proj\}\_\{s\_\{i\}\}\\left\(p\_\{i\}^\{\(m\)\}\\right\)\\right\\\|\_\{2\}\\leq R\_\{p\}\\sqrt\{\\mathcal\{L\}\_\{\\perp\}\}\.\(18\)An analogous bound holds for projectingsis\_\{i\}onto the direction ofpi\(m\)p\_\{i\}^\{\(m\)\}when‖si‖2\\\|s\_\{i\}\\\|\_\{2\}is bounded\.
Proof sketch\.Eq\. \([11](https://arxiv.org/html/2606.14172#S4.E11)\) averages squared cosines\. Projection length is bounded by the private norm times the absolute cosine, and Cauchy–Schwarz gives Eq\. \([18](https://arxiv.org/html/2606.14172#S5.E18)\)\. The bound clarifies the role ofℒ⟂\\mathcal\{L\}\_\{\\perp\}in the output layer\. If private residuals align too closely withsis\_\{i\}, thenei\(m\)e\_\{i\}^\{\(m\)\}can be dominated by duplicated shared information\. Reducing the projection term keeps modality\-specific semantics for downstream tasks\.
Together, the results justify CoMAG modules through stable context, residual multi\-hop trajectories, distance\-regularized hop matching, and orthogonal decoupling ofsis\_\{i\}andpi\(m\)p\_\{i\}^\{\(m\)\}\.
## VIExperiments
We conduct experiments to provide a comprehensive evaluation of CoMAG from five perspectives\.Q1\. How does CoMAG perform compared with representative baselines across graph\-level and modality\-level tasks, as reported in Tables[II](https://arxiv.org/html/2606.14172#S6.T2)and[III](https://arxiv.org/html/2606.14172#S6.T3)?Q2\. Do the key designed modules contribute to the overall effectiveness of CoMAG?Q3\. How sensitive is CoMAG to important hyperparameter settings?Q4\. How robust is CoMAG compared with baselines under text, image, edge, and label noise?Q5\. What is the theoretical computational complexity of CoMAG relative to representative MAG baselines, as summarized in Table[V](https://arxiv.org/html/2606.14172#S6.T5)?
### VI\-AExperimental Setup
Datasets\.We evaluate CoMAG on nine multimodal attributed graph datasets from the OpenMAG benchmark\[[47](https://arxiv.org/html/2606.14172#bib.bib1)\], including Movies, Grocery, Toys, DY, KU, Bili\_dance, RedditS, Flickr30k, and SemArt\. As summarized in Table[I](https://arxiv.org/html/2606.14172#S4.T1), these datasets cover e\-commerce, social media, video recommendation, image networks, and art networks\. All datasets contain text and visual node modalities and follow the train/validation/test splits provided by OpenMAG\.
Baselines\.We compare CoMAG with graph\-only and multimodal attributed graph baselines\. For feature\-only learning, we include MLP\. For graph\-only topology learning, we include GCN\[[19](https://arxiv.org/html/2606.14172#bib.bib25)\]and GAT\[[45](https://arxiv.org/html/2606.14172#bib.bib27)\]\. For representative MAG methods, we follow the model set analyzed in Table[V](https://arxiv.org/html/2606.14172#S6.T5), using DMGC\[[12](https://arxiv.org/html/2606.14172#bib.bib13)\]and DGF\[[57](https://arxiv.org/html/2606.14172#bib.bib12)\]as graph\-enhanced methods, and MMGCN\[[50](https://arxiv.org/html/2606.14172#bib.bib8)\], MGAT\[[44](https://arxiv.org/html/2606.14172#bib.bib9)\], and LGMRec\[[11](https://arxiv.org/html/2606.14172#bib.bib11)\]as multimodal\-enhanced methods\. We also include UniGraph2\[[14](https://arxiv.org/html/2606.14172#bib.bib14)\]as a recent unified multimodal graph baseline\. CoMAG is evaluated as the proposed sparse\-context modality\-topology co\-alignment model\.
Evaluation metrics\.We use task\-specific metrics throughout the experiments\. For node classification, accuracy and F1\-score indicate whether the learned representation supports reliable class prediction\. For link prediction, MRR and Hits@3 measure whether true edges are ranked near the top among candidate relations\. For node clustering, NMI and ARI reflect how well the discovered groups agree with ground\-truth categories\. For modality matching, AUC and AP measure the separation between matched and mismatched cross\-modal pairs\. For G2Text, BLEU\-4 and CIDEr evaluate the lexical and consensus quality of graph\-conditioned text generation\. For G2Image, CLIP\-S and DINO\-S indicate whether generated images preserve semantic alignment and visual consistency with the graph\-conditioned target\.
Implementation details\.Unless otherwise specified, all trainable models use a hidden dimension of 256, dropout rate of 0\.3\. We follow the official hyperparameter settings, and if such guidance is missing, we optimize the baseline performance with Adam Optimizer\.
Experiment Environment\.The experiments are conducted on a machine with an AMD EPYC 7J13 64\-Core Processor, and NVIDIA GeForce RTX 4090 with 24GB memory and CUDA 12\.6\. The operating system is Ubuntu 22\.04\.5 LTS with 503GB memory\.
TABLE II:Graph\-level downstream task performance comparison\. The best result isboldand the second best result isunderlined¯\\underline\{underlined\}\.TABLE III:Performance Comparison on Modality\-level tasks\.
### VI\-BPerformance Comparison
To answerQ1, Tables[II](https://arxiv.org/html/2606.14172#S6.T2)and[III](https://arxiv.org/html/2606.14172#S6.T3)compare CoMAG with representative baselines on graph\-level and modality\-level tasks\.
Graph\-level performance\.Table[II](https://arxiv.org/html/2606.14172#S6.T2)shows that CoMAG achieves the strongest graph\-level performance among the compared baselines across classification, link prediction, and clustering settings\. On Grocery, CoMAG reaches87\.7587\.75, outperforming the strongest non\-CoMAG result84\.7884\.78\. On DY, it reaches95\.7595\.75compared with92\.5192\.51from DGF\. This advantage is not shared by all baselines\. Feature\-only and graph\-only models perform poorly when either semantic evidence or reliable neighborhood selection is required, while several multimodal baselines remain competitive on individual datasets but fall behind when structural prediction and clustering are considered together\. The observed behavior is consistent with CoMAG’s reliable context design, where edge filtering suppresses noisy topology, semantic neighbors supplement missing relations, and the task\-adaptive context keeps propagation aligned with the downstream objective\.
Modality\-level performance\.Table[III](https://arxiv.org/html/2606.14172#S6.T3)further shows that CoMAG remains the best among compared baselines on modality\-level tasks\. The reported tasks emphasize cross\-modal matching and generation quality, where fine\-grained multimodal evidence creates clearer separation among methods\. On KU, CoMAG reports93\.8193\.81while LGMRec gives90\.6490\.64\. On Flickr30k, CoMAG improves the graph\-conditioned text generation score from DGF’s44\.2844\.28to45\.8345\.83\. Methods that rely mainly on graph aggregation or direct multimodal fusion can still perform reasonably, but they show weaker modality preservation when fine\-grained cross\-modal evidence is needed\. CoMAG’s hop\-token alignment and shared\-private decoupling explain this pattern by aligning contextual evidence across modalities while keeping modality\-specific cues available for matching and generation\.
### VI\-CAblation Study
To answerQ2, Table[IV](https://arxiv.org/html/2606.14172#S6.T4)compares the complete CoMAG with three variants that remove one key module at a time\. The full model remains strongest across both the graph\-level and generation settings, showing that the gains do not come from a single isolated design choice\. Removing edge reliability weakens graph\-level prediction, with the Grocery result falling from87\.7587\.75to85\.8285\.82, because noisy observed edges are no longer filtered before propagation\. Removing the semantic graph hurts both tasks by discarding missing semantic neighbors, which limits the context available beyond the raw topology\. Removing the private residual causes the clearest degradation in G2Text quality, where the result falls to40\.8640\.86, indicating that shared consensus alone is insufficient for preserving modality\-specific generation evidence\. Overall, the ablation study confirms that reliable topology, semantic context recovery, and modality\-private evidence are complementary components of CoMAG\.
### VI\-DHyperparameter Analysis
The auxiliary lossesλ⟂\\lambda\_\{\\perp\}\(orthogonal constraint\) andλs\\lambda\_\{s\}\(graph smoothing\) control the balance between graph\-centric and modality\-centric quality\. A higherλ⟂\\lambda\_\{\\perp\}pushes the shared and private subspaces toward orthogonality, whileλs\\lambda\_\{s\}controls how strongly shared representations follow the learned context graph\. We exploreλ⟂∈\{0\.000,0\.005,0\.010,0\.015,0\.020\}\\lambda\_\{\\perp\}\\in\\\{0\.000,0\.005,0\.010,0\.015,0\.020\\\}andλs∈\{0\.00,0\.05,0\.10,0\.15,0\.20\}\\lambda\_\{s\}\\in\\\{0\.00,0\.05,0\.10,0\.15,0\.20\\\}\.
TABLE IV:Ablation studies on key modules of CoMAG\.Figure 2:Hyperparameter sensitivity of two key parameters\.Figure 3:Robustness comparison under text, image, label, and edge noise\.To answerQ3, Fig\.[2](https://arxiv.org/html/2606.14172#S6.F2)shows that CoMAG is stable over a broad region rather than depending on one fragile setting\. Movies classification is strongest aroundλ⟂=0\.015\\lambda\_\{\\perp\}=0\.015, where moderate shared\-private separation improves discriminative structure without suppressing shared graph evidence\. SemArt favors moderate\-to\-strong separation withλs\\lambda\_\{s\}near0\.100\.10–0\.150\.15, indicating that smoothing helps image generation when it supports the learned context without over\-constraining the shared representation\. This pattern is consistent with CoMAG’s design, where orthogonal separation preserves private modality evidence and context smoothness is useful only when it remains aligned with reliable semantic neighborhoods\.
### VI\-ERobustness Analysis
To answerQ4, Fig\.[3](https://arxiv.org/html/2606.14172#S6.F3)compares CoMAG with representative baselines under text, image, label, and edge noise\. CoMAG keeps the strongest curve across all four settings, showing that its performance does not rely on a single clean source of evidence\. Label noise is the hardest setting because corrupted supervision directly changes the target signal, while image noise is comparatively mild because text and graph context can still stabilize the representation\. Edge noise especially separates CoMAG from topology\-dependent baselines, as reliable edge filtering and semantic context recovery reduce the impact of corrupted observed topology\. The robustness trends also support the shared\-private design, since modality\-private residuals help preserve useful evidence when one modality or structural channel becomes unreliable\.
### VI\-FTheoretical Complexity Analysis
We analyze the backbone\-level complexity of representative MAG methods in Table[V](https://arxiv.org/html/2606.14172#S6.T5)\. Let\|𝒱\|\|\\mathcal\{V\}\|and\|ℰ\|\|\\mathcal\{E\}\|denote the number of nodes and edges,MMthe number of modalities,LLthe number of baseline GNN layers,ddthe hidden dimension,KKa baseline\-specific cluster or iteration count, and\|Qτ\|\|Q\_\{\\tau\}\|the number of nonzero edges in CoMAG’s learned sparse context graph\. For CoMAG, the hop depthKhK\_\{h\}is small and fixed, so its hop\-token alignment overhead is absorbed into constant factors\.
TABLE V:Theoretical complexity analysis between baselines\.QτQ\_\{\\tau\}denotes the learned sparse context graph\. SinceMMandKhK\_\{h\}are small in practice and\|Qτ\|\|Q\_\{\\tau\}\|is controlled by sparse semantic neighbor selection, CoMAG has edge\-linear complexity comparable to sparse multimodal GNNs while avoiding the denseO\(\|𝒱\|2\)O\(\|\\mathcal\{V\}\|^\{2\}\)memory cost of structure\-learning methods\.
## VIIConclusion
We introduced CoMAG, a unified framework for multimodal attributed graph learning that simultaneously serves graph\-centric and modality\-centric tasks from a single forward pass\. Our experiments compare CoMAG with representative feature\-only, graph\-only, multimodal, and unified MAG baselines across graph\-level and modality\-level OpenMAG tasks\. Through four integrated modules \(edge reliability context graph, modality\-specific hop trajectories, hop\-token cross\-modal matching, and shared\-private decoupling\), CoMAG provides task\-adaptive context construction and modality\-preserving alignment with formal theoretical guarantees covering linear convergence to a unique fixed point, provable over\-smoothing mitigation, and bounded modality collapse\. Experiments on both modality and graph\-level tasks confirm CoMAG’s consistent advantages over baselines\. Future work will pursue broader domain evaluations, and joint multi\-task training with a calibrated loss schedule to further strengthen cross\-task generalization\.
## References
- \[1\]J\. Cai, X\. Wang, H\. Li, Z\. Zhang, and W\. Zhu\(2024\)Multimodal graph neural architecture search under distribution shifts\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 8227–8235\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1)\.
- \[2\]M\. Chen, Z\. Wei, Z\. Huang, B\. Ding, and Y\. Li\(2020\)Simple and deep graph convolutional networks\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[3\]T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton\(2020\)A simple framework for contrastive learning of visual representations\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[4\]E\. Chien, J\. Peng, P\. Li, and O\. Milenkovic\(2021\)Adaptive universal generalized pagerank graph neural network\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[5\]M\. Defferrard, X\. Bresson, and P\. Vandergheynst\(2016\)Convolutional neural networks on graphs with fast localized spectral filtering\.InAdvances in Neural Information Processing Systems,Vol\.29\.Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[6\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[7\]Y\. Ektefaie, G\. Dasoulas, A\. Noori, M\. R\. Farhat, and M\. Zitnik\(2023\)Multimodal learning with graphs\.Nature Machine Intelligence5\(4\),pp\. 340–350\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1)\.
- \[8\]Y\. Fang, B\. Jin, J\. Shen, S\. Ding, Q\. Tan, and J\. Han\(2025\)GraphGPT\-o: synergistic multimodal comprehension and generation on graphs\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 19467–19476\.Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[9\]Z\. Fang, G\. Yang, W\. Lyu, Z\. Hong, S\. Zhong, W\. Zuo, Y\. Xie, Y\. Yang, G\. Wang, Y\. Liu, and D\. Zhang\(2025\)Cellular infrastructure sharing for network robustness: a citywide empirical study\.IEEE Transactions on Mobile Computing24\(11\),pp\. 11386–11400\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2025.3580605)Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p1.1)\.
- \[10\]R\. Girdhar, A\. El\-Nouby, Z\. Liu, M\. Singh, K\. V\. Alwala, A\. Joulin, and I\. Misra\(2023\)ImageBind: one embedding space to bind them all\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 15180–15190\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[11\]Z\. Guo, J\. Li, G\. Li, C\. Wang, S\. Shi, and B\. Ruan\(2024\)LGMRec: local and global graph learning for multimodal recommendation\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 8454–8462\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p2.1)\.
- \[12\]Z\. Guo, Z\. Shen, X\. Xie, L\. Wen, and Z\. Kang\(2025\)Disentangling homophily and heterophily in multimodal graph clustering\.arXiv preprint arXiv:2507\.15253\.Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p2.1)\.
- \[13\]W\. Hamilton, Z\. Ying, and J\. Leskovec\(2017\)Inductive representation learning on large graphs\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[14\]Y\. He, Y\. Sui, X\. He, Y\. Liu, Y\. Sun, and B\. Hooi\(2025\)UniGraph2: learning a unified embedding space to bind multimodal graphs\.InProceedings of the ACM Web Conference,pp\. 1759–1770\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p2.1)\.
- \[15\]Z\. Hou, Y\. He, Y\. Cen, X\. Liu, Y\. Dong, E\. Kharlamov, and J\. Tang\(2023\)GraphMAE2: a decoding\-enhanced masked self\-supervised graph learner\.InProceedings of the ACM Web Conference,pp\. 737–746\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[16\]J\. Hu, Y\. He, Y\. Li, B\. Hooi, and B\. He\(2026\)NTSFormer: a self\-teaching graph transformer for multimodal isolated cold\-start node classification\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[17\]J\. Hu, B\. Hooi, B\. He, and Y\. Wei\(2025\)Modality\-independent graph neural networks with global transformers for multimodal recommendation\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 11790–11798\.Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[18\]X\. Jia, M\. Jiang, Y\. Dong, F\. Zhu, H\. Lin, Y\. Xin, and H\. Chen\(2023\)Multimodal heterogeneous graph attention network\.Neural Computing and Applications35\(4\),pp\. 3357–3372\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1)\.
- \[19\]T\. N\. Kipf and M\. Welling\(2017\)Semi\-supervised classification with graph convolutional networks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p2.1)\.
- \[20\]J\. Klicpera, A\. Bojchevski, and S\. Günnemann\(2019\)Predict then propagate: graph neural networks meet personalized pagerank\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[21\]Z\. Kong, L\. Sun, H\. Peng, L\. Zhan, Y\. Chen, and L\. He\(2021\)Multiplex graph networks for multimodal brain network analysis\.arXiv preprint arXiv:2108\.00158\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1)\.
- \[22\]G\. Li, M\. Müller, B\. Ghanem, and V\. Koltun\(2021\)Training graph neural networks with 1000 layers\.InInternational Conference on Machine Learning \(ICML\),pp\. 6437–6449\.Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[23\]J\. Li, R\. Wu, W\. Sun, L\. Chen, S\. Tian, L\. Zhu, C\. Meng, Z\. Zheng, and W\. Wang\(2023\)What’s behind the mask: understanding masked graph modeling for graph autoencoders\.InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining,Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[24\]Q\. Li, Z\. Han, and X\.\-M\. Wu\(2018\)Deeper insights into graph convolutional networks for semi\-supervised learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[25\]D\. Lim, F\. Hohne, X\. Li, S\.\-L\. Huang, V\. Gupta, O\. Bhalerao, and S\.\-N\. Lim\(2021\)Large scale learning on non\-homophilous graphs: new benchmarks and strong simple methods\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[26\]C\. Liu, Z\. Mao, T\. Zhang, H\. Xie, B\. Wang, and Y\. Zhang\(2020\)Graph structured network for image\-text matching\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 10918–10927\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[27\]Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov\(2019\)RoBERTa: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[28\]Z\. Liu, C\. Chen, L\. Li, J\. Zhou, X\. Li, L\. Song, and Y\. Qi\(2024\)Task\-specific topology modification for few\-shot node classification on graphs\.InProceedings of the ACM Web Conference,Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p2.1),[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[29\]S\. Luan, C\. Hua, Q\. Lu, J\. Zhu, M\. Zhao, S\. Zhang, X\.\-W\. Chang, and D\. Precup\(2022\)Revisiting heterophily for graph neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[30\]M\. Ma, J\. Ren, L\. Zhao, D\. Testuggine, and X\. Peng\(2022\)Are multimodal transformers robust to missing modality?\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 18177–18186\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[31\]M\. Ma, J\. Ren, L\. Zhao, S\. Tulyakov, C\. Wu, and X\. Peng\(2021\)SMIL: multimodal learning with severely missing modality\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 2302–2310\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[32\]D\. Q\. Nguyen, T\. D\. Nguyen, and D\. Phung\(2022\)Universal graph transformer self\-attention networks\.InCompanion Proceedings of the Web Conference,pp\. 193–196\.Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[33\]X\. Ning, D\. Fu, T\. Wei, W\. Xu, and J\. He\(2025\)Graph4MM: weaving multimodal learning with structural information\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[34\]K\. Oono and T\. Suzuki\(2020\)Graph neural networks exponentially lose expressive power for node classification\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[35\]M\. Oquab, T\. Darcet, T\. Moutakanni, H\. V\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. Haziza, F\. Massa, A\. El\-Nouby, M\. Assran, N\. Ballas, W\. Galuba, R\. Howes, P\.\-Y\. Huang, S\.\-W\. Li, I\. Misra, M\. Rabbat, V\. Sharma, G\. Synnaeve, H\. Xu, H\. Jégou, J\. Mairal, P\. Labatut, A\. Joulin, and P\. Bojanowski\(2024\)DINOv2: learning robust visual features without supervision\.Transactions on Machine Learning Research\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[36\]C\. Peng, J\. He, and F\. Xia\(2024\)Learning on multimodal graphs: a survey\.arXiv preprint arXiv:2402\.05322\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1)\.
- \[37\]Z\. Peng, W\. Huang, M\. Luo, Q\. Zheng, Y\. Rong, T\. Xu, and J\. Huang\(2020\)Graph representation learning via graphical mutual information maximization\.InProceedings of The Web Conference,Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[38\]O\. Platonov, D\. Kuznedelev, M\. Diskin, A\. Babenko, and L\. Prokhorenkova\(2023\)A critical look at the evaluation of gnns under heterophily: are we really making progress?\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[39\]J\. Qiu, Q\. Chen, Y\. Dong, J\. Zhang, H\. Yang, M\. Ding, K\. Wang, and J\. Tang\(2020\)GCC: graph contrastive coding for graph neural network pre\-training\.InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,pp\. 1150–1160\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[40\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever\(2021\)Learning transferable visual models from natural language supervision\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[41\]C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu\(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[42\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of EMNLP\-IJCNLP,pp\. 3980–3990\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[43\]M\. Tang, M\. Liu, H\. Li, J\. Yang, C\. Wei, B\. Li, D\. Li, R\. Xu, Y\. Xu, Z\. Zhang, X\. Wang, L\. Liu, Y\. Xie, C\. Liu, L\. Fawaz, L\. Li, H\. Wang, B\. Zhu, and S\. Reddy\(2024\)Async learned user embeddings for ads delivery optimization\.External Links:2406\.05898Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p1.1)\.
- \[44\]Z\. Tao, Y\. Wei, X\. Wang, X\. He, X\. Huang, and T\.\-S\. Chua\(2020\)MGAT: multimodal graph attention network for recommendation\.Information Processing & Management57\(5\),pp\. 102277\.Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p2.1)\.
- \[45\]P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Liò, and Y\. Bengio\(2018\)Graph attention networks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p2.1)\.
- \[46\]P\. Veličković, W\. Fedus, W\. L\. Hamilton, P\. Liò, Y\. Bengio, and R\. D\. Hjelm\(2019\)Deep graph infomax\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[47\]C\. Wan, X\. Li, Y\. Zuo, H\. Deng, S\. Li, B\. Fan, H\. Qin, R\. Li, and G\. Wang\(2026\)OpenMAG: a comprehensive benchmark for multimodal\-attributed graph\.arXiv preprint arXiv:2602\.05576\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p1.1)\.
- \[48\]H\. Wang, Y\. Chen, C\. Ma, J\. Avery, L\. Hull, and G\. Carneiro\(2023\)Multi\-modal learning with missing modality via shared\-specific feature modelling\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 15878–15887\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[49\]S\. Wang, R\. Wang, Z\. Yao, S\. Shan, and X\. Chen\(2020\)Cross\-modal scene graph matching for relationship\-aware image\-text retrieval\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\),pp\. 1508–1517\.Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p2.1),[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[50\]Y\. Wei, X\. Wang, L\. Nie, X\. He, R\. Hong, and T\.\-S\. Chua\(2019\)MMGCN: multi\-modal graph convolution network for personalized recommendation of micro\-video\.InProceedings of the 27th ACM International Conference on Multimedia,pp\. 1437–1445\.Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p2.1)\.
- \[51\]F\. Wu, A\. Souza, T\. Zhang, C\. Fifty, T\. Yu, and K\. Weinberger\(2019\)Simplifying graph convolutional networks\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[52\]R\. Wu, H\. Wang, H\. Chen, and G\. Carneiro\(2024\)Deep multimodal learning with missing modality: a survey\.arXiv preprint arXiv:2409\.07825\.Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[53\]K\. Xu, W\. Hu, J\. Leskovec, and S\. Jegelka\(2019\)How powerful are graph neural networks?\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II\-B](https://arxiv.org/html/2606.14172#S2.SS2.p1.1)\.
- \[54\]H\. Yan, C\. Li, Z\. Yu, J\. Yin, R\. Liu, P\. Zhang, W\. Han, M\. Li, Z\. Zeng, H\. Sun, W\. Deng, F\. Sun, Q\. Zhang, and S\. Wang\(2024\)When graph meets multimodal: benchmarking on multimodal attributed graphs learning\.arXiv preprint arXiv:2410\.09132\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1)\.
- \[55\]M\. Yoon, J\. Y\. Koh, B\. Hooi, and R\. Salakhutdinov\(2023\)Multimodal graph learning for generative tasks\.arXiv preprint arXiv:2310\.07478\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1)\.
- \[56\]Y\. You, T\. Chen, Y\. Sui, T\. Chen, Z\. Wang, and Y\. Shen\(2020\)Graph contrastive learning with augmentations\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-C](https://arxiv.org/html/2606.14172#S2.SS3.p1.2)\.
- \[57\]H\. Zheng, R\. Yang, H\. Wang, and J\. Xu\(2025\)Cross\-contrastive clustering for multimodal attributed graphs with dual graph filtering\.arXiv preprint arXiv:2511\.20030\.Cited by:[§I](https://arxiv.org/html/2606.14172#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1),[§VI\-A](https://arxiv.org/html/2606.14172#S6.SS1.p2.1)\.
- \[58\]J\. Zhu, Y\. Zhou, S\. Qian, Z\. He, T\. Zhao, N\. Shah, and D\. Koutra\(2025\)Mosaic of modalities: a comprehensive benchmark for multimodal graph learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 14215–14224\.Cited by:[§II\-A](https://arxiv.org/html/2606.14172#S2.SS1.p1.1)\.Similar Articles
Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity
This paper proposes FedMPO, a robust federated multimodal graph learning method that addresses modality heterogeneity and missing modalities through topology-aware cross-modal generation, missing-aware expert routing, and reliability-aware aggregation, achieving performance gains on multiple datasets.
GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution
GoCoMA is a multimodal framework using hyperbolic Poincaré ball embeddings to fuse code stylometry and binary artifact images for attributing LLM-generated code, outperforming unimodal and Euclidean baselines on two benchmarks.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE introduces a unified multimodal image generation and editing framework that aligns VLM semantic embeddings with diffusion conditioning, achieving state-of-the-art fidelity without costly fusion or from-scratch training.
Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems
The paper proposes GTBP, a graph-based back-propagation framework for context adaptation in multi-LLM agentic systems, which improves prompt optimization with theoretical convergence guarantees and outperforms existing methods on benchmarks.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS is a training-free inference framework that mitigates the straggler effect in expert parallelism for multimodal MoE MLLMs by introducing entropy-weighted load and dynamic modality-adaptive capacity mechanisms.