HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

HyperGVL introduces the first benchmark for evaluating Large Vision-Language Models on hypergraph understanding and reasoning, featuring 84,000 QA samples across 12 tasks and real-world applications. The paper also proposes WiseHyGR, a generalizable router that enhances LVLM performance through adaptive hypergraph representations.

arXiv:2604.15648v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) consistently require new evaluation domains to push their capabilities further, yet their proficiency with hypergraphs remains unexplored. In practice, hypergraphs have significant applications in life sciences and social networks. While recent advances in LVLMs have shown promise in understanding complex topologies, a critical gap exists: there is no comprehensive benchmark to assess LVLM capabilities with hypergraphs, leaving the scope of their abilities undefined. To address this gap, we introduce HyperGVL, the first benchmark for evaluating LVLM proficiency in hypergraph understanding and reasoning. HyperGVL provides a comprehensive evaluation of 12 state-of-the-art LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, from basic component counting to complex NP-hard problem solving. The hypergraphs included encompass multiscale synthetic structures as well as real-world citation and protein networks. Additionally, we investigate the effectiveness of 12 textual and visual hypergraph representations and introduce WiseHyGR, a generalizable router that improves LVLM performance on hypergraphs through learning adaptive representations. This work represents a significant step toward bridging hypergraphs and LVLMs.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:28 AM

# HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
Source: https://arxiv.org/html/2604.15648
Yanbin Wei1,2Chun Kang411footnotemark:1Siwei Li4Haoxuan Che3Yang Chen1Hua Liu1Jian Liu4 Zhuang Liu4Can Ouyang4Fei Xing1Lei Sha4Rui Liu322footnotemark:2Yu Zhang122footnotemark:2James Kwok2 1Southern University of Science and Technology 2Hong Kong University of Science and Technology 3Huawei Research4Beihang University

###### Abstract

Large Vision-Language Models (LVLMs) consistently require new evaluation arenas to push their expanding boundaries, yet their capabilities in hypergraph understanding remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social networks. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to assess the capabilities of LVLMs on hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, we introduce HyperGVL, the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. HyperGVL provides a comprehensive assessment of 12 state-of-the-art LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP-hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real-world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router WiseHyGR that improves LVLMs in hypergraph understanding via learning adaptive representations. We believe this work is a step forward in connecting hypergraphs with LVLMs.

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

Yanbin Wei1,2††thanks:Equal contribution. Chun Kang411footnotemark:1Siwei Li4Haoxuan Che3Yang Chen1Hua Liu1Jian Liu4Zhuang Liu4Can Ouyang4Fei Xing1Lei Sha4††thanks:Corresponding author. Rui Liu322footnotemark:2Yu Zhang122footnotemark:2James Kwok21Southern University of Science and Technology2Hong Kong University of Science and Technology3Huawei Research4Beihang University

## 1 Introduction

Graphs serve as a fundamental data structure for modeling relationships between abstract concepts or tangible objects in the real world. Among graph subcategories, hypergraphs are significant because their hyperedges can effectively model high-order correlations among three or more entities. Hypergraph applications are widespread in the real world. For example, in social networks, hypergraphs can naturally represent community interactions, where a hyperedge can connect an arbitrary number of vertices, reflecting complex, high-order relationships within communities Contisciani et al. (2022 (https://arxiv.org/html/2604.15648#bib.bib3)). Similarly, in life sciences, hypergraphs excel at modeling interactions such as catalytic triplets in protein structures Ravetz et al. (2019 (https://arxiv.org/html/2604.15648#bib.bib12)), where ordinary graphs focusing only on pairwise relations fall short. Recent work also demonstrates the promising capability of hypergraphs in modeling complex relationships among information in retrieval-augmented generation Feng et al. (2025a (https://arxiv.org/html/2604.15648#bib.bib13)) more accurately.

Refer to captionFigure 1: Overview of the HyperGVL benchmark. Benchmark | Evaluation Capability | Graph Source | Structure Perception | High-Order | #Tasks | #Samples
---|---|---|---|---|---|---
GVLQA | Reasoning | Synthetic | Visual & Textual | ✗ | 7 | 157,896
VisionGraph | Understanding & Reasoning | Synthetic | Visual | ✗ | 10 | 3,000
VGCure | Understanding & Reasoning | Synthetic & Real-world | Visual | ✗ | 2 | 223,646
LLM4Hypergraph | Mainly Understanding | Synthetic & Real-world | Textual | ✓ | 15 | 21,500
HyperGVL (Ours) | Understanding & Reasoning | Synthetic & Real-world | Visual & Textual | ✓ | 12 | 84,000

Table 1: Comparisons between HyperGVL and related graph analysis benchmarks for LVLMs/LLMs. #Tasks: number of response types; #Samples: total number of test samples; ✓/✗: support/not support high-order relationships.

On the other hand, large vision-language models (LVLMs) exhibit outstanding performance across a wide range of downstream tasks with human-like understanding and reasoning abilities Li et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib6)). This has triggered growing interest in employing LVLMs for graph learning problems, as the vision modality offers a natural way to comprehend structural information and facilitate graph-related reasoning, with GVLQA Wei et al. (2024 (https://arxiv.org/html/2604.15648#bib.bib76)), VisionGraph Li et al. (2024 (https://arxiv.org/html/2604.15648#bib.bib77)), and VGCure Zhu et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib11)) among the first batch of such methods. However, they limit their scope to ordinary graphs and do not explore the potential of LVLMs for high-order relationships in hypergraphs.

To address this gap, we introduce HyperGVL (Fig. 1 (https://arxiv.org/html/2604.15648#S1.F1)), the first comprehensive benchmark dataset designed to evaluate the capabilities of LVLMs on hypergraphs. HyperGVL consists of 84,000 vision-language question-answering (QA) pairs, covering both multiscale synthetic hypergraphs and real-world hypergraphs from citation and protein networks. The evaluation spans 12 tasks of varying difficulty levels, from fundamental hypergraph component understanding to challenging NP-hard problem reasoning. Additionally, HyperGVL integrates seven textual and five visual representations of hypergraphs, offering insights into task preferences and model capability boundaries across these diverse representations.

Based on the performance of LVLMs under different hypergraph representations, we train WiseHyGR, a generalizable router that can select appropriate hypergraph representations for given hypergraph problems. Experimental results validate that WiseHyGR generally enhances the hypergraph understanding and reasoning abilities of LVLMs, and the benefits generalize to downstream out-of-domain tasks.

The contributions of this work are threefold.

- We construct the HyperGVL benchmark, a new evaluation arena to assess LVLMs' capabilities in hypergraph understanding and reasoning.
- We extensively evaluate 12 leading LVLMs on HyperGVL and expose their actual capabilities. The dedicated evaluations from various perspectives contribute 14 valuable observations.
- Based on model performance across hypergraph representations, we train WiseHyGR, a generalizable router to enhance LVLMs' performance on hypergraph understanding and reasoning tasks.

## 2 The HyperGVL Benchmark

In this section, we introduce the HyperGVL Benchmark, designed to delineate the ability boundaries of LVLMs in handling higher-order structures of hypergraphs.

### 2.1 Benchmark Uniqueness

Table 1 (https://arxiv.org/html/2604.15648#S1.T1) underscores the distinct role of HyperGVL within the landscape of existing benchmarks. Unlike conventional graph-related LVLM benchmarks, HyperGVL delves into higher-order relationships inherent in hypergraphs, surpassing the limitations of ordinary graphs. Additionally, in contrast to text-only hypergraph benchmarks for large language models (LLMs), HyperGVL integrates visual perception effects unique to LVLMs, as well as enhances task diversity and complexity through the inclusion of intricate reasoning challenges. More related works are introduced in Appendix A (https://arxiv.org/html/2604.15648#A1).

### 2.2 Hypergraph Organization

The hypergraphs in HyperGVL involve meticulous considerations to ensure reasonable organization. First, the HyperGVL benchmark comprises an equal proportion of synthetic and real-world hypergraphs. Synthetic hypergraphs are generated using both random and regular-structured methods, providing a controlled environment for testing. In contrast, real-world hypergraphs are from anonymized citation and protein networks, offering practical insights into real-world applications.

To ensure balanced complexity for comprehensive evaluation, we employed the scale partition protocol from Feng et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib5)), and organized hypergraphs by vertex count into three scale groups: small, medium, and large, with a distribution ratio of 1:2:1. This categorization facilitates the assessment of model performance across varying levels of complexity. Detailed descriptions of the processes used to obtain these hypergraphs are provided in Appendix B (https://arxiv.org/html/2604.15648#A2).

### 2.3 Benchmark Tasks

In this section, we introduce the tasks with design considerations in HyperGVL.

#### 2.3.1 Design Principle

The tasks in HyperGVL are designed around two core dimensions: assessed capability and response type.

For assessed capabilities, tasks are divided into two main categories: understanding and reasoning. Understanding tasks evaluate three key atomic abilities: (1) basic element capture, which involves recognizing vertices and hyperedges; (2) adjacency perception, which entails understanding adjacency relationships among vertices; and (3) heuristic computation, which includes computing heuristics such as vertex degree and hyperedge order (i.e., the number of vertices in a hyperedge). On the other hand, reasoning tasks assess model abilities in terms of (1) algorithms, which involve solving problems with definitive algorithms, and (2) planning, where problems are NP-hard and lack definitive algorithms, requiring models to actively plan and devise valid solutions.

Based on these assessed capabilities, all tasks are categorized into a four-level difficulty hierarchy: Level-1 (querying single atomic capability), Level-2 (combining compound atomic capabilities), Level-3 (polynomial-solvable algorithms), and Level-4 (NP-hard planning). This stratification aligns with task complexity Bylander (1994 (https://arxiv.org/html/2604.15648#bib.bib291)), and we aim to verify whether it is consistent with the actual capability spectrum of LVLMs.

For response types, tasks are categorized into four types: (1) counting, (2) computing, (3) decision, and (4) descriptive tasks. This taxonomy enables a comprehensive evaluation of LVLMs across diverse cognitive processes.

Unlike LLM4Hypergraph Feng et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib5)), which presents relatively simple tasks for recent models (e.g., Gemini-3 Flash achieved over 90% zero-shot accuracy in 13 out of 15 tasks in our testing), the proposed benchmark introduces more challenging tasks that require reasoning beyond structural understanding. This design aligns with evolving model capabilities and leaves ample room for their future improvement. Overall, the task distribution in HyperGVL encompasses a wider range of difficulty and diversity, establishing a comprehensive evaluation framework for LVLMs.

Task | Capability | Response Type | Difficulty | Example | #Sample
---|---|---|---|---|---
Understanding Tasks | | | | | 42,000
VC | Element | Counting | Level-1 | Q: How many vertices are in the hypergraph G? A: 15. | 7,000
HEC | Element | Counting | Level-1 | Q: How many vertices are in the hypergraph G? A: 23. | 7,000
Ne | Adjacency | Descriptive | Level-1 | Q: What are the direct neighbors of vertex v4 in hypergraph G? A: v0, v3, v5. | 7,000
DVC | Heuristic & Element | Counting | Level-2 | Q: How many vertices have degree 3 in hypergraph G? A: 7. | 7,000
OEC | Heuristic & Element | Counting | Level-2 | Q: How many hyperedges have order 4 in hypergraph G? A: 8. | 7,000
ONe | Heuristic & Adjacency | Descriptive | Level-2 | Q: What are the neighbors of vertex v5 when only considering hyperedges with order >= 2 in hypergraph G? A: v0, v3. | 7,000
Reasoning Tasks | | | | | 42,000
OSP | Algorithm | Computing | Level-3 | Q: What is the order-weighted shortest path length from v4 to v8? A: 8. | 7,000
OMF | Algorithm | Computing | Level-3 | Q: What is the order-weighted maximum flow from v4 to v8? A: 19. | 7,000
ISM | Algorithm | Decision | Level-3 | Q: Are these two hypergraphs isomorphic? A: Yes. | 7,000
3-CL | Planning | Descriptive | Level-4 (NP-hard) | Q: Please provide a 3-coloring strategy such that each hyperedge contains nodes with at least 2 different colors. A: Coloring: [v0:c0,v1:c1,v2:c2,v3:c0,v4:c1,v5:c2]. | 7,000
SHC | Planning | Descriptive | Level-4 (NP-hard) | Q: Please identify a strict hypercycle in the hypergraph G. A: Cycle: [e0,e3,e2,e4,e1]. | 7,000
HHM | Planning | Descriptive | Level-4 (NP-hard) | Q: Please provide a valid Hamiltonian path from v1 to v0. (Hamiltonian path = path visiting all vertices exactly once). A: Path: [e0,e1,e2,e3]. | 7,000

Table 2: Properties, statistics, and examples of all hypergraph understanding and reasoning tasks in HyperGVL.

#### 2.3.2 Task Descriptions

All tasks in HyperGVL are introduced briefly in this section. More details are provided in Appendix C (https://arxiv.org/html/2604.15648#A3).

Hypergraph understanding tasks are designed to evaluate the composition, topology, and fundamental heuristics of hypergraphs. These tasks are mainly categorized into six types as follows.

- **Vertex Counting (VC)**: Counting the number of vertices in a given hypergraph.
- **Hyperedge Counting (HEC)**: Counting the number of hyperedges in a given hypergraph.
- **Neighbors (Ne)**: Identifying direct neighbors of a specified vertex connected by hyperedges.
- **Degree-specified Vertex Counting (DVC)**: Counting vertices with a specific degree value in the hypergraph.
- **Order-specified Hyperedge Counting (OEC)**: Counting hyperedges with a specific order in the hypergraph.
- **Order-filtered Neighbors (ONe)**: Identifying neighbors of a vertex when only considering hyperedges with orders no smaller than a specified threshold.

The associated assessed capabilities, difficulty levels, and response types of these tasks are detailed at the top of Tab. 2 (https://arxiv.org/html/2604.15648#S2.T2), along with examples.

Hypergraph reasoning tasks are designed to tackle complex, multi-step inferential challenges within hypergraphs. Beyond understanding hypergraph structures and computing heuristics, these tasks require organizing atomic capabilities into a sophisticated iterative process to tackle complex hypergraph problems. These tasks are mainly classified into six types as follows.

- **Order-weighted Shortest Path (OSP)**: Computing the shortest path length between two vertices, where the hyperedge order serves as the distance.
- **Order-weighted Maximum Flow (OMF)**: Computing the maximum flow between two vertices, where the hyperedge order determines the edge capacity.
- **Isomorphism Recognition (ISM)**: Determining whether two hypergraphs are isomorphic.
- **Hypergraph 3-Coloring (3-CL)**: Providing a valid 3-coloring, where each hyperedge contains at least two distinct colors.
- **Strict Hypercycle (SHC)**: Searching for a strict hypercycle in the hypergraph, where adjacent hyperedges share exactly one vertex.
- **Hypergraph Hamilton Path (HHM)**: Planning a Hamiltonian path, which visits all vertices in the hypergraph exactly once, with a given vertex as the starting point and another as the ending point.

The associated assessed capabilities, difficulty levels, and response types of these tasks are detailed at the bottom of Tab. 2 (https://arxiv.org/html/2604.15648#S2.T2).

### 2.4 Hypergraph Representations

Refer to captionFigure 2: The 7 textual hypergraph representations and 5 visual hypergraph representations in HyperGVL.

Hypergraph representations are crucial for evaluating the capabilities of LVLMs within hypergraphs, as distinct representations introduce unique perceptual biases Wei et al. (2024 (https://arxiv.org/html/2604.15648#bib.bib76)); Feng et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib5)). Unlike LLMs that rely solely on text, LVLMs benefit from synergistic perception of both visual and textual information. Therefore, testing LVLMs on HyperGVL should not only include

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

Similar Articles

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Submit Feedback

Similar Articles

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text