HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
Summary
HyperGVL introduces the first benchmark for evaluating Large Vision-Language Models on hypergraph understanding and reasoning, featuring 84,000 QA samples across 12 tasks and real-world applications. The paper also proposes WiseHyGR, a generalizable router that enhances LVLM performance through adaptive hypergraph representations.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
# HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning Source: https://arxiv.org/html/2604.15648 Yanbin Wei1,2Chun Kang411footnotemark:1Siwei Li4Haoxuan Che3Yang Chen1Hua Liu1Jian Liu4 Zhuang Liu4Can Ouyang4Fei Xing1Lei Sha4Rui Liu322footnotemark:2Yu Zhang122footnotemark:2James Kwok2 1Southern University of Science and Technology 2Hong Kong University of Science and Technology 3Huawei Research4Beihang University ###### Abstract Large Vision-Language Models (LVLMs) consistently require new evaluation arenas to push their expanding boundaries, yet their capabilities in hypergraph understanding remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social networks. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to assess the capabilities of LVLMs on hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, we introduce HyperGVL, the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. HyperGVL provides a comprehensive assessment of 12 state-of-the-art LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP-hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real-world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router WiseHyGR that improves LVLMs in hypergraph understanding via learning adaptive representations. We believe this work is a step forward in connecting hypergraphs with LVLMs. HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning Yanbin Wei1,2††thanks:Equal contribution. Chun Kang411footnotemark:1Siwei Li4Haoxuan Che3Yang Chen1Hua Liu1Jian Liu4Zhuang Liu4Can Ouyang4Fei Xing1Lei Sha4††thanks:Corresponding author. Rui Liu322footnotemark:2Yu Zhang122footnotemark:2James Kwok21Southern University of Science and Technology2Hong Kong University of Science and Technology3Huawei Research4Beihang University ## 1 Introduction Graphs serve as a fundamental data structure for modeling relationships between abstract concepts or tangible objects in the real world. Among graph subcategories, hypergraphs are significant because their hyperedges can effectively model high-order correlations among three or more entities. Hypergraph applications are widespread in the real world. For example, in social networks, hypergraphs can naturally represent community interactions, where a hyperedge can connect an arbitrary number of vertices, reflecting complex, high-order relationships within communities Contisciani et al. (2022 (https://arxiv.org/html/2604.15648#bib.bib3)). Similarly, in life sciences, hypergraphs excel at modeling interactions such as catalytic triplets in protein structures Ravetz et al. (2019 (https://arxiv.org/html/2604.15648#bib.bib12)), where ordinary graphs focusing only on pairwise relations fall short. Recent work also demonstrates the promising capability of hypergraphs in modeling complex relationships among information in retrieval-augmented generation Feng et al. (2025a (https://arxiv.org/html/2604.15648#bib.bib13)) more accurately. Refer to captionFigure 1: Overview of the HyperGVL benchmark. Benchmark | Evaluation Capability | Graph Source | Structure Perception | High-Order | #Tasks | #Samples ---|---|---|---|---|---|--- GVLQA | Reasoning | Synthetic | Visual & Textual | ✗ | 7 | 157,896 VisionGraph | Understanding & Reasoning | Synthetic | Visual | ✗ | 10 | 3,000 VGCure | Understanding & Reasoning | Synthetic & Real-world | Visual | ✗ | 2 | 223,646 LLM4Hypergraph | Mainly Understanding | Synthetic & Real-world | Textual | ✓ | 15 | 21,500 HyperGVL (Ours) | Understanding & Reasoning | Synthetic & Real-world | Visual & Textual | ✓ | 12 | 84,000 Table 1: Comparisons between HyperGVL and related graph analysis benchmarks for LVLMs/LLMs. #Tasks: number of response types; #Samples: total number of test samples; ✓/✗: support/not support high-order relationships. On the other hand, large vision-language models (LVLMs) exhibit outstanding performance across a wide range of downstream tasks with human-like understanding and reasoning abilities Li et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib6)). This has triggered growing interest in employing LVLMs for graph learning problems, as the vision modality offers a natural way to comprehend structural information and facilitate graph-related reasoning, with GVLQA Wei et al. (2024 (https://arxiv.org/html/2604.15648#bib.bib76)), VisionGraph Li et al. (2024 (https://arxiv.org/html/2604.15648#bib.bib77)), and VGCure Zhu et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib11)) among the first batch of such methods. However, they limit their scope to ordinary graphs and do not explore the potential of LVLMs for high-order relationships in hypergraphs. To address this gap, we introduce HyperGVL (Fig. 1 (https://arxiv.org/html/2604.15648#S1.F1)), the first comprehensive benchmark dataset designed to evaluate the capabilities of LVLMs on hypergraphs. HyperGVL consists of 84,000 vision-language question-answering (QA) pairs, covering both multiscale synthetic hypergraphs and real-world hypergraphs from citation and protein networks. The evaluation spans 12 tasks of varying difficulty levels, from fundamental hypergraph component understanding to challenging NP-hard problem reasoning. Additionally, HyperGVL integrates seven textual and five visual representations of hypergraphs, offering insights into task preferences and model capability boundaries across these diverse representations. Based on the performance of LVLMs under different hypergraph representations, we train WiseHyGR, a generalizable router that can select appropriate hypergraph representations for given hypergraph problems. Experimental results validate that WiseHyGR generally enhances the hypergraph understanding and reasoning abilities of LVLMs, and the benefits generalize to downstream out-of-domain tasks. The contributions of this work are threefold. - We construct the HyperGVL benchmark, a new evaluation arena to assess LVLMs' capabilities in hypergraph understanding and reasoning. - We extensively evaluate 12 leading LVLMs on HyperGVL and expose their actual capabilities. The dedicated evaluations from various perspectives contribute 14 valuable observations. - Based on model performance across hypergraph representations, we train WiseHyGR, a generalizable router to enhance LVLMs' performance on hypergraph understanding and reasoning tasks. ## 2 The HyperGVL Benchmark In this section, we introduce the HyperGVL Benchmark, designed to delineate the ability boundaries of LVLMs in handling higher-order structures of hypergraphs. ### 2.1 Benchmark Uniqueness Table 1 (https://arxiv.org/html/2604.15648#S1.T1) underscores the distinct role of HyperGVL within the landscape of existing benchmarks. Unlike conventional graph-related LVLM benchmarks, HyperGVL delves into higher-order relationships inherent in hypergraphs, surpassing the limitations of ordinary graphs. Additionally, in contrast to text-only hypergraph benchmarks for large language models (LLMs), HyperGVL integrates visual perception effects unique to LVLMs, as well as enhances task diversity and complexity through the inclusion of intricate reasoning challenges. More related works are introduced in Appendix A (https://arxiv.org/html/2604.15648#A1). ### 2.2 Hypergraph Organization The hypergraphs in HyperGVL involve meticulous considerations to ensure reasonable organization. First, the HyperGVL benchmark comprises an equal proportion of synthetic and real-world hypergraphs. Synthetic hypergraphs are generated using both random and regular-structured methods, providing a controlled environment for testing. In contrast, real-world hypergraphs are from anonymized citation and protein networks, offering practical insights into real-world applications. To ensure balanced complexity for comprehensive evaluation, we employed the scale partition protocol from Feng et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib5)), and organized hypergraphs by vertex count into three scale groups: small, medium, and large, with a distribution ratio of 1:2:1. This categorization facilitates the assessment of model performance across varying levels of complexity. Detailed descriptions of the processes used to obtain these hypergraphs are provided in Appendix B (https://arxiv.org/html/2604.15648#A2). ### 2.3 Benchmark Tasks In this section, we introduce the tasks with design considerations in HyperGVL. #### 2.3.1 Design Principle The tasks in HyperGVL are designed around two core dimensions: assessed capability and response type. For assessed capabilities, tasks are divided into two main categories: understanding and reasoning. Understanding tasks evaluate three key atomic abilities: (1) basic element capture, which involves recognizing vertices and hyperedges; (2) adjacency perception, which entails understanding adjacency relationships among vertices; and (3) heuristic computation, which includes computing heuristics such as vertex degree and hyperedge order (i.e., the number of vertices in a hyperedge). On the other hand, reasoning tasks assess model abilities in terms of (1) algorithms, which involve solving problems with definitive algorithms, and (2) planning, where problems are NP-hard and lack definitive algorithms, requiring models to actively plan and devise valid solutions. Based on these assessed capabilities, all tasks are categorized into a four-level difficulty hierarchy: Level-1 (querying single atomic capability), Level-2 (combining compound atomic capabilities), Level-3 (polynomial-solvable algorithms), and Level-4 (NP-hard planning). This stratification aligns with task complexity Bylander (1994 (https://arxiv.org/html/2604.15648#bib.bib291)), and we aim to verify whether it is consistent with the actual capability spectrum of LVLMs. For response types, tasks are categorized into four types: (1) counting, (2) computing, (3) decision, and (4) descriptive tasks. This taxonomy enables a comprehensive evaluation of LVLMs across diverse cognitive processes. Unlike LLM4Hypergraph Feng et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib5)), which presents relatively simple tasks for recent models (e.g., Gemini-3 Flash achieved over 90% zero-shot accuracy in 13 out of 15 tasks in our testing), the proposed benchmark introduces more challenging tasks that require reasoning beyond structural understanding. This design aligns with evolving model capabilities and leaves ample room for their future improvement. Overall, the task distribution in HyperGVL encompasses a wider range of difficulty and diversity, establishing a comprehensive evaluation framework for LVLMs. Task | Capability | Response Type | Difficulty | Example | #Sample ---|---|---|---|---|--- Understanding Tasks | | | | | 42,000 VC | Element | Counting | Level-1 | Q: How many vertices are in the hypergraph G? A: 15. | 7,000 HEC | Element | Counting | Level-1 | Q: How many vertices are in the hypergraph G? A: 23. | 7,000 Ne | Adjacency | Descriptive | Level-1 | Q: What are the direct neighbors of vertex v4 in hypergraph G? A: v0, v3, v5. | 7,000 DVC | Heuristic & Element | Counting | Level-2 | Q: How many vertices have degree 3 in hypergraph G? A: 7. | 7,000 OEC | Heuristic & Element | Counting | Level-2 | Q: How many hyperedges have order 4 in hypergraph G? A: 8. | 7,000 ONe | Heuristic & Adjacency | Descriptive | Level-2 | Q: What are the neighbors of vertex v5 when only considering hyperedges with order >= 2 in hypergraph G? A: v0, v3. | 7,000 Reasoning Tasks | | | | | 42,000 OSP | Algorithm | Computing | Level-3 | Q: What is the order-weighted shortest path length from v4 to v8? A: 8. | 7,000 OMF | Algorithm | Computing | Level-3 | Q: What is the order-weighted maximum flow from v4 to v8? A: 19. | 7,000 ISM | Algorithm | Decision | Level-3 | Q: Are these two hypergraphs isomorphic? A: Yes. | 7,000 3-CL | Planning | Descriptive | Level-4 (NP-hard) | Q: Please provide a 3-coloring strategy such that each hyperedge contains nodes with at least 2 different colors. A: Coloring: [v0:c0,v1:c1,v2:c2,v3:c0,v4:c1,v5:c2]. | 7,000 SHC | Planning | Descriptive | Level-4 (NP-hard) | Q: Please identify a strict hypercycle in the hypergraph G. A: Cycle: [e0,e3,e2,e4,e1]. | 7,000 HHM | Planning | Descriptive | Level-4 (NP-hard) | Q: Please provide a valid Hamiltonian path from v1 to v0. (Hamiltonian path = path visiting all vertices exactly once). A: Path: [e0,e1,e2,e3]. | 7,000 Table 2: Properties, statistics, and examples of all hypergraph understanding and reasoning tasks in HyperGVL. #### 2.3.2 Task Descriptions All tasks in HyperGVL are introduced briefly in this section. More details are provided in Appendix C (https://arxiv.org/html/2604.15648#A3). Hypergraph understanding tasks are designed to evaluate the composition, topology, and fundamental heuristics of hypergraphs. These tasks are mainly categorized into six types as follows. - **Vertex Counting (VC)**: Counting the number of vertices in a given hypergraph. - **Hyperedge Counting (HEC)**: Counting the number of hyperedges in a given hypergraph. - **Neighbors (Ne)**: Identifying direct neighbors of a specified vertex connected by hyperedges. - **Degree-specified Vertex Counting (DVC)**: Counting vertices with a specific degree value in the hypergraph. - **Order-specified Hyperedge Counting (OEC)**: Counting hyperedges with a specific order in the hypergraph. - **Order-filtered Neighbors (ONe)**: Identifying neighbors of a vertex when only considering hyperedges with orders no smaller than a specified threshold. The associated assessed capabilities, difficulty levels, and response types of these tasks are detailed at the top of Tab. 2 (https://arxiv.org/html/2604.15648#S2.T2), along with examples. Hypergraph reasoning tasks are designed to tackle complex, multi-step inferential challenges within hypergraphs. Beyond understanding hypergraph structures and computing heuristics, these tasks require organizing atomic capabilities into a sophisticated iterative process to tackle complex hypergraph problems. These tasks are mainly classified into six types as follows. - **Order-weighted Shortest Path (OSP)**: Computing the shortest path length between two vertices, where the hyperedge order serves as the distance. - **Order-weighted Maximum Flow (OMF)**: Computing the maximum flow between two vertices, where the hyperedge order determines the edge capacity. - **Isomorphism Recognition (ISM)**: Determining whether two hypergraphs are isomorphic. - **Hypergraph 3-Coloring (3-CL)**: Providing a valid 3-coloring, where each hyperedge contains at least two distinct colors. - **Strict Hypercycle (SHC)**: Searching for a strict hypercycle in the hypergraph, where adjacent hyperedges share exactly one vertex. - **Hypergraph Hamilton Path (HHM)**: Planning a Hamiltonian path, which visits all vertices in the hypergraph exactly once, with a given vertex as the starting point and another as the ending point. The associated assessed capabilities, difficulty levels, and response types of these tasks are detailed at the bottom of Tab. 2 (https://arxiv.org/html/2604.15648#S2.T2). ### 2.4 Hypergraph Representations Refer to captionFigure 2: The 7 textual hypergraph representations and 5 visual hypergraph representations in HyperGVL. Hypergraph representations are crucial for evaluating the capabilities of LVLMs within hypergraphs, as distinct representations introduce unique perceptual biases Wei et al. (2024 (https://arxiv.org/html/2604.15648#bib.bib76)); Feng et al. (2025b (https://arxiv.org/html/2604.15648#bib.bib5)). Unlike LLMs that rely solely on text, LVLMs benefit from synergistic perception of both visual and textual information. Therefore, testing LVLMs on HyperGVL should not only include
Similar Articles
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
This paper introduces MHGraphBench, a knowledge-graph-grounded benchmark for evaluating large language models on mental health knowledge, including entity recognition, relation judgment, and multi-hop reasoning. Experiments across 15 LLMs reveal a gap between recognition and judgment capabilities.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
This paper introduces a paradigm where Vision-Language Models (VLMs) act as test-time teachers to guide Video Generation Models (VGMs) via differentiable rewards and LoRA optimization, achieving a 16.7-point average improvement on video reasoning benchmarks.
Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models
This paper investigates using large vision-language models for built environment reasoning tasks, such as design suggestions and risk identification, leveraging remote sensing imagery. It evaluates models like InternVL and Qwen, highlighting their potential for supporting smart city decision-making and quantitative reasoning.
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
VLegal-Bench is a cognitively grounded benchmark for evaluating large language models on Vietnamese legal reasoning tasks, containing 10,450 expert-annotated samples designed to address the gap in legal benchmarks for civil law systems. The benchmark assesses multiple levels of legal understanding through question answering, multi-step reasoning, and scenario-based problem solving, providing a replicable framework for evaluating LLMs in non-English, codified legal contexts.
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.