Two-Stage Learned Decomposition for Scalable Routing on Multigraphs

arXiv cs.LG Papers

Summary

This paper proposes Node-Edge Policy Factorization (NEPF) to address scalability issues in solving Vehicle Routing Problems on multigraphs. It combines pre-encoding edge aggregation with a hierarchical reinforcement learning method to achieve state-of-the-art solution quality with faster training and inference.

arXiv:2605.05389v1 Announce Type: new Abstract: Most neural methods for Vehicle Routing Problems (VRPs) are limited to Euclidean settings or simple graphs. In this work, we instead consider multigraphs, where parallel edges represent distinct travel options with varying trade-offs (e.g., distance vs time). Few methods are designed for such formulations and those that do exist face major scalability issues. We mitigate these scalability issues via a Node-Edge Policy Factorization (NEPF) approach, which splits the routing policy into a node permutation stage and an edge selection stage. To enable the decomposition, we introduce a pre-encoding edge aggregation scheme and a non-autoregressive architecture for the edge stage, as well as a hierarchical reinforcement learning method to train the stages jointly. Our experiments across six VRP variants demonstrate that NEPF matches or outperforms the state-of-the-art in terms of solution quality, while being significantly faster in training and inference.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:14 AM

# Two-Stage Learned Decomposition for Scalable Routing on Multigraphs
Source: [https://arxiv.org/abs/2605.05389](https://arxiv.org/abs/2605.05389)
[View PDF](https://arxiv.org/pdf/2605.05389)

> Abstract:Most neural methods for Vehicle Routing Problems \(VRPs\) are limited to Euclidean settings or simple graphs\. In this work, we instead consider multigraphs, where parallel edges represent distinct travel options with varying trade\-offs \(e\.g\., distance vs time\)\. Few methods are designed for such formulations and those that do exist face major scalability issues\. We mitigate these scalability issues via a Node\-Edge Policy Factorization \(NEPF\) approach, which splits the routing policy into a node permutation stage and an edge selection stage\. To enable the decomposition, we introduce a pre\-encoding edge aggregation scheme and a non\-autoregressive architecture for the edge stage, as well as a hierarchical reinforcement learning method to train the stages jointly\. Our experiments across six VRP variants demonstrate that NEPF matches or outperforms the state\-of\-the\-art in terms of solution quality, while being significantly faster in training and inference\.

## Submission history

From: Filip Rydin \[[view email](https://arxiv.org/show-email/4d4fdb48/2605.05389)\] **\[v1\]**Wed, 6 May 2026 19:23:09 UTC \(68 KB\)

Similar Articles

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Hugging Face Daily Papers

This paper identifies surrogate hacking and temporal uncertainty as failure modes in multi-timescale RL, and proposes a Target Decoupling architecture that removes routing from the actor, using the critic for auxiliary representation learning. The method eliminates policy collapse on the LunarLander-v2 benchmark and stably surpasses the 'Environment Solved' threshold without hyperparameter hacking.

GraphPO: Graph-based Policy Optimization for Reasoning Models

arXiv cs.CL

GraphPO is a novel graph-based reinforcement learning framework that represents rollouts as a directed acyclic graph, merging semantically equivalent reasoning paths to reduce redundant exploration and improve credit assignment for large reasoning models.

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

arXiv cs.AI

COAgents is a cooperative multi-agent framework for solving Vehicle Routing Problems that models search as a graph, using specialized agents for node selection, move selection, and jumps to escape local minima. It achieves state-of-the-art results on CVRP and VRPTW benchmarks, reducing the gap to best-known solutions by up to 44% compared to prior learning-based methods.