Two-Stage Learned Decomposition for Scalable Routing on Multigraphs
Summary
This paper proposes Node-Edge Policy Factorization (NEPF) to address scalability issues in solving Vehicle Routing Problems on multigraphs. It combines pre-encoding edge aggregation with a hierarchical reinforcement learning method to achieve state-of-the-art solution quality with faster training and inference.
View Cached Full Text
Cached at: 05/08/26, 07:14 AM
# Two-Stage Learned Decomposition for Scalable Routing on Multigraphs Source: [https://arxiv.org/abs/2605.05389](https://arxiv.org/abs/2605.05389) [View PDF](https://arxiv.org/pdf/2605.05389) > Abstract:Most neural methods for Vehicle Routing Problems \(VRPs\) are limited to Euclidean settings or simple graphs\. In this work, we instead consider multigraphs, where parallel edges represent distinct travel options with varying trade\-offs \(e\.g\., distance vs time\)\. Few methods are designed for such formulations and those that do exist face major scalability issues\. We mitigate these scalability issues via a Node\-Edge Policy Factorization \(NEPF\) approach, which splits the routing policy into a node permutation stage and an edge selection stage\. To enable the decomposition, we introduce a pre\-encoding edge aggregation scheme and a non\-autoregressive architecture for the edge stage, as well as a hierarchical reinforcement learning method to train the stages jointly\. Our experiments across six VRP variants demonstrate that NEPF matches or outperforms the state\-of\-the\-art in terms of solution quality, while being significantly faster in training and inference\. ## Submission history From: Filip Rydin \[[view email](https://arxiv.org/show-email/4d4fdb48/2605.05389)\] **\[v1\]**Wed, 6 May 2026 19:23:09 UTC \(68 KB\)
Similar Articles
Near-Future Policy Optimization
Proposes Near-Future Policy Optimization (NPO), a mixed-policy RL method that accelerates convergence by learning from a later checkpoint of the same training run, boosting Qwen3-VL-8B-Instruct performance from 57.88 to 62.84.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.
Evolved Policy Gradients
OpenAI introduces Evolved Policy Gradients (EPG), a meta-learning approach that learns loss functions through evolution rather than learning policies directly, enabling RL agents to generalize better across tasks by leveraging prior experience similar to how humans transfer skills.
Expert Routing for Communication-Efficient MoE via Finite Expert Banks
The paper introduces an information-theoretic framework for communication-efficient expert routing in sparse mixture-of-experts models, treating the gate as a stochastic channel and deriving practical mutual information estimators to analyze accuracy-rate tradeoffs over finite expert banks.
Reinforcement Learning via Value Gradient Flow
Value Gradient Flow (VGF) presents a scalable approach to behavior-regularized reinforcement learning by formulating it as an optimal transport problem solved through discrete gradient flow, achieving state-of-the-art results on offline RL and LLM RL benchmarks. The method eliminates explicit policy parameterization while enabling adaptive test-time scaling by controlling transport budget.