A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems
Summary
This paper proposes a unified knowledge-embedded reinforcement learning framework for generalized capacitated vehicle routing problems, combining route-first cluster-second heuristics with dynamic programming to achieve superior solution quality and strong generalization across diverse variants.
View Cached Full Text
Cached at: 05/15/26, 06:24 AM
# A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems
Source: [https://arxiv.org/html/2605.14416](https://arxiv.org/html/2605.14416)
Xiangchen Wu1Liang Wang1Hao Hu1&Xianping Tao11Nanjing University \{wangwen, ixc\}@smail\.nju\.edu\.cn, \{wl, myou, txp\}@nju\.edu\.cn
###### Abstract
The Capacitated Vehicle Routing Problem \(CVRP\) is a fundamental NP\-hard problem with broad applications in logistics and transportation\. Real\-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework\. Recent reinforcement learning \(RL\) approaches have shown promise in combinatorial optimization, yet they rely on end\-to\-end learning and lack explicit problem\-solving knowledge, limiting solution quality\. In this paper, we propose a knowledge\-embedded framework inspired by the Route\-First Cluster\-Second heuristics\. It incorporates knowledge at two levels: \(1\) decomposing CVRPs into the route\-first and cluster\-second subproblems, and \(2\) leveraging dynamic programming to solve the second subproblem, whose results guide the RL\-based constructive solver to solve the first problem\. To mitigate partial observability caused by problem decomposition, we introduce a unified history\-enhanced context processing module\. Extensive experiments show that this framework achieves superior solution quality compared with state\-of\-the\-art learning\-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants\.
## 1Introduction
The Capacitated Vehicle Routing Problem \(CVRP\) is a fundamental research topic in combinatorial optimization and serves as a canonical model for numerous real\-world applications, including urban logisticsJeong and Song \([2025](https://arxiv.org/html/2605.14416#bib.bib8)\), last\-mile deliveryTiwari and Sharma \([2023](https://arxiv.org/html/2605.14416#bib.bib9)\), postal distributionSbaiet al\.\([2022](https://arxiv.org/html/2605.14416#bib.bib10)\), and transportation planningLeng and Li \([2021](https://arxiv.org/html/2605.14416#bib.bib11)\), where goods must be delivered from a central depot to a set of geographically distributed customers\. In practical applications, CVRP instances often incorporate additional constraints to reflect real\-world operational requirements\. For example, time window constraintsVidalet al\.\([2014](https://arxiv.org/html/2605.14416#bib.bib12)\)enforce that customers must be served within specified intervals, while open route constraintsLiet al\.\([2007](https://arxiv.org/html/2605.14416#bib.bib40)\)allow vehicles to finish their service without returning to the depot\. Given the NP\-hard natureLenstra and Kan \([1981](https://arxiv.org/html/2605.14416#bib.bib13)\)and the increased complexity introduced by these variants, developing a unified and efficient approximate solver has become an important and long\-standing research objective\.
With the development of learning\-based methods in the field of combinatorial optimization, a number of foundation models for VRPs have recently been proposedBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\); Drakulicet al\.\([2025](https://arxiv.org/html/2605.14416#bib.bib42)\); Zhouet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib43)\)\. These methods introduce advanced architectural designs inspired by Large Language Models \(LLMs\), incorporating components such as multi\-head mixed\-attention blocks, transformer\-based encoders, and mixture\-of\-experts \(MoE\) structures to construct a unified network architecture capable of addressing diverse VRP variants\. From a learning perspective, these models typically adopt POMO\-styleKwonet al\.\([2020](https://arxiv.org/html/2605.14416#bib.bib44)\)reinforcement learning or imitation learning paradigms to train neural networks that directly generate high\-quality routing solutions in an end\-to\-end manner\. In this solving framework, the incorporation of domain knowledge is manifested in two ways: \(1\) the design of network architectures guided by problem\-specific insights; and \(2\) solution samples generated by classical solvers\. Although promising results have been achieved, relying solely on model\- or data\-driven domain knowledge overlooks the importance of algorithmic design in solving classical problems, thereby limiting further improvements in solution quality\.
To directly leverage domain knowledge in solving CVRPs, we exploit the decomposable nature of the problem\. Problem decomposition has a long\-standing history in the study of CVRPs, particularly within the field of evolutionary computationZhanget al\.\([2022](https://arxiv.org/html/2605.14416#bib.bib14)\)\. Among these approaches, the Route\-first Cluster\-second \(RFCS\) heuristicPrinset al\.\([2014](https://arxiv.org/html/2605.14416#bib.bib46)\)represents one of the most influential strategies, and numerous variants have been developed following this principle\. Specifically, the RFCS heuristic first constructs a global route covering all customer nodes \(excluding the depot\) using heuristic methods such as TSP solvers\. It then applies a polynomial\-time optimal algorithm to divide the global route into feasible vehicle tours, thereby producing a valid solution to original CVRP variants\. Manually designed algorithms can naturally incorporate various operational constraints\. More importantly, once the global route is generated, the subsequent process typically runs in polynomial time, ensuring high computational efficiency\. For example,Vidal \([2015](https://arxiv.org/html/2605.14416#bib.bib45)\)give aO\(n\)O\(n\)algorithm given a global route for minsum CVRP, andWanget al\.\([2025](https://arxiv.org/html/2605.14416#bib.bib15)\)gives a binary search\-based splitting algorithm for minmax VRPs\. Nevertheless, such methods generally rely on TSP\-based heuristics to construct the global route, and the optimal TSP tour does not necessarily yield an optimal solution to the original CVRP variants after a cluster\-second operation\. In this paper, we introduce reinforcement learning to automatically learn the route first part, avoiding the limitations of handcrafted TSP\-based heuristics\.
However, the problem decomposition mechanism introduces a partial observability challenge: the context of a partial route cannot be fully assessed until the entire route is generated and the subsequent splitting stage is completed\. To address this issue, we introduce a history\-enhanced module as the core of our context processor\. This module use the LSTMHochreiter and Schmidhuber \([1997](https://arxiv.org/html/2605.14416#bib.bib3)\)architecture to maintain a temporal hidden state that continuously aggregates historical information throughout the route construction process, allowing the model to infer the latent information of the problem even when the final evaluation signal is postponed\. Beyond mitigating partial observability, the history\-enhanced context representation also provides a unified neural architecture across different VRP variants\. By encoding the evolving decision context into a learned hidden state, the framework eliminates the need for manually designed problem\-specific contexts\. As a result, the same network architecture can generalize seamlessly across multiple problem types—such as open\-route, backhaul\-request, and time\-window CVRPs—without additional adaptation\.
The main contributions can be summarized as follows\.
- •We propose a unifiedknowledge\-embedded RFCS frameworkthat integrates RL with the route\-first cluster\-second \(RFCS\) heuristic to address a wide range of CVRP variants\. This framework combines general learning\-based methods and algorithmic design principles, enabling flexible adaptation to various types of problem\-specific constraints\.
- •We introduce ahistory\-enhanced context processingmodule that preserves a unified neural network architecture across general CVRP variants\. This module simplifies the design of context features by exploiting a common representation provided by the route\-first component, enabling effective handling of complex constraints\.
- •We perform extensive experiments to evaluate the solution quality and transfer capabilities of the framework across diverse CVRP settings\. The results demonstrate that the proposed knowledge\-embedded RFCS framework consistently outperforms existing learning\-based methods and exhibits strong zero\-shot generalization to unseen problem variants\.
## 2Related Work
Since the introduction of neural networks into combinatorial optimization by the Pointer NetworkVinyalset al\.\([2015](https://arxiv.org/html/2605.14416#bib.bib16)\), a series of learning\-based combinatorial optimization solvers have subsequently emerged, collectively referred to as Neural Combinatorial Optimization \(NCO\)\. From the perspective of solution paradigms, learning\-based approaches can be broadly classified into three categoriesWuet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib17)\): \(1\) Learning to Construct \(L2C\), which focuses on directly generating solutions; \(2\) Learning to Improve \(L2I\), which aims to refine or enhance existing solutions; and \(3\) Learning to Predict \(L2P\), which leverages neural networks to support and augment traditional operations research methods\. In the L2C paradigm, the Attention Model \(AM\)Koolet al\.\([2018](https://arxiv.org/html/2605.14416#bib.bib18)\)proposed a Transformer\-based encoder\-decoder architecture to directly generate solutions\. Subsequently, POMOKwonet al\.\([2020](https://arxiv.org/html/2605.14416#bib.bib44)\)and Sym\-NCOKimet al\.\([2022a](https://arxiv.org/html/2605.14416#bib.bib19)\)introduced multi\-start and symmetry\-basedShiet al\.\([2025](https://arxiv.org/html/2605.14416#bib.bib1)\); Yuet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib2)\)training strategies, respectively, during the solution generation process, further enhancing solution quality\. More recent studies have focused on improving generalization capabilityGaoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib24)\), addressing diverse objectives and constraintsSonet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib20)\); Zhenget al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib21)\); Biet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib22)\), and scaling up to larger problem instancesLuoet al\.\([2023](https://arxiv.org/html/2605.14416#bib.bib23)\); Kimet al\.\([2022b](https://arxiv.org/html/2605.14416#bib.bib25)\)\. This paradigm achieves a favorable balance between solution quality and inference efficiency, and it is also adopted in this work\. The L2I paradigm originated from learning local rewriting operationsChen and Tian \([2019](https://arxiv.org/html/2605.14416#bib.bib26)\), and was later extended to the learning of heuristic operators such as 2\-OPT and cross\-exchanged O Costaet al\.\([2020](https://arxiv.org/html/2605.14416#bib.bib27)\); Hottung and Tierney \([2020](https://arxiv.org/html/2605.14416#bib.bib28)\); Suiet al\.\([2021](https://arxiv.org/html/2605.14416#bib.bib29)\); Wuet al\.\([2021](https://arxiv.org/html/2605.14416#bib.bib30)\)\. In recent years, research in this direction has increasingly focused on large\-scale problem solvingChenget al\.\([2023](https://arxiv.org/html/2605.14416#bib.bib31)\); Liet al\.\([2025](https://arxiv.org/html/2605.14416#bib.bib32)\)\. Although this paradigm is more suitable for handling large\-scale problems, it is relatively difficult to employ a unified framework to address diverse types of constraints\. The L2P paradigm integrates traditional algorithmic approaches with learning\-based methods, such as combining learning with the LKH algorithmXinet al\.\([2021](https://arxiv.org/html/2605.14416#bib.bib33)\); Zhenget al\.\([2021](https://arxiv.org/html/2605.14416#bib.bib34)\), integrating learning with dynamic programmingKoolet al\.\([2022](https://arxiv.org/html/2605.14416#bib.bib35)\); Xuet al\.\([2020](https://arxiv.org/html/2605.14416#bib.bib36)\), and leveraging large language models \(LLMs\) and meta\-learning for automated algorithm designLiuet al\.\([2024b](https://arxiv.org/html/2605.14416#bib.bib37)\); Derneddeet al\.\([2025](https://arxiv.org/html/2605.14416#bib.bib38)\)\. Methods of this type, by incorporating principles of algorithm design, have the potential to improve solution quality\. In this work, we adopt a similar approach within the L2C framework, introducing knowledge as a reward signal, which facilitates the handling of general forms of constraints\.
## 3Preliminaries
In this section, we first introduce the general form CVRPs\. Then, we summarize the RL formulation for CVRPs, describing how solutions can be modeled as sequential actions and optimized under the RL paradigm\.
### 3\.1General Form CVRPs
The CVRPs can be defined on a complete graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\), where𝒱=\{v0,v1,⋯,vn\}\\mathcal\{V\}=\\\{v\_\{0\},v\_\{1\},\\cdots,v\_\{n\}\\\}is the set of nodes, withv0v\_\{0\}representing the depot and\{v1,⋯,vn\}\\\{v\_\{1\},\\cdots,v\_\{n\}\\\}representing the customer nodes\. Each edge\(i,j\)∈ℰ\(i,j\)\\in\\mathcal\{E\}is associated with a non\-negative costcijc\_\{ij\}\. As a common practice, each node is assigned a two\-dimensional coordinate, and the Euclidean distance is used as the cost\. Vehicles with identical capacityQQare stationed at the depot\. Each customer nodei∈\{1,⋯,n\}i\\in\\\{1,\\cdots,n\\\}has a nonzero demandqi≤Qq\_\{i\}\\leq Qthat must be satisfied by exactly one vehicle\. Each vehicle travels from the depot node, serves customers and returns to the depot\. The total demand of the served nodes should not exceed the capacity\.
FollowingBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\), we consider a generalized formulation of the CVRP that incorporates multiple additional constraints: \(1\) Open routes — vehicles are not required to return to the depot; \(2\) Backhaul demands — customer requests may involve both delivery and pickup operations; \(3\) Distance limits — the travel distance of each route must not exceed a predefined threshold; and \(4\) Time windows — each customer must be served within a specified time interval\. Each additional constraint can be independently activated or deactivated, resulting in sixteen problem variants, denoted as CVRP, OVRP, VRPB, VRPL, VRPTW, etc\.
We aim to minimize the total travel cost over all vehicle routes\. Let\{ℛk\}k=1K\\\{\\mathcal\{R\}\_\{k\}\\\}\_\{k=1\}^\{K\}denote a feasible routing plan, whereℛk\\mathcal\{R\}\_\{k\}is the sequence of nodes served by vehiclekk\. The cost of routeℛk\\mathcal\{R\}\_\{k\}is given by
Ck=∑\(i,j\)∈ℛkci,j,C\_\{k\}=\\sum\_\{\(i,j\)\\in\\mathcal\{R\}\_\{k\}\}c\_\{i,j\},\(1\)withci,jc\_\{i,j\}representing the travel cost between nodesiiandjj\. The min\-sum objective is then formulated as
min\{ℛk\}k=1K∑k=1KCk,\\min\_\{\\\{\\mathcal\{R\}\_\{k\}\\\}\_\{k=1\}^\{K\}\}\\;\\sum\_\{k=1\}^\{K\}C\_\{k\},\(2\)subject to the following feasibility conditions: 1\) each customer is visited exactly once; 2\) the total demand on each route does not exceed vehicle capacity; and 3\) all additional constraints are satisfied\. Here,KKdenotes the number of active vehicles in the solution\.
### 3\.2Reinforcement Learning for CVRPs
In the RL framework for CVRPs, a constructive solver sequentially builds a solution by selecting one customer node at a time\. Formally, at steptt, the solver selects an actionat∈Vta\_\{t\}\\in V\_\{t\}based on a contextctc\_\{t\}, which corresponds to visiting a customer node next, according to a stochastic policyat∼πθ\(at∣ct\),a\_\{t\}\\sim\\pi\_\{\\theta\}\(a\_\{t\}\\mid c\_\{t\}\),parameterized by neural network parametersθ\\theta\. After visiting nodeata\_\{t\}, the state is updated toct\+1c\_\{t\+1\}, and the set of available actions is reduced:Vt\+1=Vt∖\{at\}V\_\{t\+1\}=V\_\{t\}\\setminus\\\{a\_\{t\}\\\}\.
Once all customers are visited and feasible routes are constructed, the solver receives a rewardRRbased on the objective\. The RL training objective is to maximize the expected reward over all possible construction sequences:
J\(θ\)=𝔼τ∼πθ\[R\(τ\)\],J\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\left\[R\(\\tau\)\\right\],\(3\)whereτ=\(a0,a1,⋯,aT\)\\tau=\(a\_\{0\},a\_\{1\},\\cdots,a\_\{T\}\)denotes a trajectory generated by the policy\. Gradient\-based optimization can be performed using policy gradient methods, e\.g\., REINFORCEWilliams \([1992](https://arxiv.org/html/2605.14416#bib.bib7)\), optionally enhanced with baselines to reduce variance:
∇θJ\(θ\)=𝔼τ∼πθ\[∑t=0T∇θlogπθ\(at∣ct\)\(R\(τ\)−b\)\],\\nabla\_\{\\theta\}J\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\Big\[\\sum\_\{t=0\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid c\_\{t\}\)\\,\(R\(\\tau\)\-b\)\\Big\],\(4\)
wherebbis a baseline value\.
## 4The proposed method
Figure 1:The Overall Framework of the Proposed Knowledge\-embedded RFCS method\.Unlike a fully end\-to\-end learning paradigm, we propose a knowledge\-embedded RL framework for solving generalized CVRPs\. The framework integrates problem\-solving principles with algorithmic design insights into the training process to improve solution quality\. It also simplifies transition modeling in the training environment, reduces the complexity of incorporating new constraints, and enables a unified network architecture applicable across multiple CVRP variants\.
### 4\.1The Overall Framework
The overall framework is illustrated in Figure[1](https://arxiv.org/html/2605.14416#S4.F1)\. We adopt aroute\-first, cluster\-secondheuristic structure\. The RL component focuses on the route\-first stage, where the objective is to generate a sequence that visits all customer nodes exactly once, resulting in a permutation of customer indices\. All complex constraints are handled by a dynamic programming\-based cluster\-second module\. Given any customer permutation, this module determines the optimal splitting of the sequence into feasible vehicle routes while satisfying additional constraints such as time windows and duration limits\.
Although the framework consists of two components, the reward signal is computed based on the solution produced by the cluster\-second module\. This reward is then used to optimize the route\-first policy through RL, thereby aligning the optimization objective of the RL model with that of the original problem formulation\. In this way, the framework eliminates the dependence on heuristic TSP solvers and achieves a direct, learning\-based formulation of the routing problem\.
### 4\.2The Unified Environment
The framework allows problems with heterogeneous constraints or objectives to be reduced to an identical learning structure, thus we adopt a unified formulation for constructing the RL environments under different constraint settings, as described below\.
States:At steptt, the environment maintains a state or contextsts\_\{t\}that summarizes all relevant information needed for decision\-making:
st=\{𝐗,𝐪,𝐓𝐖,l,n0,n1,vt,𝐦,𝐟\},s\_\{t\}=\\\{\\mathbf\{X\},\\mathbf\{q\},\\mathbf\{TW\},l,n\_\{0\},n\_\{1\},v\_\{t\},\\mathbf\{m\},\\mathbf\{f\}\\\},\(5\)
where:
- •𝐗∈ℝn×2\\mathbf\{X\}\\in\\mathbb\{R\}^\{n\\times 2\}denotes the coordinates of depot and customer nodes\. We usexi,yix\_\{i\},y\_\{i\}for theii\-th node;
- •𝐪∈ℝn×1\\mathbf\{q\}\\in\\mathbb\{R\}^\{n\\times 1\}represents the customer demands, including negative values for pickup nodes\. We useqiq\_\{i\}for theii\-th node;
- •𝐓𝐖∈ℝn×3\\mathbf\{TW\}\\in\\mathbb\{R\}^\{n\\times 3\}encodes time windows and service durations\. We useri,si,dir\_\{i\},s\_\{i\},d\_\{i\}to denote the start time, the service duration, and the deadline for nodeii\.
- •l∈ℝl\\in\\mathbb\{R\}represents distance limits;
- •n0,n1∈ℝn\_\{0\},n\_\{1\}\\in\\mathbb\{R\}track the cumulative delivered and picked\-up quantities for each route;
- •vt∈\{0,…,n\}v\_\{t\}\\in\\\{0,\\dots,n\\\}is the current node being visited;
- •𝐦∈\{0,1\}n\\mathbf\{m\}\\in\\\{0,1\\\}^\{n\}is the action mask, indicating feasible next visited nodes;
- •𝐟∈\{0,1\}4\\mathbf\{f\}\\in\\\{0,1\\\}^\{4\}encodes the variant features \(O, B, L, TW\) of the current instance, where each element represents a binary flag indicating whether the corresponding constraint is active\.
When certain attributes are absent, default values are used as placeholders to maintain a unified data structure\.
Actions:The action at stepttis selecting the next nodeat∈\{1,…,n\+1\}a\_\{t\}\\in\\\{1,\\dots,n\+1\\\}to visit\.
Transitions:Benefiting from the proposed design, state updates can be performed through a simple and efficient procedure\. The variablesn0,n1,vtn\_\{0\},n\_\{1\},v\_\{t\}are updated directly according to their definitions, while𝐦\\mathbf\{m\}is refreshed by setting the value of each visited node to zero\. The responsibility for checking action feasibility is fully delegated to the cluster\-second module\.
Rewards:We employ a dynamic programming\-based solver \(Section[4\.3](https://arxiv.org/html/2605.14416#S4.SS3)\) to optimally convert the solutions to the original problem\. The reward is defined as the negative of the total cost computed by this solver\.
We hypothesize that this structural consistency contributes substantially to the effectiveness and generality of the proposed framework\.
### 4\.3Dynamic Programming Solver
To compute the optimal cost of the route\-first sequence, we develop a dynamic programming \(DP\)\-based approach capable of naturally handling diverse constraint types\. Let a given sequence of customer nodes \(excluding the depot\) be denoted asπ=\(a1,a2,…,an\)\\pi=\(a\_\{1\},a\_\{2\},\\dots,a\_\{n\}\), and letp\[i\]p\[i\]represent the minimum total cost to serve the firstiinodes\. The DP formulation recursively computes
p\[i\]=minj<i\{p\[j\]\+C¯\(j\+1,i\)\},p\[i\]=\\min\_\{j<i\}\\\{p\[j\]\+\\overline\{C\}\(j\+1,i\)\\\},\(6\)whereC¯\(j\+1,i\)\\overline\{C\}\(j\+1,i\)is the cost of serving nodes\(πj\+1,…,πi\)\(\\pi\_\{j\+1\},\\dots,\\pi\_\{i\}\)as a single vehicle route, subject to capacity, pickup\-and\-delivery precedence, distance, and time window constraints\. To naturally handle these constraints, a simple yet effective strategy is adopted: if the sequence\(πj\+1,…,πi\)\(\\pi\_\{j\+1\},\\dots,\\pi\_\{i\}\)cannot form a feasible single vehicle route,C¯\(j\+1,i\)\\overline\{C\}\(j\+1,i\)is defined as\+∞\+\\infty\. Algorithm[1](https://arxiv.org/html/2605.14416#alg1)details the procedure\. Specifically, lines 5\-6 manage the backhaul\-related variants, and all other constraints, including capacity, time windows, and duration limits, are handled through their respective conditional checks\. It is straightforward to extend this algorithm to incorporate additional constraints\. For example, adding support for the mixed\-backhaul constraintZhouet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib43)\)can be achieved by simply adjusting the condition in line 6\.
Algorithm 1Dynamic Programming\-based Solver0:Customer sequence
π=\(a1,a2,⋯,an\)\\pi=\(a\_\{1\},a\_\{2\},\\cdots,a\_\{n\}\), node attributes, variant flags
0:Minimum total route cost
p\[n\]p\[n\]
1:Initialize
p\[0\]←0p\[0\]\\leftarrow 0,
p\[i\]←\+∞p\[i\]\\leftarrow\+\\inftyfor all
i=1,…,ni=1,\\dots,n
2:for
t=0t=0to
n−1n\-1do
3:Initialize
lo←0l\_\{o\}\\leftarrow 0,
li←0l\_\{i\}\\leftarrow 0,
tnow←0t\_\{now\}\\leftarrow 0,cost
←0\\leftarrow 0
4:for
i=t\+1i=t\+1to
nndo
5:
lo←lo\+max\(qai,0\)l\_\{o\}\\leftarrow l\_\{o\}\+\\max\(q\_\{a\_\{i\}\},0\),
li←li\+max\(−qai,0\)l\_\{i\}\\leftarrow l\_\{i\}\+\\max\(\-q\_\{a\_\{i\}\},0\)
6:if
lo\>Ql\_\{o\}\>Qor
li\>Ql\_\{i\}\>Qorprecedence violatedthen
7:break
8:endif
9:
cost←cost\+cai−1,ai\\textit\{cost\}\\leftarrow\\textit\{cost\}\+c\_\{a\_\{i\-1\},a\_\{i\}\}
10:ifTW activethen
11:
tnow←max\(tnow,rai\)\+sait\_\{now\}\\leftarrow\\max\(t\_\{now\},r\_\{a\_\{i\}\}\)\+s\_\{a\_\{i\}\}
12:if
tnow\>dait\_\{now\}\>d\_\{a\_\{i\}\}then
13:break
14:endif
15:endif
16:ifL active and
cost\>l\\textit\{cost\}\>lthen
17:break
18:endif
19:ifOactivethen
20:
p\[i\]←min\(p\[i\],p\[t\]\+cost\)p\[i\]\\leftarrow\\min\(p\[i\],\\;p\[t\]\+\\textit\{cost\}\)
21:else
22:
p\[i\]←min\(p\[i\],p\[t\]\+cost\+cai,0\)p\[i\]\\leftarrow\\min\(p\[i\],\\;p\[t\]\+\\textit\{cost\}\+c\_\{a\_\{i\},0\}\)
23:endif
24:endfor
25:endfor
26:return
p\[n\]p\[n\]
### 4\.4Model Architecture
The route\-first module is parameterized by a neural network consisting of an transformer\-based encoder\-decoder architecture augmented with history\-enhanced context representation\. The architecture is designed to handle various CVRP variants within a unified network architecture\.
Encoder:The encoder maps the raw node features into a latent embedding space\. Let the node attributes for theii\-th customer be𝐱i=\[xi,yi,qi,ri,si,di\]∈ℝ6\\mathbf\{x\}\_\{i\}=\[x\_\{i\},y\_\{i\},q\_\{i\},r\_\{i\},s\_\{i\},d\_\{i\}\]\\in\\mathbb\{R\}^\{6\}\. For the depot node, the attribute is represented as𝐱0=\[x0,y0,O,B,L,TW,l\]∈ℝ7\\mathbf\{x\}\_\{0\}=\[x\_\{0\},y\_\{0\},\\mathrm\{O\},\\mathrm\{B\},\\mathrm\{L\},\\mathrm\{TW\},l\]\\in\\mathbb\{R\}^\{7\}\.
The node features are projected via learnable linear layers:
𝐡i\(0\)=\{MLPcustomer\(𝐱i\),i≠0MLPdepot\(𝐱0\),i=0\.\\mathbf\{h\}\_\{i\}^\{\(0\)\}=\\begin\{cases\}\\text\{MLP\}\_\{\\text\{customer\}\}\(\\mathbf\{x\}\_\{i\}\),&i\\neq 0\\\\ \\text\{MLP\}\_\{\\text\{depot\}\}\(\\mathbf\{x\}\_\{0\}\),&i=0\\end\{cases\}\.
The sequence of initial embeddings𝐇\(0\)=\[𝐡0\(0\),…,𝐡n\(0\)\]∈ℝ\(n\+1\)×d\\mathbf\{H\}^\{\(0\)\}=\[\\mathbf\{h\}\_\{0\}^\{\(0\)\},\\dots,\\mathbf\{h\}\_\{n\}^\{\(0\)\}\]\\in\\mathbb\{R\}^\{\(n\+1\)\\times d\}is then processed by a stack of multiple Routefinder TransformerBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\)blocks\. Each block consists of:
1. 1\.Multi\-head self\-attention: 𝐇¯\(i\)=𝐇\(i−1\)\+MHA\(RMSNorm\(𝐇\(i−1\)\)\)\\mathbf\{\\overline\{H\}\}^\{\(i\)\}=\\mathbf\{H\}^\{\(i\-1\)\}\+\\text\{MHA\}\(\\text\{RMSNorm\}\(\\mathbf\{H\}^\{\(i\-1\)\}\)\),
2. 2\.Gated feedforward network: 𝐇\(i\)=𝐇¯\(i\)\+GatedMLP\(RMSNorm\(𝐇¯\(i\)\)\)\\mathbf\{H\}^\{\(i\)\}=\\mathbf\{\\overline\{H\}\}^\{\(i\)\}\+\\text\{GatedMLP\}\(\\text\{RMSNorm\}\(\\mathbf\{\\overline\{H\}\}^\{\(i\)\}\)\)\.
A final RMS normalization is applied to produce node embeddings𝐇\(N\)∈ℝ\(n\+1\)×d\\mathbf\{H\}^\{\(N\)\}\\in\\mathbb\{R\}^\{\(n\+1\)\\times d\}\(Hereafter,\(N\)\(N\)is omitted for notational simplicity\), and a graph\-level embedding𝐠\\mathbf\{g\}is computed by mean pooling of𝐇\\mathbf\{H\}\.
Decoder:At each decoding steptt, the policy selects the next nodeata\_\{t\}based on a context embedding𝐜t∈ℝd\\mathbf\{c\}\_\{t\}\\in\\mathbb\{R\}^\{d\}computed from:
𝐜t=LSTM\(\[𝐠,𝐡0,𝐡current,𝐝t\]\),\\mathbf\{c\}\_\{t\}=\\text\{LSTM\}\\big\(\[\\mathbf\{g\},\\mathbf\{h\}\_\{0\},\\mathbf\{h\}\_\{\\text\{current\}\},\\mathbf\{d\}\_\{t\}\]\\big\),where𝐡0\\mathbf\{h\}\_\{0\}and𝐡current\\mathbf\{h\}\_\{\\text\{current\}\}are embeddings of the depot and current nodes, and𝐝t=Linear\(\[n0,n1\]\)\\mathbf\{d\}\_\{t\}=\\text\{Linear\}\(\[n\_\{0\},n\_\{1\}\]\)encodes the cumulative delivery/pickup loads\. This history\-enhanced context enables the agent to maintain a hidden state capturing historical information, mitigating partial observability induced by the knowledge\-embedded RFCS framework\.
A multi\-head attention mechanism is applied between𝐜t\\mathbf\{c\}\_\{t\}and the node embeddings𝐇\\mathbf\{H\}:
𝐲t=MHA\(𝐜t,𝐇,𝐇,mask=𝐦t\),\\mathbf\{y\}\_\{t\}=\\text\{MHA\}\(\\mathbf\{c\}\_\{t\},\\mathbf\{H\},\\mathbf\{H\},\\text\{mask\}=\\mathbf\{m\}\_\{t\}\),where𝐦t\\mathbf\{m\}\_\{t\}is the action mask\. The resulting attention output is projected back into the embedding space, producing a compatibility score for each candidate node:
πθ\(at∣st\)∝exp\(α⋅tanh\(𝐲t⊤𝐇\)\),\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\\propto\\exp\(\\alpha\\cdot\\tanh\(\\mathbf\{y\}\_\{t\}^\{\\top\}\\mathbf\{H\}\)\),with infeasible actions masked by−∞\-\\infty\. Here,α=10\\alpha=10is a temperature scaling factor controlling exploration\.
This architecture ensures that the same network can process all CVRP variants without variant\-specific modifications\. The decoder outputs a probability distribution over feasible next nodes, from which the action is sampled or greedily selected\. The network is trained via REINFORCEWilliams \([1992](https://arxiv.org/html/2605.14416#bib.bib7)\)accroding to equation \([4](https://arxiv.org/html/2605.14416#S3.E4)\)\.
## 5Experiments
We empirically validate the effectiveness of the proposed knowledge\-embedded framework in this section\. Our experimental design examines three key aspects: solution quality, generalization ability, and ablation analysis—corresponding to the following research questions:
1. RQ1Does the proposed knowledge\-enhanced framework improve solution quality?
2. RQ2How well does the proposed framework generalize across different problem instances?
3. RQ3How does the presence or absence of historical information in the network architecture and the learning process influence the framework’s performance?
### 5\.1Experiment Settings
Following the data generation process inBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\), depot and customer coordinates are uniformly sampled within the unit square\. For training instances with 50 and 100 customers, the vehicle capacitiesQQare set to 40 and 50, respectively, while customer demands are uniformly sampled from\{1,2,⋯,9\}\\\{1,2,\\cdots,9\\\}\. In the backhaul scenario, 20% of the customers are randomly assigned negative demands to simulate pickup requests\. For the time\-window constraints, the service duration of each customer is uniformly sampled from\[0\.15,0\.18\]\[0\.15,0\.18\]\. The length of the time window is drawn uniformly from\[0\.18,0\.2\]\[0\.18,0\.2\], and both the starting and ending times are generated by applying a random offset uniformly sampled from the feasible range\. The distance limit is uniformly drawn from\[2Dmax,3\]\[2D\_\{max\},3\], whereDmaxD\_\{max\}denotes the maximum Euclidean distance between the depot and the farthest customer node\.
During training, all problem attributes are randomly generated on the fly, and the constraint flags are uniformly sampled across different variants\. For each training batch, we fix one particular constraint configuration so that all instances in the batch share the same problem setting\. For each problem setting, the model is trained for 300 epochs, with 1,280,000 samples per epoch and a batch size of 512\. To stabilize training and improve final convergence, the learning rate is progressively decreased toward the end of training\. The framework demonstrates better performance without using POMO\-style multi\-startKwonet al\.\([2020](https://arxiv.org/html/2605.14416#bib.bib44)\)\. During training, we simply perform eight stochastic policy rollouts per instance and take their average cost as the baseline\.
We evaluate our framework using the open\-source dataset fromBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\)\. We do not use multi\-start sampling; instead, for each instance, we perform 8\-fold data augmentation and generate as many trajectories as there are nodes, selecting the best among them\. This approach ensures consistent comparison data and produces an identical number of trajectories for all methods\. The code can be found at[https://github\.com/wenwenla/ijcai26\-cvrp\-solver](https://github.com/wenwenla/ijcai26-cvrp-solver)\.
Table 1:Evaluation Results for RQ1\. The results are averaged over 1000 instances per variant\. Lower values indicate better results\. Bold numbers denote the best results among learning\-based methods, while italic numbers represent the best\-known results\.
### 5\.2Baselines
We select the following methods as baselines for comparison, which are described in detail below\. These solvers are open\-source, allowing reproducible comparisons\. We select two classical optimization algorithms, PyVRPWoudaet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib5)\)and OR\-ToolsFurnon and Perron \([2024](https://arxiv.org/html/2605.14416#bib.bib4)\), as non\-learning baselines for comparison\. We also compare with three recent learning\-based VRP solvers: \(1\) MTPOMOLiuet al\.\([2024a](https://arxiv.org/html/2605.14416#bib.bib6)\), a multi\-task VRP solver based on POMOKwonet al\.\([2020](https://arxiv.org/html/2605.14416#bib.bib44)\); \(2\) MVMoEZhouet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib43)\), a multi\-task VRP solver leveraging mixture\-of\-experts networks for unified handling of different VRP variants; and \(3\) RouteFinderBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\), a transformer\-based foundation model designed for general VRPs\.
### 5\.3Results for RQ1
Table[1](https://arxiv.org/html/2605.14416#S5.T1)presents the average costs over 1,000 unseen instancesBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\)for sixteen CVRP variants, including the atomic problems—CVRP, OVRP, VRPB, VRPL, and VRPTW—as well as their various combinations\. The knowledge\-embedded framework outperforms all learning\-based methods across all variants, except for the VRPL variant, where it performs slightly below RouteFinder\. The average optimality gap are shown in Figure[2](https://arxiv.org/html/2605.14416#S5.F2)\. RFCS achieves a marked improvement by reducing the average optimality gap from 2\.60% to 1\.82% on average, highlighting the stronger solution quality achieved by our framework\. Since the proposed framework shares the same neural network forward\-propagation cost as other learning\-based methods, and the dynamic programming\-based splitting algorithm has a time complexity ofO\(n2\)O\(n^\{2\}\), its computational overhead is negligible compared with neural network inference\. Therefore, the overall solving time of our framework is comparable to that of the baseline learning\-based methods\. While classic approaches often yield better solution quality, they usually demand tens of minutes to complete a computation, in contrast to learning\-based methods, which require only a few seconds\. For more discussions regarding runtime and scalability, please refer to the appendix\.
Figure 2:The Average Optimality Gap\.
### 5\.4Results for RQ2
We evaluate the zero\-shot transfer capability of RFCS across different capacity distributions encountered during training\. During training, for scenarios withn=50n=50, the vehicle capacity was set to 40\. We directly evaluate the model obtained in RQ1 on instances with varying vehicle capacities to examine its performance changes\. The target distribution of vehicle capacities ranges from 50 to 90, and the experimental results of zero\-shot transfer evaluation are presented in Table[2](https://arxiv.org/html/2605.14416#S5.T2)\.
Table 2:Evaluation of the Knowledge\-embedded Framework Zero\-Shot Transfer to Unseen Vehicle Capacity Distributions\.The experimental results indicate that when the vehicle capacity distributions differ between training and evaluation, the knowledge\-embedded framework exhibits superior zero\-shot transfer performance compared to existing learning\-based methods\.
A commonly studied CVRP variant introduces mixed linehaul and backhaul demandsZhouet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib43)\), allowing vehicles to alternate between delivery and pickup operations instead of strictly finishing deliveries before pickups\. In this case, the key constraint is that the vehicle load must not violate the capacity limit at any time\. As described in section[4\.3](https://arxiv.org/html/2605.14416#S4.SS3), we only need to modify the Algorithm[1](https://arxiv.org/html/2605.14416#alg1)to adapt the solver to this scenario\. The zero\-shot transfer results to mixed demands are shown in Table[3](https://arxiv.org/html/2605.14416#S5.T3)\. The results show that our method outperforms the baselines when transferring from VRPB and OVRPB to the mixed constraint scenario, achieves comparable performance for VRPBL, and exhibits a notable performance drop for VRPBTW\. While the framework does not surpass the baselines across every test case, the findings highlight its inherent flexibility and adaptability\.
Table 3:Evaluation of the Knowledge\-embedded Framework Zero\-Shot Transfer to Unseen Mixed Backhaul Constraints\.
### 5\.5Results for RQ3
We conduct an ablation study to assess the contribution of the history\-enhanced context in the knowledge\-embedded framework\. The results are summarized in Table[4](https://arxiv.org/html/2605.14416#S5.T4)\. For the variant without historical information, we adopt an identical training setup, except that the LSTM module in the decoder responsible for handling historical information is replaced with a fully connected layer\. The results of this ablation study highlight the necessity of incorporating historical information\. While the LSTM is not the most advanced architecture for modeling history, it was chosen for its ability to incrementally maintain past information\. Further improvements in processing historical information by introduce more advanced network represent an interesting direction for future research\.
Table 4:Effect of Historical Information on Solution Quality: Ablation Results for 50\-node CVRP Variants\. “Gap” denotes the optimality gap relative to the best\-known solutions\.Another interesting ablation concerns whether incorporating the dynamic programming\-based reward during training is necessary, and whether the route\-first component actually learns to cooperate with the dynamic programming to produce higher\-quality solutions\. To address this question, we examine the results obtained by directly applying the cluster\-second dynamic programming module to optimize the solutions for CVRP generated by RouteFinderBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\)\. The corresponding results are shown in Table[5](https://arxiv.org/html/2605.14416#S5.T5)\. From the table, we observe that applying the cluster\-second optimization improves the solutions generated by the learning\-based methods\. However, the improvement is marginal, and the resulting performance remains inferior to that of RFCS trained with the dynamic programming\-based reward\. This not only highlights the optimality improvement provided by the DP procedure, but also demonstrates that incorporating the DP\-based reward during training is necessary\.
Table 5:Ablation Study on the Necessity of Dynamic Programming\-Based Rewards in RFCS Training\.
## 6Conclusion
In this paper, we propose a unified framework for a broad range of CVRP variants\. Inspired by the route\-first cluster\-second \(RFCS\) heuristic, the framework integrates RL with dynamic programming\-based algorithmic design knowledge to solve CVRPs with various additional constraints\. Specifically, the route\-first component leverages RL to generate customer visitation sequences\. The cluster\-second component employs dynamic programming to optimally convert a given sequence into a feasible solution for the original problem, and provides the resulting cost as a reward to guide RL\. The proposed framework not only achieves strong performance in terms of solution quality and transferability but also offers a flexible and unified approach to handling diverse constraints\. In future work, we plan to incorporate more advanced network architectures with this framework to achieve even better results\. We also plan to extend this approach to other domains beyond routing problems\.
## Acknowledgments
This work was supported by the NSFC Major Research Plan Key Program under Grant No\. 92582204, and the Collaborative Innovation Center of Novel Software Technology and Industrialization\.
## Appendix AScalability and Inference Time
To further assess scalability, we conducted additional zero\-shot experiments by training the model on50\-nodeinstances and directly transferring it to500\-node and 1000\-nodeCVRP without fine\-tuning\. The results are reported in Table[6](https://arxiv.org/html/2605.14416#A1.T6)\. For POMO\-based methods, the number of starts typically scales with the number of nodes\. However, due to GPU memory limitations on large\-scale instances, we evaluate these methods using11and100100starts\. As shown in Table[6](https://arxiv.org/html/2605.14416#A1.T6), our method achieves consistently better solution quality on both 500\-node and 1000\-node instances\. These results suggest that the proposed framework maintains stable performance when scaling to larger problem sizes and demonstrates better zero\-shot generalization ability\. The time is for solving 100 instances\.
Table 6:Zero\-shot generalization performance on large\-scale CVRP instances\.sns\_\{n\}denotes the number of POMO starts\. All methods are evaluated with greedy rollout\. m1: RouteFinder, m2: MTPOMO, m3: MvMoE\. Costs are averaged over 100 random instances\.To access inference time, from atheoreticalstandpoint, the proposed frameworkdoes not increase the asymptotic time complexitycompared to existing end\-to\-end learning\-based VRP solvers\. Specifically, the DP\-based clustering stage has complexityO\(n2\)O\(n^\{2\}\), which is of the same asymptotic order as the attention computation in transformer\-based methods\. From apracticalperspective, empirical results on large\-scale instances are provided in Table[6](https://arxiv.org/html/2605.14416#A1.T6)\. The results are obtained on a Linux server with an RTX5090 and 8470Q with 208 cores\. The DP procedure is implemented in C\+\+ and integrated into the training and evaluation pipeline via pybind11\.
## Appendix BMotivation of Learning
We apply LKH3 to solve the same 1,000 TSP instances withn=50n=50\. For each instance, the resulting tour is circularly permuted 50 times so each node serves as the start, and the tours are fed into the DP\-based clustering module\. Results:11\.312vs10\.484\. These results suggest that even high\-quality TSP tours do not necessarily lead to strong CVRP solutions after clustering, highlighting the importance of learning a task\-aware sequence construction policy\.
## Appendix CAdditional Experiment Details
Since RouteFinderBertoet al\.\([2024](https://arxiv.org/html/2605.14416#bib.bib41)\)released the benchmark dataset and pretrained models, we directly report the results of non\-learning solvers \(PyVRP and OR\-Tools\) from the original paper for fair comparison, where the time limits are10s and 20sper instance for n=50 and n=100, respectively\. Regarding training efficiency, training time depends on hardware and implementation and may vary significantly across works\. For example, for n=50, MvMoE reports training times ranging from 38–90 hours for different architectures, while RouteFinder reports 9–24 hours using A100 GPUs\. In comparison, our model is trained on two RTX 5090 GPUs and requires16\.17 hoursfor n=50, which is comparable to existing learning\-based approaches\.
## Appendix DDiscussion about Historical Information
The history\-modeling component is modular and can be readily replaced by otherRNNarchitectures\. We chose LSTM as it maintains hidden statesincrementally, offering better efficiency and memory scalability than Transformers that attend to the full sequence at each step\. Exploring advanced alternatives is a promising future direction\. For VRPBTW, the history ablation \(18\.464→18\.48918\.464\\to 18\.489\) also confirms the module’s effectiveness under mixed constraints\.
## References
- F\. Berto, C\. Hua, N\. G\. Zepeda, A\. Hottung, N\. Wouda, L\. Lan, K\. Tierney, and J\. Park \(2024\)RouteFinder: towards foundation models for vehicle routing problems\.InICML 2024 Workshop on Foundation Models in the Wild,Cited by:[Appendix C](https://arxiv.org/html/2605.14416#A3.p1.1),[§1](https://arxiv.org/html/2605.14416#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.14416#S3.SS1.p2.1),[§4\.4](https://arxiv.org/html/2605.14416#S4.SS4.p4.1),[§5\.1](https://arxiv.org/html/2605.14416#S5.SS1.p1.6),[§5\.1](https://arxiv.org/html/2605.14416#S5.SS1.p3.1),[§5\.2](https://arxiv.org/html/2605.14416#S5.SS2.p1.1),[§5\.3](https://arxiv.org/html/2605.14416#S5.SS3.p1.1),[§5\.5](https://arxiv.org/html/2605.14416#S5.SS5.p2.1)\.
- J\. Bi, Y\. Ma, J\. Zhou, W\. Song, Z\. Cao, Y\. Wu, and J\. Zhang \(2024\)Learning to handle complex constraints for vehicle routing problems\.Advances in Neural Information Processing Systems37,pp\. 93479–93509\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- X\. Chen and Y\. Tian \(2019\)Learning to perform local rewriting for combinatorial optimization\.Advances in neural information processing systems32\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- H\. Cheng, H\. Zheng, Y\. Cong, W\. Jiang, and S\. Pu \(2023\)Select and optimize: learning to solve large\-scale tsp instances\.InInternational conference on artificial intelligence and statistics,pp\. 1219–1231\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- P\. R\. d O Costa, J\. Rhuggenaath, Y\. Zhang, and A\. Akcay \(2020\)Learning 2\-opt heuristics for the traveling salesman problem via deep reinforcement learning\.InAsian conference on machine learning,pp\. 465–480\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- T\. Dernedde, D\. Thyssens, S\. Dittrich, M\. Stubbemann, and L\. Schmidt\-Thieme \(2025\)Moco: a learnable meta optimizer for combinatorial optimization\.InPacific\-Asia Conference on Knowledge Discovery and Data Mining,pp\. 236–248\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- D\. Drakulic, S\. Michel, and J\. Andreoli \(2025\)GOAL: a generalist combinatorial optimization agent learner\.InProceedings of the International Conference on Learning Representations \(ICLR\) 2025,Note:PosterCited by:[§1](https://arxiv.org/html/2605.14416#S1.p2.1)\.
- V\. Furnon and L\. Perron \(2024\)OR\-tools routing libraryGoogle\.External Links:[Link](https://developers.google.com/optimization/routing/)Cited by:[§5\.2](https://arxiv.org/html/2605.14416#S5.SS2.p1.1)\.
- C\. Gao, H\. Shang, K\. Xue, D\. Li, and C\. Qian \(2024\)Towards generalizable neural solvers for vehicle routing problems via ensemble with transferrable local policy\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence,pp\. 6914–6922\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural computation9\(8\),pp\. 1735–1780\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p4.1)\.
- A\. Hottung and K\. Tierney \(2020\)Neural large neighborhood search for the capacitated vehicle routing problem\.InECAI 2020,pp\. 443–450\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- H\. Y\. Jeong and B\. D\. Song \(2025\)Optimizing urban logistics: vehicle routing problem with underground transportation\.IEEE Transactions on Intelligent Transportation Systems\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p1.1)\.
- M\. Kim, J\. Park, and J\. Park \(2022a\)Sym\-nco: leveraging symmetricity for neural combinatorial optimization\.Advances in Neural Information Processing Systems35,pp\. 1936–1949\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- M\. Kim, J\. Son, H\. Kim, and J\. Park \(2022b\)Scale\-conditioned adaptation for large scale combinatorial optimization\.InNeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications,Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- W\. Kool, H\. van Hoof, J\. Gromicho, and M\. Welling \(2022\)Deep policy dynamic programming for vehicle routing problems\.InInternational conference on integration of constraint programming, artificial intelligence, and operations research,pp\. 190–213\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- W\. Kool, H\. van Hoof, and M\. Welling \(2018\)Attention, learn to solve routing problems\!\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- Y\. Kwon, J\. Choo, B\. Kim, I\. Yoon, Y\. Gwon, and S\. Min \(2020\)Pomo: policy optimization with multiple optima for reinforcement learning\.Advances in Neural Information Processing Systems33,pp\. 21188–21198\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p2.1),[§2](https://arxiv.org/html/2605.14416#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.14416#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.14416#S5.SS2.p1.1)\.
- K\. Leng and S\. Li \(2021\)Distribution path optimization for intelligent logistics vehicles of urban rail transportation using vrp optimization model\.IEEE Transactions on Intelligent Transportation Systems23\(2\),pp\. 1661–1669\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p1.1)\.
- J\. K\. Lenstra and A\. R\. Kan \(1981\)Complexity of vehicle routing and scheduling problems\.Networks11\(2\),pp\. 221–227\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p1.1)\.
- F\. Li, B\. Golden, and E\. Wasil \(2007\)The open vehicle routing problem: algorithms, large\-scale test problems, and computational results\.Computers & operations research34\(10\),pp\. 2918–2930\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p1.1)\.
- K\. Li, F\. Liu, Z\. Wang, and Q\. Zhang \(2025\)Destroy and repair using hyper\-graphs for routing\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 18341–18349\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- F\. Liu, X\. Lin, Q\. Zhang, X\. Tong, and M\. Yuan \(2024a\)Multi\-task learning for routing problem with cross\-problem zero\-shot generalization\.International Conference on Knowledge Discovery and Data Mining \(KDD\)\.Cited by:[§5\.2](https://arxiv.org/html/2605.14416#S5.SS2.p1.1)\.
- F\. Liu, X\. Tong, M\. Yuan, X\. Lin, F\. Luo, Z\. Wang, Z\. Lu, and Q\. Zhang \(2024b\)Evolution of heuristics: towards efficient automatic algorithm design using large language model\.InProceedings of the 41st International Conference on Machine Learning,pp\. 32201–32223\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- F\. Luo, X\. Lin, F\. Liu, Q\. Zhang, and Z\. Wang \(2023\)Neural combinatorial optimization with heavy decoder: toward large scale generalization\.Advances in Neural Information Processing Systems36,pp\. 8845–8864\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- C\. Prins, P\. Lacomme, and C\. Prodhon \(2014\)Order\-first split\-second methods for vehicle routing problems: a review\.Transportation Research Part C\-emerging Technologies40,pp\. 179–200\.External Links:[Link](https://api.semanticscholar.org/CorpusID:110554881)Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p3.1)\.
- I\. Sbai, S\. Krichen, and O\. Limam \(2022\)Two meta\-heuristics for solving the capacitated vehicle routing problem: the case of the tunisian post office\.Operational Research22\(1\),pp\. 507–549\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p1.1)\.
- R\. Shi, X\. Yu, Y\. Wang, Y\. Tian, Z\. Liu, W\. Wu, X\. Zhang, and M\. M\. Veloso \(2025\)Symmetry\-informed marl: a decentralized and cooperative uav swarm control approach for communication coverage\.IEEE Transactions on Mobile Computing24\(01\),pp\. 1–18\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- J\. Son, M\. Kim, S\. Choi, H\. Kim, and J\. Park \(2024\)Equity\-transformer: solving np\-hard min\-max routing problems as sequential generation with equity context\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 20265–20273\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- J\. Sui, S\. Ding, R\. Liu, L\. Xu, and D\. Bu \(2021\)Learning 3\-opt heuristics for traveling salesman problem via deep reinforcement learning\.InAsian conference on machine learning,pp\. 1301–1316\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- K\. V\. Tiwari and S\. K\. Sharma \(2023\)An optimization model for vehicle routing problem in last\-mile delivery\.Expert Systems with Applications222,pp\. 119789\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p1.1)\.
- T\. Vidal, T\. G\. Crainic, M\. Gendreau, and C\. Prins \(2014\)A unified solution framework for multi\-attribute vehicle routing problems\.European Journal of Operational Research234\(3\),pp\. 658–673\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p1.1)\.
- T\. Vidal \(2015\)Technical note: split algorithm in o\(n\) for the capacitated vehicle routing problem\.Comput\. Oper\. Res\.69,pp\. 40–47\.External Links:[Link](https://api.semanticscholar.org/CorpusID:18070450)Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p3.1)\.
- O\. Vinyals, M\. Fortunato, and N\. Jaitly \(2015\)Pointer networks\.Advances in neural information processing systems28\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- W\. Wang, X\. Wu, L\. Wang, H\. Hu, X\. Tao, and L\. Zhang \(2025\)Solving the min\-max multiple traveling salesmen problem via learning\-based path generation and optimal splitting\.arXiv preprint arXiv:2508\.17087\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p3.1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Mach\. Learn\.8\(3–4\),pp\. 229–256\.External Links:ISSN 0885\-6125,[Link](https://doi.org/10.1007/BF00992696),[Document](https://dx.doi.org/10.1007/BF00992696)Cited by:[§3\.2](https://arxiv.org/html/2605.14416#S3.SS2.p2.2),[§4\.4](https://arxiv.org/html/2605.14416#S4.SS4.p7.1)\.
- N\. A\. Wouda, L\. Lan, and W\. Kool \(2024\)PyVRP: a high\-performance VRP solver package\.INFORMS Journal on Computing36\(4\),pp\. 943–955\.External Links:[Document](https://dx.doi.org/10.1287/ijoc.2023.0055),[Link](https://doi.org/10.1287/ijoc.2023.0055)Cited by:[§5\.2](https://arxiv.org/html/2605.14416#S5.SS2.p1.1)\.
- X\. Wu, D\. Wang, L\. Wen, Y\. Xiao, C\. Wu, Y\. Wu, C\. Yu, D\. L\. Maskell, and Y\. Zhou \(2024\)Neural combinatorial optimization algorithms for solving vehicle routing problems: a comprehensive survey with perspectives\.arXiv preprint arXiv:2406\.00415\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- Y\. Wu, W\. Song, Z\. Cao, J\. Zhang, and A\. Lim \(2021\)Learning improvement heuristics for solving routing problems\.IEEE transactions on neural networks and learning systems33\(9\),pp\. 5057–5069\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- L\. Xin, W\. Song, Z\. Cao, and J\. Zhang \(2021\)Neurolkh: combining deep learning model with lin\-kernighan\-helsgaun heuristic for solving the traveling salesman problem\.Advances in Neural Information Processing Systems34,pp\. 7472–7483\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- S\. Xu, S\. S\. Panwar, M\. Kodialam, and T\. Lakshman \(2020\)Deep neural network approximated dynamic programming for combinatorial optimization\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 1684–1691\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- X\. Yu, R\. Shi, P\. Feng, Y\. Tian, S\. Li, S\. Liao, and W\. Wu \(2024\)Leveraging partial symmetry for multi\-agent reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 17583–17590\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- H\. Zhang, H\. Ge, J\. Yang, and Y\. Tong \(2022\)Review of vehicle routing problems: models, classification and solving algorithms\.Archives of Computational Methods in Engineering29\(1\),pp\. 195–221\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p3.1)\.
- J\. Zheng, K\. He, J\. Zhou, Y\. Jin, and C\. Li \(2021\)Combining reinforcement learning with lin\-kernighan\-helsgaun algorithm for the traveling salesman problem\.InProceedings of the AAAI conference on artificial intelligence,Vol\.35,pp\. 12445–12452\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- Z\. Zheng, S\. Yao, Z\. Wang, T\. Xialiang, M\. Yuan, and K\. Tang \(2024\)DPN: decoupling partition and navigation for neural solvers of min\-max vehicle routing problems\.InInternational Conference on Machine Learning,pp\. 61559–61592\.Cited by:[§2](https://arxiv.org/html/2605.14416#S2.p1.1)\.
- J\. Zhou, Z\. Cao, Y\. Wu, W\. Song, Y\. Ma, J\. Zhang, and X\. Chi \(2024\)MVMoE: multi\-task vehicle routing solver with mixture\-of\-experts\.InInternational Conference on Machine Learning,pp\. 61804–61824\.Cited by:[§1](https://arxiv.org/html/2605.14416#S1.p2.1),[§4\.3](https://arxiv.org/html/2605.14416#S4.SS3.p1.8),[§5\.2](https://arxiv.org/html/2605.14416#S5.SS2.p1.1),[§5\.4](https://arxiv.org/html/2605.14416#S5.SS4.p3.1)\.Similar Articles
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations
This paper introduces CaTR, a value-decomposed reinforcement learning framework for real-time multi-aircraft taxiway routing that uses hierarchical foresight traffic representation to balance safety and efficiency.
Two-Stage Learned Decomposition for Scalable Routing on Multigraphs
This paper proposes Node-Edge Policy Factorization (NEPF) to address scalability issues in solving Vehicle Routing Problems on multigraphs. It combines pre-encoding edge aggregation with a hierarchical reinforcement learning method to achieve state-of-the-art solution quality with faster training and inference.
COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
COAgents is a cooperative multi-agent framework for solving Vehicle Routing Problems that models search as a graph, using specialized agents for node selection, move selection, and jumps to escape local minima. It achieves state-of-the-art results on CVRP and VRPTW benchmarks, reducing the gap to best-known solutions by up to 44% compared to prior learning-based methods.
Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty
This paper proposes a multi-agent reinforcement learning framework that co-trains an autonomous vehicle and pedestrians with personality-driven jaywalking behavior, achieving a 30% reduction in collisions compared to single-agent approaches and demonstrating more realistic interaction scenarios.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.