WeCon: An Efficient Weight-Conditioned Neural Solver for Multi-Objective Combinatorial Optimization Problems
Summary
Presents WeCon, a weight-conditioned neural solver for multi-objective combinatorial optimization problems that achieves comparable hypervolume to the state-of-the-art while reducing inference time by 40%.
View Cached Full Text
Cached at: 05/25/26, 08:55 AM
# WeCon: An Efficient Weight-Conditioned Neural Solver for Multi-Objective Combinatorial Optimization Problems
Source: [https://arxiv.org/html/2605.22876](https://arxiv.org/html/2605.22876)
\\ArticleType
RESEARCH PAPER\\Year2025\\MonthJanuary\\Vol68\\No1\\DOI\\ArtNo\\ReceiveDate\\ReviseDate\\AcceptDate\\OnlineDate\\AuthorMark\\AuthorCitation
WeCon: An Efficient Weight\-Conditioned Neural Solver for Multi\-Objective Combinatorial Optimization Problems
zyou@jlu\.edu\.cn
Jinbiao ChenYang LiLijie WenChunguo WuYuanshu LiYubin Xiao Chunyan MiaoYou ZhouDi WangKey Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaDepartment of Industrial Systems Engineering and Management, National University of Singapore, Singapore 117576, SingaporeCollege of Software, Jilin University, Changchun 130012, ChinaSchool of Software, Tsinghua University, Beijing 100084, ChinaSchool of Computing and Information Systems, Singapore Management University, Singapore 178902, SingaporeJoint NTU\-UBC Research Centre of Excellence in Active Living for the Elderly, Nanyang Technological University, Singapore 639798, Singapore
###### Abstract
Existing neural solvers for Multi\-Objective Combinatorial Optimization Problems \(MOCOPs\) commonly adopt decomposition\-based strategies that scalarize an MOCOP into multiple subproblems associated with distinct weight vectors\. However, they either inject weights only once during decoding, limiting weight\-conditioned context modeling, or primarily during encoding, causing weight\-signal dilution during decoding\. Moreover, preference optimization methods rely on purely random sampling to construct solution pairs for training solvers, which often produces less informative pairs and thus leads to low training effectiveness\. To better address these limitations, we propose an efficient Weight\-Conditioned neural solver \(WeCon\)\. Specifically, we design an encoder layer with three attention blocks and our proposed Gated Residual Fusion \(GRF\) block to facilitate harmonious interaction between instance features and weights, thereby generating informative weight\-conditioned context\. We further introduce a plug\-and\-play Residual Fusion \(RF\) block in the decoder to alleviate weight\-signal dilution\. Finally, we propose Efficient Preference Optimization \(EPO\), which constructs high\-quality solutions, thereby generating more informative pairs to improve training effectiveness\. Experiments on four MOCOP variants across different problem scales and distribution patterns demonstrate that WeCon achieves HyperVolume \(HV\) values comparable to SOTA solver POCCO\-W, while reducing inference time by 40%\. Ablation studies validate the contributions of all designs\.
###### keywords:
Neural Combinatorial Optimization, Multi\-Objective Problem, Preference Optimization, Attention Model
## 1Introduction
Combinatorial Optimization Problems \(COPs\) are a core topic of mathematical optimization, aiming to find optimal solutions in the discrete search space, and have long attracted sustained research attention due to their broad real\-world relevance\[[31](https://arxiv.org/html/2605.22876#bib.bib53),[36](https://arxiv.org/html/2605.22876#bib.bib5),[52](https://arxiv.org/html/2605.22876#bib.bib4),[41](https://arxiv.org/html/2605.22876#bib.bib6)\]\. Conventional algorithms for solving COPs can be generally divided into three categories, namely exact, approximation, and heuristic methods\[[42](https://arxiv.org/html/2605.22876#bib.bib31),[1](https://arxiv.org/html/2605.22876#bib.bib52)\]\. However, most cannot derive insights from historical COP instances, leading to substantial computing overhead\[[16](https://arxiv.org/html/2605.22876#bib.bib18),[49](https://arxiv.org/html/2605.22876#bib.bib30),[46](https://arxiv.org/html/2605.22876#bib.bib3)\]\.
To solve COPs efficiently, recent studies have increasingly developed neural solvers that learn from extensive historical instances, enabling effective searches for optimal solutions\[[47](https://arxiv.org/html/2605.22876#bib.bib17),[24](https://arxiv.org/html/2605.22876#bib.bib33),[35](https://arxiv.org/html/2605.22876#bib.bib23),[21](https://arxiv.org/html/2605.22876#bib.bib14),[54](https://arxiv.org/html/2605.22876#bib.bib37),[20](https://arxiv.org/html/2605.22876#bib.bib35),[45](https://arxiv.org/html/2605.22876#bib.bib10)\]\. Despite this progress, existing studies have largely focused on Single\-Objective COPs \(SOCOPs\), leaving Multi\-Objective COPs \(MOCOPs\) relatively underexplored\. However, many real\-world decisions must balance multiple considerations \(e\.g\., cost and convenience\)\. Therefore, developing neural solvers tailored to MOCOPs is of paramount importance\[[19](https://arxiv.org/html/2605.22876#bib.bib38),[5](https://arxiv.org/html/2605.22876#bib.bib60),[42](https://arxiv.org/html/2605.22876#bib.bib31)\]\.
Figure 1:Comparative illustration of our proposed WeCon and existing NCO models for solving MOCOPs\.To tackle the challenge posed by MOCOPs, certain neural solvers\[[44](https://arxiv.org/html/2605.22876#bib.bib25),[51](https://arxiv.org/html/2605.22876#bib.bib40)\]adopted decomposition\-based methods that decompose an MOCOP into a set of subproblems associated with distinct weight vectors\. Pioneering methods, termed multi\-model solvers, train or fine\-tune a separate model for each subproblem\[[4](https://arxiv.org/html/2605.22876#bib.bib1),[23](https://arxiv.org/html/2605.22876#bib.bib65)\]\. However, they may be impractical due to substantial training or fine\-tuning overhead and limited generalization to weights beyond those seen during training\[[37](https://arxiv.org/html/2605.22876#bib.bib34)\]\. Consequently, recent studies aim to train a single solver that generalizes across a wide range of weights\[[5](https://arxiv.org/html/2605.22876#bib.bib60),[10](https://arxiv.org/html/2605.22876#bib.bib58)\]and can produce solutions that dynamically adapt to the input weights, i\.e\., enable weight\-conditioned decisions\. As shown in Figure[1](https://arxiv.org/html/2605.22876#S1.F1)\(a\) and[1](https://arxiv.org/html/2605.22876#S1.F1)\(b\), existing single\-model solvers typically incorporate weights either only in the decoder \(e\.g\., PMOCO\[[23](https://arxiv.org/html/2605.22876#bib.bib65)\]\) or primarily in the encoder \(e\.g\., WE\-CA\[[3](https://arxiv.org/html/2605.22876#bib.bib7)\]\)\. However, these two designs, while straightforward, can hinder generalization across weights\. Specifically, decoding\-only injection may fail to provide informative weight\-conditioned context, whereas encoding\-primarily injection may dilute the weight signals during decoding\. To boost performance, Fan et al\.\[[11](https://arxiv.org/html/2605.22876#bib.bib69)\]proposed the Mixture\-of\-Experts \(MoE\)\-based Conditional Computation \(CCO\) block in the decoder to route subproblems to different experts\. Nevertheless, this gating\-and\-routing process introduces additional runtime \(see Section[5\.2](https://arxiv.org/html/2605.22876#S5.SS2)\), which may limit model practicality in time\-sensitive scenarios, e\.g\., traffic signal control\[[28](https://arxiv.org/html/2605.22876#bib.bib62)\]\. This inefficiency reflects that such design does not fundamentally address the key limitation of existing solvers, i\.e\., the underexploitation of weights in the encoder and decoder\. Therefore, in this work, we investigate the following research question:
Can a solver achieve high\-level performance by effectively producing informative weight\-conditioned context and preventing weight\-signal dilution during decoding, without significantly increasing runtime?
Moreover, the recent study\[[11](https://arxiv.org/html/2605.22876#bib.bib69)\]adopted the Preference Optimization \(PO\) proposed in\[[29](https://arxiv.org/html/2605.22876#bib.bib12)\]to train the solver\. Specifically, PO randomly samples a number ofrrcandidate solutions for each instance under the current policy and performs pairwise comparisons based on their objective values to constructr\(r−1\)2\\frac\{r\(r\-1\)\}\{2\}number of preference pairs, where each pair consists of a better solutionπw\\pi\_\{w\}and a worse solutionπl\\pi\_\{l\}\. These pairs then provide comparative supervision for training the solver\. However, purely random sampling does not guarantee a sufficient number of \(near\-\)optimal solutions, which limits PO’s exploitation ability and adversely affects training effectiveness\[[22](https://arxiv.org/html/2605.22876#bib.bib11)\]\.
To better overcome the afore\-discussed limitations holistically, we propose a Weight\-Conditioned neural solver \(WeCon\)\. As shown in Figure[1](https://arxiv.org/html/2605.22876#S1.F1)\(c\), WeCon lets weight vectors play an essential, critical role in both encoder and decoder, aiming to achieve high\-level performance by sharing weight\-conditioned context between them\. Specifically, the encoder comprises stacked layers that alternately apply a Multi\-Head Self\-Attention \(MHSA\) block over instance features and then use two Multi\-Head Attention \(MHA\) blocks together with our proposed Gated Residual Fusion \(GRF\) block to integrate instance and weight embeddings \(see Section[4\.1](https://arxiv.org/html/2605.22876#S4.SS1)\), thereby producing informative weight\-conditioned context\. In the decoder, WeCon first exploits an MHA layer together with our proposed Residual Fusion \(RF\) block to inject weight signals in a more effective manner at each decoding step \(see Section[4\.2](https://arxiv.org/html/2605.22876#S4.SS2)\)\. By adopting this design, WeCon mitigates weight\-signal dilution and achieves high\-level performance without significantly increasing runtime \(see Section[5\.2](https://arxiv.org/html/2605.22876#S5.SS2)\)\. We deem the adoption of RF in its decoder makes WeCon substantially differ from existing solvers that primarily utilize weights in encoders \(see Figure[1](https://arxiv.org/html/2605.22876#S1.F1)\(b\)\)\. For instance, WE\-CA treats the weight embedding as an extra token appended to the instance embeddings along the node dimension \(for vehicle routing problems\), which is then used as the key/value inputs to the decoder’s MHA layer\[[3](https://arxiv.org/html/2605.22876#bib.bib7)\]\. Whereas WeCon’s decoder first leverages the weight embedding within an MHA layer and subsequently broadcasts it to all nodes, upon which RF explicitly conditions each node embedding according to the weights \(see Figure[2](https://arxiv.org/html/2605.22876#S3.F2)\)\. Moreover, RF is a plug\-and\-play module that can be effortlessly integrated into various decoder architectures\. To demonstrate the generality of RF, we propose a variant,WeCon\-CCO, which adopts the decoder from\[[11](https://arxiv.org/html/2605.22876#bib.bib69)\]and augments it with the RF module\. To improve training effectiveness, we extend PO and propose Efficient Preference Optimization \(EPO\)\. Specifically, instead of randomly samplingrrsolutions for each instance, EPO performs a guided sampling to generate⌈rc⌉\\lceil\\frac\{r\}\{c\}\\rceilsolutions, while the remaining\(r−⌈rc⌉\)\(r\-\\lceil\\frac\{r\}\{c\}\\rceil\)solutions are sampled at random\. This guided sampling constrains each decision step to select the next node only from the top\-kkfeasible nodes with the highest probabilities \(see Section[4\.3](https://arxiv.org/html/2605.22876#S4.SS3)\)\. This design enables EPO to obtain sufficiently high\-quality solutions, thereby constructing more informative preference pairs with larger quality gaps so as to improve training effectiveness\. To the best of our knowledge,WeCon is the first neural solver to simultaneously achieve SOTA performance and runtime efficiency, by effectively leveraging weights in both encoder and decoder\.
The key contributions of this work are as follows\.
I\)We design an encoder in which each layer applies self\-attention over instance features, then performs bidirectional attention between instance and weight features, and finally integrates them via the proposed GRF module to produce weight\-conditioned context\.
II\)We develop a decoder that combines an MHA layer with the proposed plug\-and\-play RF block to mitigate weight\-signal dilution and enable more effective weight\-conditioned decisions\.
III\)To improve training effectiveness, we propose the EPO strategy, which efficiently generates sufficiently high\-quality solutions and thereby constructs more informative preference pairs\.
IV\)To evaluate the effectiveness of the proposed WeCon and WeCon\-CCO, we conduct extensive experiments on four MOCOPs across different problem scales and distribution patterns\. The experiments demonstrate that WeCon achieves HyperVolume \(HV\) values comparable to the SOTA solver POCCO\-W, while reducing inference time by approximately 40%\.WeCon\-CCOattains the best overall HV performance, albeit at the cost of increased inference time\. Ablation studies validate the effectiveness of our encoder and decoder architectures as well as EPO\.
## 2Related Work
In this section, we review the relevant literature\.
Neural Solvers for MOCOP:To solve MOCOPs, pioneering studies\[[40](https://arxiv.org/html/2605.22876#bib.bib36),[19](https://arxiv.org/html/2605.22876#bib.bib38),[51](https://arxiv.org/html/2605.22876#bib.bib40)\]adopted the MOEA/D framework\[[50](https://arxiv.org/html/2605.22876#bib.bib54)\], which decomposes an MOCOP into a set of subproblems associated with different weight vectors and trains a separate model for each subproblem\. Such methods are termed multi\-model solvers\. Nonetheless, training multiple models requires substantial computational resources\. To overcome this limitation, Lin et al\.\[[23](https://arxiv.org/html/2605.22876#bib.bib65)\]proposed a single\-model solver that incorporates the weight vector into the decoder, generating solutions that dynamically adapt to different weight vectors, i\.e\., to make weight\-conditioned decisions\. Subsequently, single\-model solvers have become the mainstream for solving MOCOPs\[[10](https://arxiv.org/html/2605.22876#bib.bib58),[44](https://arxiv.org/html/2605.22876#bib.bib25),[12](https://arxiv.org/html/2605.22876#bib.bib8)\]\. For example, Chen et al\.\[[3](https://arxiv.org/html/2605.22876#bib.bib7)\]proposed an encoder with a conditional attention mechanism to integrate weights with instance features\. However, existing solvers often leverage weight vectors coarsely, either primarily in the encoder or solely in the decoder \(see Figure[1](https://arxiv.org/html/2605.22876#S1.F1)\), which may dilute the weight signals during decoding or fail to provide informative weight\-conditioned context\. To better overcome this limitation, we rethink the role of weight vectors in MOCOPs and design respective encoder and decoder architectures to produce informative weight\-conditioned context and mitigate weight\-signal dilution, respectively\.
Preference Optimization:Most neural solvers are trained with Reinforcement Learning \(RL\), while fewer adopt Supervised Learning \(SL\)\[[42](https://arxiv.org/html/2605.22876#bib.bib31),[16](https://arxiv.org/html/2605.22876#bib.bib18)\]\. Because RL does not require optimal solutions as training labels, adopting RL substantially reduces labeling costs\[[13](https://arxiv.org/html/2605.22876#bib.bib21),[33](https://arxiv.org/html/2605.22876#bib.bib15),[43](https://arxiv.org/html/2605.22876#bib.bib2)\]\. However, RL methods, such as the widely used REINFORCE\[[39](https://arxiv.org/html/2605.22876#bib.bib61)\], update the policy by comparing sampled rewards against a baseline\[[16](https://arxiv.org/html/2605.22876#bib.bib18)\]\. As training progresses, the policy\-gradient signal may diminish, leading to slow convergence\[[29](https://arxiv.org/html/2605.22876#bib.bib12)\]\. To mitigate this problem, Pan et al\.\[[29](https://arxiv.org/html/2605.22876#bib.bib12)\]proposed PO as an alternative to RL, which samples multiple solutions for each instance and then constructs preference pairs based on their relative objective values, enabling the solver to learn comparative quality across solutions\. Subsequently, to improve the training effectiveness of PO, BOPO\[[22](https://arxiv.org/html/2605.22876#bib.bib11)\]and VAGPO\[[25](https://arxiv.org/html/2605.22876#bib.bib9)\]select only a subset of the sampled solutions to construct preference pairs, instead of exhaustively considering all sampled solutions\. However, these methods largely overlook the sampling procedure itself\. To improve the training efficiency, we propose EPO, which samples diverse high\-quality solutions by constraining each decision step to sample the next node only from the top\-kknodes with the highest probabilities, thereby constructing more informative preference pairs\.
## 3Preliminaries
An MOCOP instanceG=\{v1,⋯,vn\}\{G\}=\\\{v\_\{1\},\\cdots,v\_\{n\}\\\}can be formulated asminπ∈𝒳\\min\_\{\\pi\\in\\mathcal\{X\}\}F\(π\)=\(F1\(π\),…,Fκ\(π\)\)F\(\\pi\)=\(F\_\{1\}\(\\pi\),\\dots,F\_\{\\kappa\}\(\\pi\)\), whereFFdenotes an objective vector with a number ofκ\\kappaobjective functions,π\\pidenotes a feasible solution, and𝒳\\mathcal\{X\}denotes the feasible solution space\. Owing to inherent conflicts among objectives, there is typically no single solution optimal for all objectives\. Therefore, we seek Pareto\-optimal solutions that capture trade\-offs among objectives, defined as follows:
Pareto Dominance:A solutionπ∈𝒳\\pi\\in\\mathcal\{X\}is said to dominate another solutionπ′∈𝒳\\pi^\{\\prime\}\\in\\mathcal\{X\}\(denoted asπ≺π′\\pi\\prec\\pi^\{\\prime\}\) iffFi\(π\)≤Fi\(π′\)F\_\{i\}\(\\pi\)\\leq F\_\{i\}\(\\pi^\{\\prime\}\),∀i∈\{1,⋯,κ\}\\forall i\\in\\left\\\{1,\\cdots,\\kappa\\right\\\}andF\(π\)≠F\(π′\)F\(\\pi\)\\neq F\(\\pi^\{\\prime\}\)\.
Pareto Optimality:A solutionπ∗∈𝒳\\pi^\{\*\}\\in\\mathcal\{X\}is Pareto optimal if it is not dominated by any other solution\. Moreover, the Pareto set𝒫\\mathcal\{P\}comprises all Pareto optimal solutions, formally,𝒫=\{π∗∈𝒳∣∄π∈𝒳:π≺π∗\}\\mathcal\{P\}=\\\{\\pi^\{\*\}\\in\\mathcal\{X\}\\mid\\nexists\\;\\pi\\in\\mathcal\{X\}:\\pi\\prec\\pi^\{\*\}\\\}\. The Pareto frontℱ\\mathcal\{F\}is the visualization of Pareto optimal solutions in the objective space, i\.e\.,ℱ=\{F\(π∗\)∣π∗∈𝒫\}\\mathcal\{F\}=\\left\\\{F\(\\pi^\{\*\}\)\\mid\\pi^\{\*\}\\in\\mathcal\{P\}\\right\\\}\.
To solve MOCOPs, decomposition\-based methods are widely used to decompose the original problem into a set of subproblems \(i\.e\., SOCOPs\) with much lower complexity and shorter solution time\. For example, the conventional method, MOEA/D\[[50](https://arxiv.org/html/2605.22876#bib.bib54)\]scalarizes an MOCOP into𝒩\\mathcal\{N\}subproblems using a set of uniformly distributed weight vectors\{λ1,⋯,λ𝒩\}\\\{\\lambda\_\{1\},\\cdots,\\lambda\_\{\\mathcal\{N\}\}\\\}, where eachλs=\(λs1,⋯,λsκ\)⊤\\lambda\_\{s\}=\(\\lambda\_\{s\}^\{1\},\\cdots,\\lambda\_\{s\}^\{\\kappa\}\)^\{\\top\}satisfiesλsj≥0\\lambda\_\{s\}^\{j\}\\geq 0for alljjand∑j=1κλsj=1\\sum\\nolimits\_\{j=1\}^\{\\kappa\}\\lambda\_\{s\}^\{j\}=1\. In this paper, following prior studies\[[3](https://arxiv.org/html/2605.22876#bib.bib7),[11](https://arxiv.org/html/2605.22876#bib.bib69)\], we choose Weighted\-Sum \(WS\) as the decomposition technique for all baselines\. Specifically, given an MOCOP, thessth subproblem is defined by the weight vectorλs\\lambda\_\{s\}asminπ∈𝒳∑j=1κλsjFj\(π\)\\min\_\{\\pi\\in\\mathcal\{X\}\}\\sum\\nolimits\_\{j=1\}^\{\\kappa\}\\lambda\_\{s\}^\{j\}F\_\{j\}\(\\pi\)\. Notably, effectively leveragingλs\\lambda\_\{s\}is crucial for improving the performance of the neural solver\[[3](https://arxiv.org/html/2605.22876#bib.bib7)\]\.
Figure 2:WeCon network architectures\. \(a\) Illustration of the proposed WeCon tailored for MOCOPs\. Through multiple rounds of interactions between instance features and weight vectors, WeCon’s encoder produces informative weight\-conditioned context\. Subsequently, WeCon’s decoder exploits the proposed RF block to mitigate weight\-signal dilution and enable effective weight\-conditioned decisions\. \(b\) and \(c\) Illustrations of the proposed GRF and RF modules, respectively\.
## 4Weight\-Conditioned Neural Solver \(WeCon\)
The architecture of WeCon is schematically depicted in Figure[2](https://arxiv.org/html/2605.22876#S3.F2)\(a\)\. In this section, we first describe its encoder and decoder architectures, which produce informative weight\-conditioned context and mitigate weight\-signal dilution, respectively\. We then introduce EPO\. The source code of WeCon is available online111https://github\.com/wuuu110/WeCon\.
### 4\.1The Encoder of WeCon
Compared with SOCOPs, neural solvers tailored for MOCOPs often benefit from encoding the weight vectorλs\\lambda\_\{s\}during the encoding phase, which enables them to solve decomposed subproblems associated withλs\\lambda\_\{s\}\. However, existing solvers typically only perform a single interaction between instance features and weights in each encoder layer, yielding contexts that may be insufficiently conditioned on the weights \(i\.e\., poor distinction between instance and weight embeddings, see Figure[5](https://arxiv.org/html/2605.22876#S5.F5)\)\. To better produce more informative weight\-conditioned context, we design a novel encoder\. Specifically, given a subproblem\(G=\{v1,⋯,vn\},λs\)\(\{G\}=\\\{v\_\{1\},\\cdots,v\_\{n\}\\\},\\lambda\_\{s\}\), the encoder first exploits two Fully Connected Networks \(FCNs\) to embed the instance features and the weight vector, producing instance embedding𝑯0\\bm\{H\}^\{0\}and weight embedding𝑨0\\bm\{A\}^\{0\}, respectively\. Formally,
𝑯0=G𝑾0\+𝒃0,\\bm\{H\}^\{0\}=\{G\}\\bm\{W\}^\{0\}\+\\bm\{b\}^\{0\},\(1\)𝑨0=λs𝑾λ0\+𝒃λ0,\\bm\{A\}^\{0\}=\\lambda\_\{s\}\\bm\{W\}\_\{\\lambda\}^\{0\}\+\\bm\{b\}\_\{\\lambda\}^\{0\},\(2\)where matrices𝑾0\\bm\{W\}^\{0\},𝒃0\\bm\{b\}^\{0\},𝑾λ0,\\bm\{W\}\_\{\\lambda\}^\{0\},and𝒃λ0\\bm\{b\}\_\{\\lambda\}^\{0\}are learnable\.
Subsequently, the instance and weight embeddings are jointly updated viaLLnumber of layers to produce informative weight\-conditioned context, where each layer consists of a Multi\-Head Self\-Attention \(MHSA\) block, two Multi\-Head Attention \(MHA\) blocks, and our proposed Gated Residual Fusion \(GRF\) block\. Specifically, for thellth layer, the instance embeddings𝑯1l\\bm\{H\}^\{l\}\_\{1\}\(l∈\{1,2,⋯L\}l\\in\\\{1,2,\\cdots L\\\}\) are first updated via an MHSA block, defined as follows:
𝑯¯1l=RMSNorm\(𝑯l−1\+MHSA\(𝑯l−1,𝑯l−1,𝑯l−1\)\),\\bar\{\\bm\{H\}\}^\{l\}\_\{1\}=\\text\{RMSNorm\}\\left\(\\bm\{H\}^\{l\-1\}\+\\text\{MHSA\}\\left\(\\bm\{H\}^\{l\-1\},\\bm\{H\}^\{l\-1\},\\bm\{H\}^\{l\-1\}\\right\)\\right\),\(3\)𝑯1l=RMSNorm\(𝑯¯1l\+FF\(𝑯¯1l\)\),\\bm\{H\}^\{l\}\_\{1\}=\\text\{RMSNorm\}\\left\(\\bar\{\\bm\{H\}\}\_\{1\}^\{l\}\+\\text\{FF\}\\left\(\\bar\{\\bm\{H\}\}\_\{1\}^\{l\}\\right\)\\right\),\(4\)MHSA\(𝑸,𝑲,𝑽\)=\(\|\|m=1MATTm\(𝑸,𝑲,𝑽\)\)𝑾O\+𝒃O,\\text\{MHSA\}\(\\bm\{Q\},\\bm\{K\},\\bm\{V\}\)=\\left\(\|\|\_\{m=1\}^\{M\}\\text\{ATT\}\_\{m\}\(\\bm\{Q\},\\bm\{K\},\\bm\{V\}\)\\right\)\\bm\{W\}\_\{O\}\+\\bm\{b\}\_\{O\},\(5\)ATTm\(𝑸,𝑲,𝑽\)=softmax\(\(𝑸𝑾Q,m\)\(𝑲𝑾K,m\)⊤d′\)𝑽𝑾V,m,\\text\{ATT\}\_\{m\}\(\\bm\{Q\},\\bm\{K\},\\bm\{V\}\)=\\operatorname\{softmax\}\\left\(\\frac\{\\left\(\\bm\{Q\}\\bm\{W\}\_\{Q,m\}\\right\)\\left\(\\bm\{K\}\\bm\{W\}\_\{K,m\}\\right\)^\{\\top\}\}\{\\sqrt\{d^\{\\prime\}\}\}\\right\)\\bm\{V\}\\bm\{W\}\_\{V,m\},\(6\)whereRMSNormandFFdenote the Root Mean Square Layer Normalization\[[48](https://arxiv.org/html/2605.22876#bib.bib48)\]and Feed Forward Network \(FFN\), respectively\. The matrices𝑾O\\bm\{W\}\_\{O\},𝒃O\\bm\{b\}\_\{O\},𝑾Q,m\\bm\{W\}\_\{Q,m\},𝑾K,m\\bm\{W\}\_\{K,m\}, and𝑾V,m\\bm\{W\}\_\{V,m\}, are learnable, wheremmdenotes themmth attention head\. Symbold′=d/Md^\{\\prime\}=d/Mdenotes the per\-head dimension, wheredddenotes the hidden dimension size\. Operators\|\|\|\|and⊤\\topdenote the concatenation and transpose operators, respectively\. Following\[[18](https://arxiv.org/html/2605.22876#bib.bib67),[2](https://arxiv.org/html/2605.22876#bib.bib45)\], we adopt SwishGLU\[[7](https://arxiv.org/html/2605.22876#bib.bib46)\]as the FFN in MHSA and subsequent MHA blocks \(MHSA uses the same source forQ,KQ,K, andVV, whereas MHA drawsQQandK/VK/Vfrom different sources\), which employs the Sigmoid Linear Unit \(SiLU\)\[[9](https://arxiv.org/html/2605.22876#bib.bib47)\]to model the nonlinearity, defined as follows:
SwiGLU\(𝑿\)=𝑿⊙σ\(𝑿𝑾1\+𝒃1\)⊗SiLU\(𝑿𝑾2\+𝒃2\),\\mathrm\{SwiGLU\}\(\\bm\{X\}\)=\\bm\{X\}\\odot\\sigma\(\\bm\{X\}\\bm\{W\}\_\{1\}\+\\bm\{b\}\_\{1\}\)\\otimes\\mathrm\{SiLU\}\(\\bm\{X\}\\bm\{W\}\_\{2\}\+\\bm\{b\}\_\{2\}\),\(7\)where symbols⊙\\odot,⊗\\otimes, andσ\\sigmadenote element\-wise multiplication, matrix multiplication, and the sigmoid function, respectively\. The matrices𝑾1\\bm\{W\}\_\{1\},𝒃1\\bm\{b\}\_\{1\},𝑾2\\bm\{W\}\_\{2\},𝒃2\\bm\{b\}\_\{2\}are learnable\.
Subsequently, we employ two MHA blocks to facilitate interactions between the instance and weight embeddings\. Formally,
𝑨¯l=RMSNorm\(𝑨l−1\+MHA\(𝑨l−1,𝑯1l,𝑯1l\)\),\\bar\{\\bm\{A\}\}^\{l\}=\\text\{RMSNorm\}\\left\(\\bm\{A\}^\{l\-1\}\+\\text\{MHA\}\\left\(\\bm\{A\}^\{l\-1\},\\bm\{H\}^\{l\}\_\{1\},\\bm\{H\}^\{l\}\_\{1\}\\right\)\\right\),\(8\)𝑨l=RMSNorm\(𝑨¯l\+FF\(𝑨¯l\)\),\\bm\{A\}^\{l\}=\\text\{RMSNorm\}\\left\(\\bar\{\\bm\{A\}\}^\{l\}\+\\text\{FF\}\\left\(\\bar\{\\bm\{A\}\}^\{l\}\\right\)\\right\),\(9\)𝑯¯2l=RMSNorm\(𝑯1l\+MHA\(𝑯1l,𝑨l,𝑨l\)\),\\bar\{\\bm\{H\}\}^\{l\}\_\{2\}=\\text\{RMSNorm\}\\left\(\\bm\{H\}^\{l\}\_\{1\}\+\\text\{MHA\}\\left\(\\bm\{H\}^\{l\}\_\{1\},\\bm\{A\}^\{l\},\\bm\{A\}^\{l\}\\right\)\\right\),\(10\)𝑯2l=RMSNorm\(𝑯¯2l\+FF\(𝑯¯2l\)\)\.\\bm\{H\}^\{l\}\_\{2\}=\\text\{RMSNorm\}\\left\(\\bar\{\\bm\{H\}\}\_\{2\}^\{l\}\+\\text\{FF\}\\left\(\\bar\{\\bm\{H\}\}\_\{2\}^\{l\}\\right\)\\right\)\.\(11\)By adopting this design, we obtain the weight embedding𝑨l\\bm\{A\}^\{l\}at thellth layer, which is conditioned on the interactions with the instance embedding\. The final instance embedding𝑯l\\bm\{H\}^\{l\}is then produced by the proposed Gated Residual Fusion \(GRF\) block, defined as follows:
𝑯l=𝑯2l\+gl⊙\(𝑨l𝑾3\),\\bm\{H\}^\{l\}=\\bm\{H\}\_\{2\}^\{l\}\+g^\{l\}\\odot\(\\bm\{A\}^\{l\}\\bm\{W\}\_\{3\}\),\(12\)gl=σ\(\(GeLU\(\(\[𝑯2l\|\|𝑨→nl\]\)𝑾4\+𝒃4\)\)𝑾5\+𝒃5\),g^\{l\}=\\sigma\\bigg\(\\Big\(\\mathrm\{GeLU\}\\big\(\(\[\\bm\{H\}^\{l\}\_\{2\}\|\|\\bm\{A\}\_\{\\rightarrow n\}^\{l\}\]\)\\bm\{W\}\_\{4\}\+\\bm\{b\}\_\{4\}\\big\)\\Big\)\\bm\{W\}\_\{5\}\+\\bm\{b\}\_\{5\}\\bigg\),\(13\)where𝑨→nl\\bm\{A\}\_\{\\rightarrow n\}^\{l\}denotes the broadcasted version of𝑨l\\bm\{A\}^\{l\}along the node dimension, thus, its shape matches that of𝑯2l\\bm\{H\}^\{l\}\_\{2\}\. The matrices𝑾3\\bm\{W\}\_\{3\},𝑾4\\bm\{W\}\_\{4\},𝒃4\\bm\{b\}\_\{4\},𝑾5\\bm\{W\}\_\{5\}, and𝒃5\\bm\{b\}\_\{5\}are learnable\. This gated residual design adaptively controls how much weight features are injected into each node embedding\. The architecture of the proposed GRF module is illustrated in Figure[2](https://arxiv.org/html/2605.22876#S3.F2)\(b\)\.
### 4\.2The Decoder of WeCon
After obtaining instance and weight embeddings𝑯\\bm\{H\}and𝑨\\bm\{A\}, the decoder autoregressively computes the probabilitypitp\_\{i\}^\{t\}of selecting theiith node at thettth step\. To this end, the decoder first produces a vector𝒒ct\\bm\{q\}\_\{c\}^\{t\}, then computes attention scoresαt\\alpha^\{t\}over all candidate nodes to derivepitp\_\{i\}^\{t\}\. To produce𝒒ct\\bm\{q\}\_\{c\}^\{t\}, prior studies\[[3](https://arxiv.org/html/2605.22876#bib.bib7),[10](https://arxiv.org/html/2605.22876#bib.bib58)\]rely solely on an MHA layer, whose attention can be dominated by instance embeddings and may therefore dilute the weight signals\. To mitigate such dilution, we propose a Residual Fusion \(RF\) block that further incorporates the weight embedding𝑨\\bm\{A\}into the decision process\. As discussed in Section[1](https://arxiv.org/html/2605.22876#S1), the adoption of RF makes WeCon substantially differ from existing neural solvers that primarily utilize weights in the encoder\. Specifically, WeCon first applies an MHA layer withMMattention heads, followed by the RF block\. The embedding𝑯^1t\\hat\{\\bm\{H\}\}\_\{1\}^\{t\}is computed by the MHA layer as follows:
𝑯^1t=MHA\(𝒉qt,\[𝑯\|\|𝑨\],\[𝑯\|\|𝑨\]\),\\hat\{\\bm\{H\}\}\_\{1\}^\{t\}=\\text\{MHA\}\\left\(\\bm\{h\}\_\{q\}^\{t\},\[\\bm\{H\}\|\|\\bm\{A\}\],\[\\bm\{H\}\|\|\\bm\{A\}\]\\right\),\(14\)where𝒉qt\\bm\{h\}\_\{q\}^\{t\}denotes the query embedding\. Following the prior studies\[[3](https://arxiv.org/html/2605.22876#bib.bib7),[11](https://arxiv.org/html/2605.22876#bib.bib69)\], for the Bi\-objective Traveling Salesman Problem \(Bi\-TSP\) and Tri\-objective Traveling Salesman Problem \(Tri\-TSP\), the context embedding𝒉qt\\bm\{h\}\_\{q\}^\{t\}is formed by concatenating the embeddings of the first and last visited nodes\. For the Bi\-objective Capacitated Vehicle Routing Problem \(Bi\-CVRP\),𝒉qt\\bm\{h\}\_\{q\}^\{t\}consists of the embedding of the last visited node together with the remaining vehicle capacity\. In Bi\-objective Knapsack Problem \(Bi\-KP\),𝒉qt\\bm\{h\}\_\{q\}^\{t\}combines the graph embeddingh~=1n\+1∑i=0nhi\\widetilde\{h\}=\\frac\{1\}\{n\+1\}\\sum\_\{i=0\}^\{n\}h\_\{i\}with the remaining knapsack capacity\.
Subsequently, to enable weight\-conditioned decisions, we feed𝑯^1t\\hat\{\\bm\{H\}\}\_\{1\}^\{t\}into the RF block\. Formally,
𝒒ct=𝑯^1t\+\(\(ReLU\(\[𝑯^1t\|\|𝑨→n\]𝑾6\+𝒃6\)\)𝑾7\+𝒃7\),\\bm\{q\}\_\{c\}^\{t\}=\\hat\{\\bm\{H\}\}\_\{1\}^\{t\}\+\\Big\(\\big\(\\mathrm\{ReLU\}\(\[\\hat\{\\bm\{H\}\}\_\{1\}^\{t\}\|\|\\bm\{A\}\_\{\\rightarrow n\}\]\\bm\{W\}\_\{6\}\+\\bm\{b\}\_\{6\}\)\\big\)\\bm\{W\}\_\{7\}\+\\bm\{b\}\_\{7\}\\Big\),\(15\)where𝑨→n\\bm\{A\}\_\{\\rightarrow n\}denotes the broadcasted version of𝑨\\bm\{A\}along the node dimension\. The matrices𝑾6\\bm\{W\}\_\{6\},𝒃6\\bm\{b\}\_\{6\},𝑾7\\bm\{W\}\_\{7\}, and𝒃7\\bm\{b\}\_\{7\}are learnable\. The architecture of the proposed RF module is illustrated in Figure[2](https://arxiv.org/html/2605.22876#S3.F2)\(c\)\. By adopting this design, WeCon mitigates weight\-signal dilution, enabling weight\-conditioned decisions\.
Moreover, the effect of RF can be analyzed from a theoretical intuition\. With RF,qct=MHA\(\)\+RF\(MHA\(\)\|\|An\)q\_\{c\}^\{t\}=MHA\(\)\+RF\(MHA\(\)\|\|A\_\{n\}\)\.∂qct∂A=∂MHA∂A\+∂RF∂MHA⋅∂MHA∂A\+∂RF∂An⋅∂An∂A\\frac\{\\partial q\_\{c\}^\{t\}\}\{\\partial A\}=\\frac\{\\partial MHA\}\{\\partial A\}\+\\frac\{\\partial RF\}\{\\partial MHA\}\\cdot\\frac\{\\partial MHA\}\{\\partial A\}\+\\frac\{\\partial\{RF\}\}\{\\partial A\_\{n\}\}\\cdot\\frac\{\\partial A\_\{n\}\}\{\\partial A\}\. Compared to the version without RF, the last term introduces an additional direct gradient path fromAAtoqctq\_\{c\}^\{t\}\.∂An∂A\\frac\{\\partial A\_\{n\}\}\{\\partial A\}is non\-zero\. Under a mild non\-degeneracy assumption,\|∂RF∂An\|2≥c\>0\|\\frac\{\\partial RF\}\{\\partial A\_\{n\}\}\|\_\{2\}\\geq c\>0, this path remains locally non\-vanishing\. Thus, RF helps mitigate weight\-signal dilution during decoding\.
Finally, the probabilitypitp\_\{i\}^\{t\}is computed as follows:
pit=pθ\(πt=i∣G,λ,π1:t−1\)=eαit∑jeαjt,p\_\{i\}^\{t\}=p\_\{\\theta\}\\left\(\\pi\_\{t\}=i\\mid\{G\},\\lambda,\\pi\_\{1:t\-1\}\\right\)=\\frac\{e^\{\\alpha\_\{i\}^\{t\}\}\}\{\\sum\_\{j\}e^\{\\alpha\_\{j\}^\{t\}\}\},\(16\)αit=\{−∞,theith node is masked,C⋅tanh\(𝒒c⊤𝒉id\),otherwise,\\alpha\_\{i\}^\{t\}=\\begin\{cases\}\-\\infty,&\\text\{the $i$th node is masked\},\\\\ C\\cdot\\tanh\(\\frac\{\\bm\{q\}\_\{c\}^\{\\top\}\\bm\{h\}\_\{i\}\}\{\\sqrt\{d\}\}\),&\\text\{otherwise,\}\\end\{cases\}\(17\)whereθ\\thetadenotes all learnable model parameters andCCis a predefined clipping constant following prior studies\[[3](https://arxiv.org/html/2605.22876#bib.bib7),[10](https://arxiv.org/html/2605.22876#bib.bib58)\]\. Different MOCOP variants impose different feasibility constraints\. For MOTSP, theiith node is feasible \(and thus unmasked\) if it has not been visited previously, i\.e\.,i≠πt′,∀t′<ti\\neq\\pi\_\{t^\{\\prime\}\},\\forall t^\{\\prime\}<t\.
Notably, RF is a plug\-and\-play module that can be effortlessly integrated into various decoder architectures\. To demonstrate its generality, we proposeWeCon\-CCO, which adopts the decoder from\[[11](https://arxiv.org/html/2605.22876#bib.bib69)\]and inserts RF between its MHA layer and CCO module \(see Figure[3](https://arxiv.org/html/2605.22876#S4.F3)\)\. Formally, for WeCon\-CCO, the vector𝒒ct\{\\bm\{q\}\_\{c\}^\{t\}\}is computed as follows:
𝑯^1t=MHA\(𝒉qt,\[𝑯\|\|𝑨\],\[𝑯\|\|𝑨\]\),\\hat\{\\bm\{H\}\}\_\{1\}^\{t\}=\\text\{MHA\}\\left\(\\bm\{h\}\_\{q\}^\{t\},\[\\bm\{H\}\|\|\\bm\{A\}\],\[\\bm\{H\}\|\|\\bm\{A\}\]\\right\),\(18\)𝑯^2t=𝑯^1t\+\(\(ReLU\(\[𝑯^1t\|\|𝑨→n\]𝑾6\+𝒃6\)\)𝑾7\+𝒃7\),\\hat\{\\bm\{H\}\}\_\{2\}^\{t\}=\\hat\{\\bm\{H\}\}\_\{1\}^\{t\}\+\\Big\(\\big\(\\mathrm\{ReLU\}\(\[\\hat\{\\bm\{H\}\}\_\{1\}^\{t\}\|\|\\bm\{A\}\_\{\\rightarrow n\}\]\\bm\{W\}\_\{6\}\+\\bm\{b\}\_\{6\}\)\\big\)\\bm\{W\}\_\{7\}\+\\bm\{b\}\_\{7\}\\Big\),\(19\)𝒒ct=RMSNorm\(∑i=1NeGi\(𝑯^2t\)Ei\(𝑯^2t\)\+𝑯^2t\),\\bm\{q\}\_\{c\}^\{t\}=\\text\{RMSNorm\}\\left\(\\sum\\nolimits\_\{i=1\}^\{N\_\{e\}\}G\_\{i\}\(\\hat\{\\bm\{H\}\}\_\{2\}^\{t\}\)E\_\{i\}\(\\hat\{\\bm\{H\}\}\_\{2\}^\{t\}\)\+\\hat\{\\bm\{H\}\}\_\{2\}^\{t\}\\right\),\(20\)whereNeN\_\{e\}denotes the number of experts,GiG\_\{i\}andEiE\_\{i\}denote the output of theiith gate and expert functions, respectively\. In addition, we replace the Instance Normalization used in the original CCO with RMSNorm to remain consistent with the normalization employed in the WeCon encoder\. By adopting this enhanced decoder, WeCon\-CCO achieves the best overall HV values across different MOCOPs and scales, albeit with significantly higher computational overhead than WeCon \(see Section[5\.2](https://arxiv.org/html/2605.22876#S5.SS2)\)\.
Figure 3:Illustration of WeCon\-CCO decoder architecture\.
### 4\.3Effective Preference Optimization \(EPO\)
Recent studies adopt PO to train neural solvers\[[11](https://arxiv.org/html/2605.22876#bib.bib69),[29](https://arxiv.org/html/2605.22876#bib.bib12)\]\. Specifically, PO randomly samples multiple solutions for each instance and constructs preference pairs by comparing their objective values pairwise\. Each pair consists of a better solutionπw\\pi\_\{w\}and a worse solutionπl\\pi\_\{l\}\(denoted asπw⋖πl\\pi\_\{w\}\\lessdot\\pi\_\{l\}under a minimization objective\)\. These preference pairs provide an implicit training signal for optimizing the solver\. However, purely random sampling may fail to produce sufficiently high\-quality solutions, causing the resulting preference pairs to be weakly informative \(i\.e\.,πw\\pi\_\{w\}andπl\\pi\_\{l\}have similar quality\) and thereby reducing training effectiveness\.
To better solve this issue, we propose a straightforward yet efficient EPO strategy\. Instead of samplingrrsolutions purely at random for each instance, EPO performs a guided sampling strategy to obtain⌈rc⌉\\lceil\\frac\{r\}\{c\}\\rceilnumber of high\-quality solutions \(see Eq\. \([21](https://arxiv.org/html/2605.22876#S4.E21)\)\), while the others \(r−⌈rc⌉r\-\\lceil\\frac\{r\}\{c\}\\rceil\) are sampled randomly\. Here,c\>1c\>1controls the fraction of guided samples\. Specifically, given the probabilitiespitp\_\{i\}^\{t\}\(see Eq\. \([16](https://arxiv.org/html/2605.22876#S4.E16)\)\) over all unmasked \(feasible\) nodes at thettth step, guided sampling constrains the next\-node decision to top\-kknodes with the highest probabilities\. Formally, in thettth step, the candidate node set is defined as follows:
𝒯t=\{i∣i∈argtop\(k\)j∈Gfpjt\},\\mathcal\{T\}^\{t\}=\\\{i\\mid i\\in\\mathop\{\\operatorname\{arg\\,top\(\\textit\{k\}\)\}\}\\limits\_\{j\\in G\_\{f\}\}p\_\{j\}^\{t\}\\\},\(21\)whereGfG\_\{f\}denotes the set of feasible nodes\. If the number of remaining feasible nodes is less thankk, i\.e\.,\|Gf\|<k\|G\_\{f\}\|<k, we simply set𝒯t=Gf\\mathcal\{T\}^\{t\}=G\_\{f\}\. By adopting this strategy, EPO samples a sufficient number of high\-quality solutions, yielding more informative preference pairs in whichπw\\pi\_\{w\}is substantially better thanπl\\pi\_\{l\}in terms of solution quality\. Subsequently, these guided and randomly sampled solutions are jointly used to construct a number ofr\(r−1\)2\\frac\{r\(r\-1\)\}\{2\}preference pairs via pairwise comparisons\.
To learn from preference pairs, EPO treats the average log likelihood of a solution as an implicit rewardfθf\_\{\\theta\}, thereby linking the solution preferences to their policy probabilities, defined as follows:
fθ\(π∣G,λs\)=logpθ\(π∣G,λs\)\|π\|=∑t=1\|π\|logpθ\(πt∣π<t,G,λs\)\|π\|,f\_\{\\theta\}\(\\pi\\mid\{G\},\\lambda\_\{s\}\)=\\frac\{\\log p\_\{\\theta\}\(\\pi\\mid\{G\},\\lambda\_\{s\}\)\}\{\|\\pi\|\}=\\frac\{\\sum\_\{t=1\}^\{\|\\pi\|\}\\log\\,p\_\{\\theta\}\(\\pi\_\{t\}\\mid\\pi\_\{<t\},\{G\},\\lambda\_\{s\}\)\}\{\|\\pi\|\},\(22\)wherepθ\(π∣G,λs\)p\_\{\\theta\}\(\\pi\\mid\{G\},\\lambda\_\{s\}\)denotes the probability of generating solutionπ\\pi, and\|π\|\|\\pi\|denotes the sequence length ofπ\\pi\(used to normalize the implicit reward\)\. Then, we definegθ\(⋅\)g\_\{\\theta\}\(\\cdot\)to map reward differencefθ\(πw∣G,λs\)−fθ\(πl∣G,λs\)f\_\{\\theta\}\(\\pi\_\{w\}\\mid\{G\},\\lambda\_\{s\}\)\-f\_\{\\theta\}\(\\pi\_\{l\}\\mid\{G\},\\lambda\_\{s\}\)to a preference probability as follows:
gθ\(πw⋖πl∣G,λs\)=σ\(β\(fθ\(πw∣G,λs\)−fθ\(πl∣G,λs\)\)\),g\_\{\\theta\}\\\!\\bigl\(\\pi\_\{w\}\\lessdot\\pi\_\{l\}\\mid\{G\},\\lambda\_\{s\}\\bigr\)=\\sigma\\Big\(\\beta\\big\(f\_\{\\theta\}\(\\pi\_\{w\}\\mid\{G\},\\lambda\_\{s\}\)\-f\_\{\\theta\}\(\\pi\_\{l\}\\mid\{G\},\\lambda\_\{s\}\)\\big\)\\Big\),\(23\)whereσ\\sigmadenotes the sigmoid function andβ\\betadenotes a predefined temperature parameter that controls the sharpness of the preference comparison\. Finally, WeCon is trained by maximizing the log likelihood ofgθ\(⋅\)g\_\{\\theta\}\(\\cdot\)and the loss function defined as follows:
ℒ\(θ∣gθ,G,λs,πw,πl\)=−yloggθ\(πw⋖πl∣G,λs\)=−ylogσ\(β\(logpθ\(πw∣G,λs\)\|πw\|−logpθ\(πl∣G,λs\)\|πl\|\)\),\\displaystyle\\mathcal\{L\}\(\\theta\\mid g\_\{\\theta\},\{G\},\\lambda\_\{s\},\\pi\_\{w\},\\pi\_\{l\}\)=\-y\\log g\_\{\\theta\}\\\!\\bigl\(\\pi\_\{w\}\\lessdot\\pi\_\{l\}\\mid\{G\},\\lambda\_\{s\}\\bigr\)=\-y\\log\\sigma\\big\(\\beta\(\\frac\{\\log\\,p\_\{\\theta\}\(\\pi\_\{w\}\\mid\{G\},\\lambda\_\{s\}\)\}\{\|\\pi\_\{w\}\|\}\-\\frac\{\\log\\,p\_\{\\theta\}\(\\pi\_\{l\}\\mid\{G\},\\lambda\_\{s\}\)\}\{\|\\pi\_\{l\}\|\}\)\\big\),\(24\)whereyydenotes a binary preference label thaty=1y=1ifπw⋖πl\\pi^\{w\}\\lessdot\\pi^\{l\}, andy=0y=0otherwise\. The pseudocode of EPO is presented in Algorithm[1](https://arxiv.org/html/2605.22876#alg1)\.
Algorithm 1EPO Procedures1:Input:Instance distribution
G~\\tilde\{\{G\}\}, weight vector distribution
λ~\\tilde\{\\lambda\}, training steps
EE, batch size
BB, the total number of sampled solutions
rr, and the guidance ratio parameter
cc\.
2:Output:Trained policy network parameters
θ\\theta\.
3:Initialize policy network parameters
θ\\theta\.
4:for
e=1e=1to
EEdo
5:for
b=1b=1to
BBdo
6:
λb∼\\lambda\_\{b\}\\simSampleWeightVector\(
λ~\\tilde\{\\lambda\}\)
7:
Gb∼\{G\}\_\{b\}\\simSampleInstance\(
G~\\tilde\{G\}\)
8:
πi,b∼GuidedSampleSolution\(pθ\(⋅∣Gb,λb\)\),\\pi\_\{i,b\}\\sim\\textsc\{GuidedSampleSolution\}\(p\_\{\\theta\}\(\\cdot\\mid\{G\}\_\{b\},\\lambda\_\{b\}\)\),∀i∈\{1\+mc∣m∈ℤ≥0,1\+mc≤r\}\\forall i\\in\\\{\\,1\+mc\\mid m\\in\\mathbb\{Z\}\_\{\\geq 0\},\\ 1\+mc\\leq r\\,\\\}\# Eq\. \([21](https://arxiv.org/html/2605.22876#S4.E21)\)
9:
πi,b∼\\pi\_\{i,b\}\\\!\\simRandomlySampleSolution\(
pθ\(⋅∣Gb,λb\)p\_\{\\theta\}\(\\cdot\\mid\{G\}\_\{b\},\\lambda\_\{b\}\)\),
∀i∈\{1,…,r\}∖\{1\+mc∣m∈ℤ≥0,1\+mc≤r\}\\forall i\\in\\\{1,\\ldots,r\\\}\\setminus\\\{\\,1\+mc\\mid m\\in\\mathbb\{Z\}\_\{\\geq 0\},\\ 1\+mc\\leq r\\,\\\}
10:
y\(πi,b,πj,b\)←y\(\\pi\_\{i,b\},\\pi\_\{j,b\}\)\\leftarrowPairwisePreference\(1\[πi,b⋖πj,b\]\)\(1\_\{\[\\pi\_\{i,b\}\\lessdot\\pi\_\{j,b\}\]\}\),
∀i,j∈\{1,⋯r\}\\forall i,j\\in\\\{1,\\cdots r\\\},
11:endfor
12:Compute gradient
∇θℒ\(θ\)\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)according to Eq\. \([24](https://arxiv.org/html/2605.22876#S4.E24)\)
13:
θ←\\theta\\leftarrowAdam\(
θ,∇θℒ\(θ\)\\theta,\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)\)
14:endfor
## 5Experimental Results
In this section, we first present the detailed experimental setups\. Then, following the experimental settings of prior studies\[[11](https://arxiv.org/html/2605.22876#bib.bib69),[3](https://arxiv.org/html/2605.22876#bib.bib7)\], we compare WeCon andWeCon\-CCOwith fourteen methods on four MOCOPs, namely Bi\-TSP, Bi\-CVRP, Bi\-KP, and Tri\-TSP across different problem scales and distribution patterns\. Furthermore, we provide visualizations to demonstrate the effectiveness of our encoder and decoder architectures and show that the performance gain of WeCon is not merely due to a larger model size\. Finally, we conduct ablation studies to validate the effectiveness of the encoder and decoder architectures as well as the proposed EPO strategy\.
### 5\.1Experiment Setups
This subsection details the baselines, evaluation metrics, hyperparameter settings, and hardware configuration, and presents the definitions, instance augmentation procedures, and test datasets for the four MOCOP variants\.
Baselines:The baseline models include six conventional heuristics, namely WS\-LKH, WS\-DP, MOEA/D\[[50](https://arxiv.org/html/2605.22876#bib.bib54)\], NSGA\-II\[[8](https://arxiv.org/html/2605.22876#bib.bib43)\], MOGLS\[[14](https://arxiv.org/html/2605.22876#bib.bib42)\], and PPLS/D\-C\[[32](https://arxiv.org/html/2605.22876#bib.bib41)\], and eight neural solvers, namely DRL\-MOA\[[19](https://arxiv.org/html/2605.22876#bib.bib38)\], MDRL\[[51](https://arxiv.org/html/2605.22876#bib.bib40)\], EMNH\[[5](https://arxiv.org/html/2605.22876#bib.bib60)\], PMOCO\[[23](https://arxiv.org/html/2605.22876#bib.bib65)\], CNH\[[10](https://arxiv.org/html/2605.22876#bib.bib58)\], PA\-MoE\-W\[[26](https://arxiv.org/html/2605.22876#bib.bib63)\], WE\-CA\[[3](https://arxiv.org/html/2605.22876#bib.bib7)\], and POCCO\-W\[[11](https://arxiv.org/html/2605.22876#bib.bib69)\]\. Among these models, DRL\-MOA, MDRL, and EMNH are multi\-model neural solvers that train distinct models for different subproblems specified by weight vectors, whereas PMOCO, CNH, WE\-CA, PA\-MoE\-W, andPOCCO\-W, as well as our WeCon and WeCon\-CCO are single\-model solvers that use a unified model to handle all subproblems\. WS\-LKH is the SOTA heuristic widely used as a baseline\[[3](https://arxiv.org/html/2605.22876#bib.bib7),[11](https://arxiv.org/html/2605.22876#bib.bib69)\]\. POCCO\-W is the SOTA neural solver for MOCOPs\. Notably, CNH, WE\-CA, PA\-MoE\-W, POCCO\-W, WeCon, and WeCon\-CCO are trained on multiple problem scalesn∈\{20,21,⋯,100\}n\\in\\\{20,21,\\cdots,100\\\}\(except for Bi\-KP, wheren∈\{50,51,⋯,200\}n\\in\\\{50,51,\\cdots,200\\\}\) to obtain a unified model applicable across individual scales\. The resulting models are then directly evaluated on test datasets of varying sizes without additional fine\-tuning for specific scales\.
Evaluation Metrics:Following the prior arts in the field, we evaluate all methods using three metrics, namely average HyperVolume \(HV\)\[[38](https://arxiv.org/html/2605.22876#bib.bib44)\], average gap, and total runtime per test dataset\. HV is a widely used indicator in MOCOPs that reflects both the convergence and diversity of the obtained solution set; higher HV indicates better performance\. For consistency, HV is normalized to\[0,1\]\[0,1\]using the same reference and ideal points for all methods \(see Table[1](https://arxiv.org/html/2605.22876#S5.T1)\)\. The gap is defined as the relative difference in HV between the corresponding method and WeCon\-CCO\-Aug\. Methods with the “\-Aug” suffix apply instance augmentation\[[23](https://arxiv.org/html/2605.22876#bib.bib65)\]to further improve performance and are highlighted with a gray background in the tables\. Moreover, we conduct the Wilcoxon rank\-sum test at the 1% significance level\. In each comparison, the best result is highlighted in bold only when it shows a statistically significant improvement over the other methods, while the second\-best result is underlined\.
Hyper\-parameters:Unless otherwise specified in sensitivity analyses, we setk=5k=5,c=8c=8,L=6L=6,M=8M=8, andC=10C=10in all experiments\. To ensure a fair comparison, we setβ=3\.5\\beta=3\.5for bi\-objective problems andβ=4\.5\\beta=4\.5for tri\-objective problems following the prior art\[[11](https://arxiv.org/html/2605.22876#bib.bib69)\]\. In addition, all solvers are trained for 200 epochs, with each epoch processing 100,000 randomly sampled instances and a batch size ofB=64B=64\. We use the Adam optimizer\[[15](https://arxiv.org/html/2605.22876#bib.bib56)\]with a learning rate of3×10−43\\times 10^\{\-4\}and a weight decay of10−610^\{\-6\}\. The weight vectors used for decomposition are generated following\[[6](https://arxiv.org/html/2605.22876#bib.bib57)\], with𝒩=101\\mathcal\{N\}=101forκ=2\\kappa=2and𝒩=105\\mathcal\{N\}=105forκ=3\\kappa=3\. For fair comparison, we set the reference and ideal points following\[[11](https://arxiv.org/html/2605.22876#bib.bib69)\], and these values are reported in Table[1](https://arxiv.org/html/2605.22876#S5.T1)\.
Table 1:Reference pointsr∗r^\{\*\}and ideal pointszzBi\-TSPBi\-CVRPBi\-KPTri\-TSPSizer∗r^\{\*\}zzSizer∗r^\{\*\}zzSizer∗r^\{\*\}zzSizer∗r^\{\*\}zz20\(20,20\)\(20,\\,20\)\(0,0\)\(0,\\,0\)20\(30,4\)\(30,\\,4\)\(0,0\)\(0,\\,0\)50\(5,5\)\(5,\\,5\)\(30,30\)\(30,\\,30\)20\(20,20,20\)\(20,\\,20,\\,20\)\(0,0\)\(0,\\,0\)50\(35,35\)\(35,\\,35\)\(0,0\)\(0,\\,0\)50\(45,4\)\(45,\\,4\)\(0,0\)\(0,\\,0\)100\(20,20\)\(20,\\,20\)\(50,50\)\(50,\\,50\)50\(35,35,35\)\(35,\\,35,\\,35\)\(0,0\)\(0,\\,0\)100\(65,65\)\(65,\\,65\)\(0,0\)\(0,\\,0\)100\(80,4\)\(80,\\,4\)\(0,0\)\(0,\\,0\)200\(30,30\)\(30,\\,30\)\(75,75\)\(75,\\,75\)100\(65,65,65\)\(65,\\,65,\\,65\)\(0,0\)\(0,\\,0\)150\(85,85\)\(85,\\,85\)\(0,0\)\(0,\\,0\)500\(90,90\)\(90,\\,90\)\(150,150\)\(150,\\,150\)200\(115,115\)\(115,\\,115\)\(0,0\)\(0,\\,0\)1000\(130,130\)\(130,\\,130\)\(260,260\)\(260,\\,260\)500\(250,250\)\(250,\\,250\)\(0,0\)\(0,\\,0\)1000\(450,450\)\(450,\\,450\)\(0,0\)\(0,\\,0\)
Hardware:We comprehensively evaluate the performance of all algorithms, using a machine equipped with an Intel Xeon Gold 6348 CPU and an NVIDIA A800 GPU \(80GB\)\.
Definitions of MOCOPS:
MOTSP:An MOTSP \(Bi\-TSP/Tri\-TSP\) instance is defined by multiple cost matrices and seeks a set of Pareto\-optimal tours \(i\.e\., node sequences\)\. Specifically, aκ\\kappa\-objective TSP instanceG\{G\}withnnnodes is characterized by cost matrices𝒞i=\(𝒞j,ki\)\\mathcal\{C\}^\{i\}=\(\\mathcal\{C\}\_\{j,k\}^\{i\}\), wherei∈\{1,⋯,κ\}i\\in\\left\\\{1,\\cdots,\\kappa\\right\\\}andj,k∈\{1,⋯,n\}j,k\\in\\left\\\{1,\\cdots,n\\right\\\}\. Theκ\\kappaobjectives are defined as follows:
minπ∈𝒳F\(π\)=min\(F1\(π\),F2\(π\),⋯,Fκ\(π\)\),withFi\(π\)=𝒞πn,π0i\+∑j=1n−1𝒞πj,πj\+1i,\\begin\{split\}\\min\_\{\\pi\\in\\mathcal\{X\}\}F\(\\pi\)=\\min\(F\_\{1\}\(\\pi\),F\_\{2\}\(\\pi\),\\cdots,F\_\{\\kappa\}\(\\pi\)\),\{\\rm with\}\\;F\_\{i\}\(\\pi\)=\\mathcal\{C\}\_\{\\pi\_\{n\},\\pi\_\{0\}\}^\{i\}\+\\sum\\nolimits\_\{j=1\}^\{n\-1\}\\mathcal\{C\}\_\{\\pi\_\{j\},\\pi\_\{j\+1\}\}^\{i\},\\end\{split\}\(25\)whereπ=\(π1,π2,⋯,πn\)\\pi=\\left\(\\pi\_\{1\},\\pi\_\{2\},\\cdots,\\pi\_\{n\}\\right\)\.𝒳\\mathcal\{X\}denotes the set of all feasible solutions, where each node is visited exactly once\. Following\[[19](https://arxiv.org/html/2605.22876#bib.bib38),[23](https://arxiv.org/html/2605.22876#bib.bib65)\], we consider the Euclidean MOTSP\. Each nodejjis associated with a2κ2\\kappa\-dimensional feature vectoroj=\[locj1,locj2,⋯,locjκ\]o\_\{j\}=\[loc\_\{j\}^\{1\},loc\_\{j\}^\{2\},\\cdots,loc\_\{j\}^\{\\kappa\}\], wherelocji∈ℝ2loc\_\{j\}^\{i\}\\in\\mathbb\{R\}^\{2\}denotes the coordinate under theiith objective\. Theiith objective value is computed asFi\(π\)=‖locπni−locπ1i‖2\+∑j=1n−1‖locπji−locπj\+1i‖2F\_\{i\}\(\\pi\)=\|\|loc\_\{\\pi\_\{n\}\}^\{i\}\-loc\_\{\\pi\_\{1\}\}^\{i\}\|\|\_\{2\}\+\\sum\_\{j=1\}^\{n\-1\}\|\|loc\_\{\\pi\_\{j\}\}^\{i\}\-loc\_\{\\pi\_\{j\+1\}\}^\{i\}\|\|\_\{2\}\.
Bi\-CVRP:A Bi\-CVRP instance consists ofnncustomer nodes and a single depot node\. Each nodej∈\{0,⋯,n\}j\\in\\left\\\{0,\\cdots,n\\right\\\}is associated with a 3\-dimensional feature vectoroj=\[locj,δj\]o\_\{j\}=\[loc\_\{j\},\\delta\_\{j\}\], wherelocjloc\_\{j\}andδj\\delta\_\{j\}denote the coordinates and demand of nodejj, respectively\. For the depot, the demand is set to zero, i\.e\.,δ0=0\\delta\_\{0\}=0\. A fleet of vehicles with capacity𝒬\\mathcal\{Q\}\(with𝒬\>δj\\mathcal\{Q\}\>\\delta\_\{j\}for all customers\) is used to serve all customers via multiple routes, each starting and ending at the depot\. The solution must satisfy the following constraints: \(i\) each customer is visited exactly once, and \(ii\) the total demand served on each route does not exceed the vehicle capacity\. In this paper, following prior studies\[[23](https://arxiv.org/html/2605.22876#bib.bib65),[17](https://arxiv.org/html/2605.22876#bib.bib27)\], we aim to minimize two objectives: i\.e\., the total length of all routes and the length of the longest route\.
Bi\-KP:A Bi\-KP instance consists ofn\+1n\+1items, where each itemj∈\{1,⋯,n\}j\\in\\\{1,\\cdots,n\\\}is characterized by a 2\-dimensional feature vectoroj=\[wj,uj\]o\_\{j\}=\[w\_\{j\},u\_\{j\}\], withwjw\_\{j\}denoting its weight anduju\_\{j\}its profit vector\. Here,uj∈ℝ2u\_\{j\}\\in\\mathbb\{R\}^\{2\}contains the two objective\-specific profit values associated with itemjj\. We consider a knapsack with capacity𝒞\\mathscr\{C\}, where𝒞\>wj\\mathscr\{C\}\>w\_\{j\}for each item\. The goal is to select a subset of items such that the total weight does not exceed𝒞\\mathscr\{C\}\. Following the prior studies\[[27](https://arxiv.org/html/2605.22876#bib.bib29),[53](https://arxiv.org/html/2605.22876#bib.bib26)\], we aim to simultaneously maximize two objectives: the sum of the two profit components across the selected items\.
Instance Augmentation:To further improve the inference performance of the solvers, following the prior studies\[[3](https://arxiv.org/html/2605.22876#bib.bib7),[11](https://arxiv.org/html/2605.22876#bib.bib69),[23](https://arxiv.org/html/2605.22876#bib.bib65)\], we apply the instance augmentation method proposed in\[[16](https://arxiv.org/html/2605.22876#bib.bib18)\]\. The rationale behind instance augmentation is that a Euclidean VRP instance can be transformed into multiple equivalent instances that share the same optimal solution, e\.g\., by reflecting or rotating the coordinates of all nodes\. For a node with coordinate\(x,y\)\(x,y\), there are eight standard transformations, i\.e\.,\(x′,y′\)\(x^\{\\prime\},y^\{\\prime\}\)=\(x,y\),\(y,x\),\(x,1−y\),\(y,1−x\),\(1−x,y\),\(1−y,x\)\(x,y\),\(y,x\),\(x,1\-y\),\(y,1\-x\),\(1\-x,y\),\(1\-y,x\),\(1−x,1−y\)\(1\-x,1\-y\), and\(1−y,1−x\)\(1\-y,1\-x\)\. In our paper, we adopt these transformations independently to each coordinate set\. Consequently, this yields eight transformations for Bi\-CVRP \(one coordinate set per node\),82=648^\{2\}=64transformations for Bi\-TSP, and83=5128^\{3\}=512transformations for Tri\-TSP\.
Test Datasets:Except for Bi\-TSP500, Bi\-TSP1000, Bi\-KP500, and Bi\-KP1000, all other test datasets are taken from the prior study\[[3](https://arxiv.org/html/2605.22876#bib.bib7)\], because\[[3](https://arxiv.org/html/2605.22876#bib.bib7)\]did not conduct experiments on large\-scale instances\. Hence, we generate the test datasets for Bi\-TSP500, Bi\-TSP1000, Bi\-KP500, and Bi\-KP1000 by sampling instances under a uniform distribution, with 20 instances per problem size\.
### 5\.2Performance Comparisons
Table 2:Performance comparison of different methods on Bi\-TSP across different problem scalesMethodBi\-TSP20Bi\-TSP50Bi\-TSP100Bi‐TSP150Bi‐TSP200HV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowWS\-LKH0\.62700\.00%10m0\.64150\.03%1\.8h0\.7090\-0\.16%6h0\.7149\-1\.17%13h0\.7490\-1\.15%22hNSGA\-II \(TEVC’02\)0\.62580\.19%6\.0h0\.61204\.63%6\.1h0\.66925\.47%6\.9h0\.66595\.76%6\.8h0\.70454\.86%6\.9hMOGLS \(EJOR’02\)0\.6279\-0\.14%1\.6h0\.63301\.36%3\.7h0\.68543\.18%11h0\.67684\.22%22h0\.71143\.93%38hMOEA/D \(TEVC’07\)0\.62410\.46%1\.7h0\.63161\.57%1\.8h0\.68992\.54%2\.2h0\.68093\.64%2\.4h0\.71393\.59%2\.7hPPLS/D\-C \(TCYB’24\)0\.62560\.22%26m0\.62822\.10%2\.8h0\.68443\.32%11h0\.67843\.99%21h0\.71064\.04%32hDRL\-MOA \(TCYB’21\)0\.62570\.21%6s0\.63600\.89%9s0\.69701\.54%16s0\.69012\.34%36s0\.72192\.51%1\.2mMDRL \(TNNLS’23\)0\.6271\-0\.02%5s0\.63640\.83%8s0\.69691\.55%14s0\.69222\.04%36s0\.72512\.08%1\.1mMDRL\-Aug \(TNNLS’23\)0\.6271\-0\.02%47s0\.64080\.14%1\.8m0\.70220\.81%5\.4m0\.69761\.27%37m0\.72991\.43%1\.1hEMNH \(NeurIPS’23\)0\.6271\-0\.02%5s0\.63640\.83%8s0\.69691\.55%15s0\.69301\.92%37s0\.72601\.96%1\.1mEMNH\-Aug \(NeurIPS’23\)0\.6271\-0\.02%46s0\.64080\.14%1\.8m0\.70230\.79%5\.4m0\.69831\.17%39m0\.73071\.32%1\.1hPMOCO \(ICLR’22\)0\.62560\.22%2s0\.63540\.98%4s0\.69691\.55%11s0\.69102\.21%31s0\.72312\.35%1\.0mPMOCO\-Aug \(ICLR’22\)0\.62700\.00%36s0\.63950\.34%1\.9m0\.70370\.59%9m0\.69671\.40%32m0\.72831\.65%1\.1hCNH \(TNNLS’25\)0\.62700\.00%3s0\.63870\.47%6s0\.70190\.85%15s0\.69851\.15%35s0\.72921\.53%1\.1mCNH\-Aug \(TNNLS’25\)0\.6271\-0\.02%39s0\.64100\.11%2\.9m0\.70540\.35%12m0\.70250\.58%34m0\.73430\.84%1\.2hWE\-CA \(ICLR’25\)0\.6271\-0\.02%2s0\.63920\.39%4s0\.70340\.64%11s0\.70080\.82%30s0\.73460\.80%1\.0mWE\-CA\-Aug \(ICLR’25\)0\.6271\-0\.02%35s0\.64130\.06%2\.3m0\.70660\.18%10m0\.70440\.31%31m0\.73810\.32%1\.0hPA\-MoE\-W \(OpenReview’25\)0\.6271\-0\.02%6s0\.63960\.33%13s0\.70420\.52%25s0\.70190\.67%59s0\.73600\.61%1\.7mPA\-MoE\-W\-Aug \(OpenReview’25\)0\.6271\-0\.02%1\.6m0\.6425\-0\.12%4\.6m0\.70700\.13%19m0\.70520\.20%55m0\.73910\.19%1\.9hPOCCO\-W \(NeurIPS’25\)0\.6275\-0\.08%6s0\.64110\.09%10s0\.70540\.35%27s0\.70290\.52%1\.0m0\.73620\.58%1\.8mPOCCO\-W\-Aug \(NeurIPS’25\)0\.62700\.00%1\.5m0\.6418\-0\.02%5\.6m0\.70770\.03%21m0\.70620\.06%59m0\.73990\.08%2\.1hWeCon\(ours\)0\.6271\-0\.02%3s0\.64070\.16%6s0\.70560\.32%15s0\.70350\.44%36s0\.73740\.42%1\.2mWeCon\-Aug\(ours\)0\.62700\.00%40s0\.64150\.03%3m0\.70770\.03%12m0\.70630\.04%36m0\.74020\.04%1\.2hWeCon\-CCO\(ours\)0\.6273\-0\.05%5s0\.64120\.08%11s0\.70610\.25%30s0\.70400\.37%1\.1m0\.73790\.35%1\.9mWeCon\-CCO\-Aug\(ours\)0\.62700\.00%1\.4m0\.64170\.00%6\.2m0\.70790\.00%23m0\.70660\.00%58m0\.74050\.00%1\.8h
In Tables[2](https://arxiv.org/html/2605.22876#S5.T2)∼\\sim[5](https://arxiv.org/html/2605.22876#S5.T5), we report the results of different methods on small\- and medium\-scale Bi\-TSP, small\-scale Bi\-CVRP, Bi\-KP, and Tri\-TSP instances, respectively\. As shown in Table[2](https://arxiv.org/html/2605.22876#S5.T2), our WeCon\-Aug achieves HV values comparable to POCCO\-W\-Aug, with relative Gap differences of 0\.00%, \-0\.05%, 0\.00%, \+0\.02%, and \+0\.04% across Bi\-TSP20, Bi\-TSP50, Bi\-TSP100, Bi\-TSP150, and Bi\-TSP200, respectively, while reducing inference time by approximately 55%, 46%, 42%, 38%, and 43% on the corresponding datasets\. This speedup mainly stems from architectural differences\. Specifically,POCCO\-Wemploys the MoE\-based CCO module, whose gating\-and\-routing mechanism assigns subproblems to different experts, incurring additional dispatch/aggregation operations\. Consequently, POCCO\-W incurs higher runtime, which may limit its practicality in time\-constrained scenarios\. For example, the urban traffic signal control task adjusts signal timing plans using real\-time traffic data to balance multiple conflicting objectives such as delay time, queue length, and number of stops\. Because traffic conditions typically change rapidly, solutions must be recomputed within a few seconds to remain effective, making this task a representative time\-sensitive real\-world MOCOP\[[34](https://arxiv.org/html/2605.22876#bib.bib59),[28](https://arxiv.org/html/2605.22876#bib.bib62)\]\. In contrast, our RF module avoids such routing overhead, delivering faster inference while maintaining high solution quality and diversity by mitigating weight\-signal dilution during decoding\. Thus, WeCon is well suited to time\-sensitive scenarios such as traffic signal control\. More importantly, without any data augmentation, WeCon achieves significantly higher HV than POCCO\-W, reducing the Gap by 0\.08%, 0\.16%, 0\.22%, and 0\.4% on Bi\-TSP150, Bi\-TSP200, Tri\-TSP50, and Tri\-TSP100, respectively\. This advantage is mainly attributable to our encoder design, which provides more informative weight\-conditioned context\. WeCon\-CCO achieves the best overall performance across all MOCOP variants and scales, albeit at the cost of increased inference time relative to WeCon \(on average,1\.7×1\.7\\times\)\. Its performance advantage becomes more pronounced as the problem size increases, indicating a stronger ability to explore the solution space and approximate high\-quality Pareto fronts across scales \(see Section[5\.3](https://arxiv.org/html/2605.22876#S5.SS3)for results on large\-scale Bi\-TSP and Bi\-KP instances\)\.
Table 3:Performance comparison of different methods on Bi\-CVRP across different problem scalesMethodBi\-CVRP20Bi\-CVRP50Bi\-CVRP100HV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowNSGA\-II \(TEVC’02\)0\.42750\.63%6\.4h0\.38965\.21%8\.8h0\.362011\.49%9\.4hMOGLS \(EJOR’02\)0\.42780\.56%9\.0h0\.39843\.07%20h0\.38755\.26%72hMOEA/D \(EJOR’07\)0\.42551\.09%2\.3h0\.40002\.68%2\.9h0\.39533\.35%5\.0hPPLS/D\-C \(TCYB’24\)0\.42870\.35%1\.6h0\.40072\.51%9\.7h0\.39463\.52%38hDRL\-MOA \(TCYB’21\)0\.42870\.35%8s0\.40760\.83%12s0\.40550\.86%21sMDRL \(TNNLS’23\)0\.42910\.26%6s0\.40820\.68%13s0\.40560\.83%22sMDRL\-Aug \(TNNLS’23\)0\.42940\.19%12s0\.40920\.44%36s0\.40720\.44%2\.8mEMNH \(NeurIPS’23\)0\.42990\.07%7s0\.40980\.29%12s0\.40720\.44%22sEMNH\-Aug \(NeurIPS’23\)0\.43020\.00%12s0\.41060\.10%35s0\.40790\.27%2\.8mPMOCO \(ICLR’22\)0\.42670\.81%3s0\.40361\.80%7s0\.39134\.33%16sPMOCO\-Aug \(ICLR’22\)0\.42940\.19%6s0\.40800\.73%21s0\.39692\.96%1\.6mCNH \(TNNLS’25\)0\.42870\.35%4s0\.40870\.56%8s0\.40650\.61%15sCNH\-Aug \(TNNLS’25\)0\.42990\.07%7s0\.41010\.22%23s0\.40770\.32%1\.7mWE\-CA \(ICLR’25\)0\.42900\.28%3s0\.40880\.54%7s0\.40690\.51%15sWE\-CA\-Aug \(ICLR’25\)0\.43000\.05%6s0\.41030\.17%19s0\.40810\.22%1\.5mPA\-MoE\-W \(OpenReview’25\)0\.42910\.26%7s0\.40950\.36%14s0\.40730\.42%31sPA\-MoE\-W\-Aug \(OpenReview’25\)0\.43010\.02%14s0\.41060\.10%41s0\.40840\.15%2\.9mPOCCO\-W \(NeurIPS’25\)0\.42940\.19%7s0\.41020\.19%15s0\.40840\.15%34sPOCCO\-W\-Aug \(NeurIPS’25\)0\.43010\.02%18s0\.41080\.05%51s0\.40890\.02%3\.3mWeCon\(ours\)0\.42980\.09%4s0\.41030\.17%9s0\.40830\.17%19sWeCon\-Aug\(ours\)0\.4303\-0\.02%8s0\.41090\.02%25s0\.40880\.05%1\.9mWeCon\-CCO\(ours\)0\.42990\.07%8s0\.41050\.12%15s0\.40850\.12%35sWeCon\-CCO\-Aug\(ours\)0\.43020\.00%15s0\.41100\.00%47s0\.40900\.00%3\.2m
Table 4:Performance comparison of different methods on Bi\-KP across different problem scalesMethodBi\-KP50Bi\-KP100Bi\-KP200HV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowWS\-DP0\.35610\.00%22m0\.45320\.02%2h0\.36010\.03%5\.8hNSGA\-II \(TEVC’02\)0\.35470\.39%7\.8h0\.45200\.29%8\.0h0\.35900\.33%8\.4hMOGLS \(EJOR’02\)0\.35400\.59%5\.8h0\.45100\.51%10h0\.35820\.56%18hMOEA/D \(TEVC’07\)0\.35400\.59%1\.6h0\.45080\.55%1\.7h0\.35810\.58%1\.8hPPLS/D\-C \(TCYB’24\)0\.35280\.93%18m0\.44801\.17%47m0\.35411\.69%1\.5hDRL\-MOA \(TCYB’21\)0\.35590\.06%8s0\.45310\.04%15s0\.36010\.03%32sMDRL \(TNNLS’23\)0\.35300\.87%7s0\.45320\.02%18s0\.36010\.03%35sEMNH \(NeurIPS’23\)0\.35610\.00%7s0\.4535\-0\.04%17s0\.3603\-0\.03%48sPMOCO \(ICLR’22\)0\.35520\.25%5s0\.45230\.22%9s0\.35950\.19%35sCNH \(TNNLS’25\)0\.35560\.14%5s0\.45270\.13%10s0\.35980\.11%35sWE\-CA \(ICLR’25\)0\.35580\.08%5s0\.45310\.04%10s0\.36020\.00%34sPA\-MoE\-W \(OpenReview’25\)0\.35071\.52%10s0\.44302\.27%21s0\.33098\.13%53sPOCCO\-W \(NeurIPS’25\)0\.35610\.00%11s0\.4534\-0\.02%23s0\.3603\-0\.03%1\.0mWeCon\(ours\)0\.35580\.08%5s0\.45310\.04%10s0\.36020\.00%35sWeCon\-CCO\(ours\)0\.35610\.00%11s0\.45330\.00%23s0\.36020\.00%1\.0m
Table 5:Performance comparison of different methods on Tri\-TSP across different problem scalesMethodTri\-TSP20Tri\-TSP50Tri\-TSP100HV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowHV↑\\uparrowGap↓\\downarrowTime↓\\downarrowWS\-LKH0\.47120\.00%12m0\.4440\-0\.11%1\.9h0\.5076\-0\.59%6\.6hNSGA\-II \(TEVC’02\)0\.423810\.06%7\.1h0\.285835\.56%7\.5h0\.282444\.03%9\.0hMOGLS \(EJOR’02\)0\.47010\.23%1\.5h0\.42115\.05%4\.1h0\.425415\.70%13hMOEA/D \(TEVC’07\)0\.47020\.21%1\.9h0\.43142\.73%2\.2h0\.451110\.60%2\.4hPPLS/D\-C \(TCYB’24\)0\.46980\.30%1\.4h0\.41745\.89%3\.9h0\.437613\.28%14hDRL\-MOA \(TCYB’21\)0\.46990\.28%6s0\.43032\.98%9s0\.48064\.76%18sMDRL \(TNNLS’23\)0\.46990\.28%5s0\.43172\.66%10s0\.48523\.84%17sMDRL\-Aug \(TNNLS’23\)0\.47120\.00%4\.2m0\.44080\.61%25m0\.49581\.74%1\.6hEMNH \(NeurIPS’23\)0\.46990\.28%5s0\.43242\.50%10s0\.48663\.57%17sEMNH\-Aug \(NeurIPS’23\)0\.47120\.00%4\.2m0\.44180\.38%25m0\.49731\.45%1\.6hPMOCO \(ICLR’22\)0\.46930\.40%3s0\.43152\.71%6s0\.48583\.73%12sPMOCO\-Aug \(ICLR’22\)0\.47120\.00%4\.9m0\.44090\.59%22m0\.49561\.78%1\.6hCNH \(TNNLS’25\)0\.46980\.30%40\.43581\.74%6s0\.49312\.28%14sCNH\-Aug \(TNNLS’25\)0\.47040\.17%5\.3m0\.44090\.59%24m0\.49960\.99%1\.6hWE\-CA \(ICLR’25\)0\.47070\.11%2s0\.43891\.04%5s0\.49751\.41%11sWE\-CA\-Aug \(ICLR’25\)0\.47120\.00%4\.8m0\.44320\.07%20m0\.50320\.28%1\.3hPA\-MoE\-W \(OpenReview’25\)0\.47090\.06%6s0\.43910\.99%12s0\.49751\.41%26sPA\-MoE\-W\-Aug \(OpenReview’25\)0\.47120\.00%9\.5m0\.44320\.07%38m0\.50340\.24%2\.7hPOCCO\-W \(NeurIPS’25\)0\.47100\.04%5s0\.44030\.72%13s0\.49851\.21%28sPOCCO\-W\-Aug \(NeurIPS’25\)0\.47120\.00%12m0\.4437\-0\.05%44m0\.5048\-0\.04%2\.9hWeCon\(ours\)0\.47110\.02%3s0\.44130\.50%6s0\.50050\.81%15sWeCon\-Aug\(ours\)0\.47120\.00%5\.6m0\.4437\-0\.05%25m0\.5049\-0\.06%1\.7hWeCon\-CCO\(ours\)0\.47110\.02%6s0\.44110\.54%12s0\.50010\.89%30sWeCon\-CCO\-Aug\(ours\)0\.47120\.00%8\.3m0\.44350\.00%42m0\.50460\.00%2\.9h
### 5\.3Results on Large\-scale MOCOP Instances
In this subsection, following the prior art\[[11](https://arxiv.org/html/2605.22876#bib.bib69)\], we assess the performance of WeCon andWeCon\-CCOon large\-scale MOCOP instances, including Bi\-TSP500, Bi\-TSP1000, Bi\-KP500, and Bi\-KP1000\. As shown in Table[6](https://arxiv.org/html/2605.22876#S5.T6), WeCon\-CCO achieves the best performance on the highly challenging Bi\-TSP500 and Bi\-TSP1000 instances, followed by WeCon\. On large\-scale Bi\-KP instances, most methods exhibit substantial performance degradation, whereas WeCon attains the best performance\. These findings further validate the effectiveness of our proposed encoder and decoder architectures and EPO strategy\.
Table 6:Performance comparison on large\-scale instancesMethodBi\-TSP500Bi\-TSP1000Bi\-KP500Bi\-KP1000HV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowWE\-CA0\.73351\.3m0\.68339\.2m0\.649741s0\.31664\.0mPA\-MoE\-W0\.76122\.6m0\.748915m0\.27041\.2m0\.23344\.8mPOCCO\-W0\.74422\.2m0\.703712m0\.65771\.1m0\.57334\.3mWeCon0\.76641\.4m0\.77339\.6m0\.865745s0\.77784\.2mWeCon\-CCO0\.76722\.3m0\.773512m0\.67181\.2m0\.52254\.3m
### 5\.4Results on Real\-world MOCOP Instances
Table 7:Performance comparison on real\-world instancesMethodKroAB100KroAB150KroAB200HV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowWS\-LKH0\.70222\.3m0\.70174\.0m0\.74305\.6mMOEA/D0\.68365\.8m0\.67107\.1m0\.71067\.3mNSGA\-II0\.66767\.0m0\.65527\.9m0\.70118\.4mMOGLS0\.681752m0\.66711\.3h0\.70831\.6hPPLS/D\-C0\.678538m0\.66591\.4h0\.71003\.8hDRL\-MOA0\.690310s0\.679412s0\.718518sMDRL0\.68819s0\.683111s0\.720916sMDRL\-Aug0\.695010s0\.689016s0\.726125sEMNH0\.69009s0\.683211s0\.721716sEMNH\-Aug0\.695810s0\.689216s0\.727025sPMOCO0\.68787s0\.681910s0\.719315sPMOCO\-Aug0\.693711s0\.688615s0\.725124sCNH0\.69478s0\.689213s0\.725018sCNH\-Aug0\.698012s0\.693816s0\.730325sWE\-CA0\.69487s0\.692412s0\.731716sWE\-CA\-Aug0\.699210s0\.695815s0\.734724sPA\-MoE\-W0\.696519s0\.692629s0\.732538sPA\-MoE\-W\-Aug0\.699822s0\.696933s0\.736148sPOCCO\-W0\.697820s0\.693831s0\.733141sPOCCO\-W\-Aug0\.700622s0\.697134s0\.736050sWeCon0\.69748s0\.694613s0\.733717sWeCon\-Aug0\.700010s0\.697516s0\.736926sWeCon\-CCO0\.699119s0\.694830s0\.734239sWeCon\-CCO\-Aug0\.700623s0\.698134s0\.737550s
\(a\)
\(b\)
\(c\)
Figure 4:Pareto fronts of benchmark instances\.In this subsection, we evaluate the cross\-distribution generalization of WeCon and WeCon\-CCO\. Specifically, following the prior arts\[[3](https://arxiv.org/html/2605.22876#bib.bib7),[11](https://arxiv.org/html/2605.22876#bib.bib69)\], we select three widely benchmarked instances adapted from TSPLIB\[[30](https://arxiv.org/html/2605.22876#bib.bib66)\], namely KroAB100, KroAB150, and KroAB200\. These instances exhibit underlying distributions that differ substantially from those of synthetic datasets and are difficult to characterize\. As shown in Table[7](https://arxiv.org/html/2605.22876#S5.T7), WeCon surpasses POCCO\-W on KroAB150 and KroAB200\. WeCon\-CCO achieves the best overall performance, demonstrating strong cross\-distribution generalization\. In addition, Figure[4](https://arxiv.org/html/2605.22876#S5.F4)visualizes the Pareto fronts produced by different methods\. Notably, many solutions generated by WeCon\-CCO dominate those of WE\-CA, PA\-MoE\-W, and POCCO\-W, indicating high\-level performance on the associated decomposed subproblems\.
### 5\.5Analyses of the Similarity between the Weight and Instance Embeddings
\(a\)
\(b\)
Figure 5:Similarity between weight & instance embeddings\.In this subsection, we visualize the cosine similarity between the weight and instance embeddings produced by both WE\-CA and WeCon, respectively\. As shown in Figure[5](https://arxiv.org/html/2605.22876#S5.F5), WeCon learns more discriminative instance and weight embeddings \(warmer color\), suggesting that it captures more informative weight\-conditioned context\.
### 5\.6Analyses of Weight\-Signal Dilution during Decoding
\(a\)
\(b\)
\(c\)
Figure 6:Node selection under varying weight vectors\.In this subsection, we visualize the node selection distributions of different decoders across a sweep of weight vectors \(𝒩=101\\mathcal\{N\}=101\) to assess whether weight signals are diluted during decoding\. As shown in Figure[6](https://arxiv.org/html/2605.22876#S5.F6), WE\-CA selects nearly the same nodes across weight vectors\. In contrast, WeCon selects different nodes as the weight vector changes, indicating the mitigation of weight\-signal dilution\. Although POCCO\-W exhibits greater variation due to its MoE\-based design, its selections are less consistent across nearby weight vectors\. Moreover, integrating MoE incurs additional runtime, which hinders its applicability to time\-sensitive MOCOPs\.
### 5\.7Model Size and Efficiency Analyses
Table 8:Model size and training cost comparisonMethod\#Param\.GPU MemoryTraining TimeWE\-CA \(ICLR’25\)1\.5M5926M6\.7hPA\-MoE\-W \(OpenReview’25\)4\.3M11984MB22\.3hPOCCO\-W \(NeurIPS’25\)2\.0M11920M24\.5hWeCon \(ours\)5\.4M9016MB9\.1hWeCon\-CCO \(ours\)5\.9M14314MB24\.7h
Table 9:Performance of different solvers under increased model sizeMethodBi\-TSP50Bi\-TSP100HV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowWE\-CA0\.63924s0\.64132\.3m0\.703411s0\.706611mWE\-CA \(5\.6M\)0\.6389†\\dagger6s0\.6412†\\dagger2\.9m0\.7027†\\dagger13s0\.7061†\\dagger12mWE\-CA \(4\.9M\)0\.63966s0\.64142\.8m0\.704113s0\.706912mPOCCO\-W0\.641110s0\.64185\.6m0\.705427s0\.707721mPOCCO\-W \(6\.1M\)0\.6409†\\dagger14s0\.64187\.5m0\.7052†\\dagger30s0\.7076†\\dagger24mPOCCO\-W \(4\.4M\)0\.6408†\\dagger14s0\.6417†\\dagger6\.9m0\.7051†\\dagger30s0\.7071†\\dagger23mWeCon0\.64076s0\.64153m0\.705615s0\.707712m
Note\.†\\daggerdenotes a performance decrease compared to the corresponding original model with fewer parameters\. Results obtained with instance augmentation are highlighted in gray\.
In this subsection, we compare the number of parameters, GPU memory consumption, and training time of WE\-CA, PA\-MoE\-W, POCCO\-W, WeCon, and WeCon\-CCO\. As shown in Table[8](https://arxiv.org/html/2605.22876#S5.T8), WeCon uses more parameters \(approximately2\.5×2\.5\\timesmore than POCCO\-W\)\. However, it requires only about 37\.1% and 60% of POCCO\-W’s training and inference time \(see Tables[2](https://arxiv.org/html/2605.22876#S5.T2)∼\\sim[5](https://arxiv.org/html/2605.22876#S5.T5)\), respectively, and it also consumes less GPU memory\. This efficiency gap is largely attributable to the MoE structure inPOCCO\-W, whose gating\-and\-routing process can incur additional dispatch/aggregation operations, increasing both runtime and GPU memory consumption\. In contrast, the carefully designed architecture enables WeCon to achieve high\-level performance while achieving faster inference\. Because the absolute parameter size of all solvers is only a few megabytes, the additional parameters introduced by WeCon are acceptable in most practical scenarios\. In such settings, training and inference time are often more critical than parameter size, making WeCon particularly suitable\. Moreover, simply increasing the number of parameters does not necessarily improve HV\. For example, WeCon has a similar parameter budget to PA\-MoE\-W, which incorporates MoE into the encoder, yet PA\-MoE\-W performs substantially worse than WeCon\.
To further demonstrate that the effectiveness of WeCon is not merely due to having more parameters, we scale up WE\-CA and POCCO\-W in two ways: by increasing model depth and width\. Specifically, we first expand their encoders from six to 24 layers, resulting in models with 5\.6M and 6\.1M parameters, respectively\. We then increase their hidden dimensions, yielding models with 4\.9M and 4\.4M parameters, respectively\. As shown in Table[9](https://arxiv.org/html/2605.22876#S5.T9), naively increasing depth even degrades the performance of both WE\-CA and POCCO\-W\. Although increasing width improves WE\-CA, its performance still falls short of WeCon, while POCCO\-W shows a significant performance drop compared with its depth\-scaled variant\. These results further demonstrate the rationality of the decoder and encoder architecture design of WeCon\.
### 5\.8Ablation Studies
In this subsection, we conduct ablation studies to validate the effectiveness of WeCon’s encoder and decoder designs, as well as the proposed EPO strategy\. Specifically, w/o Encoder refers to the setting that replaces the proposed encoder with the encoder of\[[3](https://arxiv.org/html/2605.22876#bib.bib7)\]; w/o GRF refers to the setting that the encoder only comprises an MHSA block and two MHA blocks; w/o RF refers to the setting that the decoder only comprises a single MHA layer to compute the vector𝒒ct\\bm\{q\}\_\{c\}^\{t\}; w/ RL, w/ PO, and w/ BOPO refer to the settings that replace the proposed EPO strategy with REINFORCE, PO\[[29](https://arxiv.org/html/2605.22876#bib.bib12)\], and BOPO\[[22](https://arxiv.org/html/2605.22876#bib.bib11)\], respectively\. As shown in Table[10](https://arxiv.org/html/2605.22876#S5.T10), it is not surprising that removing any component in WeCon always degrades performance\. Moreover, the proposed EPO strategy outperforms all the other widely adopted training strategies\. This improvement stems from EPO’s ability to sample sufficiently high\-quality solutions, thereby constructing more informative preference pairs and improving training effectiveness\. Overall, these results validate the contributions of all proposed components\.
Table 10:Ablation study results on different design choicesMethodBi\-TSP100Bi\-TSP150HV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarroww/o Encoder0\.705013s0\.707311m0\.702835s0\.705835mw/o GRF0\.705315s0\.707511m0\.703236s0\.706036mw/o RF0\.704512s0\.70709\.8m0\.702030s0\.705130mw/ RL0\.705314s0\.707412m0\.703036s0\.705936mw/ PO0\.705215s0\.707512m0\.703036s0\.706036mw/ BOPO0\.702515s0\.705812m0\.699136s0\.703036mWeCon0\.705615s0\.707712m0\.703536s0\.706336m
Note\.Results obtained with instance augmentation are highlighted in gray\.
Table 11:Sensitivity analyses of different parameter choicesSettingBi\-TSP50Bi\-TSP100Bi\-TSP150k=4,c=8k=4,c=80\.64060\.70520\.7033k=6,c=8k=6,c=80\.64070\.70530\.7031k=5,c=6k=5,c=60\.64040\.70480\.7023k=5,c=10k=5,c=100\.64070\.70540\.7033k=5,c=12k=5,c=120\.64060\.70520\.7031k=5,c=8k=5,c=8\(adopted\)0\.64070\.70560\.7035
In Table[11](https://arxiv.org/html/2605.22876#S5.T11), we present the effect ofkkandccon the performance of WeCon\. Overall, WeCon is insensitive to these hyperparameters within the tested ranges, with HV varying only marginally across different settings\. The adopted configuration \(k=5,c=8k=5,c=8\) consistently achieves the best or near\-best performance across all scales\. Hence, our predefinedkkandccare reasonable\.
Table 12:Ablation study on different decomposition techniquesMethodBi\-TSP50Bi\-TSP100HV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowHV↑\\uparrowTime↓\\downarrowWeCon \(TCH\)0\.63916s0\.64113m0\.702615s0\.706213mWeCon \(WS\)0\.64076s0\.64153m0\.705615s0\.707712m
Note\.Results obtained with instance augmentation are highlighted in gray\.
Finally, we assess the impact of different decomposition techniques, namely Tchebycheff \(TCH\) and WS \(the adopted method\)\. As shown in Table[12](https://arxiv.org/html/2605.22876#S5.T12), WS consistently outperforms TCH, indicating that WS offers more robust performance across different problem scales\. This finding is consistent with the results of prior studies\[[3](https://arxiv.org/html/2605.22876#bib.bib7),[11](https://arxiv.org/html/2605.22876#bib.bib69)\]\.
## 6Conclusion
In this study, we positively answer the proposed research question with ample experimental results as supporting evidence\. Specifically, we propose WeCon, which comprises a carefully designed encoder to produce informative weight\-conditioned context and our proposed decoder to enable weight\-conditioned decisions\. Moreover, we introduce the EPO strategy for WeCon, which constructs more informative preference pairs and thereby improves training effectiveness\. The experimental results show that i\) WeCon achieves HV values comparable to the SOTA solver POCCO\-W while requiring only 60% of its inference time and ii\) WeCon\-CCO achieves the best overall performance across all evaluated settings while sacrificing runtime efficiency\. Ablation studies validate the effectiveness of our encoder and decoder designs as well as the proposed EPO strategy\. Currently, WeCon and WeCon\-CCO are trained across multiple problem scales to obtain a unified model applicable to all scales\. This design can degrade performance on test instances at a particular scale compared with models trained exclusively on that scale\. Going forward, we believe further improvement is achievable by adopting Curriculum Learning to progressively train the solver, thereby further enhancing generalization across problem scales\.
## References
- \[1\]Y\. Bengio, A\. Lodi, and A\. Prouvost\(2021\)Machine learning for combinatorial optimization: A methodological tour d’horizon\.European Journal of Operational Research290\(2\),pp\. 405–421\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1)\.
- \[2\]F\. Berto, C\. Hua, N\. G\. Zepeda, A\. Hottung, N\. Wouda, L\. Lan, J\. Park, K\. Tierney, and J\. Park\(2025\)RouteFinder: towards foundation models for vehicle routing problems\.Transactions on Machine Learning Research,pp\. 1–41\.Cited by:[§4\.1](https://arxiv.org/html/2605.22876#S4.SS1.p2.21)\.
- \[3\]J\. Chen, Z\. Cao, J\. Wang, Y\. Wu, H\. Qin, Z\. Zhang, and Y\. Gong\(2025\)Rethinking neural multi\-objective combinatorial optimization via neat weight embedding\.InThe International Conference on Learning Representations,pp\. 1–20\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1),[§1](https://arxiv.org/html/2605.22876#S1.p6.4),[§2](https://arxiv.org/html/2605.22876#S2.p2.1),[§3](https://arxiv.org/html/2605.22876#S3.p4.10),[§4\.2](https://arxiv.org/html/2605.22876#S4.SS2.p1.12),[§4\.2](https://arxiv.org/html/2605.22876#S4.SS2.p1.17),[§4\.2](https://arxiv.org/html/2605.22876#S4.SS2.p4.5),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p10.7),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p11.1),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2),[§5\.4](https://arxiv.org/html/2605.22876#S5.SS4.p1.1),[§5\.8](https://arxiv.org/html/2605.22876#S5.SS8.p1.1),[§5\.8](https://arxiv.org/html/2605.22876#S5.SS8.p3.1),[§5](https://arxiv.org/html/2605.22876#S5.p1.1)\.
- \[4\]J\. Chen, J\. Wang, Z\. Cao, and Y\. Wu\(2025\)Neural multi\-objective combinatorial optimization via graph\-image multimodal fusion\.InThe International Conference on Learning Representations,pp\. 1–20\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1)\.
- \[5\]J\. Chen, J\. Wang, Z\. Zhang, Z\. Cao, T\. Ye, and C\. Siyuan\(2023\)Efficient meta neural heuristic for multi\-objective combinatorial optimization\.InProceedings of Advances in Neural Information Processing Systems,pp\. 56825–56837\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1),[§1](https://arxiv.org/html/2605.22876#S1.p3.1),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2)\.
- \[6\]I\. Das and J\. E\. Dennis\(1998\)Normal\-boundary intersection: a new method for generating the pareto surface in nonlinear multicriteria optimization problems\.SIAM Journal on Optimization8\(3\),pp\. 631–657\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p4.14)\.
- \[7\]Y\. N\. Dauphin, A\. Fan, M\. Auli, and D\. Grangier\(2017\)Language modeling with gated convolutional networks\.InProceedings of International Conference on Machine Learning,pp\. 933–941\.Cited by:[§4\.1](https://arxiv.org/html/2605.22876#S4.SS1.p2.21)\.
- \[8\]K\. Deb, A\. Pratap, S\. Agarwal, and T\. Meyarivan\(2002\)A fast and elitist multiobjective genetic algorithm: NSGA\-II\.IEEE Transactions on Evolutionary Computation6\(2\),pp\. 182–197\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2)\.
- \[9\]S\. Elfwing, E\. Uchibe, and K\. Doya\(2018\)Sigmoid\-weighted linear units for neural network function approximation in reinforcement learning\.Neural Networks107,pp\. 3–11\.Cited by:[§4\.1](https://arxiv.org/html/2605.22876#S4.SS1.p2.21)\.
- \[10\]M\. Fan, Y\. Wu, Z\. Cao, W\. Song, G\. Sartoretti, H\. Liu, and G\. Wu\(2025\)Conditional neural heuristic for multiobjective vehicle routing problems\.IEEE Transactions on Neural Networks and Learning Systems36\(3\),pp\. 4677–4689\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1),[§2](https://arxiv.org/html/2605.22876#S2.p2.1),[§4\.2](https://arxiv.org/html/2605.22876#S4.SS2.p1.12),[§4\.2](https://arxiv.org/html/2605.22876#S4.SS2.p4.5),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2)\.
- \[11\]M\. Fan, J\. Zhou, Y\. Zhang, Y\. Wu, J\. Chen, and G\. A\. Sartoretti\(2025\)Preference\-driven multi\-objective combinatorial optimization with conditional computation\.InProceedings of Advances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1),[§1](https://arxiv.org/html/2605.22876#S1.p5.4),[§1](https://arxiv.org/html/2605.22876#S1.p6.4),[§3](https://arxiv.org/html/2605.22876#S3.p4.10),[§4\.2](https://arxiv.org/html/2605.22876#S4.SS2.p1.17),[§4\.2](https://arxiv.org/html/2605.22876#S4.SS2.p5.1),[§4\.3](https://arxiv.org/html/2605.22876#S4.SS3.p1.5),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p10.7),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p4.14),[§5\.3](https://arxiv.org/html/2605.22876#S5.SS3.p1.1),[§5\.4](https://arxiv.org/html/2605.22876#S5.SS4.p1.1),[§5\.8](https://arxiv.org/html/2605.22876#S5.SS8.p3.1),[§5](https://arxiv.org/html/2605.22876#S5.p1.1)\.
- \[12\]X\. Fu, S\. Gu, C\. Chew, and T\. Li\(2025\)Towards generalizable meta\-deep reinforcement learning algorithm for multi\-objective traveling salesman problems\.IEEE Transactions on Artificial Intelligence\.Cited by:[§2](https://arxiv.org/html/2605.22876#S2.p2.1)\.
- \[13\]C\. Gao, H\. Shang, K\. Xue, D\. Li, and C\. Qian\(2024\)Towards generalizable neural solvers for vehicle routing problems via ensemble with transferrable local policy\.InProceedings of International Joint Conference on Artificial Intelligence,pp\. 6914–6922\.Cited by:[§2](https://arxiv.org/html/2605.22876#S2.p3.1)\.
- \[14\]A\. Jaszkiewicz\(2002\)Genetic local search for multi\-objective combinatorial optimization\.European Journal of Operational Research137\(1\),pp\. 50–71\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2)\.
- \[15\]D\. P\. Kingma\(2015\)Adam: a method for stochastic optimization\.InInternational Conference on Learning Representations,pp\. 1–15\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p4.14)\.
- \[16\]Y\. Kwon, J\. Choo, B\. Kim, I\. Yoon, Y\. Gwon, and S\. Min\(2020\)POMO: Policy optimization with multiple optima for reinforcement learning\.InProceedings of Advances in Neural Information Processing Systems,pp\. 21188–21198\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1),[§2](https://arxiv.org/html/2605.22876#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p10.7)\.
- \[17\]P\. Lacomme, C\. Prins, and M\. Sevaux\(2006\)A genetic algorithm for a bi\-objective capacitated arc routing problem\.Computers & Operations Research33\(12\),pp\. 3473–3493\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p8.9)\.
- \[18\]H\. Li, F\. Liu, Z\. Zheng, Y\. Zhang, and Z\. Wang\(2025\)CaDA: Cross\-problem routing solver with constraint\-aware dual\-attention\.InProceedings of International Conference on Machine Learning,pp\. 35438–35456\.Cited by:[§4\.1](https://arxiv.org/html/2605.22876#S4.SS1.p2.21)\.
- \[19\]K\. Li, T\. Zhang, and R\. Wang\(2021\)Deep reinforcement learning for multiobjective optimization\.IEEE Transactions on Cybernetics51\(6\),pp\. 3103–3114\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1),[§2](https://arxiv.org/html/2605.22876#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p7.16)\.
- \[20\]Q\. Li, Z\. Cao, Y\. Ma, Y\. Wu, and Y\. Gong\(2025\)Diversity optimization for travelling salesman problem via deep reinforcement learning\.InProceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 683–694\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1)\.
- \[21\]Y\. Li, D\. Wang, W\. Du, X\. Wu, P\. Zhao, Y\. Xiao, and Y\. Zhou\(2026\)Efficient few\-step solution generation via discrete flow matching for combinatorial optimization\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1)\.
- \[22\]Z\. Liao, J\. Chen, D\. Wang, Z\. Zhang, and J\. Wang\(2025\)BOPO: neural combinatorial optimization via best\-anchored and objective\-guided preference optimization\.InProceedings of International Conference on Machine Learning,pp\. 37456–37475\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p5.4),[§2](https://arxiv.org/html/2605.22876#S2.p3.1),[§5\.8](https://arxiv.org/html/2605.22876#S5.SS8.p1.1)\.
- \[23\]X\. Lin, Z\. Yang, and Q\. Zhang\(2022\)Pareto set learning for neural multi\-objective combinatorial optimization\.InThe International Conference on Learning Representations,pp\. 1–31\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1),[§2](https://arxiv.org/html/2605.22876#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p10.7),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p3.1),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p7.16),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p8.9)\.
- \[24\]Q\. Liu, J\. Lian, C\. Liu, and Z\. Cao\(2025\)Enhancing generalization in large\-scale hcvrp: a rank\-augmented neural solver\.InProceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 1845–1856\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1)\.
- \[25\]S\. Liu, B\. Tan, Z\. Cao, and Y\. Jin\(2025\)VAGPO: vision\-augmented asymmetric group preference optimization for graph routing problems\.Note:arXiv: 2508\.01774Cited by:[§2](https://arxiv.org/html/2605.22876#S2.p3.1)\.
- \[26\]W\. Liu, Y\. Wu, Y\. Zhang, T\. Bäck, and Y\. Fan\(2025\)Preference\-aware mixture\-of\-experts for multi\-objective combinatorial optimization\.Note:OpenReview preprintSubmitted to the ICLR 2026\. Submission \#18931Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2)\.
- \[27\]T\. Lust and J\. Teghem\(2010\)Two\-phase Pareto local search for the biobjective traveling salesman problem\.Journal of Heuristics16\(3\),pp\. 475–510\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p9.10)\.
- \[28\]L\. Nie, D\. Qi, B\. Liu, P\. Li, H\. Bao, and H\. He\(2025\)CMRM: Collaborative multi\-agent reinforcement learning for multi\-objective traffic signal control\.IEEE Transactions on Consumer Electronics71\(2\),pp\. 2793–2805\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1),[§5\.2](https://arxiv.org/html/2605.22876#S5.SS2.p1.2)\.
- \[29\]M\. Pan, G\. Lin, Y\. Luo, B\. Zhu, Z\. Dai, L\. Sun, and C\. Yuan\(2025\)Preference optimization for combinatorial optimization problems\.InProceedings of the International Conference on Machine Learning,pp\. 47672–47696\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p5.4),[§2](https://arxiv.org/html/2605.22876#S2.p3.1),[§4\.3](https://arxiv.org/html/2605.22876#S4.SS3.p1.5),[§5\.8](https://arxiv.org/html/2605.22876#S5.SS8.p1.1)\.
- \[30\]G\. Reinelt\(1991\)TSPLIB—a traveling salesman problem library\.ORSA Journal on Computing3,pp\. 376–384\.Cited by:[§5\.4](https://arxiv.org/html/2605.22876#S5.SS4.p1.1)\.
- \[31\]A\. Schrijver\(2005\)On the history of combinatorial optimization \(till 1960\)\.Handbooks in Operations Research and Management Science12,pp\. 1–68\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1)\.
- \[32\]J\. Shi, J\. Sun, Q\. Zhang, H\. Zhang, and Y\. Fan\(2024\)Improving pareto local search using cooperative parallelism strategies for multiobjective combinatorial optimization\.IEEE Transactions on Cybernetics54\(4\),pp\. 2369–2382\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2)\.
- \[33\]J\. Son, M\. Kim, S\. Choi, H\. Kim, and J\. Park\(2024\)Equity\-transformer: solving np\-hard min\-max routing problems as sequential generation with equity context\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 20265–20273\.Cited by:[§2](https://arxiv.org/html/2605.22876#S2.p3.1)\.
- \[34\]M\. Wang, Q\. Zhou, Y\. Hu, Z\. Lou, and X\. Zhou\(2025\)An optimization framework for multi\-objective traffic signal control using intersection similarity and reinforcement learning\.Alexandria Engineering Journal133,pp\. 653–672\.Cited by:[§5\.2](https://arxiv.org/html/2605.22876#S5.SS2.p1.2)\.
- \[35\]M\. Wang, Y\. Zhou, Z\. Cao, Y\. Xiao, X\. Wu, W\. Pang, Y\. Jiang, H\. Yang, P\. Zhao, and Y\. Li\(2025\)An efficient diffusion\-based non\-autoregressive solver for traveling salesman problem\.InProceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 1469–1480\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1)\.
- \[36\]R\. Wang, Y\. Li, J\. Yan, and X\. Yang\(2024\)Learning to solve combinatorial optimization under positive linear constraints via non\-autoregressive neural networks \(in chinese\)\.Science China Information Sciences54,pp\. 2368–2384\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1)\.
- \[37\]Z\. Wang, S\. Yao, G\. Li, and Q\. Zhang\(2024\)Multiobjective combinatorial optimization using a single deep reinforcement learning model\.IEEE Transactions on Cybernetics54\(3\),pp\. 1984–1996\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1)\.
- \[38\]L\. While, P\. Hingston, L\. Barone, and S\. Huband\(2006\)A faster algorithm for calculating hypervolume\.IEEE Transactions on Evolutionary Computation10\(1\),pp\. 29–38\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p3.1)\.
- \[39\]R\. J\. Williams\(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine learning8\(3\),pp\. 229–256\.Cited by:[§2](https://arxiv.org/html/2605.22876#S2.p3.1)\.
- \[40\]H\. Wu, J\. Wang, and Z\. Zhang\(2020\)MODRL/D\-AM: Multiobjective deep reinforcement learning algorithm using decomposition and attention model for multiobjective optimization\.InProceedings of International Symposium on Intelligence Computation and Applications,pp\. 575–589\.Cited by:[§2](https://arxiv.org/html/2605.22876#S2.p2.1)\.
- \[41\]J\. Wu, C\. Sun, and C\. Yang\(2024\)On the size generalizibility of graph neural networks for learning resource allocation\.Science China Information Sciences67,pp\. 142301\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1)\.
- \[42\]X\. Wu, D\. Wang, L\. Wen, Y\. Xiao, C\. Wu, Y\. Wu, C\. Yu, D\. L\. Maskell, and Y\. Zhou\(2024\)Neural combinatorial optimization algorithms for solving vehicle routing problems: a comprehensive survey with perspectives\.Note:arXiv: 2406\.00415Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1),[§1](https://arxiv.org/html/2605.22876#S1.p2.1),[§2](https://arxiv.org/html/2605.22876#S2.p3.1)\.
- \[43\]X\. Wu, D\. Wang, C\. Wu, K\. Qi, C\. Miao, Y\. Xiao, J\. Zhang, and Y\. Zhou\(2026\)Efficient neural combinatorial optimization solver for the min\-max heterogeneous capacitated vehicle routing problem\.Note:arXiv:2507\.21386Cited by:[§2](https://arxiv.org/html/2605.22876#S2.p3.1)\.
- \[44\]Y\. Wu, M\. Fan, Z\. Cao, R\. Gao, Y\. Hou, and G\. Sartoretti\(2024\)Collaborative deep reinforcement learning for solving multi\-objective vehicle routing problems\.InProceedings of International Conference on Autonomous Agents and Multiagent Systems,pp\. 1956–1965\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1),[§2](https://arxiv.org/html/2605.22876#S2.p2.1)\.
- \[45\]Y\. Xiao, D\. Wang, Z\. Cao, R\. Cao, X\. Wu, B\. Li, and Y\. Zhou\(2026\)GELD: A unified neural model for efficiently solving traveling salesman problems across different scales\.Pattern Recognition173,pp\. 112865\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1)\.
- \[46\]Y\. Xiao, D\. Wang, B\. Li, H\. Chen, W\. Pang, X\. Wu, H\. Li, D\. Xu, Y\. Liang, and Y\. Zhou\(2025\)Reinforcement learning\-based nonautoregressive solver for traveling salesman problems\.IEEE Transactions on Neural Networks and Learning Systems36\(7\),pp\. 13402–13416\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1)\.
- \[47\]Y\. Xiao, Y\. Wu, R\. Cao, D\. Wang, Z\. Cao, X\. Wu, P\. Zhao, Y\. Li, Y\. Zhou, and Y\. Jiang\(2025\)DGL: Dynamic global\-local information aggregation for scalable vrp generalization with self\-improvement learning\.InProceedings of International Joint Conference on Artificial Intelligence,pp\. 8669–8677\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1)\.
- \[48\]B\. Zhang and R\. Sennrich\(2019\)Root mean square layer normalization\.InProceedings of Advances in Neural Information Processing Systems,pp\. 12381–12392\.Cited by:[§4\.1](https://arxiv.org/html/2605.22876#S4.SS1.p2.21)\.
- \[49\]C\. Zhang, Y\. Wu, Y\. Ma, W\. Song, Z\. Le, Z\. Cao, and J\. Zhang\(2023\)A review on learning to solve combinatorial optimisation problems in manufacturing\.IET Collaborative Intelligent Manufacturing5,pp\. e12072\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1)\.
- \[50\]Q\. Zhang and H\. Li\(2007\)MOEA/D: A multiobjective evolutionary algorithm based on decomposition\.IEEE Transactions on Evolutionary Computation11\(6\),pp\. 712–731\.Cited by:[§2](https://arxiv.org/html/2605.22876#S2.p2.1),[§3](https://arxiv.org/html/2605.22876#S3.p4.10),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2)\.
- \[51\]Z\. Zhang, Z\. Wu, H\. Zhang, and J\. Wang\(2023\)Meta\-learning\-based deep reinforcement learning for multiobjective optimization problems\.IEEE Transactions on Neural Networks and Learning Systems34\(10\),pp\. 7978–7991\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p3.1),[§2](https://arxiv.org/html/2605.22876#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p2.2)\.
- \[52\]J\. Zhong, H\. Ma, M\. Long, and J\. Wang\(2024\)Scheduling approach for aircraft assembly pulsation production lines with deep reinforcement learning and knowledge transfer \(in chinese\)\.Science China Information Sciences54,pp\. 1441–1457\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p1.1)\.
- \[53\]A\. Zhou, Q\. Zhang, and G\. Zhang\(2012\)A multiobjective evolutionary algorithm based on decomposition and probability model\.InProceedings of IEEE Congress on Evolutionary Computation,pp\. 1–8\.Cited by:[§5\.1](https://arxiv.org/html/2605.22876#S5.SS1.p9.10)\.
- \[54\]Z\. Zong, H\. Wang, J\. Wang, M\. Zheng, and Y\. Li\(2022\)RBG: Hierarchically solving large\-scale routing problems in logistic systems via reinforcement learning\.InProceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 4648–4658\.Cited by:[§1](https://arxiv.org/html/2605.22876#S1.p2.1)\.Similar Articles
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
This paper proposes WORC, a weak-link optimization framework for multi-agent LLM systems that identifies and reinforces underperforming agents through meta-learning-based weight prediction and uncertainty-driven resource allocation, achieving 82.2% accuracy on reasoning benchmarks while improving system stability.
CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks
Introduces CAWI, a copula-based weight initialization method for randomized neural networks that models inter-feature dependence, improving predictive performance across 83 classification benchmarks.
Multi-Objective Constraint Inference using Inverse reinforcement learning
This paper introduces MOCI, a novel framework for inferring shared constraints and individual preferences from heterogeneous expert demonstrations in reinforcement learning, outperforming existing baselines in predictive performance and computational efficiency.
DisjunctiveNet: Neural Symbolic Learning via Differentiable Convexified Optimization Layers
Introduces DisjunctiveNet, a unified end-to-end framework for enforcing hard, input-dependent mixed integer linear constraints within neural networks via differentiable convexified optimization layers, achieving perfect rule satisfaction on real-world datasets.
A lift for input-convex neural network training
Proposes a 'lift' method for training input-convex neural networks (ICNNs) that uses an unconstrained hypernetwork to emit non-negative inter-layer weights, softening the loss landscape and escaping gradient attenuation, achieving lower test loss than projected gradient descent and softplus reparametrization.