A Comparative Study of Bayesian Contextual Bandits for Real-Time Warehouse Sorter Optimization
Summary
This paper presents a comparative study of Bayesian Contextual Bandits, XGBoost, and Linear Regression for real-time sorter diversion optimization in e-commerce warehouses, showing BCB achieves 2.03% reward uplift with superior online learning and inference latency.
View Cached Full Text
Cached at: 06/24/26, 07:49 AM
# A Comparative Study of Bayesian Contextual Bandits for Real-Time Warehouse Sorter Optimization
Source: [https://arxiv.org/html/2606.23977](https://arxiv.org/html/2606.23977)
###### Abstract
Efficient sorter diversion control of automated material handling systems \(MHS\) is critical for optimizing operational efficiency in large\-scale warehouse environments\. In this study, we use an inbound receiving sorter at a high\-volume e\-commerce warehouse as our primary use case, where the sorter diversion system relies on cost functions with static weight configurations that fail to adapt to highly dynamic system contexts, such as volume mode, congestion level, equipment physical status, and upstream/downstream dependencies\. To address this real\-time sorter diversion optimization challenge, we conducted a comparative study of three candidate hybrid machine learning frameworks: Linear Regression with Gradient Descent Optimization \(LR\+GDO\), XGBoost with Bayesian Optimization \(XGB\+BO\), and Bayesian Contextual Bandits \(BCB\)\. Model training and evaluation were enabled by leveraging a high\-fidelity physics\-aware emulator to overcome the cold\-start problem and allow a safe transition from offline to online learning\. We performed comprehensive evaluations including reward model predictive accuracy, contextual sensitivity, action distribution, and projected reward uplift\. Our results demonstrate that while tree\-based reward models offer slightly better predictive power, the BCB framework achieved overall higher performance with2\.03%2\.03\\%reward uplift over the heuristic baseline\. Furthermore, BCB exhibits several superior characteristics, such as its decisive time\-optimal policy backed by Bang\-Bang control theory, continuous online learning capability, strategic balance between exploration and exploitation, and significantly shorter inference latency\. These results demonstrate the potential of the BCB framework for real\-time control optimization in large\-scale warehouse environments, motivating further investigation toward operational deployment\.
Index Terms—Bayesian Contextual Bandits, Warehouse Automation, Real\-Time Optimization, Online Learning, Cost Function Tuning\.
## IIntroduction
In an e\-commerce warehouse environment, various sorters are deployed to manage the diversion of incoming items towards their respective destinations\. Each sorter utilizes a customized cost function to decide the diversion direction for incoming items\. The cost function calculates a compound cost score for each possible diversion direction by computing a weighted sum of several cost factors in real time\. For each incoming item, the sorter evaluates the cost scores across all possible diversion directions and selects the direction with the lowest score\. This cost function is tailored to the specific characteristics of each sorter at each site, taking into consideration of factors such as sorter type, site physical layout, number of diversion options, available sensors and system metrics\. Although the specific components of each cost function may vary, the general form can be expressed as:
CompoundCostScore=∑i=1nwi⋅cost\_factoriCompound\\ Cost\\ Score=\\sum\_\{i=1\}^\{n\}w\_\{i\}\\cdot cost\\\_factor\_\{i\}\(1\)wherecost\_factoricost\\\_factor\_\{i\}represents the value of theii\-th cost factor,wiw\_\{i\}represents the weight assigned to that cost factor, andnnrepresents the total number of cost factors\. Cost factors are operational metrics that reflect key system performance dimensions\. A few examples are destination area fullness, throughput, divert continuity, and assignment preferences, etc\.
The warehouse is a highly complex system with interconnected upstream and downstream processes that evolve over time, requiring the sorter diversion system to adapt and make optimal diversion decisions\. However, the current limitation is that the weight configurations in the cost functions have been historically static\. Fixed configurations can render sub\-optimal diversion decisions that lead to high recirculation, congestion, and reduced operational efficiency\. To address this issue, we propose a hybrid machine learning framework that automatically recommends optimal weights given a system context\. This framework incorporates offline model initialization and online continuous learning stages\.
A key challenge in developing such a framework is the cold\-start problem\. Historical operational data lacks variation in cost weight values and is insufficient to train a robust optimization model\. To mitigate this, we leverage a high\-fidelity physics\-aware emulator\. By programmatically assigning random values to the cost weights within the emulator and recording the corresponding system results, we prepare simulated datasets that capture the complex hidden relationships between system dynamics \(context\), cost weights \(action\), and resulting performance metrics \(reward\)\.
In this paper, we conduct a comparative study of three candidate algorithm architectures for the sorter optimization solution:
- •Linear Regression \+ Gradient Descent Optimization \(LR\+GDO\)
- •XGBoost \+ Bayesian Optimization \(XGB\+BO\)
- •Bayesian Contextual Bandit \(BCB\)
Using an inbound receiving sorter at a high\-volume fulfillment environment as our primary use case, we collected 5000 training samples by running the emulator\. During the offline initialization stage, we train a reward model on these samples to learn the relationship between system state, control actions, and resulting operational outcomes\. We then evaluate the trained policy offline by generating actions for held\-out system states and using the learned reward model as a surrogate to estimate the corresponding rewards, comparing against a heuristic baseline recorded in the same dataset\. The three model candidates are compared based on offline predictive accuracy, simulation results, model complexity, potential for continuous online learning, and real\-time inference latency\. Ultimately, the results show that the Bayesian Contextual Bandit \(BCB\) provides the most robust balance of performance and online learning capability for various sorter use cases\.
## IIRelated Work
Many existing research about warehouse sorter optimization problems are addressed using traditional Operations Research \(OR\) methodologies\. These works apply heuristic algorithms to make deterministic assignment or scheduling decisions and often utilize Mixed\-Integer Linear Programming \(MILP\) to minimize travel time or maximize throughput, based on known parameters\[[13](https://arxiv.org/html/2606.23977#bib.bib2),[3](https://arxiv.org/html/2606.23977#bib.bib3),[1](https://arxiv.org/html/2606.23977#bib.bib4),[2](https://arxiv.org/html/2606.23977#bib.bib5)\]\. However, in the highly dynamic warehouse sorter environment, simple heuristic rules rarely hold over time as the system evolves on its own and become increasingly stochastic\. More importantly, as part of the highly context and interconnected system, the optimal sorter configuration relies on high\-dimensional system context inputs which make traditional OR models struggle with the “curse of dimensionality” and high inference latency which fails to meet the real\-time diversion requirement\.
Some other studies employ data\-driven strategies by training Supervised Learning \(SL\) models to predict future volume, congestion or equipment failure, followed with rule\-based or OR models like linear programming for recommendation\[[12](https://arxiv.org/html/2606.23977#bib.bib6)\]\. Although these models can have high accuracy on predicting certain parameters as inputs for the solver, they are often implemented as open\-loop systems that lack the real\-time adaptive feedback loop\. Our proposed solution builds upon a closed\-loop architecture where optimization and learning are tightly integrated\.
Contextual Bandits has been successfully applied to many different domains including education \(\[[5](https://arxiv.org/html/2606.23977#bib.bib11)\]\), health \(\[[7](https://arxiv.org/html/2606.23977#bib.bib8),[11](https://arxiv.org/html/2606.23977#bib.bib10)\]\), tourism \(\[[8](https://arxiv.org/html/2606.23977#bib.bib7)\]\), and digital marketing such as ad\-placement and news recommendation \(\[[6](https://arxiv.org/html/2606.23977#bib.bib9)\]\)\. In the industry settings, Bayesian Contextual Bandit \(BCB\) is emerging as a sample\-efficient single\-step Bandit model that offers a transparent mechanism for uncertainty quantification via Thompson Sampling, compared to the ”black\-box” multi\-step Deep Reinforcement Learning which is known to be data hungry and requires hundreds of thousands of samples for training\[[4](https://arxiv.org/html/2606.23977#bib.bib12)\]\.
While existing studies have explored using Contextual Bandit for warehouse optimization, they primarily focus on macro logistics and static discrete action planning problems such as order consolidation, picking optimization, and storage allocation\[[9](https://arxiv.org/html/2606.23977#bib.bib13)\]\. By contrast, our study moves beyond simple discrete assignments, and uses the Bayesian Contextual Bandit framework to address the continuous cost\-weight optimization problem in the critical but under\-explored domain of high\-frequency warehouse sorter control\.
## IIISystem Modeling: Context Space, Decision Variables, and Objective Functions
Before we introduce the algorithm designs, we first mathematically formulate this optimization problem that we are addressing\. In this section, we define the system context vector, decision variables, and reward function, taking the inbound receiving sorter as an example\.
### III\-ASystem Context Representation
The system context can be represented as a numerical vector including features such as system throughput, upstream scan count, destination area fullness, recirculation rate, control override rate, etc\. Some features are included as aggregated sum or mean over the previousΔT\\Delta Trolling window, and some other features are included as time series that provide temporal information to enhance the model’s predictive capability\.
- •Aggregated features include: throughput, recirculation rate, control override rate, routing non\-compliance rate, destination fullness\. These features capture the surrounding system dynamics of the sorter, and are aggregated over the previousΔT\\Delta Ttime window\. Rolling\-window aggregation filters out transient noise and provides a stable representation of the true underlying system condition\.
- •Time series features include: upstream scan volume by input category\. These features provide insights into upstream system throughput, which is useful for predicting future sorter arrival volume\. Including predicted future arrival volume in the context vector is helpful for action recommendation\. Rather than building a separate volume prediction model, we directly include raw upstream throughput signals in the context vector, allowing the model to internalize the temporal relationship between upstream volume and future reward\.
### III\-BDecision Variables
The decision variables are the weights assigned to the cost factors in the cost function, which is used for making real\-time sorter divert decisions\. In order to be able to properly attribute the resulting reward to its corresponding cost weights, the cost weights should remain at certain values for a reasonably long duration \(e\.g\. a few minutes\) so that the system can react to it and accumulate some reliable impacts on system metrics\. Therefore, during the simulation phase, we choose to perturb the values of cost weights everyΔ\\Deltatime window and during the online learning phase, we also have the model to generate new weights recommendations everyΔ\\Deltatime window\.
### III\-CReward Design
There are several system performance metrics that we would like to improve and balance with this optimization solution\. Therefore, we utilize a composite reward score which is defined as a weighted sum of several operational Key Performance Indicators \(KPIs\):
r=∑i=1nwi⋅ki,subject towi≥0,∑i=1nwi=1r=\\sum\_\{i=1\}^\{n\}w\_\{i\}\\cdot k\_\{i\},\\quad\\text\{subject to \}w\_\{i\}\\geq 0,\\ \\sum\_\{i=1\}^\{n\}w\_\{i\}=1\(2\)wherekik\_\{i\}represents theii\-th KPI andwiw\_\{i\}is its corresponding weight\. The reward score should be calculated as an aggregated value over aΔ\\Deltatime window, rather than a point value collected from a certain timestamp, so that it can reflect the true operation efficiency level of the system and filter out noises\. We have analyzed the definition and business logic of each reward metric included in the composite score, the majority of their impacts should come from the immediate cost weights implemented during the same time window\. Therefore, the reward observation time window should align with the same time window of its corresponding cost weights\. consequently, this model will focus on optimizing the immediate reward, rather than delayed long term reward\.
## IVAlgorithm Design: Optimization Framework Candidates and Formulations
This section discusses the algorithm architecture and design details of three candidate optimization frameworks for the sorter divert optimization problem\.
### IV\-ALinear Regression \+ Gradient Descent Optimization \(LR\+GDO\)
The LR\+GDO architecture first trains a reward model that captures the complex relationship between system context \(CC\), cost weights \(WW\) and reward outcomes \(RR\)\. The reward model is trained using linear regression with Lasso regularization to reduce feature dimension and make it a sample efficient baseline\. We have included interaction termsCi∗WjC\_\{i\}\*W\_\{j\}between all context variables and weight variables into the regression model to allow it to capture how cost weights could affect reward differently given different system context\.
ϕ\(C,W\)\\displaystyle\\phi\(C,W\)=\[1,C1,…,Cdc,W1,…,Wdw,\\displaystyle=\\bigl\[1,\\ C\_\{1\},\\ \\ldots,\\ C\_\{d\_\{c\}\},\\ W\_\{1\},\\ \\ldots,\\ W\_\{d\_\{w\}\},\(3\)C1W1,…,CdcWdw\]⊤\\displaystyle\\quad C\_\{1\}W\_\{1\},\\ \\ldots,\\ C\_\{d\_\{c\}\}W\_\{d\_\{w\}\}\\bigr\]^\{\\top\}wheredϕ=1\+dc\+dw\+dc⋅dwd\_\{\\phi\}=1\+d\_\{c\}\+d\_\{w\}\+d\_\{c\}\\cdot d\_\{w\}\. The reward model is then formulated as:
R^\(C,W\)=β⊤ϕ\(C,W\)\\hat\{R\}\(C,W\)=\\beta^\{\\top\}\\phi\(C,W\)\(4\)The model parametersβ\\betaare learned by minimizing the Lasso\-regularized objective:
minβ∑i=1n\(Ri−β⊤ϕ\(Ci,Wi\)\)2\+α‖β‖1\\min\_\{\\beta\}\\sum\_\{i=1\}^\{n\}\\left\(R\_\{i\}\-\\beta^\{\\top\}\\phi\(C\_\{i\},W\_\{i\}\)\\right\)^\{2\}\+\\alpha\\\|\\beta\\\|\_\{1\}\(5\)wherennis the number of training samples andα\\alphais the regularization parameter\.
Once the reward model is trained, it is used to guide the optimization search for action recommendation\. We utilize the PyTorch framework for efficient gradient descent optimization\. We extract the learned coefficients from the regression model and port them into a differentiable PyTorch wrapper which reconstructs the reward model within its computation graph, so that it can use this reward model as its objective function to search for global optimum\. The optimal solution needs to meet a simplex constraint:
maxWR^\(C,W\)subject to∑i=1dwwi=1,wi≥0∀i\\max\_\{W\}\\hat\{R\}\(C,W\)\\quad\\text\{subject to\}\\quad\\sum\_\{i=1\}^\{d\_\{w\}\}w\_\{i\}=1,\\quad w\_\{i\}\\geq 0\\quad\\forall i\(6\)Given that the default PyTorch optimization search is performed in an unconstrained latent spaceZ∈ℝdwZ\\in\\mathbb\{R\}^\{d\_\{w\}\}, we add a Softmax layer to map it to the valid weightsWW, then we use Adam optimizer to iteratively updateZZby backpropagating the gradient calculated from predicted rewards\.
### IV\-BXGBoost \+ Bayesian Optimization \(XGB\+BO\)
The XGB\+BO framework consists of two key components: a tree\-based reward model and a Bayesian optimization solver\. The XGBoost reward model, initially trained on simulated data, is able to predict system performance outcomes with current context and cost weights\.
R^\(C,W\)=fXGB\(\[C,W\]\)\\hat\{R\}\(C,W\)=f\_\{\\text\{XGB\}\}\(\[C,W\]\)\(7\)where\[C,W\]\[C,W\]denotes the concatenation of context and weight vectors\.
This reward model is used to guide the optimum search for a real\-time optimization solver that dynamically generate weights recommendations as the system context evolves\. We chose Bayesian Optimization as our real\-time solver due to its distinct advantages: 1\) our tree\-based reward model is non\-differentiable and BO can build a Gaussian process \(GP\) surrogate model to allow for global optimum search, 2\) BO process is very sample efficient and handle such expensive\-to\-evaluate objective function well, 3\) BO provides uncertainty quantification in its predictions\. Bayesian Optimization utilized a GP process to model the objective function:
fXGB\(W\)∼𝒢𝒫\(μ\(W\),k\(W,W′\)\)f\_\{\\text\{XGB\}\}\(W\)\\sim\\mathcal\{GP\}\(\\mu\(W\),k\(W,W^\{\\prime\}\)\)\(8\)whereμ\(W\)\\mu\(W\)is the mean function andk\(W,W′\)k\(W,W^\{\\prime\}\)is the covariance kernel\. We use Matern 2\.5 as the kernel function in this study\. New observations𝒟1:t=\{\(W\(i\),R\(i\)\)\}i=1t\\mathcal\{D\}\_\{1:t\}=\\\{\(W^\{\(i\)\},R^\{\(i\)\}\)\\\}\_\{i=1\}^\{t\}are used to update the posterior distribution:
p\(R\|W,𝒟1:t\)=𝒩\(μt\(W\),σt2\(W\)\)p\(R\|W,\\mathcal\{D\}\_\{1:t\}\)=\\mathcal\{N\}\(\\mu\_\{t\}\(W\),\\sigma\_\{t\}^\{2\}\(W\)\)\(9\)The optimal weights are found by maximizing reward using an acquisition function:
W\(t\+1\)=argmaxW∈𝒲α\(W\|𝒟1:t\)W^\{\(t\+1\)\}=\\arg\\max\_\{W\\in\\mathcal\{W\}\}\\alpha\(W\|\\mathcal\{D\}\_\{1:t\}\)\(10\)We use the Upper Confidence Bound \(UCB\) as the acquisition functionα\(W\|𝒟1:t\)\\alpha\(W\|\\mathcal\{D\}\_\{1:t\}\)in this study\.
As a proposed extension, once the optimization solver is transitioned to an online setting, real\-world data can be collected to form a feedback loop and enable continuous learning and adaptability\. A separate offline pipeline would be responsible for retraining the XGBoost reward model periodically \(e\.g\., hourly, daily\) by gradually replacing simulated data with live operational data\. After each retraining cycle, the updated reward model becomes available to the real\-time optimization solver to make smarter decisions\. This approach enables the reward model to adapt to data distributional shifts and new operational patterns\.
### IV\-CBayesian Contextual Bandits \(BCB\)
A core advantage of the BCB framework is it inherently balances between exploration and exploitation by learning a probabilistic belief of the reward function instead of point estimates\. The reward function is in the form of:
R=β⊤ϕ\(C,W\)\+ϵ,ϵ∼𝒩\(0,σ2\)R=\\beta^\{\\top\}\\phi\(C,W\)\+\\epsilon,\\quad\\epsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\)\(11\)whereϕ\(C,W\)\\phi\(C,W\)is the feature vector including interaction terms as defined in equation\([3](https://arxiv.org/html/2606.23977#S4.E3)\)\. The reward function is trained in the form of a Bayesian Linear Regression \(BLR\) through maintaining a posterior distribution of coefficients:
p\(β\|𝒟1:t\)=𝒩\(𝝁t,𝚲t−1\)p\(\\beta\|\\mathcal\{D\}\_\{1:t\}\)=\\mathcal\{N\}\(\\bm\{\\mu\}\_\{t\},\\bm\{\\Lambda\}\_\{t\}^\{\-1\}\)\(12\)where𝝁t∈ℝdϕ\\bm\{\\mu\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{\\phi\}\}is the posterior mean and𝚲t∈ℝdϕ×dϕ\\bm\{\\Lambda\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{\\phi\}\\times d\_\{\\phi\}\}is the precision matrix\. Given a batch of new observations, the model update is calculated as :
𝚲t\\displaystyle\\bm\{\\Lambda\}\_\{t\}=𝚲t−1\+𝚽⊤𝚽\\displaystyle=\\bm\{\\Lambda\}\_\{t\-1\}\+\\bm\{\\Phi\}^\{\\top\}\\bm\{\\Phi\}\(13\)ϕt\\displaystyle\\bm\{\\phi\}\_\{t\}=ϕt−1\+𝚽⊤𝐑\\displaystyle=\\bm\{\\phi\}\_\{t\-1\}\+\\bm\{\\Phi\}^\{\\top\}\\mathbf\{R\}\(14\)𝝁t\\displaystyle\\bm\{\\mu\}\_\{t\}=𝚲t−1ϕt\\displaystyle=\\bm\{\\Lambda\}\_\{t\}^\{\-1\}\\bm\{\\phi\}\_\{t\}\(15\)where𝐑=\[R1,…,RN\]⊤\\mathbf\{R\}=\[R\_\{1\},\\ldots,R\_\{N\}\]^\{\\top\}is the reward vector,𝚽∈ℝN×dϕ\\bm\{\\Phi\}\\in\\mathbb\{R\}^\{N\\times d\_\{\\phi\}\}is the feature matrix, andϕt∈ℝdϕ\\bm\{\\phi\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{\\phi\}\}is the accumulated reward\-weighted feature vector\.
At inference time, BCB uses Thompson Sampling to sample a coefficient vectorβ~∼𝒩\(𝝁t,𝚲t−1\)\\tilde\{\\beta\}\\sim\\mathcal\{N\}\(\\bm\{\\mu\}\_\{t\},\\bm\{\\Lambda\}\_\{t\}^\{\-1\}\)from the model’s current posterior distribution to be used for reward prediction during optimization search\. The optimum is defined as
W∗=argmaxW∈𝒲β~⊤ϕ\(C,W\)W^\{\*\}=\\arg\\max\_\{W\\in\\mathcal\{W\}\}\\tilde\{\\beta\}^\{\\top\}\\phi\(C,W\)\(16\)The context vector is provided as an input, and the action optimization search is performed in an unconstrained latent spaceZ∈ℝdwZ\\in\\mathbb\{R\}^\{d\_\{w\}\}, and we apply a Softmax transformation to map it toWW, then we use the calculated softmax Jacobian to iteratively updateZZuntil the reward converges:
Z\(k\+1\)=Z\(k\)\+ηJsoftmax\(Z\)⊤∇W\[β~⊤ϕ\(C,W\)\]Z^\{\(k\+1\)\}=Z^\{\(k\)\}\+\\eta J\_\{\\text\{softmax\}\}\(Z\)^\{\\top\}\\nabla\_\{W\}\\left\[\\tilde\{\\beta\}^\{\\top\}\\phi\(C,W\)\\right\]\(17\)whereη\\etais the learning rate andJsoftmax\(Z\)ij=wi\(δij−wj\)J\_\{\\text\{softmax\}\}\(Z\)\_\{ij\}=w\_\{i\}\(\\delta\_\{ij\}\-w\_\{j\}\)is the Softmax Jacobian\. The gradient with respect toWWis calculated as∇W=βW\+C⊤B\\nabla\_\{W\}=\\beta\_\{W\}\+C^\{\\top\}B, whereβW\\beta\_\{W\}are the action coefficients andB∈ℝdc×dwB\\in\\mathbb\{R\}^\{d\_\{c\}\\times d\_\{w\}\}are the interaction coefficients\.
The stochastic sampling allows the model to continue to explore weights in the uncertain area \(where𝚲t−1\\bm\{\\Lambda\}\_\{t\}^\{\-1\}has large variance\) while naturally performing exploitation as the uncertainty shrinks and posterior distribution narrows\. Optionally, we can set weight priors using expert domain knowledge to allow efficient early exploration or even add a safety bounds to the search space to guarantee safe exploration\.
Due to this closed\-loop design, BCB is able to update its policy in real\-time as it continues to collect online feedback and can rapidly adapt to the evolving system dynamics, such as mechanical wear, floor conditions changes, sensor drift, or new operational patterns\.
## VSimulation Experiment and Comparative Performance Analysis
Using the inbound receiving sorter at the chosen large\-scale e\-commerce warehouse as our primary use case, we compare the three candidate optimization frameworks by evaluating their corresponding reward model’s accuracy, action distribution, sample efficiency, and projected reward lift returned from high fidelity simulation\.
### V\-AEmulator data collection
In order to address the cold\-start challenge and generate data samples to initialize the reward model training, we leverage a physics\-aware emulator that replicates the operational dynamics of the target system\. We collect a set of 5000 data samples in the form of\[context,action,reward\]\[\\text\{context\},\\ \\text\{action\},\\ \\text\{reward\}\]tuples, where each sample captures a snapshot of system state, the applied cost weights, and the resulting operational outcome\. In this study, the context vector has dimensionalitydc=14d\_\{c\}=14and the action vector has dimensionalitydw=6d\_\{w\}=6, corresponding to the six cost weights in the sorter’s cost function\. The resulting feature vectorϕ\(C,W\)\\phi\(C,W\)used in the BCB reward model has dimensionalitydϕ=1\+dc\+dw\+dc⋅dw=105d\_\{\\phi\}=1\+d\_\{c\}\+d\_\{w\}\+d\_\{c\}\\cdot d\_\{w\}=105, including bias, context, action, and all interaction terms\. To ensure broad coverage of the action space during emulation, we randomly sample actions from a Dirichlet distribution that satisfies the simplex constraint \(wi≥0,∑iwi=1w\_\{i\}\\geq 0,\\ \\sum\_\{i\}w\_\{i\}=1\)\. The emulator is configured to execute a different randomly drawn action over each collection interval and record the corresponding system context and reward metrics\. The time window structure is central to the feature engineering and reward calculation process\. We define three temporally aligned windows for each sample:
- •Context \(system conditions\): a rolling lookback window ending at decision timet\(0\)t\(0\)
- •Action \(cost weights\): applied over a forward execution window\[t\(0\),t\(Δ\)\]\[t\(0\),\\ t\(\\Delta\)\]
- •Reward \(resulting system performance\): observed over the window\[t\(0\),t\(Δ\)\]\[t\(0\),\\ t\(\\Delta\)\]
### V\-BReward Model Efficacy Analysis
All three candidate frameworks have the components of reward prediction and action recommendation\. We first evaluate their respective model performance of reward prediction using two metrics: Root Mean Squared Error \(RMSE\) and Mean Absolute Percentage Error \(MAPE\)\. We split data samples as80%80\\%for training and20%20\\%for testing, then we train each model with incremental percentage of training data from10%10\\%to100%100\\%and test them on the same held out data\. Finally, we compare the three reward models against a naive mean baseline\. As shown in Fig\.[1](https://arxiv.org/html/2606.23977#S5.F1), both RMSE and MAPE improve as training data size increases for XGB\+BO and BCB, with diminishing marginal returns\. The curves start to plateau after about 70% of the data size, suggesting that the models are sample efficient and model performance is converging\. By contrast, LR\+GDO maintains consistently high error rates across all training sizes, suggesting its inability to effectively learn from data\.
Figure 1:Reward model learning curve comparisonTable[I](https://arxiv.org/html/2606.23977#S5.T1)presents the final evaluation results on models trained with100%100\\%training dataset\. All three ML reward models outperformed the baseline, with the XGB\+BO framework achieving the highest performance score, followed by BCB, while LR\+GRO yields the poorest results\. This performance ranking is consistent across both the RMSE and MAPE metrics\. Given LR\+GDO’s limited learning capacity and consistently inferior performance across both metrics \(44\.87% and 48\.09% improvement over baseline, compared to over 49% for the other two\), the remainder of our study focuses exclusively on comparing XGB\+BO and BCB, which demonstrate sufficient predictive power to serve as reliable reward surrogates\.
TABLE I:Model Prediction Power SummaryMetricXGB\+BOBCBLR\+GDORMSE0\.04920\.05190\.0566Baseline RMSE0\.10270\.10270\.1027Improvement over baseline52\.12%49\.50%44\.87%MAPE4\.56%4\.96%5\.48%Baseline MAPE10\.55%10\.55%10\.55%Improvement over baseline56\.74%52\.98%48\.09%
### V\-CFeature Importance and Action Sensitivity Analysis
By analyzing the coefficients of trained BCB reward model and the feature importance of the XGBoost model, we verify two critical assumptions underlying this optimization problem: 1\) The cost weights do have impact on reward, 2\) The optimal cost weights do vary with system context\. The two models are complementary to each other as they describe the physical system from different perspectives, with XGBoost identifying the most influential factors in the system environment, while BCB reveals how the system would respond to cost weights and how the impact of weights would change under different system context \.
Among the top ranking coefficients returned from BCB \(Table[II](https://arxiv.org/html/2606.23977#S5.T2)\), most are interaction terms between context and actions, which confirms that optimal weights are highly sensitive to the system state\. For example, the impact of routing\-related action weight is conditional on upstream volume indicators and operational performance metrics, with a negative interaction coefficient indicating that the benefit of increasing this weight diminishes during high\-load periods\. For XGBoost, we utilize SHAP \(SHapley Additive exPlanations\) to decompose feature impact into main effects and interaction effects, isolating action\-context sensitivity\. We report a normalized strength score \(mean absolute SHAP value relative to the maximum\) for each feature in Table[III](https://arxiv.org/html/2606.23977#S5.T3)\. Both models consistently identify volume and load\-related features as the most impactful factors for reward outcomes\.
TABLE II:Top 10 Features from BCB Reward ModelRankFeatureDirection1\[Upstream throughput context \(T\-1\)\]×\\times\[Volume priority weight\]Negative2\(Bias\)Positive3\[Recirc context\]×\\times\[Volume priority weight\]Positive4\[Recirc context\]×\\times\[Throughput weight\]Positive5\[Recirc context\]×\\times\[Assignment preference weight\]Positive6\[Fullness context\]×\\times\[Throughput weight\]Negative7\[Fullness context\]×\\times\[Volume priority weight\]Negative8Fullness contextPositive9\[Fullness context\]×\\times\[Fullness weight\]Negative10\[Fullness context\]×\\times\[Assignment preference weight\]NegativeTABLE III:Top 10 Features from XGBoost Reward Model \(SHAP\)RankFeatureNorm\. Strength1Context: Routing compliance indicator1\.0002Action: Volume priority weight0\.8193Context: Destination Fullness0\.4974Context: Upstream throughput \(T\-1\)0\.4245Context: Sorter throughput0\.2386Action: Fullness weight0\.2077Context: Control override count0\.1918Action: Divert continuity weight0\.1829Context: Upstream throughput \(T\-3\)0\.15710Context: Destination Fullness0\.102
### V\-DAction Distribution Analysis
The histograms in Fig\.[2](https://arxiv.org/html/2606.23977#S5.F2)and Fig\.[3](https://arxiv.org/html/2606.23977#S5.F3)compare the distributions of actions recommended by XGB\+BO versus BCB\. Overall, both models effectively learn to assign smaller values to Action 2, 4 and 5\. BCB exhibits a “U” shape distribution for Action 1 and 6, identifying specific contexts where the weight must be maxed out to optimize reward, while XGB\+BO displays a broader action distribution, favoring the conservative middle\-ground\. BCB demonstrates to be a more decisive and reactive policy compared to the conservative XGB\+BO\.
The bimodal distribution of BCB \(switching abruptly between 0 and 1 for high\-impact actions like Action 1 and 6\) is analogous to Bang\-Bang control, a feedback control strategy that switches between extremes rather than intermediate values\[[10](https://arxiv.org/html/2606.23977#bib.bib1)\]\. While a formal optimal\-control derivation is beyond the scope of this study, the empirical behavior suggests that decisive corrections may be more effective than moderate adjustments in this high\-frequency control setting\. Given our frequent model inference cycles, this approach allows BCB to act as a rapid feedback controller that applies maximum effort to correct system deviations as fast as possible before the next inference cycle, instead of applying a gentle adjustment\.
Figure 2:XGB action histogramFigure 3:BCB action histogram
### V\-EReward Uplift Estimation
To estimate the projected reward uplift, we conducted Python\-based simulations to compare candidate frameworks with a heuristic baseline\. The heuristic baseline is defined as a fixed weight configuration determined through domain expertise, iterative tuning by operations engineers, and prior operational studies, representing the best\-known static policy prior to the introduction of a data\-driven optimization framework\. This baseline reflects a principled, operationally grounded reference point rather than an arbitrary or naive choice\. In each iteration, we provide the models with one context vector from the test dataset, and ask the model to recommend optimal weights, then we estimate corresponding rewards of policy actions and baseline actions using the trained reward model\. We calculate performance of policy vs baseline using the identical testing context vectors to ensure a fair comparison\. As shown in Table[IV](https://arxiv.org/html/2606.23977#S5.T4), both models significantly outperform the baseline, but BCB proves to be superior by delivering a 2\.03% increase in reward\.
TABLE IV:Reward Uplift ComparisonModelReward Uplift \(%\)XGB\+BO1\.75%BCB2\.03%
### V\-FExperiment summary and framework recommendation
Although the XGBoost reward model reported a slightly lower MAPE, the BCB model achieved higher final reward uplift over the heuristic baseline, which is the primary optimization objective\. We recommend the BCB framework for real\-time sorter diversion optimization, due to its unique advantages summarized as following:
- •Superior online learning capacity: The warehouse environment is highly dynamic and its underlying system may drift over time as new equipment and software are installed or patterns change\. BCB expertly balances exploration and exploitation through Bayesian uncertainty\.
- •Contextual sensitivity: Our coefficient analysis demonstrates that BCB effectively captures the interactions between context and actions, making optimal weights sensitive to system dynamics\. In contrast, XGB’s feature importance doesn’t prove its policy is able to leverage the nuanced relationship between environment and actions\.
- •Time\-optimal control: BCB’s bimodal action distribution allows for decisive, timely system corrections\.
- •Operational efficiency: BCB has extremely short inference latency \(milliseconds\) which facilitates near\-instantaneous adaptation\. By contrast, the XGB\+BO pipeline requires an average of1818seconds per inference\.
## VIConclusion
The major contribution of this work is that we mathematically formulated the warehouse sorter environment into a structured Context\-Action\-Reward representation and conducted a comprehensive evaluation on three end\-to\-end machine learning frameworks for real\-time sorter optimization\. The Bayesian Contextual Bandit \(BCB\) emerged as the superior solution, achieving a 2\.03% reward uplift over the heuristic baseline\. Given its empirical alignment with Bang\-Bang Control principles, significantly higher computational efficiency, practical sample efficiency and continuous online adaptation capability, the BCB model is the better choice for a high\-frequency, context\-aware optimization solution required for a highly complex, evolving warehouse environment\. Future work will validate the reported reward uplift through closed\-loop emulator testing with repeated randomized trials and statistical significance analysis\.
## Appendix AAppendix
### A\-AAction Distributions
Figure[4](https://arxiv.org/html/2606.23977#A1.F4),[5](https://arxiv.org/html/2606.23977#A1.F5),[6](https://arxiv.org/html/2606.23977#A1.F6), and[7](https://arxiv.org/html/2606.23977#A1.F7)include the box plots and correlation matrix of recommended actions by XGB\+BO versus BCB\.
Figure 4:XGB action box plotFigure 5:BCB action box plotFigure 6:XGB action correlation matrixFigure 7:BCB action correlation matrix
## ACKNOWLEDGMENT
The authors would like to thank the leadership team for their support and guidance throughout this project\. We also acknowledge the simulation team and software engineering team for their assistance in developing and validating the emulator used for model training and evaluation\.
## References
- \[1\]N\. Boysen, S\. Fedtke, and F\. Weidinger\(2018\)Optimizing automated sorting in warehouses: the minimum order spread sequencing problem\.European Journal of Operational Research270\(1\),pp\. 386–400\.External Links:ISSN 0377\-2217,[Document](https://dx.doi.org/10.1016/j.ejor.2018.03.026)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p1.1)\.
- \[2\]N\. Boysen, K\. Stephan, and S\. Schwerdfeger\(2024\)Order consolidation in warehouses: the loop sorter scheduling problem\.European Journal of Operational Research316\(2\),pp\. 459–472\.External Links:ISSN 0377\-2217,[Document](https://dx.doi.org/10.1016/j.ejor.2024.02.042)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p1.1)\.
- \[3\]J\. C\. Duque\-Jaramillo, J\. M\. Cogollo\-Flórez, C\. G\. Gómez\-Marín, and A\. A\. Correa\-Espinal\(2024\)Warehouse management optimization using a sorting\-based slotting approach\.Journal of Industrial Engineering and Management17\(1\),pp\. 133–150\.External Links:[Document](https://dx.doi.org/10.3926/jiem.5661)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p1.1)\.
- \[4\]M\. Ghavamzadeh, S\. Mannor, J\. Pineau, and A\. Tamar\(2015\-11\)Bayesian reinforcement learning: a survey\.Foundations and Trends® in Machine Learning8\(5\-6\),pp\. 359–483\.External Links:[Link](https://arxiv.org/abs/1609.04436),[Document](https://dx.doi.org/https%3A//doi.org/10.1561/2200000049)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p3.1)\.
- \[5\]A\. S\. Lan and R\. Baraniuk\(2016\)A contextual bandits framework for personalized learning action selection\.InProceedings of the 9th International Conference on Educational Data Mining,Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p3.1)\.
- \[6\]L\. Li, W\. Chu, J\. Langford, and R\. E\. Schapire\(2010\)A contextual\-bandit approach to personalized news article recommendation\.InProceedings of the 19th International Conference on World Wide Web,New York, NY, USA,pp\. 661–670\.External Links:[Document](https://dx.doi.org/10.1145/1772690.1772758)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p3.1)\.
- \[7\]B\. Liang, L\. Xu, A\. Taneja, M\. Tambe, and L\. Janson\(2024\)Context in public health for underserved communities: a bayesian approach to online restless bandits\.External Links:[Link](https://arxiv.org/abs/2402.04933),2402\.04933Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p3.1)\.
- \[8\]S\. Qassimi and S\. Rakrak\(2025\-04\)Multi\-objective contextual bandits in recommendation systems for smart tourism\.Scientific Reports15\(1\)\.External Links:[Document](https://dx.doi.org/10.1038/s41598-025-89920-2)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p3.1)\.
- \[9\]G\. Siciliano, D\. Braun, K\. Zöls, and J\. Fottner\(2023\-04\)A concept for optimal warehouse allocation using contextual multi\-arm bandits\.InProceedings of the 25th International Conference on Enterprise Information Systems,pp\. 460–467\.External Links:[Document](https://dx.doi.org/10.5220/0011839700003467),[Link](https://www.researchgate.net/publication/370315477)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p4.1)\.
- \[10\]L\. M\. Sonneborn and F\. S\. Van Vleck\(1964\-01\)The bang\-bang principle for linear control systems\.Journal of the Society for Industrial and Applied Mathematics Series A Control2\(2\),pp\. 151–159\.External Links:[Document](https://dx.doi.org/10.1137/0302013)Cited by:[§V\-D](https://arxiv.org/html/2606.23977#S5.SS4.p2.1)\.
- \[11\]A\. Tewari and S\. A\. Murphy\(2017\)From ads to interventions: contextual bandits in mobile health\.InMobile Health,pp\. 495–517\.External Links:ISBN 978\-3\-319\-51393\-5,[Document](https://dx.doi.org/10.1007/978-3-319-51394-2%5F25)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p3.1)\.
- \[12\]H\. Wang, Z\. Liu, Y\. Chen, and X\. Xie\(2025\)LSTM and linear programming\-based optimization for logistics sorting center operations\.In2025 IEEE 7th International Conference on Communications, Information System and Computer Engineering \(CISCE\),pp\. 862–866\.External Links:[Document](https://dx.doi.org/10.1109/CISCE65916.2025.11064978)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p2.1)\.
- \[13\]Z\. Zhou, N\. Boysen, K\. Stephan, H\. Yu, and Y\. Yu\(2025\)Order consolidation in warehouses with compact 3d sorter modules\.European Journal of Operational Research\.External Links:ISSN 0377\-2217,[Document](https://dx.doi.org/10.1016/j.ejor.2025.12.015),[Link](https://www.sciencedirect.com/science/article/pii/S0377221725009828)Cited by:[§II](https://arxiv.org/html/2606.23977#S2.p1.1)\.Similar Articles
Contextual Bandits for Maximizing Stimulated Word-of-Mouth Rewards
This paper presents a contextual multi-armed bandit framework that learns individual spillover probabilities in social networks to optimize stimulated word-of-mouth marketing, achieving higher rewards by targeting connected users.
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
This paper studies piecewise-stationary low-rank linear contextual bandits, proposes the SPSC algorithm that achieves dynamic regret scaling with the intrinsic rank instead of the ambient dimension, and characterizes the identification boundary for subspace recovery under scalar feedback.
Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization
This paper introduces Pareto UCB1 Gossip and Simulated NSW UCB Gossip for multi-objective multi-agent multi-armed bandits, addressing both learning efficiency and fairness in stochastic environments.
Online LLM Selection via Constrained Bandits with Time-Varying Demand
This paper proposes a constrained stochastic bandit algorithm for online selection of large language models under time-varying task demand and heterogeneous accuracy, latency, and cost profiles, with theoretical guarantees on regret and constraint violations.
Human-in-the-Loop Contextual Bandits for Short-Term Rental Dynamic Pricing: Structural Equivalence of Historical Warm-Up and Approval-Gated Live Learning
The paper introduces Human-in-the-Loop Gated Bandit (HITL-GB) for short-term rental dynamic pricing, showing that historical pricing data under a prior policy is structurally equivalent to on-policy warm-up data, reducing cold-start from ~150 to ~30 episodes.