STARIXNet: Multivariate and Multi-attribute Deep Learning Approach to Real-Time Resource Allocation in Cloud Platforms

arXiv cs.LG Papers

Summary

STARIXNet is a lightweight neural network that improves cloud resource allocation by capturing multivariate spatio-temporal relationships among system metrics, prioritizing service stability over forecast accuracy. Deployed at Walmart, it achieved 10-50% cost savings while maintaining service reliability.

arXiv:2606.07565v1 Announce Type: new Abstract: Intelligent scaling of microservices in cloud platforms is crucial for mitigating escalating compute costs while avoiding service disruptions. Current solutions are limited to the univariate space, typically focusing on CPU usage alone to drive scaling decisions. Moreover, they address the problem as a purely forecasting task, focusing on prediction precision while neglecting the greater risks of underestimation and delays in system responsiveness. Alternative solutions are computationally complex, making them impractical for large-scale, real-time deployments. To address these challenges, we present STARIXNet, a lightweight neural network that guides resource allocation decisions in the multivariate space by capturing spatio-temporal relationships among multiple system metrics. STARIXNet models multiple quasi-dependent attributes, in particular the (S)easonal, (T)emporal, (A)uto-(R)egressive (I)ntegrated, and e(X)ogenous patterns, then implements an aggregation policy to finalize scaling decisions, prioritizing service stability, followed by cost-efficiency, over raw forecast accuracy. We empirically demonstrate the performance of STARIXNet by benchmarking against existing solutions in real-world settings. STARIXNet is deployed for critical production microservices at Walmart achieving tangible savings ranging from 10\% to 50\%, in addition to intangible benefits through improved service stability and customer experience.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:46 AM

# STARIXNet: Multivariate and Multi-attribute Deep Learning Approach to Real-Time Resource Allocation in Cloud Platforms
Source: [https://arxiv.org/html/2606.07565](https://arxiv.org/html/2606.07565)
###### Abstract\.

Intelligent scaling of microservices in cloud platforms is crucial for mitigating escalating compute costs while avoiding service disruptions\. Current solutions are limited to the univariate space, typically focusing on CPU usage alone to drive scaling decisions\. Moreover, they address the problem as a purely forecasting task, focusing on prediction precision while neglecting the greater risks of underestimation and delays in system responsiveness\. Alternative solutions are computationally complex, making them impractical for large\-scale, real\-time deployments\. To address these challenges, we present STARIXNet, a lightweight neural network that guides resource allocation decisions in the multivariate space by capturing spatio\-temporal relationships among multiple system metrics\. STARIXNet models multiple quasi\-dependent attributes, in particular the \(S\)easonal, \(T\)emporal, \(A\)uto\-\(R\)egressive \(I\)ntegrated, and e\(X\)ogenous patterns, then implements an aggregation policy to finalize scaling decisions, prioritizing service stability, followed by cost\-efficiency, over raw forecast accuracy\. We empirically demonstrate the performance of STARIXNet by benchmarking against existing solutions in real\-world settings\. STARIXNet is deployed for critical production microservices at Walmart achieving tangible savings ranging from 10% to 50%, in addition to intangible benefits through improved service stability and customer experience\.

Horizontal Pod Autoscaling \(HPA\), Multivariate Time Series Forecasting, Spatio\-Temporal Modeling, Lightweight Neural Networks

## 1\.Introduction

Cloud platforms have long relied on intelligent autoscaling to dynamically adjust resources in response to workload fluctuations\. This elasticity is essential for controlling costs \(by releasing idle capacity\) while upholding service reliability under peak loads\(Islamet al\.,[2012](https://arxiv.org/html/2606.07565#bib.bib25)\)\. Traditionally, autoscaling strategies are categorized as reactive – adding or removing resources based on threshold rules – or proactive – forecasting future demand and scaling in advance\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6)\)\. Reactive mechanisms, such as scaling out when CPU usage exceeds a fixed threshold, are simple but often lag behind sudden load surges, leading to performance degradation or Service Level Agreement \(SLA\) violations\(Chenet al\.,[2018](https://arxiv.org/html/2606.07565#bib.bib26)\)\. Conversely, proactive approaches attempt to predict workload trends to preemptively provision capacity\(Islamet al\.,[2012](https://arxiv.org/html/2606.07565#bib.bib25)\)\. In theory, proactive scaling can improve responsiveness, but in practice, current implementations suffer notable limitations\.

Most proactive autoscaling systems focus on univariate forecasts, typically tracking a single metric, such as CPU usage or request rate, as the sole indicator of workload\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6)\)\. This narrow view misses the complex multi\-resource dynamics of modern cloud applications\. For example, the performance of an application can be jointly influenced by CPU, memory, network, and I/O; scaling decisions based solely on CPU can become suboptimal if another resource becomes the bottleneck\(Ahujaet al\.,[2025](https://arxiv.org/html/2606.07565#bib.bib10)\)\. Relying on one metric can also make autoscaling brittle\. For example, CPU\-only policies may lead to over\-provisioning in response to CPU spikes, even when memory or network resources remain underutilized, resulting in inefficient resource use\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6)\)\. Additionally, many forecast\-based autoscaling solutions update on coarse intervals, such as 15\-minute and even as high as hourly cadence\(Luoet al\.,[2024](https://arxiv.org/html/2606.07565#bib.bib27); Huaet al\.,[2023](https://arxiv.org/html/2606.07565#bib.bib35)\), in order to reduce computational costs\. However, this significantly impedes responsiveness to rapid workload changes\. However, by reducing prediction intervals or using high\-frequency metrics to improve agility introduces substantial overhead on the monitoring and scaling infrastructure\(Islamet al\.,[2012](https://arxiv.org/html/2606.07565#bib.bib25)\)\. Thus, operators face a trade\-off between responsiveness and system overhead or stability\(Islamet al\.,[2012](https://arxiv.org/html/2606.07565#bib.bib25)\)\. Indeed, naive designs reacting to every short\-term fluctuation often lead to oscillations \(thrashing\), where resources are repeatedly added and removed, compromising stability and increasing operational costs\. In summary, existing univariate forecasting approaches often struggle with slow responsiveness to change, limited view of system state, and costly volatility in scaling behavior\.

This paper introduces STARIXNet, a light\-weight spatio\-temporal deep learning solution for real\-time decision\-making, in order to address the aforementioned shortcomings\. Instead of monitoring a single signal, STARIXNet learns a joint model of multiple metrics, including CPU usage, memory, network throughput, and more\. It learns their hidden correlations over time, enabling a holistic view of application load patterns\. Furthermore, STARIXNet provides several forecast options, based on different time series signal attributes and solution objectives\. By employing a lightweight Deep Neural Network \(DNN\) architecture that captures both long\-term trends and short\-term spikes, STARIXNet achieves reliable predictions without the heavy computational footprint of complex generative, recurrent, and/or attention\-based model architectures\. Moreover, STARIXNet moves beyond pure prediction accuracy by tightly integrating the forecasting module with a decision engine that implements an aggregation policy prioritizing stability over point precision\. In principle, the system favors consistent, gradual adjustments informed by spatio\-temporal context, rather than reactive oscillation to every minor forecast deviation\. This policy explicitly balances performance and cost: It reacts promptly to maintain Service Level Agreements \(SLAs\) and meet Service Level Objectives \(SLOs\), while dampening unnecessary resource transitions, a trade\-off often neglected by prior approaches\.

The key contributions of this work are summarized as follows:

Multivariate Lightweight Architecture:We innovatively design a deep learning model that fuses multiple resource metrics, external features, learns quasi\-dependent attributes, and captures spatio\-temporal patterns for multivariate workload forecasting\. In contrast to heavy attention\-based models that introduce significant complexity\(Hanet al\.,[2024](https://arxiv.org/html/2606.07565#bib.bib21)\), our architecture is optimized for real\-time operation with linear scalability in input dimensions\. It learns cross\-metric interactions to anticipate resource needs more accurately than single\-metric predictors, while remaining efficient and deployable\. This design uniquely supports updating decisions at high frequencies and decentralized deployments, potentially as part of microservice sidecar pattern or agentic AI solution in future work\.

Customizable Stability\-First Scaling Policy:Our solution employs client\-customizable aggregations and decision\-making policies, with a default policy emphasizing system stability and SLA compliance over naively chasing every predicted fluctuation\. By smoothing and validating forecasting outputs prior to executing scale actions, STARIXNet avoids the rapid oscillations observed in many aggressive autoscaling solutions\. This novel policy addresses the known responsiveness\-reliability trade\-off\(Islamet al\.,[2012](https://arxiv.org/html/2606.07565#bib.bib25)\), through ensuring that scaling decisions are robust against transient noise, thus reducing thrashing and long\-term costs\.

Real\-World Deployment at Scale:We report on the implementation of STARIXNet in a large\-scale cloud platform, handling hundreds of microservices across geographically distributed data centers in real time\. The framework was deployed with minimal friction alongside existing orchestration systems, such as kubernetes, demonstrating its practical integration capabilities\. To our knowledge, this is one of the first real\-time multivariate deep learning autoscaling approaches successfully validated in production at such scale\.

Measurable Performance Improvements:Through extensive experiments, live benchmarking with real traffic, and live A/B testing, we show that STARIXNet achieves significant gains over state\-of\-practice alternatives\. In our post\-deployment analysis, we noted cloud resource cost reductions ranging from 10% to 50%, while simultaneously lowering SLA violation rates in comparison to all of default rule\-based reactive, alternative deep learning, statistical, and univariate solutions\. It outperformed advanced baselines in metrics such as average response time and scaling efficiency\. These improvements underscore the value of coordinated multivariate learning and a stability\-aware policy in real\-time resource management\.

The remainder of this paper is structured as follows: Section[2](https://arxiv.org/html/2606.07565#S2)summarizes our findings from relevant literature\. Section[3](https://arxiv.org/html/2606.07565#S3)describes our solution in higher depth and mathematical notation\. Section[4](https://arxiv.org/html/2606.07565#S4)discusses our experimental setups covering live benchmarking, client simulations, and post\-client onboarding evaluations\. Section[5](https://arxiv.org/html/2606.07565#S5)summarizes the observed results and impacts from both experiments and live, onboarded, critical microservices, as well as learned lessons and practical implications\. Finally, we conclude this work in Section[6](https://arxiv.org/html/2606.07565#S6)\.

## 2\.Background

Early cloud autoscaling mechanisms predominantly used simple threshold\-based rules or queuing theory formulas to trigger scaling actions\. Commercial and open\-source cloud platforms, such as AWS Auto Scaling and Kubernetes Horizontal Pod Autoscaling \(HPA\) typically allow users to set static upper and lower bounds on a metric like average CPU usage or request queue length\. Resources are added or removed when the metric crosses these thresholds\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6); Mao and Humphrey,[2010](https://arxiv.org/html/2606.07565#bib.bib9)\)\. Such rule\-based autoscaling methods are straightforward and react in real time, but require careful tuning and often perform sub\-optimally in dynamic environments\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6)\), due to their rather reactive than proactive response, and negligence of pod startup delays\. Choosing appropriate thresholds and cool\-down durations require expert knowledge of each application’s workload patterns\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6)\)\.

Recent approaches to dynamic resource allocation, such as in\(Chen,[2023](https://arxiv.org/html/2606.07565#bib.bib1); Chenet al\.,[2023](https://arxiv.org/html/2606.07565#bib.bib2); Melloet al\.,[2017](https://arxiv.org/html/2606.07565#bib.bib3)\), employ distributed orchestration and consensus algorithms to ensure rapid failure recovery by focusing on stateless application management, although full microservice decomposition remains constrained by system limitations in these designs\. The authors in\(Ballaet al\.,[2020](https://arxiv.org/html/2606.07565#bib.bib4)\)proposed an alternative to HPA, which first vertically optimizes the resources for a given pod, then uses indicators from a virtual environment to a adapt resource definitions and adjust horizontal scaling accordingly, without requiring user\-assigned parameters\. However, these solutions remain reactive and limited to tracking a singular workload signal, the CPU usage\.

Recent advances in machine learning and AI have led to a paradigm shift from traditional rule\-based resource allocation to intelligent, predictive frameworks\. Time series analysis frameworks are among the pioneering works in addressing the reactive lag of threshold methods by predicting future request rates and proactively scaling accordingly, thus improving SLA adherence in comparison to reactive rules\(Calheiros and Buyya,[2015](https://arxiv.org/html/2606.07565#bib.bib7); Li and Xia,[2011](https://arxiv.org/html/2606.07565#bib.bib8)\)\. However, these methods often assume stationarity and homoscedasticity, and most remain univariate despite the prevalence of multidimensional workloads in modern services\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6)\)\. To overcome these limitations, researchers have explored more flexible machine learning models, including transformer\- and attention\-based architectures\(Mao and Humphrey,[2010](https://arxiv.org/html/2606.07565#bib.bib9); Ahujaet al\.,[2025](https://arxiv.org/html/2606.07565#bib.bib10)\)\. A notable example is Adaptive Horizontal Pod Autoscaling \(AHPA\), which enhances Kubernetes autoscaling under fluctuating business demand by combining decomposition\-based time series forecasting with performance modeling in Alibaba Cloud Container Services\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07565#bib.bib5)\)\. Similarly,\(Rubak and Taheri,[2023](https://arxiv.org/html/2606.07565#bib.bib13)\)propose a machine learning\-driven solution to the challenges of over\- and under\-provisioning in Kubernetes\. Their approach leverages classical models, including linear regression, support vector machines, and Multi\-Layer Perceptron \(MLP\) neural networks to predict resource requirements based on anticipated user demand, enhancing both service quality and cost efficiency beyond what standard HPA can offer\. However, while achieving robustness during seasonality fluctuations, these solutions remain questionable in addressing inconsistent or irregular workload patterns\.

Recurrent Neural Networks \(RNNs\), particularly Long Short\-Term Memory \(LSTM\) and Gated Recurrent Unit \(GRU\) architectures, have become widely adopted for modeling temporal patterns\(Laiet al\.,[2018](https://arxiv.org/html/2606.07565#bib.bib18)\)\. The work in\(Prachitmutitaet al\.,[2018](https://arxiv.org/html/2606.07565#bib.bib34)\)utilized LSTM–MLP hybrid models to forecast web traffic, Bi\-directional LSTMs were used in\(Yanet al\.,[2021](https://arxiv.org/html/2606.07565#bib.bib20)\), while\(Ouhameet al\.,[2021](https://arxiv.org/html/2606.07565#bib.bib19)\)proposed hybrid Convolutional Neural Network \(CNN\) and LSTM \(CNN–LSTM\) architectures to capture spatial and temporal dependencies in cloud workloads\. However, these solutions have widely addressed univariate workload signals and are characterized by high computational complexities, limiting wide\-scale adoption in real\-time settings\.

The authors in\(Tokaet al\.,[2020](https://arxiv.org/html/2606.07565#bib.bib16)\)employed multiple machine learning models in a short\-term evaluation loop, allowing them to compete under a ”winner\-takes\-all” fusion strategy\. Their multi\-step\-ahead predictive scaling engine, evaluated through both simulations and real\-world web traces, showcased improved adaptability and efficiency\. They also introduced a compact management parameter to help balance resource provisioning with adherence to SLA\.

More recently, Graph Neural Networks \(GNNs\) were utilized in\(Luoet al\.,[2024](https://arxiv.org/html/2606.07565#bib.bib27)\), where the authors introduced a novel spatial\-temporal graph neural network that models dynamic microservice interactions, workload periodicities, and system states, in order to enhance resource prediction and efficiency in microservice\-based applications\. While their method achieves notable gains in accuracy and real\-world resource savings by comprehensively modeling spatio\-temporal dependencies, it prioritizes prediction precision over computational speed, which may limit responsiveness in latency\-critical scenarios\. Furthermore, forecast precision metrics in isolation are misleading for the purpose of evaluating microservices autoscaling solutions, where the risk of workload underestimation is critically more costly than the risks of overestimation\.

Another line of research explored reinforcement learning \(RL\) for autoscaling\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.07565#bib.bib15); Tesauro and Jong,[2007](https://arxiv.org/html/2606.07565#bib.bib11)\)\. Researchers in\(Xuet al\.,[2020](https://arxiv.org/html/2606.07565#bib.bib12)\), applied deep RL to learn scaling policies optimizing SLA\-cost trade\-offs, and in\(Xueet al\.,[2022](https://arxiv.org/html/2606.07565#bib.bib17)\), the authors proposed meta model\-based RL for better generalization to unseen workloads\. However, RL faces challenges in sample efficiency, stability, and deployment complexity, especially in non\-stationary cloud environments\(Xueet al\.,[2022](https://arxiv.org/html/2606.07565#bib.bib17)\)\.

While the aforementioned solutions target predictive autoscaling, they often suffer from architectural complexity, platform specificity, and lack of large\-scale real\-world validation\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6); Calheiros and Buyya,[2015](https://arxiv.org/html/2606.07565#bib.bib7); Laiet al\.,[2018](https://arxiv.org/html/2606.07565#bib.bib18); Ouhameet al\.,[2021](https://arxiv.org/html/2606.07565#bib.bib19); Yanet al\.,[2021](https://arxiv.org/html/2606.07565#bib.bib20)\)\. Furthermore, key research gaps remain in multivariate modeling, stability\-aware scaling policies, real\-time and high\-frequency revisions, and operational deployment at scale\(Lorido\-Botranet al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib6)\)\.

STARIXNet addresses these challenges by introducing a robust, lightweight, real\-time, multivariate deep learning framework with a stability\-first policy, validated through large\-scale deployment at Walmart\. This work bridges the gap between academic research and practical and scalable autoscaling solutions\.

## 3\.Methodology

In this section, we present STARIXNet in detail\. STARIXNet is a DNN architecture combining four encoder\-decoder modules, which share input features while utilizing different structures to model various attributes of the multivariate time series\. The modeled time series targets describe workload in the multivariate space, including metrics critical to driving scaling decisions, such as CPU usage, memory, network throughput, request rate, or any other combination of user\-specified metrics simultaneously\. The attributes decoded and predicted by these modules are described as follows:

Seasonal::Forecasts reflecting a Fourier series representation of the target variables\.

Temporal::Forecasts reflecting acute and repetitive seasonal patterns, such as those due to planned events \(e\.g\. maintenance or stress tests\)\.

AutoRegressive\-Integrated::Forecasts from the sequence\-to\-sequence \(seq2seq\) modeling of the multivariate targets, reflecting the short\-term trends and dynamics\.

eXogenous::Forecasts based on the spatio\-temporal dependencies with external features, such as system or business features\.

We denoted the name STARIXNet by combining the aforementioned attributes and the word ”Network”\. Moreover, exogenous modules can be replicated for independent sets of external features, allowing the number of modules to be greater than four\. For example, a microservice’s system metrics, such as latency, utilization, errors, and network metrics, can represent one group of features requiring one exogenous module, while business metrics, such as Orders Per Minute \(OPM\) and user sign\-in rate can represent another group, and upstream microservices’ traffic can be a third group requiring a third exogenous module\.

We summarize the nomenclature adopted in this section in Table[1](https://arxiv.org/html/2606.07565#S3.T1)\. For simplicity, we useΘ\\ThetaandΘ´\\acute\{\\Theta\}as generic symbols of any DNN function,𝚯\\boldsymbol\{\\Theta\}and𝚯´\\boldsymbol\{\\acute\{\\Theta\}\}as the sets of these DNN functions, while omitting typically redundant parameters \(e\.g\. ReLU, linear layer weights, biases, depth, etc\.\), with the exception of parameters particular to our methods\. Similarly, we useKK,LL, andZZin common notation form, however, the actual values of these parameters are different for each module or layer described in the following subsections\.

Table 1\.NomenclatureWe illustrate the framework of STARIXNet schematically in Fig\.[1](https://arxiv.org/html/2606.07565#S3.F1)and in the following subsections, we explain and present the mathematical formulation of the DNN’s modules\.

![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/e2e.png)Figure 1\.STARIXNet training and inference framework\.End\-to\-end architecture of the STARIXnet DNN with inference step\.### 3\.1\.Decoding the Seasonal Representation

We designed a module in STARIXNet that learns the Fourier representation of the22\-dimensionalYY, by encoding the 1\-dimensional epoch timestamps,Tt\+1:t\+JT\_\{t\+1:t\+J\}, usingSSlinear layers with sinusoid activations, where optimized weights represent𝜶\\boldsymbol\{\\alpha\},𝝎\\boldsymbol\{\\omega\}, and𝜷\\boldsymbol\{\\beta\}, and reconstructs the future sequence,Yt\+1:t\+JY\_\{t\+1:t\+J\}, accordingly\. We formulate this module as follows:

\(1\)Y´t\+1:t\+J,Θ´=Θ´​\(Θ​\(𝜶,𝝎,𝜷,T\)\+𝜸\)\\acute\{Y\}\_\{t\+1:t\+J,\\acute\{\\Theta\}\}=\\acute\{\\Theta\}\(\\Theta\(\\boldsymbol\{\\alpha\},\\boldsymbol\{\\omega\},\\boldsymbol\{\\beta\},T\)\+\\boldsymbol\{\\gamma\}\)\(2\)Θ​\(𝜶,𝝎,𝜷,T\)=∑s∈S𝜶s∘sin⁡\(𝝎s​Tt\+1:t\+J\+𝜷s\)\\Theta\(\\boldsymbol\{\\alpha\},\\boldsymbol\{\\omega\},\\boldsymbol\{\\beta\},T\)=\\sum\_\{s\\in S\}\\boldsymbol\{\\alpha\}\_\{s\}\\circ\\sin\(\\boldsymbol\{\\omega\}\_\{s\}T\_\{t\+1:t\+J\}\+\\boldsymbol\{\\beta\}\_\{s\}\)
Although the DNN,Θ\\Theta, can independently learn appropriate weights for𝝎\\boldsymbol\{\\omega\}through backpropagation, we leverage prior knowledge of the modeled targets to initialize and freeze𝝎\\boldsymbol\{\\omega\}, thus enhancing training efficiency\. These priors include values representing daily and weekly frequencies \(in radians\), as well as values obtained from spectral analysis with Fast\-Fourier Transform \(FFT\)\(Welch,[2003](https://arxiv.org/html/2606.07565#bib.bib28)\)onYYprior to training the DNN, similar to the approach proposed in\(Abdulaalet al\.,[2021](https://arxiv.org/html/2606.07565#bib.bib29)\)\.

The outputs of this module serve to represent the smoothed\-periodic patterns while disregarding real\-time deviations in other features\. Consequently, it adds a layer of resilience to the inference service because it does not depend on the accuracy or consistency of any external metrics\.

### 3\.2\.Decoding the Temporal Representation

We designed another module that learns a more\-discrete, periodic, representation of the22\-dimensionalYY, by encoding the11\-dimensional epoch timestamps,Tt\+1:t\+JT\_\{t\+1:t\+J\}, using embedding learning withEEweights matrix, such that the epoch timestamps are first transformed into a finite vocabulary set of size\|V\|\|V\|using modulo operation, and mapped into an embeddings vector of lengthdd, then reconstructs the future sequence,Yt\+1:t\+JY\_\{t\+1:t\+J\}, accordingly\. We formulate this DNN module as follows:

\(3\)Y´t\+1:t\+J,Θ´=Θ´​\(Θ​\(E,T\)\)\\acute\{Y\}\_\{t\+1:t\+J,\\acute\{\\Theta\}\}=\\acute\{\\Theta\}\(\\Theta\(E,T\)\)\(4\)Θ​\(E,T\)=E​\[Tt\+1:t\+J%​\|V\|\]\\Theta\(E,T\)=E\[T\_\{t\+1:t\+J\}\\%\|V\|\]
The forecast outputs of this module serve to represent the discrete periodic pattern, while allowing for acute deviations, such as those due to predictable events \(e\.g\., maintenance, stress tests, or promotional events\)\. Similarly to the seasonal representation module, this module disregards real\-time deviations in other features, thus adding an additional layer of resilience to the service\.

### 3\.3\.Autoregressive\-Integrated Pattern Learning

Inspired by popular statistical state\-space models such as ARIMA\(Boxet al\.,[2015](https://arxiv.org/html/2606.07565#bib.bib30); Calheiroset al\.,[2014](https://arxiv.org/html/2606.07565#bib.bib31)\), we covered the autoregressive behavior in the DNN by encoding the multivariate inputYt−1:t−I,Θ´Y\_\{t\-1:t\-I,\\acute\{\\Theta\}\}using matrix differencing and CNN operations, aiming to capture short\-term spatio\-temporal dependencies\. The decoder layers then reconstruct the future sequence,Yt\+1:t\+JY\_\{t\+1:t\+J\}from the feature maps\. We formulate this DNN module as follows:

\(5\)Y´t\+1:t\+J,Θ´=Δ−1​Θ´​\(Θ​\(K,Y\)\)\\acute\{Y\}\_\{t\+1:t\+J,\\acute\{\\Theta\}\}=\\Delta^\{\-1\}\\acute\{\\Theta\}\(\\Theta\(K,Y\)\)\(6\)Θ​\(K,Y\)=\{P​o​o​l​\(C​o​n​v​1​D​\(K\(l\),Z\(l−1\)\)\)\}∀l∈L\\Theta\(K,Y\)=\\\{Pool\(Conv1D\(K^\{\(l\)\},Z^\{\(l\-1\)\}\)\)\\\}\\qquad\\forall l\\in L\(7\)Z\(l−1\)=Δ​Yt−I:t−1∀l=1Z^\{\(l\-1\)\}=\\Delta Y\_\{t\-I:t\-1\}\\qquad\\forall l=1
This module serves to capture and forecast short\-term, non\-periodic, fluctuations, specifically with respect to steep upward or downward shifts in trend, such as those resulting from traffic failover across regions, DDoS attacks, social or promotional events, and many other factors\.

### 3\.4\.Modeling Exogenous Features Influence

Moreover, we include a group of DNN modules to modelYt\+1:t\+JY\_\{t\+1:t\+J\}as functions of unique sets of exogenous regressors,Xt−I:t−1rX\_\{t\-I:t\-1\}^\{r\}\. We encoded the past sequenceXt−I:t−1rX\_\{t\-I:t\-1\}^\{r\}using CNN operations to capture spatio\-temporal dependencies, then reconstructed the future sequenceYt\+1:t\+JY\_\{t\+1:t\+J\}from the feature maps\. We formulate therrth DNN module as follows:

\(8\)Y´t\+1:t\+J,Θ´=Θ´​\(Θ​\(K,Xr\)\)\\acute\{Y\}\_\{t\+1:t\+J,\\acute\{\\Theta\}\}=\\acute\{\\Theta\}\(\\Theta\(K,X^\{r\}\)\)\(9\)Θ\(K,Xr\)=\{Pool\(ReLU\(Conv1D\(K\(l\),Z\(l−1\)\)\)\}∀l∈L\\Theta\(K,X^\{r\}\)=\\\{Pool\(ReLU\(Conv1D\(K^\{\(l\)\},Z^\{\(l\-1\)\}\)\)\\\}\\qquad\\forall l\\in L\(10\)Z\(l−1\)=Xt−I:t−1r∀l=1Z^\{\(l\-1\)\}=X\_\{t\-I:t\-1\}^\{r\}\\qquad\\forall l=1
These modules serve to model and forecast potential fluctuations in the resource targets, in response to recent deviations in external groups of regressors, including system metrics, business metrics, or other interdependent microservices’ traffic\. For example, spikes or dips in top\-of\-the\-funnel traffic features may indicate a potential change in the modeled microservice’s load levels\. In another example, an upward or downward trend in business metrics, such as the OPM or sign\-in rate, may propagate to the modeled microservice’s load levels\. These modules aim to model such behaviors and more\.

### 3\.5\.Multi\-Objective Loss Optimization

One advantage of our DNN design, which outputs separate multivariate forecasts covering different attributes of the multivariate workload time series, is the enhanced user\-flexibility of utilizing diverse loss functions or loss function parameters, tailored to the varying preferences of different microservices’ owners\. For example, a more risk\-averse client might assign a higher penalty to underestimation than to overestimation when optimizing the loss functions for the seasonal and temporal forecasts\. In contrast, a more risk\-tolerant client may prefer more\-balanced loss function parameters across all outputs\. In another example, owners of microservices characterized by a high degree of noise, such as frequent momentary CPU spikes, may prefer reduced sensitivity in the autoregressive or some of the exogenous forecasts, in reaction to such noise\. Additionally, users have the flexibility to adjust the weighting of the objectives, allowing for deviations from a strictly balanced multi\-objective configuration\. Accordingly, we provided, as a default configuration, a multi\-objective quantile loss function for training the DNN, due to the flexibility of customizing quantile values for eachθ´\\acute\{\\theta\}output and user\. We express this multi\-objective loss as follows:

\(11\)ℒ\(Y,Y´\)=∑t∑θ´∈𝜽´𝓌θ´\(max\[qθ´\(Y´t\+1:t\+J,θ´−Yt\+1:t\+J\),\(qθ´−1\)\(Y´t\+1:t\+J,θ´−Yt\+1:t\+J\)\]\)\\begin\{split\}\\mathscr\{L\}\(Y,\\acute\{Y\}\)=\\sum\_\{t\}\\sum\_\{\\acute\{\\theta\}\\in\\boldsymbol\{\\acute\{\\theta\}\}\}\\mathscr\{w\}\_\{\\acute\{\\theta\}\}\\bigg\(max\\big\[q\_\{\\acute\{\\theta\}\}\(\\acute\{Y\}\_\{t\+1:t\+J,\\acute\{\\theta\}\}\-Y\_\{t\+1:t\+J\}\),\\\\ \(q\_\{\\acute\{\\theta\}\}\-1\)\(\\acute\{Y\}\_\{t\+1:t\+J,\\acute\{\\theta\}\}\-Y\_\{t\+1:t\+J\}\)\\big\]\\bigg\)\\end\{split\}where by default, we set𝓌θ´=1\\mathscr\{w\}\_\{\\acute\{\\theta\}\}=1andqθ´=0\.5q\_\{\\acute\{\\theta\}\}=0\.5for everyθ´\\acute\{\\theta\}unless the client requests a customization or depending on the unique characteristics of the onboarded microservice\.

### 3\.6\.Inference and Dynamic Scaling Policies

In previous subsections, we discussed STARIXNet’s architecture, parameters, and training scheme\. In this subsection we briefly demonstrate basic inference and post\-processing steps for translating STARIXNet’s output into real\-time scaling decisions\. However, we must note that the exact implementation at Walmart involves many additional engineering and processing steps that are beyond the scope of this paper\.

#### 3\.6\.1\.Aggregation along thett\-Dimension

Another operational advantage of our approach, distinguishing it from prior reactive or point\-based forecast solutions, is our utilization of a seq2seq framework, in order to achieve improved resilience and minimize the risk of delayed responsiveness\. This is particularly critical in HPA problems, where pod start\-up time and load\-rebalancing delays, along with risks such as missing, delayed, or failed input signals, can significantly impact responsiveness or result in underestimation risks\. Therefore, while inference executes at a near\-real\-time cadence, such as every 30 seconds, the forecasts obtained from each run instance span an extended horizon of lengthJJ\. For example, Fig\.[2](https://arxiv.org/html/2606.07565#S3.F2)shows a single run instance’s raw inputs and forecast outputs from a STARIXNet DNN with 5 modules and multivariate workload targets of size 2\. The figure demonstrates that each update yields 15\-steps ahead forecasts with different suggestions\. Accordingly, we achieve risk minimization through maximum aggregation along thett\-dimension:

\(12\)Y´t,θ´=maxt⁡\(Y´t\+1:t\+J,θ´\)\\acute\{Y\}\_\{t,\\acute\{\\theta\}\}=\\max\_\{t\}\(\\acute\{Y\}\_\{t\+1:t\+J,\\acute\{\\theta\}\}\)
![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/seq2seq.png)Figure 2\.Input \(lagged\) and output \(forecasted\) targets\.Sample plotted input:output workload targets\.Furthermore, during both training and inference, we drop the signal values from the current runtime step,t=0t=0, as reflected in the aforementioned nomenclature\. This design choice minimizes the risk of processing incomplete or partially available data\. In other words, at timet=0t=0, STARIXNet takes inputs from the period\[t−I:t−1\]\[t\-I:t\-1\]and yields forecasts for the period\[t\+1:t\+J\]\[t\+1:t\+J\], deliberately ignoring potentially noisy readings obtained fromt=0t=0\.

#### 3\.6\.2\.Aggregation along theθ´\\acute\{\\theta\}\-Dimension

The next step in post\-processing involves combining the different types of forecasts\. In this step, different users may elect to apply custom aggregation logic, as we had previously indicated\. For example, aqq\-th quantile aggregation with a highqqmay suit risk\-averse users\. Alternatively, users may set custom rules or triggers, potentially guided by additional signals, to situationally switch between the different forecast types\. However, the details of such strategies are beyond the scope of this paper\. For simplicity, we present aqq\-th quantile aggregation method:

\(13\)Y´t=q​u​a​n​t​i​l​eθ´∈𝜽´​\(q,Y´t\+1:t\+J,θ´\)\\acute\{Y\}\_\{t\}=\\underset\{\\acute\{\\theta\}\\in\\boldsymbol\{\\acute\{\\theta\}\}\}\{quantile\}\(q,\\acute\{Y\}\_\{t\+1:t\+J,\\acute\{\\theta\}\}\)
![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/median.png)Figure 3\.Median aggregation of the multivariate forecasts\.Sample plotted model output against the actual and the aggregated results\.In Fig\.[3](https://arxiv.org/html/2606.07565#S3.F3)we show an example withq=0\.5q=0\.5\(i\.e\., median\) across 5 forecast outputs for 2 workload metrics\. The figure demonstrates the smoothing effect of this aggregation strategy, resulting in robustness to signal noise, without compromising responsiveness to meaningful fluctuations\.

#### 3\.6\.3\.Calculating the Number of Pods

For computing the number of pods corresponding to each forecasted target value inYtY\_\{t\}, we divide by the product of the per\-pod resource capacity \(i\.e\., the pod profile\) and the corresponding utilization target\. This results inNNsolutions for the pod count recommendations\. We then take the maximum of theNNrecommendations, since a lower value would risk violating at least one of the utilization targets\. We summarize this step mathematically, in vectorized form, as:

\(14\)P​o​d​st=max⁡\(Y´t⊘\(C∘U\)\)Pods\_\{t\}=\\max\\big\(\\acute\{Y\}\_\{t\}\\oslash\(\{C\\circ U\}\)\\big\)

#### 3\.6\.4\.Applying Stabilization Measures and Driving HPA

The final steps in post\-processing include ensuring that fluctuations inYtY\_\{t\}over time would not risk system instability, then exposing the revised pod counts via Prometheus to drive HPA for the users’ microservices using the`autoscaling/v2`Kubernetes API and its custom metric scaling method\. For the former step, we apply a short stabilization window to prevent subsequent downscaling events in a short period of time, as well as a separate, very short, stabilization window to avoid immediate upscaling events when a recent upscale has exceeded a pre\-specified percentage threshold\. The full implementation at Walmart involves many additional heuristics and engineering safeguards, which are beyond the scope of this paper\.

## 4\.Experimental Setups

In this section, we describe the setup of our benchmark study, conducted to evaluate the practical advantages and performance of STARIXNet against popular and State\-Of\-The\-Art \(SOTA\) forecasting models\. Moreover, we briefly describe the process of evaluating impacts and performance against the baseline, post\-onboarding actual customer microservices\.

### 4\.1\.Live Benchmarking in Production Environment

#### 4\.1\.1\.System

To establish confidence, we avoided offline evaluations using artificial data or simulations\. In lieu, we leveraged Walmart Cloud Native Platform \(WCNP\) to deploy identical clones of a Java\-based microservice across multiple datacenter clusters and regions\. We implemented a model\-agnostic inference service and utilized a multi\-instance deployment strategy to predict for each cloned microservices through a separate inference service instance, where each instance is configured to load a different forecasting model\. This multi\-instance deployment enabled us to collect and compare system performance metrics, such as CPU usage, memory usage, and inference latency \(50th and 95th percentiles\), corresponding to each loaded model type\. For a visual illustration of the benchmark setup, readers can refer to Subsection[B\.1](https://arxiv.org/html/2606.07565#A2.SS1)of the appendix\. We note that this benchmark setup differs from our centralized inference service configuration used for actual customer microservices\.

#### 4\.1\.2\.Load

To generate the load, we have implemented a distributed traffic generator application that mirrors real traffic\. Distributed instances of the traffic generator application generate identical requests across all cloned microservices for processing\. This setup allows us to consistently simulate production\-like load across services and easily test fail over and scaling scenarios\. For fairness, all pods’ CPU, memory, and HPA bounds \(i\.e\., minimum and maximum pod replicas\) were set identically across all microservices and traffic generators\. We also standardized the inference service architecture and CPU allocation, with exception of memory, since some benchmark models required significantly more RAM to load\.

#### 4\.1\.3\.Models

We benchmarked the live performance of STARIXNet against traditional univariate statistical models as well as multivariate deep learning and SOTA models\. For statistical models, we selectedARIMA\(Boxet al\.,[2015](https://arxiv.org/html/2606.07565#bib.bib30)\), consistent with the experimentation baselines adopted in prior literature\. Since ARIMA is a univariate model, we implemented a wrapper that trains separate ARIMA models for each load target and concatenates their outputs to match the multivariate format expected by our inference service\. For multivariate DNN models, we implemented anLSTM\-based RNN\(Prachitmutitaet al\.,[2018](https://arxiv.org/html/2606.07565#bib.bib34)\), a probabilistic autoregressive RNN popularly known asDeepAR\(Salinaset al\.,[2020](https://arxiv.org/html/2606.07565#bib.bib32)\), and the Temporal Fusion Transformer \(TFT\)\(Limet al\.,[2021](https://arxiv.org/html/2606.07565#bib.bib33)\), which incorporates SOTA transformers and attention mechanisms\.

#### 4\.1\.4\.Parameters and Configuration

One of the limitations to practically applying DNN solutions from literature is the costly and sensitive nature of hyperparameter tuning\. This is especially challenging in decentralized settings, where thousands of microservices must each maintain models that are sensitive to frequent release cycles and evolving traffic patterns\. Therefore, we sought a set of hyperparameter values that generalize well across most microservices, while maintaining reasonably light\-weight models\. Accordingly, we leveraged the built\-in auto\-tuning methods from the benchmark models’ open\-source Python packages, restricting the search to a modest range of values for DNN’s depth, number of RNN layers, and dropout rates\. It should be noted that for STARIXNet, we have not attempted any hyperparameter tuning during this benchmark study, and that it has fewer trainable parameters compared to the other DNN models\. However, to ensure fairness, the following configuration values were held constant across all models, including STARIXNet: 30 \(lagged steps\) and 15 \(forecast steps\) for the encoder and decoder lengths, 256 for batch size, 100 maximum training epochs, and early termination patience of 10\. Similarly, for the ARIMA models, we selected 30 for the autoregressive lag parameter\. All models were trained on 1\-minute interval time series of length 30,240\. Furthermore, we included the exogenous features in all deep learning models where applicable, to ensure parity with STARIXNet\.

#### 4\.1\.5\.Forecast Targets and SLA

In the multivariate context, we define three forecast targets as CPU usage, requests rate, and network throughput \(i\.e\., network bytes received rate\)\. The inference service uses the forecasts for these targets and computes the corresponding utilization metrics; CPU usage, Requests per pod, and network received rate per pod, then calculates the number of pods accordingly\. The resulting values are aggregated and exposed as a custom scaling metric as explained in Subsection[3\.6](https://arxiv.org/html/2606.07565#S3.SS6)\. To maintain SLA, we set the utilization thresholds as follows: 25% CPU usage, 10 requests per pod, and 25,000 network bytes received per pod\.

### 4\.2\.Pre\- and Post\-Client Adoption Evaluations

As with any product deployed at scale, we rely on a systematic evaluation and experimentation strategy to ensure reliability\. Prior to onboarding a clients’ microservice, we run simulations on historical data against the client’s baseline, which is often a reactive CPU\-based HPA or a pre\-defined schedule\-based scaling policy\. Once the simulation results meet an agreed\-upon acceptance criteria, we leverage the multiregion deployment of microservices to run quasi\-A/B experimentation as part of a phased rollout plan\. Our solution is initially adopted in one region and evaluated against the clients’ baseline in the remaining regions for a few weeks\. If the performance is satisfactory, we onboard a second region and repeat the process until the microservice is fully migrated to our solution\.

## 5\.Results and Learnings

### 5\.1\.Live Benchmarking Results

#### 5\.1\.1\.Inference System Performance

The efficiency of real\-time model inference at scale is crucial to maintaining timeliness, responsiveness, maintainability, and cost\-effectiveness\. This becomes particularly paramount when the inference component is integrated into the microservice deployment as a sidecar, alongside standard sidecar functionalities, such as load balancing, circuit breaking, dynamic configuration updates, and logging\. While our currently adopted solution acts as an external microservice exposing the custom scaling metrics, our future vision is full decentralization, integrating the solution directly into the Kubernetes stack\. Therefore, a practical solution must be lightweight in resources and latency\. From the benchmark study, we conclude that our solution is superior in overall efficiency, as recorded in Table[2](https://arxiv.org/html/2606.07565#S5.T2), which reports average values of system performance metrics over a period of one day\. We must note that we have not optimized STARIXNet or used quantization to enhance its performance\. Surprisingly, ARIMA, despite being a traditional statistical solution, exhibited the highest resource consumption\. We attribute this to the internal management required by the Kalman filter and the need to update state information before each forecast\. While the recorded costs may appear trivial at benchmark study scale, the implications are significant at a large industrial scale and higher dimensional inputs\. This aspect of inference performance benchmarking sets our work apart from prior literature, which often either overlooks inference system performance or focuses on non\-real\-time solutions\.

Table 2\.Inference Service System Performance
#### 5\.1\.2\.Client\-Side Impact

As noted in early sections, we assert that the approaches used in some literature for evaluating models based on prediction accuracy metrics, such as mean squared errors, is potentially misleading in the context of autoscaling microservices\. We know from experience that the adoption criteria for critical microservices are not driven by these metrics\. Instead, clients prioritize operational performance indicators, including SLA compliance, system stability, alongside cost efficiency, particularly under irregular load patterns, peak periods, and during disaster recovery\. In contrast, a high forecast accuracy does not guarantee business value\. A single underestimation event can result in SLA violations with direct revenue loss and negative customer impact, far outweighing the impact of frequent overestimation\. Therefore, we designed our solution to aggregate forecasts from both static and reactive model outputs, such as seasonality\-based outputs and autoregressive outputs, thus preventing the end result from reaching dangerously low or unreasonably high predictions\. Consequently, we evaluate performance based on the client’s system health indicators, SLA/SLO\-related metrics, and cost implications\. We recorded these metrics from the benchmark experiment over a period containing segments of irregular traffic patterns, and we summarize the findings in Table[3](https://arxiv.org/html/2606.07565#S5.T3)and Table[4](https://arxiv.org/html/2606.07565#S5.T4)\. Table[3](https://arxiv.org/html/2606.07565#S5.T3)counts the instances in which traffic per replica, CPU usage, or latency thresholds were violated, while Table[4](https://arxiv.org/html/2606.07565#S5.T4)compares compute cost and latency reductions relative to the model with the worst performance\. The results demonstrate the effectiveness of our solution in balancing the competing goals of meeting SLAs and SLOs while reducing costs\. For supplementary visualizations, readers may refer to Subsection[B\.1](https://arxiv.org/html/2606.07565#A2.SS1)of the appendix\.

Table 3\.Count of Daily SLA/SLO\-related Metric ViolationsTable 4\.Comparative Reductions with 100% as Baseline

### 5\.2\.Learnings from Onboarded Microservices

The solution has been deployed at Walmart and is already managing autoscaling for many critical microservices, and continues to expand to additional services due to its reputable effectiveness\. During phased rollouts, clients validated significant cost reductions, ranging from 10% to 50%, and in some instances noticeable improvements in system health metrics, based on cross\-region comparisons\. These findings were consistent with the pre\-onboarding simulation results\. It is important to note that the level of savings or improvements varies depending on each client’s prior scaling strategy\. We also note that the full end\-to\-end implementation involves various engineering aspects, guardrails, rules, and dependencies on other site reliability engineering tooling, which are beyond the scope of this paper\. We share some anonymized dashboard screenshots from the monitoring of onboarded microservices in Section[B](https://arxiv.org/html/2606.07565#A2)of the appendix\. One of the distinct advantages of our solution, acknowledged by clients, is its flexibility\. It enables customizations and periodic modifications of inputs, thresholds, and aggregation strategies, to suit the various clients’ needs, as well as to dynamically manage stability, risk, and cost trade\-offs across different seasons and events\. For supplementary visualizations, readers may refer to Subsection[B\.2](https://arxiv.org/html/2606.07565#A2.SS2)of the appendix\.

## 6\.Conclusions

Intelligent scaling of microservices in cloud platforms is central to managing compute costs and ensuring system stability\. However, most existing solutions are limited in scope—treating the problem as a univariate forecasting task and optimizing for predictive accuracy at the expense of responsiveness, stability, and scalability\. These shortcomings are exacerbated by the computational overhead of more advanced alternatives, which makes them unsuitable for large\-scale, real\-time environments\.

To overcome these challenges, we introduced STARIXNet, a lightweight neural network designed for multivariate resource allocation decisions in dynamic cloud environments\. Unlike conventional approaches, STARIXNet captures rich spatio\-temporal dependencies across multiple quasi\-dependent system metrics, including CPU usage, memory, request rate, and network throughput\. By modeling Seasonal, Temporal, Auto\-Regressive, Integrated, and eXogenous \(STARIX\) patterns, it formulates scaling decisions through a policy that explicitly prioritizes service stability, followed by cost\-efficiency rather than raw forecast accuracy\.

Empirical evaluations demonstrate the real\-world efficacy of STARIXNet, deployed across critical production microservices at Walmart\. The system achieved substantial infrastructure cost savings, ranging from 10% to 50%, while also improving service reliability and the end\-user experience\. These outcomes confirm the effectiveness of our approach and establish STARIXNet as a pragmatic, production\-ready solution to intelligent autoscaling\.

More broadly, our research contributes to the evolving landscape of machine learning for systems, building upon and extending recent developments in temporal modeling\(Limet al\.,[2021](https://arxiv.org/html/2606.07565#bib.bib33)\), state\-aware workload prediction\(Luoet al\.,[2024](https://arxiv.org/html/2606.07565#bib.bib27)\), and unified autoscaling frameworks\(Zouet al\.,[2024](https://arxiv.org/html/2606.07565#bib.bib24); Zhouet al\.,[2023](https://arxiv.org/html/2606.07565#bib.bib5)\)\. The principles behind STARIXNet echo the shift towards combining deep learning, probabilistic reasoning\(Salinaset al\.,[2020](https://arxiv.org/html/2606.07565#bib.bib32)\), and adaptive control policies for intelligent infrastructure management\.

Additionally, our findings reinforce the need for scalable and interpretable models in production environments\. While prior work has emphasized the importance of accuracy and complexity tradeoffs\(Dixitet al\.,[2021](https://arxiv.org/html/2606.07565#bib.bib14)\), STARIXNet demonstrates that it is possible to strike a balance—delivering performance gains while maintaining the simplicity and robustness necessary for real\-time deployment\.

In conclusion, STARIXNet represents a significant step forward in multivariate, policy\-driven autoscaling, providing both a theoretical framework and a proven production\-ready system that aligns with the operational goals of modern cloud infrastructure\. We hope this work inspires further innovation at the intersection of AI, cloud systems, and operational excellence\.

## References

- \[1\]A\. Abdulaal, Z\. Liu, and T\. Lancewicki\(2021\)Practical approach to asynchronous multivariate time series anomaly detection and localization\.InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining,pp\. 2485–2494\.Cited by:[§3\.1](https://arxiv.org/html/2606.07565#S3.SS1.p2.4)\.
- \[2\]S\. Agarwal, M\. A\. Rodriguez, and R\. Buyya\(2024\)A deep recurrent\-reinforcement learning method for intelligent autoscaling of serverless functions\.IEEE Transactions on Services Computing\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p7.1)\.
- \[3\]R\. Ahuja, S\. Garg, R\. Singh, and I\. Perl\(2025\)Effective resource management through vm allocation in cloud data center\.InProceedings of the 18th Innovations in Software Engineering Conference,pp\. 1–6\.Cited by:[§1](https://arxiv.org/html/2606.07565#S1.p2.1),[§2](https://arxiv.org/html/2606.07565#S2.p3.1)\.
- \[4\]D\. Balla, C\. Simon, and M\. Maliosz\(2020\)Adaptive scaling of kubernetes pods\.InNOMS 2020\-2020 IEEE/IFIP Network Operations and Management Symposium,pp\. 1–5\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p2.1)\.
- \[5\]G\. E\. Box, G\. M\. Jenkins, G\. C\. Reinsel, and G\. M\. Ljung\(2015\)Time series analysis: forecasting and control\.John Wiley & Sons\.Cited by:[§3\.3](https://arxiv.org/html/2606.07565#S3.SS3.p1.2),[§4\.1\.3](https://arxiv.org/html/2606.07565#S4.SS1.SSS3.p1.1)\.
- \[6\]R\. N\. Calheiros and R\. Buyya\(2015\)Workload prediction using arima model and its impact on cloud applications’ qos\.InIEEE Transactions on Cloud Computing,Vol\.3,pp\. 449–458\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p3.1),[§2](https://arxiv.org/html/2606.07565#S2.p8.1)\.
- \[7\]R\. N\. Calheiros, E\. Masoumi, R\. Ranjan, and R\. Buyya\(2014\)Workload prediction using arima model and its impact on cloud applications’ qos\.IEEE transactions on cloud computing3,pp\. 449–458\.Cited by:[§3\.3](https://arxiv.org/html/2606.07565#S3.SS3.p1.2)\.
- \[8\]A\. C\. H\. Chen\(2023\)Efficiency analysis of microservices based on queueing models\.2023 IEEE International Conference on Machine Learning and Applied Network Technologies \(ICMLANT\),pp\. 1–5\.External Links:[Link](https://api.semanticscholar.org/CorpusID:266730334)Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p2.1)\.
- \[9\]A\. C\. Chen, M\. C\. Hsiang, and M\. Wang\(2023\)Efficiency analysis of microservices based on queueing models\.In2023 IEEE International Conference on Machine Learning and Applied Network Technologies \(ICMLANT\),pp\. 1–5\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p2.1)\.
- \[10\]Z\. Chen, J\. Deng, J\. Wu, and J\. Chen\(2018\)Autonomous resource provisioning for multi\-service web applications\.InIEEE International Conference on Web Services \(ICWS\),Cited by:[§1](https://arxiv.org/html/2606.07565#S1.p1.1)\.
- \[11\]A\. Dixit, R\. K\. Gupta, A\. Dubey, and R\. Misra\(2021\)Machine learning based adaptive auto\-scaling policy for resource orchestration in kubernetes clusters\.InInternational Conference on Internet of Things and Connected Technologies,pp\. 1–16\.Cited by:[§6](https://arxiv.org/html/2606.07565#S6.p5.1)\.
- \[12\]J\. Han, F\. Xu, X\. Zhang, and S\. Ma\(2024\)SOFTS: scalable off\-the\-shelf forecasting with multivariate mlps\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.07565#S1.p5.1)\.
- \[13\]Q\. Hua, D\. Yang, S\. Qian, H\. Hu, J\. Cao, and G\. Xue\(2023\)Kae\-informer: a knowledge auto\-embedding informer for forecasting long\-term workloads of microservices\.InProceedings of the ACM Web Conference 2023,pp\. 1551–1561\.Cited by:[§1](https://arxiv.org/html/2606.07565#S1.p2.1)\.
- \[14\]S\. Islam, J\. Keung, K\. Lee, and A\. Liu\(2012\)Empirical prediction models for adaptive resource provisioning in the cloud\.Future Generation Computer Systems\.Cited by:[§1](https://arxiv.org/html/2606.07565#S1.p1.1),[§1](https://arxiv.org/html/2606.07565#S1.p2.1),[§1](https://arxiv.org/html/2606.07565#S1.p6.1)\.
- \[15\]G\. Lai, W\. Chang, Y\. Yang, and H\. Liu\(2018\)Modeling long\- and short\-term temporal patterns with deep neural networks\.InACM SIGIR,Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p4.1),[§2](https://arxiv.org/html/2606.07565#S2.p8.1)\.
- \[16\]J\. Li and Y\. Xia\(2011\)Improving resource provisioning in cloud computing environments using workload prediction models\.Concurrency and Computation: Practice and Experience23\(17\),pp\. 2361–2375\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p3.1)\.
- \[17\]B\. Lim, S\. Ö\. Arık, N\. Loeff, and T\. Pfister\(2021\)Temporal fusion transformers for interpretable multi\-horizon time series forecasting\.International journal of forecasting37\(4\),pp\. 1748–1764\.Cited by:[§4\.1\.3](https://arxiv.org/html/2606.07565#S4.SS1.SSS3.p1.1),[§6](https://arxiv.org/html/2606.07565#S6.p4.1)\.
- \[18\]T\. Lorido\-Botran, J\. Miguel\-Alonso, and J\. A\. Lozano\(2014\)A review of auto\-scaling techniques for elastic applications in cloud environments\.Journal of grid computing12,pp\. 559–592\.Cited by:[§1](https://arxiv.org/html/2606.07565#S1.p1.1),[§1](https://arxiv.org/html/2606.07565#S1.p2.1),[§2](https://arxiv.org/html/2606.07565#S2.p1.1),[§2](https://arxiv.org/html/2606.07565#S2.p3.1),[§2](https://arxiv.org/html/2606.07565#S2.p8.1)\.
- \[19\]Y\. Luo, M\. Gao, Z\. Yu, H\. Ge, X\. Gao, T\. Cai, and G\. Chen\(2024\)Integrating system state into spatio temporal graph neural network for microservice workload prediction\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5521–5531\.Cited by:[§1](https://arxiv.org/html/2606.07565#S1.p2.1),[§2](https://arxiv.org/html/2606.07565#S2.p6.1),[§6](https://arxiv.org/html/2606.07565#S6.p4.1)\.
- \[20\]M\. Mao and M\. Humphrey\(2010\)Auto\-scaling to minimize cost and meet application deadlines in cloud workflows\.Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,pp\. 1–12\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p1.1),[§2](https://arxiv.org/html/2606.07565#S2.p3.1)\.
- \[21\]R\. Mello, E\. Fernandes, C\. de Oliveira, and A\. de Souza\(2017\)An architecture to automate performance tests on microservices\.InProceedings of the 18th International Conference on Information Integration and Web\-based Applications and Services,pp\. 17–26\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p2.1)\.
- \[22\]I\. Ouhame, A\. Ezzati, and D\. Aboutajdine\(2021\)An efficient deep learning model for cloud resource usage prediction using cnn\-lstm\.Cluster Computing\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p4.1),[§2](https://arxiv.org/html/2606.07565#S2.p8.1)\.
- \[23\]I\. Prachitmutita, W\. Aittinonmongkol, N\. Pojjanasuksakul, M\. Supattatham, and P\. Padungweang\(2018\)Auto\-scaling microservices on iaas under sla with cost\-effective framework\.In2018 Tenth International Conference on Advanced Computational Intelligence \(ICACI\),pp\. 583–588\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p4.1),[§4\.1\.3](https://arxiv.org/html/2606.07565#S4.SS1.SSS3.p1.1)\.
- \[24\]A\. Rubak and J\. Taheri\(2023\)Machine learning for predictive resource scaling of microservices on kubernetes platforms\.InProceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing,pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p3.1)\.
- \[25\]D\. Salinas, V\. Flunkert, J\. Gasthaus, and T\. Januschowski\(2020\)DeepAR: probabilistic forecasting with autoregressive recurrent networks\.International journal of forecasting36\(3\),pp\. 1181–1191\.Cited by:[§4\.1\.3](https://arxiv.org/html/2606.07565#S4.SS1.SSS3.p1.1),[§6](https://arxiv.org/html/2606.07565#S6.p4.1)\.
- \[26\]G\. Tesauro and N\. K\. Jong\(2007\)Online resource allocation using decompositional reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p7.1)\.
- \[27\]L\. Toka, G\. Dobreff, B\. Fodor, and B\. Sonkoly\(2020\)Adaptive ai\-based auto\-scaling for kubernetes\.In2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing \(CCGRID\),pp\. 599–608\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p5.1)\.
- \[28\]P\. Welch\(2003\)The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms\.IEEE Transactions on audio and electroacoustics15\(2\),pp\. 70–73\.Cited by:[§3\.1](https://arxiv.org/html/2606.07565#S3.SS1.p2.4)\.
- \[29\]C\. Xu, P\. Li, K\. Ren, and F\. Wu\(2020\)FIRM: an intelligent fine\-grained resource management framework for cloud datacenters\.InIEEE INFOCOM 2020,Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p7.1)\.
- \[30\]S\. Xue, Z\. Qin, W\. Zhou, G\. Xue, and C\. Guo\(2022\)Morpheus: robust autoscaling with meta\-learning in production cloud platforms\.InProceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 3695–3705\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p7.1)\.
- \[31\]Y\. Yan, V\. Sharma, D\. Yu, and J\. Wei\(2021\)HANSEL: a lightweight, highly available neural service level autoscaler\.InACM Symposium on Cloud Computing,Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p4.1),[§2](https://arxiv.org/html/2606.07565#S2.p8.1)\.
- \[32\]Z\. Zhou, C\. Zhang, L\. Ma, J\. Gu, H\. Qian, Q\. Wen, L\. Sun, P\. Li, and Z\. Tang\(2023\)AHPA: adaptive horizontal pod autoscaling systems on alibaba cloud container service for kubernetes\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 15621–15629\.Cited by:[§2](https://arxiv.org/html/2606.07565#S2.p3.1),[§6](https://arxiv.org/html/2606.07565#S6.p4.1)\.
- \[33\]F\. Zou, X\. Li, and J\. Liang\(2024\)OptScaler: unified predictive and reactive autoscaling for cloud resource management\.InVLDB 2024,Cited by:[§6](https://arxiv.org/html/2606.07565#S6.p4.1)\.

## Appendix AReproducibility Information

### A\.1\.Code Access

The initial release of the code, along with the licensing details, can be found at https://huggingface\.co/aabdulaal/starixnet/tree/v1\.0\.0\. To obtain access, please reach out to the corresponding author\. The repository will be made publicly available once the current revision is finalized\. Meanwhile, we include additional reproducibility information in the subsections below\.

### A\.2\.Software Packages

Independent of the client\-facing production implementation, the live benchmark experiments deploy Python \(slim\-bullseye\) runtime containers, installing the following dependencies:

- •Packages required by STARIXNet model: - –python 3\.11\.12 - –joblib 7\.8\.0 - –numpy 2\.2\.5 - –pandas 2\.2\.3 - –scipy 1\.15\.2 - –torch 2\.6\.0 \(CPU build\) - –tqdm 4\.67\.1
- •Additional Packages required by benchmark models: - –lightning 2\.5\.1\.post0 - –pytorch\-forecasting 1\.3\.0 - –statsmodels 0\.14\.4
- •Additional packages installed for inference service: - –aiohttp 3\.11\.12 - –aiosignal 1\.3\.2 - –asyncio 3\.4\.3 - –dataclasses\-json 0\.6\.7 - –dynaconf 3\.2\.11 - –flask 3\.1\.0 \(async\) - –flask\-caching 2\.3\.1 - –prometheus\-client 0\.21\.1 - –tenacity 9\.1\.2

We omit the listing of packages related to commercial storage solutions or internal tooling at Walmart to maintain compliance with policies\.

### A\.3\.STARIXNet Parameter Configurations

As mentioned in Subsection[4\.1](https://arxiv.org/html/2606.07565#S4.SS1), we do not attempt separate hyperparameter tuning when training STARIXNet for new microservices or different sets of metrics\. The per\-module trainable parameters scale linearly inNN, while the exogenous modules scale linearly inN​MrNM^\{r\}for eachrr\. While there may be room to further improve precision through dedicated hyperparameter tuning, this would conflict with our goal of keeping the DNN as light\-weight as possible\. The general set of parameters we implement are as follows:

- •J=15J=15
- •For seasonal module: - –S=10S=10
- •For temporal module: - –\|V\|=10080\|V\|=10080andd=m​i​n​\(50,\|V\|\+12\)=50d=min\(50,\\frac\{\|V\|\+1\}\{2\}\)=50 - –L=1L=1
- •For autoregressive\-integrated module: - –kT′=\(N,m​i​n​\(50,I\+12\)\)k\_\{T^\{\\prime\}\}=\(N,min\(50,\\frac\{I\+1\}\{2\}\)\) - –L=1L=1
- •For any exogenous module: - –kT′=\(Mr,m​i​n​\(50,I\+12\)\)k\_\{T^\{\\prime\}\}=\(M^\{r\},min\(50,\\frac\{I\+1\}\{2\}\)\) - –L=2L=2

Therefore, the variables that may deviate by microservice areNN,MrM^\{r\},rrandII, whereI=30I=30was used for the benchmark study\.

## Appendix BSupplementary Figures

### B\.1\.Live Benchmark Experiment Figures

![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/experiment_deployment_architecture.png)Figure 4\.Experiment deployment architecture for STARIXNet load testing\. Distributed instances of a traffic generator application mirror real traffic by sending identical requests to cloned microservice pods\. All microservice pods share identical CPU and memory limits and HPA bounds \(min/max replicas\)\. Inference service pods use standardized CPU allocations, with memory adjusted per model requirements\.Block diagram showing distributed traffic generators sending load to cloned microservice pods instrumented with Prometheus sidecars; metrics flow into Prometheus, where the inference service reads them and feeds predictions back; the Kubernetes HPA controller then reads those predictions and adjusts microservice pod counts\.![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/traffic_pattern.png)Figure 5\.Snapshot of the identical load pattern generated across all microservices\. The distributed traffic generator sends synchronized requests to all cloned microservice pods, enabling consistent and fair comparison\. The y\-axis shows load in Requests Per Seconds \(RPS\); the x\-axis shows the time\. The traffic pattern includes steady phases, drops, and recoveries\.Load generation accross all microservices\.![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/latency_p95.png)Figure 6\.Response latency \(95th percentile\) comparison across microservices during the evaluations\. The y\-axis shows latency in milliseconds; the x\-axis shows the time\. All models exhibit consistent latency patterns with notable drops during the periods of zero traffic load\.Response latency \(p95\) for all microservices during the evaluations\.![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/latency_p99.png)Figure 7\.Response latency \(99th percentile\) comparison across microservices during the evaluations\. The y\-axis shows latency in milliseconds; the x\-axis shows the time\. Microservices that are scaled by DeepAR, LSTM, and ARIMA models show significant spikes above baseline while STARIXNet and TFT maintain more consistent performance\.Response latency \(p99\) for all microservices during the evaluations\.![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/cpu_utilization_q50.png)Figure 8\.CPU usage \(50th percentile\) comparison under the load during experimentation and evaluations\. Each micro service shows distinct resource consumption patterns under identical traffic conditions, because of different scale level\. The y\-axis shows CPU usage percentage \(1\-100\); the x\-axis shows the time\. Different microservices shows different CPU usage patterns, with spikes and drops corresponding to load changes and scaling\.CPU usage \(q50\) for all microservices the evaluations\.
### B\.2\.Onboarded Client Microservice Figures

![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/post_onboarding_1.png)Figure 9\.Multi\-region comparison post\-onboarding a client’s microservice on a scale of hundreds of pods\. The 2 control regions implement an alternative autoscaling strategy\. Plot shows clear distinction in handling volatility, peaks, and predictive scaling in the onboarded region, versus reactive scaling in control regions\.Quasi\-A/B testing across regions\.![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/post_onboarding_2.png)Figure 10\.Multi\-region comparison post\-onboarding of a client’s microservice\. The two control regions implement CPU\-based scaling with high thresholds for safety, resulting in near\-constant capacity regardless of workload spikes or dips\. In contrast, the onboarded region adjusts capacity dynamically in response to predicted pod count\.Quasi\-A/B testing across regions\.![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/ss_cpu_predictions.png)Figure 11\.An onboarded microservice’s forecasts for the CPU demand \(i\.e\. CPU seconds\) from STARIXNet’s decoder modules: autoregressive \(yellow\), exogenous \(blue\), seasonal \(orange\), temporal \(red\), and aggregated \(green\)\.CPU demand over time time for each decoder output and the final aggregated prediction\.![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/ss_tps_predictions.png)Figure 12\.An onboarded microservice’s forecasts for the Requests Per Seconds \(RPS\) from STARIXNet’s decoder modules: autoregressive, exogenous, seasonal, and temporal components\.RPS over time for each decoder output\.![Refer to caption](https://arxiv.org/html/2606.07565v1/figures/ss_pod_predictions.png)Figure 13\.The calculated Pod\-count options after aggregating RPS and CPU forecasts from STARIXNet’s decoder modules: autoregressive \(green\), exogenous \(yellow\), seasonal \(blue\), and temporal \(orange\)\.Pod count over time for each decoder output\.

Similar Articles

ChurnNet: A Optimized Modern AI for Churn Prediction

arXiv cs.LG

This paper evaluates traditional machine learning techniques (Random Forests, XGBoost, SVM) against a deep learning model (Unified Multi-Task Time Series Model) for customer churn prediction in retail, finding that conventional methods can outperform in predictive performance and efficiency.