QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning
Summary
QSplitFL proposes a DQN-based framework for optimal split point selection in split federated learning, using client hardware metrics to adapt to heterogeneous devices. Experiments show improved convergence and accuracy across multiple datasets and architectures.
View Cached Full Text
Cached at: 06/10/26, 06:14 AM
# QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning
Source: [https://arxiv.org/html/2606.09869](https://arxiv.org/html/2606.09869)
11institutetext:Department of Computer Science, Kennesaw State University, Marietta, GA, 30060 USA
11email:nshadin@students\.kennesaw\.edu, xzhang48@kennesaw\.edu22institutetext:Department of Computer Science, San Francisco State University, San Francisco, CA, 94132 USA
22email:jingyiwang@sfsu\.edu33institutetext:Department of Electrical and Computer Engineering, University of Houston, Houston, TX, 77204 USA
33email:mpan2@uh\.edu###### Abstract
Federated Learning \(FL\) combined with Split Learning \(SL\) is a privacy preserving paradigm that enables training deep neural networks \(DNNs\) on resource constrained devices while reducing overall training cost\. However, determining the optimal split point, meaning the layer where the model is divided still remains a critical challenge, especially when clients have heterogeneous hardware capabilities\. Fixed split points can overload weak devices and increase the communication and server load, which slows convergence and reduces stability\. This paper introduces QSplitFL, a novel capability\-aware Deep Q\-Network \(DQN\) framework for optimal split point selection in Split learning based Federated Learning \(SFL\) environments\. Unlike existing approaches that rely on high\-dimensional model weight representations, QSplitFL employs a lightweight state representation derived directly from client hardware metrics, including CPU utilization, memory, battery level, and network latency\. The proposed framework incorporates a decayed loss\-drop reward function that prioritizes early convergence, and a committee\-based DQN architecture with majority voting to mitigate reward hacking\. Extensive experiments on MNIST, Fashion\-MNIST, CIFAR\-10, and CIFAR\-100 datasets using CNN, ResNet50, MobileNetV4, and ConvNeXt architectures demonstrate that our approach achieves better convergence and higher accuracy compared to existing methods, while effectively adapting to heterogeneous device resources\. The source code is publicly available at[https://github\.com/AIPO\-Lab/QSplitFL](https://github.com/AIPO-Lab/QSplitFL)\.
## 1Introduction
The rapid adoption of edge devices along with growing concerns about data privacy and security have driven interest in Federated Learning \(FL\), which enables collaborative model training without centralizing sensitive raw data\[[17](https://arxiv.org/html/2606.09869#bib.bib77)\]\. FL is particularly relevant in Internet of Things \(IoT\), Internet of Medical Things \(IoMT\), and edge computing environments, where data are naturally distributed across smartphones, IoT sensors, and medical devices deployed in remote and resource limited settings\[[13](https://arxiv.org/html/2606.09869#bib.bib52),[2](https://arxiv.org/html/2606.09869#bib.bib30)\]\. Despite its benefits, FL relies on client side computation and often assumes that participating clients can train complete neural networks locally, which can fail in real\-world scenarios with resource constrained client devices\[[27](https://arxiv.org/html/2606.09869#bib.bib49)\]\. Considering a practical healthcare scenario where rural hospitals and community health centers aim to collaboratively develop and train deep learning models for a particular disease detection using patient data distributed across multiple facilities\[[13](https://arxiv.org/html/2606.09869#bib.bib52)\]\. While collaboration is essential for achieving robust models that generalize across diverse patient populations, these facilities often operate with limited computational infrastructure, including legacy hardware, constrained power supplies, and unreliable network connectivity\[[4](https://arxiv.org/html/2606.09869#bib.bib91)\]\. For such environments, the computational cost associated with training deep neural networks \(DNNs\) locally becomes the dominant bottleneck, which limits the applicability of conventional FL\.
To address this limitation, Split Learning \(SL\) offers a compelling alternative by partitioning the model between clients and the server, which enables resource constrained devices to execute only the initial layers of the model\[[10](https://arxiv.org/html/2606.09869#bib.bib47)\]\. The resulting intermediate activations, termed smashed data, are then transmitted to the server, which completes the remaining forward and backward passes\. SL reduces client side computation while keeping raw data on the device\. Building on this idea, Split learning based federated learning \(SFL\) integrates the computational efficiency of SL with the privacy guarantees and scalability of FL\[[27](https://arxiv.org/html/2606.09869#bib.bib49)\]\.
Even with these advantages of SFL, it introduces a critical optimization challenge, which is determining the optimal split point, that means the specific layer where the neural network is divided between clients and the server\[[20](https://arxiv.org/html/2606.09869#bib.bib46)\]\. The split point determines how computation and communication are balanced between clients and the server\[[27](https://arxiv.org/html/2606.09869#bib.bib49)\]\. When the split is shallow, clients execute only a small portion of the network, which reduces client side compute but increases the size of intermediate activations sent to the server and increases server side processing\. On the other hand, when the split is deep, more layers run on the client, which reduces activation transmission but can overload resource constrained devices\[[21](https://arxiv.org/html/2606.09869#bib.bib43)\]\. So in practice, the split point can’t be fixed because client capability changes over time and varies across devices, as well as capability can fluctuate across federated rounds due to battery drain, competing applications, network congestion, and hardware heterogeneity within client clusters\[[6](https://arxiv.org/html/2606.09869#bib.bib39)\]\. In real\-world deployments, some devices can support deeper splits, while others cannot\[[31](https://arxiv.org/html/2606.09869#bib.bib38)\]\. This context motivates adaptive, capability aware split point selection\.
Existing split point selection methods mainly use heuristic rules, exhaustive search, or reinforcement learning that relies on high dimensional state representations derived from model weights\[[24](https://arxiv.org/html/2606.09869#bib.bib32)\]\. These approaches often introduce nontrivial overhead, slow adaptation, and limited interpretability\. Methods that build states from model weights typically require dimensionality reduction, often via Principal Component Analysis \(PCA\)\[[30](https://arxiv.org/html/2606.09869#bib.bib40)\]\. This step adds substantial complexity for collecting weights and computing the projection, which is difficult to justify in resource constraint device deployments\.
For mitigating these issues, we propose QSplitFL, a capability aware reinforcement learning \(RL\) based Deep Q Network framework that selects split points dynamically in SFL settings\. Here, we have replaced weight based state representations with a lightweight, interpretable state built from client capability metrics\. We formulate split selection as a Markov Decision Process\[[30](https://arxiv.org/html/2606.09869#bib.bib40)\]in which the state summarizes aggregated cluster capability using normalized metrics for CPU availability, memory utilization, battery level, and network latency, together with a heterogeneity indicator derived from capability distribution across clients\. The action selects a split layer from a feasible range, for example, from mid network layers up to the last executable layer, which aligns with real deployments where devices and hospitals differ widely in computing capacity\. Resource constrained sites in rural areas may only execute a small portion of the model, while larger hospitals can support deeper computation\. This heterogeneity makes a single fixed split point ineffective and motivates a mechanism that adapts split depth to the capability of each participating client cluster\. The reward follows a decayed loss drop objective, where the reward function assigns higher credit to early improvements to speed up initial convergence\.
QSplitFL uses standard DQN stabilization techniques as described by Wang et al\.\[[30](https://arxiv.org/html/2606.09869#bib.bib40)\]\. This includes experience replay, where a finite\-capacity buffer stores previous transitions to decorrelate gradient updates and allow the model to reuse informative experiences\. Stability in temporal difference \(TD\) learning is further maintained through a target network that is periodically synchronized with the main network\. Additionally, the framework employs committee\-based action selection to mitigate the problem of reward hacking, which ensures more robust and reliable decision making for action selection\[[32](https://arxiv.org/html/2606.09869#bib.bib34)\]\. The principal contributions of this paper are summarized as follows:
- •Capability\-Aware State: A lightweight and interpretable state representation based on normalized client capability metrics \(CPU, memory, battery, network\) and cluster heterogeneity\.
- •Decayed Loss Drop Reward: An exponentially decayed loss drop reward that prioritizes early round improvements to accelerate split point discovery\.
- •Committee\-Based DQN: A committee\-based DQN with multiple CNN models as a single committee member to vote on action selection to improve robustness and reduce reward hacking\.
- •First DQN\-Based Adaptive Split Point Selection: To the best of our knowledge, QSplitFL is the first work to apply Deep Q\-Learning for optimal split point selection in SFL\.
- •Comprehensive Evaluation: Extensive experiments on MNIST, Fashion MNIST, CIFAR\-10, and CIFAR\-100 datasets using CNN, ResNet50, MobileNetV4, and ConvNeXt across 5, 10, 100, and 200 clients, demonstrate scalability and effectiveness of our method\.
## 2Related Work
SFL combines the computation reduction of SL with the collaborative scalability of FL\[[27](https://arxiv.org/html/2606.09869#bib.bib49),[12](https://arxiv.org/html/2606.09869#bib.bib48)\]\. In SFL, clients perform SL with a server, while client side updates are periodically aggregated in a federated manner\. This hybrid design enables local participation with resource\-constrained devices, while allowing parallel processing of multiple clients on the server side\. Recent SFL research has emphasized optimization and resource management\. Fan et al\.\[[9](https://arxiv.org/html/2606.09869#bib.bib44)\]proposed a multi\-agent deep reinforcement learning framework for cloud edge device collaborative SFL that jointly optimizes partitioning, resource allocation, and client scheduling\. Yu et al\.\[[35](https://arxiv.org/html/2606.09869#bib.bib45)\]introduced U\-shaped SFL for vehicular environments and used deep reinforcement learning to handle dynamic resource allocation and split selection under mobility constraints\. ESFL formulates workload and server resource allocation for heterogeneous wireless devices using optimization techniques tailored to system constraints\[[37](https://arxiv.org/html/2606.09869#bib.bib42)\]\. Privacy and integrity aspects have also been studied\. For example, differential privacy mechanisms that perturb smashed data can reduce the success of label inference attacks\[[33](https://arxiv.org/html/2606.09869#bib.bib12)\]\. IV\-FED uses trusted execution environments to support training integrity in healthcare IoT scenarios\[[15](https://arxiv.org/html/2606.09869#bib.bib23)\]\.
Reinforcement learning \(RL\), and deep reinforcement learning in particular, has been applied to decision making problems in distributed machine learning systems\[[14](https://arxiv.org/html/2606.09869#bib.bib21)\]\. Deep Q\-Networks \(DQN\) are widely used for discrete action spaces common in scheduling, allocation, and partitioning tasks\[[30](https://arxiv.org/html/2606.09869#bib.bib40)\]\. In distributed learning, RL has been used for split point selection, client selection, and task offloading\. Goal oriented DNN splitting methods use RL to control split decisions under resource constraints and accuracy objectives\[[5](https://arxiv.org/html/2606.09869#bib.bib18)\]\. Earlier approaches applied Q\-learning to split decisions using PCA compressed weight based representations\[[24](https://arxiv.org/html/2606.09869#bib.bib32)\]\. RL has also been used to select clients in heterogeneous IoT FL settings to balance convergence and resource usage\[[34](https://arxiv.org/html/2606.09869#bib.bib19)\], and to optimize task offloading between edge and cloud resources\[[36](https://arxiv.org/html/2606.09869#bib.bib17)\]\. Improvements to experience replay, such as prioritized experience replay and diversity aware replay, have further improved sample efficiency in DRL training\[[25](https://arxiv.org/html/2606.09869#bib.bib14)\]\.
Research Gap and Our Contribution:Existing RL\-based approaches represent the state using high\-dimensional model weights with PCA compression\[[24](https://arxiv.org/html/2606.09869#bib.bib32)\], which introduces substantial overhead and limits the scalability\. They also represent dynamic client capability changes, which makes split point selection decisions slow to adapt\. In addition, single agent decision making can be vulnerable to reward hacking, where the policy exploits weaknesses in the reward signal rather than improving true training performance\[[32](https://arxiv.org/html/2606.09869#bib.bib34)\]\. Finally, reward formulations that treat all rounds equally can miss the importance of early round decisions for shaping later convergence behavior\. QSplitFL addresses these gaps through a lightweight capability\-aware state, a committee\-based DQN with majority voting, and a decayed loss\-drop reward for adaptive split point selection in resource\-constrained SFL environments\.
## 3System Model
Figure 1:QSplitFL Workflow Architecture\.\(1\)Client\-Side: Clients has hardware metrics, receive split layerℓ\\ell, and run forward propagation through layers 1 toℓ\\ell, which produces smashed data \(AkA\_\{k\}\)\. \(2\)Server\-Side: The server completes training through layersℓ\+1\\ell\+1toLLand returns gradients to clients\. \(3\)Aggregation: Client updates are aggregated via FedAvg; reward \(rtr\_\{t\}\) is computed on the based of the loss function\. \(4\)RL Controller: A committee of MLP models votes on the optimal split layer based on client capabilities\.Figure[1](https://arxiv.org/html/2606.09869#S3.F1)presents the QSplitFL workflow architecture\. On theClient\-Side Process, each client begins with a Hardware Profiler \(Step 1\) that measures four real\-time metrics: CPU utilization, battery level, memory availability, and network latency\. These raw readings are converted into normalized capability metricsCiC\_\{i\}\(Step 2\) and then passed to a state aggregator \(Step 3\), which constructs a six\-dimensional neural state vectorsts\_\{t\}\(Step 3a\) summarizing the cluster’s overall capability and heterogeneity\. After receiving the selected split layerℓ\\elland global model withLLlayers and weightsWtW\_\{t\}from the server \(Step 5\), each client loads its client sub\-model with layers11toℓ\\ell\(Step 6\) and runs a forward pass \(Step 7\) to produce smashed dataAℓA\_\{\\ell\}, which is transmitted to the server\.
On theServer\-Side Process, in Block A, the server receives the smashed data and executes the server forward pass \(Step 8a\) through the server sub\-Model covering layersℓ\+1\\ell\+1toLL\(Step 8\)\. The server calculatesℒ\(y^,Y\)\\mathcal\{L\}\(\\hat\{y\},Y\)by comparing predictions with true labels \(Step 8b\)\. Server backpropagation \(Step 9\) computes gradients∇Aℓ\\nabla A\_\{\\ell\}, which are sent back to clients\. Each client then performs its client backward pass \(Step 10\), computes local weight updatesΔW\\Delta W\(Step 11\), and sends them to the Aggregator \(Step 12\) for federated aggregation\. The aggregated loss is fed into the reward calculator \(Step 13\), which computes the decayed loss\-drop rewardrtr\_\{t\}, and the updated global modelWt\+1W\_\{t\+1\}\(Step 15\) is prepared for the next round\.
In Block B \(RL Controller\), the server executes the committee\-based DQN\. The state vectorsts\_\{t\}enters the committee stage \(Step 4\), which consists of a shared encoder feeding into multiple independent MLP Heads\. Each head proposes a split action, and majority voting \(Step 4a\) determines the consensus\. The winning action is selected \(Step 4b\) as the split layerℓ\\ell, which is broadcast to all clients through Step 5\. The rewardrtr\_\{t\}from Step 13 is stored in the experience replay buffer𝒟\\mathcal\{D\}\(Step 14\), from which training batches are sampled to update the committee networks\.
Consider a federated network withKKedge clients partitioned intoCCclusters, denoted\{𝒦c\}c=1C\\\{\\mathcal\{K\}\_\{c\}\\\}\_\{c=1\}^\{C\}, where each cluster𝒦c\\mathcal\{K\}\_\{c\}contains clients with similar hardware capabilities\. The central server maintains the server\-side portion of the neural network and coordinates training across all clusters\. Within each cluster, a dedicated RL agent \(the Q\-controller, Block B in Figure[1](https://arxiv.org/html/2606.09869#S3.F1)\) observes cluster\-level capability states and selects appropriate split points for each training round\.
Computation\-Communication Trade\-off:SFL addresses the computational limitations of resource\-constrained devices by allowing each client to execute only the initial layers of the neural network model\. The intermediate activations \(smashed data\) along with the corresponding true labels are transmitted to a central server, which completes the more computation\-intensive portions of the training including the deeper layers and loss computation\. This approach significantly reduces the workload on devices with limited power, memory, or battery, making participation possible even in resource constrained settings like rural healthcare facilities\. Although SFL increases communication costs because activations and gradients must be exchanged between clients and the server; this trade\-off is necessary and acceptable\[[26](https://arxiv.org/html/2606.09869#bib.bib33)\]\. In critical applications like collaborative AI in healthcare, where enabling weak devices to participate matters more than reducing network latency\. If a device cannot run the full model locally, the lower communication cost of traditional FL is irrelevant as the device simply cannot participate\. QSplitFL optimizes this trade\-off by adaptively choosing split points that balance client computation against communication overhead\.
## 4QSplitFL Framework
### 4\.1Per\-Round Operation
Before formalizing the individual components, we first describe how they interact within a single training round, which gives an end\-to\-end view of the framework\. Figure[2](https://arxiv.org/html/2606.09869#S4.F2)provides a detailed view of how the QSplitFL framework operates during each training round\. It shows the integration of capability\-aware state construction \(corresponding to Steps 1–3 in Figure[1](https://arxiv.org/html/2606.09869#S3.F1)\), RL\-based split point selection \(Step 4\), SFL execution \(Steps 5–11\), and federated aggregation with reward computation \(Steps 12–15\)\. The per\-round workflow proceeds as follows:
1. 1\.State Construction:At the beginning of each round, the Hardware Profiler \(Step 1\) collects capability metrics \(CPU, memory, battery, network\) from all clients, which are normalized into Capability MetricsCiC\_\{i\}\(Step 2\) and aggregated by the State Aggregator \(Step 3\) into the six\-dimensional neural statests\_\{t\}\.
2. 2\.Action Selection:The RL Controller’s Committee Machine evaluates the current state through a Shared Encoder and multiple MLP Heads\. Majority Voting \(Step 4a\) determines the consensus split layer, and the selected action is output \(Step 4b\)\.
3. 3\.SFL Execution:The Global Model and Split Broadcaster \(Step 5\) sendsWtW\_\{t\}andℓ\\ellto all clients\. Each client loads its Sub\-Model \(Step 6\) and runs the Forward Pass \(Step 7\) to produce smashed dataAℓA\_\{\\ell\}\. The server executes the Server Forward Pass \(Step 8a\) through layersℓ\+1\\ell\+1toLL, computes the loss \(Step 8b\), and performs Server Backpropagation \(Step 9\)\. Gradients∇Aℓ\\nabla A\_\{\\ell\}are returned to clients for the Client Backward Pass \(Step 10\) and Local Weight Update \(Step 11\)\.
4. 4\.Federated Aggregation:The FedAvg Aggregator \(Step 12\) combines all client updates\. The Reward Calculator \(Step 13\) computes the decayed loss\-drop rewardrtr\_\{t\}, which is stored in the Experience Replay Buffer𝒟\\mathcal\{D\}\(Step 14\)\. The updated Global ModelWt\+1W\_\{t\+1\}\(Step 15\) is prepared for the next round\.
Figure 2:High\-Level Overview of QSplitFL Framework Operation in Each Training Round\.The diagram illustrates the complete workflow: \(1\) capability metric collection and state construction, \(2\) committee\-based split point selection, \(3\) SFL execution with smashed data transmission, and \(4\) federated aggregation with reward computation\.With this per\-round view in place, we now formalize each component, beginning with the Markov Decision Process formulation that underpins split point selection\.
### 4\.2Markov Decision Process Formulation
We formulate the split point selection problem as a Markov Decision Process \(MDP\) defined by the tuple\(𝒮,𝒜,P,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\), where𝒮\\mathcal\{S\}denotes the state space,𝒜\\mathcal\{A\}the action space,PPthe transition,RRthe reward function, andγ\\gammathe discount factor\. The MDP connects the key blocks shown in Figure[1](https://arxiv.org/html/2606.09869#S3.F1): the state is constructed from client capability metrics \(Steps 1–3\), the action determines the split layer via the RL Controller \(Step 4\), the transition executes one SFL round \(Steps 5–12\), and the reward is derived from the aggregated loss \(Step 13\)\.
#### 4\.2\.1State Space:
The state at roundttfor clusterccencodes the aggregated hardware capabilities and heterogeneity of participating clients\. As shown in Steps 1–3 of Figure[1](https://arxiv.org/html/2606.09869#S3.F1), each client’s Hardware Profiler \(Step 1\) reports raw metrics, which are normalized into Capability MetricsCiC\_\{i\}\(Step 2\) and aggregated by the State Aggregator \(Step 3\) into a compact neural state vectorsts\_\{t\}\. Unlike prior approaches that employ high\-dimensional model weight representations requiring PCA compression, we define a compact, interpretable state vector based on four normalized capability metrics\. For each clientkkat timett, we compute: CPU availability asCCPU\(k\)\(t\)=CPUavail\(k\)\(t\)/CPUmaxC\_\{\\text\{CPU\}\}^\{\(k\)\}\(t\)=\\text\{CPU\}\_\{\\text\{avail\}\}^\{\(k\)\}\(t\)/\\text\{CPU\}\_\{\\max\}, memory availability asCMem\(k\)\(t\)=Memavail\(k\)\(t\)/MemmaxC\_\{\\text\{Mem\}\}^\{\(k\)\}\(t\)=\\text\{Mem\}\_\{\\text\{avail\}\}^\{\(k\)\}\(t\)/\\text\{Mem\}\_\{\\max\}, battery level asCBat\(k\)\(t\)=Batlevel\(k\)\(t\)/BatmaxC\_\{\\text\{Bat\}\}^\{\(k\)\}\(t\)=\\text\{Bat\}\_\{\\text\{level\}\}^\{\(k\)\}\(t\)/\\text\{Bat\}\_\{\\max\}, and network quality asCNet\(k\)\(t\)=1−Latency\(k\)\(t\)/LatencymaxC\_\{\\text\{Net\}\}^\{\(k\)\}\(t\)=1\-\\text\{Latency\}^\{\(k\)\}\(t\)/\\text\{Latency\}\_\{\\max\}\. All metrics are normalized to the interval\[0,1\]\[0,1\]using min\-max normalization across the federated network; see Appendix[0\.A\.2](https://arxiv.org/html/2606.09869#Pt0.A1.SS2)for detailed descriptions\. The overall capability score for clientkkcombines these metrics asCOverall\(k\)\(t\)=∑i=14wi⋅Ci\(k\)\(t\)C\_\{\\text\{Overall\}\}^\{\(k\)\}\(t\)=\\sum\_\{i=1\}^\{4\}w\_\{i\}\\cdot C\_\{i\}^\{\(k\)\}\(t\), wherew1,w2,w3,w4≥0w\_\{1\},w\_\{2\},w\_\{3\},w\_\{4\}\\geq 0and∑i=14wi=1\\sum\_\{i=1\}^\{4\}w\_\{i\}=1are importance weights which are tunable according to the deployment requirements\. The cluster\-level state vector aggregates individual client metrics into a six\-dimensional representation:st\(c\)=\[C¯CPU\(c\)\(t\),C¯Memory\(c\)\(t\),C¯Battery\(c\)\(t\),C¯Network\(c\)\(t\),C¯Overall\(c\)\(t\),σc\(t\)\]s\_\{t\}^\{\(c\)\}=\[\\bar\{C\}\_\{\\text\{CPU\}\}^\{\(c\)\}\(t\),\\bar\{C\}\_\{\\text\{Memory\}\}^\{\(c\)\}\(t\),\\bar\{C\}\_\{\\text\{Battery\}\}^\{\(c\)\}\(t\),\\bar\{C\}\_\{\\text\{Network\}\}^\{\(c\)\}\(t\),\\bar\{C\}\_\{\\text\{Overall\}\}^\{\(c\)\}\(t\),\\sigma\_\{c\}\(t\)\]\. Here, each mean capability is computed asC¯i\(c\)\(t\)=1\|𝒦c\|∑k∈𝒦cCi\(k\)\(t\)\\bar\{C\}\_\{i\}^\{\(c\)\}\(t\)=\\frac\{1\}\{\|\\mathcal\{K\}\_\{c\}\|\}\\sum\_\{k\\in\\mathcal\{K\}\_\{c\}\}C\_\{i\}^\{\(k\)\}\(t\), which represents the average of metriciiacross all clients in clustercc\. The heterogeneity indicatorσc\(t\)=1\|𝒦c\|∑k∈𝒦c\(COverall\(k\)\(t\)−C¯Overall\(c\)\(t\)\)2\\sigma\_\{c\}\(t\)=\\sqrt\{\\frac\{1\}\{\|\\mathcal\{K\}\_\{c\}\|\}\\sum\_\{k\\in\\mathcal\{K\}\_\{c\}\}\(C\_\{\\text\{Overall\}\}^\{\(k\)\}\(t\)\-\\bar\{C\}\_\{\\text\{Overall\}\}^\{\(c\)\}\(t\)\)^\{2\}\}captures the standard deviation of overall capability scores within the cluster\. This six\-dimensional state representation provides direct interpretability: highC¯Overall\(c\)\(t\)\\bar\{C\}\_\{\\text\{Overall\}\}^\{\(c\)\}\(t\)indicates the cluster can accommodate deeper split points, while highσc\(t\)\\sigma\_\{c\}\(t\)shows heterogeneous capabilities which require adaptive split point selection to avoid overloads from weaker client devices\.
#### 4\.2\.2Action Space:
The action space consists of feasible split layer indices𝒜=\{ℓ∈ℕ:ℓmin≤ℓ≤ℓmax\}\\mathcal\{A\}=\\left\\\{\\ell\\in\\mathbb\{N\}:\\ell\_\{\\min\}\\leq\\ell\\leq\\ell\_\{\\max\}\\right\\\}, whereℓmin=⌈L/2⌉\\ell\_\{\\min\}=\\lceil L/2\\rceilrepresents the minimum split point set at approximately half the total network depthLL, andℓmax=L−1\\ell\_\{\\max\}=L\-1is the maximum split point\. The RL Controller’s Committee Machine \(Step 4 in Figure[1](https://arxiv.org/html/2606.09869#S3.F1)\) selects one split layer from this range through majority voting \(Step 4a\) and outputs the selected action \(Step 4b\)\. The lower boundℓmin=⌈L/2⌉\\ell\_\{\\min\}=\\lceil L/2\\rceilensures that clients execute at least half of the network layers, which serves two purposes: \(1\) it prevents clients from sending raw or nearly\-raw data to the server, which would increase privacy risks and communication overhead; and \(2\) it ensures client devices perform meaningful local feature extraction before transmitting smashed data\. For example, in ResNet50 withL=50L=50layers, this constraint sets the split point at least in the half layer \(ℓmin=25\\ell\_\{\\min\}=25\)\. The upper boundL−1L\-1excludes the output layer, which must reside on the server for loss computation and gradient initialization\. Each actionat=ℓa\_\{t\}=\\ellspecifies that clients compute layers11throughℓ\\ellwhile the server computes layersℓ\+1\\ell\+1throughLL\.
#### 4\.2\.3Transition:
After the RL Controller selects a split layerℓ\\ellat roundtt\(Step 4 in Figure[1](https://arxiv.org/html/2606.09869#S3.F1)\), the environment executes one complete SFL training round following Steps 5 through 12 as shown in Figure[1](https://arxiv.org/html/2606.09869#S3.F1), using Algorithm[2](https://arxiv.org/html/2606.09869#alg2)from Appendix[0\.A\.7](https://arxiv.org/html/2606.09869#Pt0.A1.SS7):
Wt=TrainRoundSFL\(c,st\(c\),Wt−1;Qθ,ϵ\),W\_\{t\}=\\texttt\{TrainRoundSFL\}\(c,s\_\{t\}^\{\(c\)\},W\_\{t\-1\};Q\_\{\\theta\},\\epsilon\),\(1\)where the server broadcasts parameters and split configuration \(Step 5\), clients perform forward passes and transmit smashed data \(Steps 6–7\), the server completes forward and backward passes \(Steps 8–9\), clients perform backpropagation and weight updates \(Steps 10–11\), and updates are aggregated via FedAvg \(Step 12\)\. The next state is then computed:
st\+1\(c\)=Φ\(\{Ci\(k\)\(t\+1\)\}k∈𝒦c\),s\_\{t\+1\}^\{\(c\)\}=\\Phi\\left\(\\\{C\_\{i\}^\{\(k\)\}\(t\+1\)\\\}\_\{k\\in\\mathcal\{K\}\_\{c\}\}\\right\),\(2\)whereΦ\(⋅\)\\Phi\(\\cdot\)applies the aggregation to updated client capability metrics\.
### 4\.3Reward Function Design
The reward function guides the RL agent toward optimized split point selections that minimize cluster loss while considering capability constraints\. As indicated in Step 13 of Figure[1](https://arxiv.org/html/2606.09869#S3.F1), the Reward Calculator computes the reward \(rtr\_\{t\}\) based on the decayed loss\-drop during the aggregation phase\. We employ a decayed loss\-drop reward that prioritizes early\-round improvements\.
#### 4\.3\.1Cluster Loss Computation:
At each roundtt, we calculate the overall validation loss for a cluster by averaging the losses from all clients, where we assign more weight to clients with more data\. The cluster loss is computed asLt=∑k=1Kωk⋅Lt\(k\)L\_\{t\}=\\sum\_\{k=1\}^\{K\}\\omega\_\{k\}\\cdot L\_\{t\}^\{\(k\)\}, whereωk=nk∑j=1Knj\\omega\_\{k\}=\\frac\{n\_\{k\}\}\{\\sum\_\{j=1\}^\{K\}n\_\{j\}\}represents the weight of clientkkbased on its number of training samplesnkn\_\{k\}\.
#### 4\.3\.2Decay Factor:
Since the model improves most in early rounds, we use an exponential decay factor to give more credit to early round gains and less to the later ones for optimal split point selection\. So, the exponential decay factor can be defined as,ρt=e−λ\(t−1\),whereλ\>0\\rho\_\{t\}=e^\{\-\\lambda\(t\-1\)\},\\text\{where\}\\quad\\lambda\>0\. Here, at roundt=1t=1, we haveρ1=e0=1\\rho\_\{1\}=e^\{0\}=1, ensuring the first round receives full credit\. As training progresses,ρt\\rho\_\{t\}decays exponentially toward zero, causing later rounds to receive diminishing credit for equivalent loss improvements\. The decay rateλ\\lambdacontrols the speed of this transition: larger values ofλ\\lambdaresult in sharper decay, while smaller values produce more gradual decay\. In practice, we setλ\\lambdasuch thatρT≪1\\rho\_\{T\}\\ll 1by the final round, which ensures the RL agent prioritizes early stage convergence\.
#### 4\.3\.3Loss\-Drop Reward:
The reward at roundttmeasures the improvement in cluster loss, weighted by the temporal decay:rt=−\(Lt−Lt−1\)⋅ρt\\boxed\{r\_\{t\}=\-\\left\(L\_\{t\}\-L\_\{t\-1\}\\right\)\\cdot\\rho\_\{t\}\}\. If the selected split point leads to loss reduction \(Lt<Lt−1L\_\{t\}<L\_\{t\-1\}\), the reward is positive; and if the loss increases it yields negative reward\. The decay factorρt\\rho\_\{t\}ensures that equivalent improvements in later rounds receive progressively smaller rewards, encouraging the agent to prioritize early convergence\. Detailed reward behavior examples are provided in Table[4](https://arxiv.org/html/2606.09869#Pt0.A1.T4)in Appendix[0\.A\.6](https://arxiv.org/html/2606.09869#Pt0.A1.SS6)\.
### 4\.4Deep Q\-Network Architecture
The RL Controller shown in Block B of Figure[1](https://arxiv.org/html/2606.09869#S3.F1)is implemented using a Deep Q\-Network \(DQN\) to approximate the action value functionQ\(s,a\)Q\(s,a\)\. The Q\-network is an MLP that maps the six\-dimensional capability state \(constructed through Steps 1–3\) to Q\-values for each possible split action\. To improve learning stability, we maintain an experience replay buffer𝒟\\mathcal\{D\}that stores transition tupleset=\(st,at,rt,st\+1,dt\)e\_\{t\}=\(s\_\{t\},a\_\{t\},r\_\{t\},s\_\{t\+1\},d\_\{t\}\), wheredt∈\{0,1\}d\_\{t\}\\in\\\{0,1\\\}indicates round termination\. The buffer operates under a First\-In\-First\-Out \(FIFO\) policy with fixed capacity\. During training, mini\-batches ofBBtransitions are sampled uniformly from𝒟\\mathcal\{D\}, which breaks the correlation between consecutive samples and helps the model learn more effectively\. We also use a target networkQθ¯Q\_\{\\bar\{\\theta\}\}with parametersθ¯\\bar\{\\theta\}that are periodically synchronized with the current network\. For each sampled transition\(si,ai,ri,si′,di\)\(s\_\{i\},a\_\{i\},r\_\{i\},s^\{\\prime\}\_\{i\},d\_\{i\}\), the TD target is computed asyi=ri\+γ\(1−di\)maxa′Qθ¯\(si′,a′\)y\_\{i\}=r\_\{i\}\+\\gamma\(1\-d\_\{i\}\)\\max\_\{a^\{\\prime\}\}Q\_\{\\bar\{\\theta\}\}\(s^\{\\prime\}\_\{i\},a^\{\\prime\}\)\. The DQN parameters are updated by minimizing the mean\-squared TD error inBBmini\-batches is:
ℒ\(θ\)=1B∑i=1B\(Qθ\(si,ai\)−yi\)2\.\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\left\(Q\_\{\\theta\}\(s\_\{i\},a\_\{i\}\)\-y\_\{i\}\\right\)^\{2\}\.\(3\)For action selection, we use the standardϵ\\epsilon\-greedy exploration strategy with exponential decay\[[11](https://arxiv.org/html/2606.09869#bib.bib35)\]\.
Algorithm 1SFL with RL\-based DQN Training0:
MM\(odd\),
LL,
𝒜\\mathcal\{A\},
η\\eta,
γ\\gamma,
ϵ0\\epsilon\_\{0\},
ϵmin\\epsilon\_\{\\min\},
κ\\kappa,
𝒞\\mathcal\{C\},
NmaxN\_\{\\max\},
BB,
\{𝒦c\}\\\{\\mathcal\{K\}\_\{c\}\\\},
TT,
EE
0:Trained committee
\{Qθ\(m\)\}m=1M\\\{Q\_\{\\theta\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}; policy
π\\pi
1:Model form \(shared encoder, per\-member head\):
fs\(⋅;ϕs\)\(shared\),g\(m\)\(⋅;ψ\(m\)\)\(head\)f\_\{s\}\(\\cdot;\\phi\_\{s\}\)\\text\{ \(shared\)\},\\quad g^\{\(m\)\}\(\\cdot;\\psi^\{\(m\)\}\)\\text\{ \(head\)\}Qθ\(m\)\(s,a\)=\[g\(m\)\(fs\(s;ϕs\);ψ\(m\)\)\]a;θm=fs\+g\(m\)Q^\{\(m\)\}\_\{\\theta\}\(s,a\)=\\big\[g^\{\(m\)\}\(f\_\{s\}\(s;\\phi\_\{s\}\);\\psi^\{\(m\)\}\)\\big\]\_\{a\};\\theta\_\{m\}=f\_\{s\}\+g^\{\(m\)\}
2:Initialize shared encoder
fs\(⋅;ϕs\)f\_\{s\}\(\\cdot;\\phi\_\{s\}\), heads
\{g\(m\)\}m=1M\\\{g^\{\(m\)\}\\\}\_\{m=1\}^\{M\}; set
Qθ¯\(m\)←Qθ\(m\)Q\_\{\\bar\{\\theta\}\}^\{\(m\)\}\\leftarrow Q\_\{\\theta\}^\{\(m\)\}for all
mm
3:Initialize replay buffer
𝒟←∅\\mathcal\{D\}\\leftarrow\\emptyset
4:forepisode
e=1e=1to
EEdo
5:forcluster
c=1c=1to
CCdo
6:Set
ϵ←ϵ0\\epsilon\\leftarrow\\epsilon\_\{0\};
L0←∞L\_\{0\}\\leftarrow\\infty
7:forround
t=1t=1to
TTdo
8:Compute capability metrics
CCPU\(k\)\(t\)C\_\{\\text\{CPU\}\}^\{\(k\)\}\(t\),
CMemory\(k\)\(t\)C\_\{\\text\{Memory\}\}^\{\(k\)\}\(t\),
CBattery\(k\)\(t\)C\_\{\\text\{Battery\}\}^\{\(k\)\}\(t\),
CNetwork\(k\)\(t\)C\_\{\\text\{Network\}\}^\{\(k\)\}\(t\)for all
k∈𝒦ck\\in\\mathcal\{K\}\_\{c\}
9:Construct state
st\(c\)=\[C¯CPU\(c\),C¯Memory\(c\),C¯Battery\(c\),C¯Network\(c\),C¯Overall\(c\),σc\(t\)\]s\_\{t\}^\{\(c\)\}=\[\\bar\{C\}\_\{\\text\{CPU\}\}^\{\(c\)\},\\bar\{C\}\_\{\\text\{Memory\}\}^\{\(c\)\},\\bar\{C\}\_\{\\text\{Battery\}\}^\{\(c\)\},\\bar\{C\}\_\{\\text\{Network\}\}^\{\(c\)\},\\bar\{C\}\_\{\\text\{Overall\}\}^\{\(c\)\},\\sigma\_\{c\}\(t\)\]
10:Select action
ata\_\{t\}: with prob\.
ϵ\\epsilonrandom from
𝒜\\mathcal\{A\}, else
at←CommitteeVote\(st\(c\),\{Qθ\(m\)\}\)a\_\{t\}\\leftarrow\\texttt\{CommitteeVote\}\(s\_\{t\}^\{\(c\)\},\\\{Q\_\{\\theta\}^\{\(m\)\}\\\}\)\(Alg\.[3](https://arxiv.org/html/2606.09869#alg3)\)
11:Execute SFL round:
\(Wt,Lt\)←TrainRoundSFL\(c,st\(c\),Wt−1,ℓ=at\)\(W\_\{t\},L\_\{t\}\)\\leftarrow\\texttt\{TrainRoundSFL\}\(c,s\_\{t\}^\{\(c\)\},W\_\{t\-1\},\\ell=a\_\{t\}\)\(Alg\.[2](https://arxiv.org/html/2606.09869#alg2)\)
12:Compute reward
rt=−\(Lt−Lt−1\)⋅ρtr\_\{t\}=\-\(L\_\{t\}\-L\_\{t\-1\}\)\\cdot\\rho\_\{t\}where
ρt=e−λ\(t−1\)\\rho\_\{t\}=e^\{\-\\lambda\(t\-1\)\}
13:Obtain next state
st\+1\(c\)s\_\{t\+1\}^\{\(c\)\}by recomputing capability metrics
14:Set terminal flag:
dt=𝟏d\_\{t\}=\\mathbf\{1\}\[if t = T\]
15:Store
\(st\(c\),at,rt,st\+1\(c\),dt\)\(s\_\{t\}^\{\(c\)\},a\_\{t\},r\_\{t\},s\_\{t\+1\}^\{\(c\)\},d\_\{t\}\)in
𝒟\\mathcal\{D\}; if
\|𝒟\|\>Nmax\|\\mathcal\{D\}\|\>N\_\{\\max\}, pop oldest
16:Sample mini\-batch
\{\(si,ai,ri,si′,di\)\}i=1B\\\{\(s\_\{i\},a\_\{i\},r\_\{i\},s^\{\\prime\}\_\{i\},d\_\{i\}\)\\\}\_\{i=1\}^\{B\}from
𝒟\\mathcal\{D\}
17:for
m=1m=1to
MMdo
18:Compute TD target:
yi\(m\)=ri\+γ\(1−di\)maxa′Qθ¯\(m\)\(si′,a′\)y\_\{i\}^\{\(m\)\}=r\_\{i\}\+\\gamma\(1\-d\_\{i\}\)\\max\_\{a^\{\\prime\}\}Q\_\{\\bar\{\\theta\}\}^\{\(m\)\}\(s^\{\\prime\}\_\{i\},a^\{\\prime\}\)
19:Update:
θ\(m\)←θ\(m\)−η∇θ\(m\)1B∑i\(Qθ\(m\)\(si,ai\)−yi\(m\)\)2\\theta^\{\(m\)\}\\leftarrow\\theta^\{\(m\)\}\-\\eta\\nabla\_\{\\theta^\{\(m\)\}\}\\frac\{1\}\{B\}\\sum\_\{i\}\(Q\_\{\\theta\}^\{\(m\)\}\(s\_\{i\},a\_\{i\}\)\-y\_\{i\}^\{\(m\)\}\)^\{2\}
20:if
tmod𝒞=0t\\mod\\mathcal\{C\}=0then
21:
Qθ¯\(m\)←Qθ\(m\)Q\_\{\\bar\{\\theta\}\}^\{\(m\)\}\\leftarrow Q\_\{\\theta\}^\{\(m\)\}
22:endif
23:endfor
24:Decay:
ϵ←ϵmin\+\(ϵ0−ϵmin\)e−κt\\epsilon\\leftarrow\\epsilon\_\{\\min\}\+\(\\epsilon\_\{0\}\-\\epsilon\_\{\\min\}\)e^\{\-\\kappa t\};
Lt−1←LtL\_\{t\-1\}\\leftarrow L\_\{t\}
25:endfor
26:endfor
27:endfor
28:return
\{Qθ\(m\)\}m=1M\\\{Q\_\{\\theta\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\},
π\(s\)=CommitteeVote\(s,\{Qθ\(m\)\}\)\\pi\(s\)=\\texttt\{CommitteeVote\}\(s,\\\{Q\_\{\\theta\}^\{\(m\)\}\\\}\)
Committee\-Based Solution for Mitigating Reward Hacking:A single agent can sometimes learn to cheat by finding shortcuts that increase its reward without actually improving the model’s real performance; this phenomenon is known as reward hacking\[[32](https://arxiv.org/html/2606.09869#bib.bib34)\]\. As shown in Step 4 of Figure[1](https://arxiv.org/html/2606.09869#S3.F1), the RL Controller’s Committee Machine employsMMMLP Heads with a Shared Encoder \(withMMbeing an odd number\) that vote together via Majority Voting \(Step 4a\) to decide the best split action \(Step 4b\)\. Each committee member shares a common shallow encoderfs\(⋅;ϕs\)f\_\{s\}\(\\cdot;\\phi\_\{s\}\)for stable low\-level state representation, but maintains an independent deep headg\(m\)\(⋅;ψ\(m\)\)g^\{\(m\)\}\(\\cdot;\\psi^\{\(m\)\}\)for diverse Q\-value estimation:
Qθ\(m\)\(s,a\)=\[g\(m\)\(fs\(s;ϕs\);ψ\(m\)\)\]a\.Q\_\{\\theta\}^\{\(m\)\}\(s,a\)=\\left\[g^\{\(m\)\}\\left\(f\_\{s\}\(s;\\phi\_\{s\}\);\\psi^\{\(m\)\}\\right\)\\right\]\_\{a\}\.\(4\)By sharing the early layers but keeping separate deeper layers, the committee members learn different perspectives on which actions are best, without adding too much computational cost\. During action selection, each committee member proposes its preferred split point by selecting the action with the highest Q\-value\. We then count the votes for each action and select the action with the most votes as the final decision\. In case of ties, we break by selecting the action with the highest mean Q\-value across committee members\. This majority voting mechanism ensures that no single model can exploit loopholes in the reward function\. It is worth clarifying the relationship between the committee structure and the broader RL formulation\. QSplitFL is designed as a single\-agent RL system\. A single centralized RL Controller on the server observes one global statest\(c\)s\_\{t\}^\{\(c\)\}aggregating all client capability metrics and outputs one global action, the split layerℓ\\ell, applied uniformly in that round\. TheMMMLP heads are not independent agents\. Rather, they form an ensemble of Q\-value estimators within the single agent, each sharing the same encoder but maintaining separate deeper layers\[[32](https://arxiv.org/html/2606.09869#bib.bib34)\]\. A true multi\-agent formulation would require each client to act as an independent decision maker, introducing non\-stationary dynamics and coordination overhead\. Since QSplitFL needs one globally consistent split point per cluster at each round, the single\-agent MDP formulation is both sufficient and appropriate\. Beyond architectural clarity, this ensemble design serves a critical functional purpose which is mitigating reward hacking\. To verify this, we conducted a controlled comparison between a single\-head DQN \(MM=1\) and the committee DQN \(MM=3\)\. The single\-head agent converged to a fixed split point early in training, accumulating high reward through early loss\-drops while accuracy plateaued\. The committee DQN, in contrast, continued adapting its split selection across rounds and achieved significantly higher final accuracy, which validates majority voting as an effective defense against reward hacking\.
### 4\.5Training Procedure
Algorithm[1](https://arxiv.org/html/2606.09869#alg1)presents the main controller that orchestrates this complete training loop\. It manages the experience replay buffer𝒟\\mathcal\{D\}and coordinates all committee members\. At each roundtt, it computes capability metrics, constructs the six\-dimensional statest\(c\)s\_\{t\}^\{\(c\)\}, and determines whether to explore \(random action with probabilityϵ\\epsilon\) or exploit \(committee voting with probability1−ϵ1\-\\epsilon\)\. Two supporting algorithms are provided in Appendix[0\.A\.7](https://arxiv.org/html/2606.09869#Pt0.A1.SS7): Algorithm[2](https://arxiv.org/html/2606.09869#alg2)\(SFL Training Round\) executes the FL round corresponding to Steps 5–12 in Figure[1](https://arxiv.org/html/2606.09869#S3.F1), and Algorithm[3](https://arxiv.org/html/2606.09869#alg3)\(Committee Majority Voting\) implements Step 4\.
### 4\.6Computational Complexity Analysis
A key advantage of the capability\-based state representation is its computational efficiency compared to weight\-based approaches\. Traditional methods using PCA\-compressed model weights require high complexity but our capability\-based states require only𝒪\(\|𝒦\|\)\\mathcal\{O\}\(\|\\mathcal\{K\}\|\), which corresponds to simple aggregation operations \(mean, standard deviation\) over client capability metrics\. This becomes especially significant in large scale settings when the network has many layers \(dd\) and there are many participating clients \(\|𝒦\|\|\\mathcal\{K\}\|\)\.
## 5Experiments and Results
### 5\.1Experimental Setup
Datasets and Architectures:We evaluate QSplitFL on MNIST, Fashion\-MNIST, CIFAR\-10, and CIFAR\-100, covering a range of complexity from simple grayscale to challenging RGB classification tasks\. To assess performance across varying computational settings, we have paired these datasets with four DNN architectures: a 10\-layer CNN \(split points 5–9\), ResNet50 \(50 layers, split points 25–49\), MobileNetV4 \(53 layers, split points 27–52\), and ConvNeXt \(59 layers, split points 30–58\)\.
Heterogeneity and Reproducibility:To emulate heterogeneous, resource\-constrained clients, each client samples its four capability metrics from device\-tier distributions \(strong, medium, and weak\) that fluctuate across rounds, and clients are grouped into capability clusters served by a per\-cluster controller\. Data are partitioned with a Dirichlet distribution \(α=0\.5\\alpha=0\.5\) for non\-IID conditions, and for fairness all baselines share the same backbone, data partition, optimizer, and number of rounds, with split\-based baselines using the same client and server partition where applicable\. The full device\-tier ranges, clustering procedure, hyperparameter settings, and baseline tuning protocol are detailed in Appendix[0\.A\.8](https://arxiv.org/html/2606.09869#Pt0.A1.SS8), and Appendix[0\.A\.11](https://arxiv.org/html/2606.09869#Pt0.A1.SS11)discusses the rationale for the split depth bound, its privacy implications, system\-level cost, feature selection, and limitations\.
\(a\)MNIST: ResNet50
\(b\)MNIST: MobileNetV4
\(c\)MNIST: ConvNeXt
\(d\)FMNIST: ResNet50
\(e\)FMNIST: MobileNetV4
\(f\)FMNIST: ConvNeXt
\(g\)CIFAR10: ResNet50
\(h\)CIFAR10: MobileNetV4
\(i\)CIFAR10: ConvNeXt
\(j\)CIFAR100: ResNet50
\(k\)CIFAR100: MobileNetV4
\(l\)CIFAR100: ConvNeXt
Figure 3:Accuracy Convergence Analysis \(100 Rounds\)\.Comparison of ResNet50, MobileNetV4, and ConvNeXt across MNIST, Fashion\-MNIST, CIFAR\-10, and CIFAR\-100 with varying client counts\. CNN results are detailed in Appendix[0\.A\.9](https://arxiv.org/html/2606.09869#Pt0.A1.SS9)\.
### 5\.2Accuracy Convergence Analysis \(100 Rounds\)
Figure[3](https://arxiv.org/html/2606.09869#S5.F3.fig1)shows how accuracy improves over 100 training rounds for all datasets and architectures \(except CNN\)\. Deeper models consistently outperform shallow CNNs\. ConvNeXt achieves the best results: 99\.6% on MNIST, 94\.1% on Fashion\-MNIST, 86\.5% on CIFAR\-10, and 68\.3% on CIFAR\-100\. Across all datasets, we observe that accuracy increases sharply in the first 20–30 rounds and then gradually stabilizes\. This pattern holds regardless of the number of clients \(5, 10, 50, 100, or 200\), showing that QSplitFL scales well to larger federated settings\. The performance gap between architectures becomes more noticeable on harder datasets like CIFAR\-100, where ConvNeXt performs better than ResNet50 and MobileNetV4\. CNN results and additional comparisons are provided in Appendix[0\.A\.9](https://arxiv.org/html/2606.09869#Pt0.A1.SS9)\.
### 5\.3Split Point Selection Analysis
Figure[4](https://arxiv.org/html/2606.09869#S5.F4.fig1)shows how the RL agent adaptively selects split points for deep architectures across all datasets over 100 rounds\. The agent learns to choose higher split points \(layers 30–59\) for deeper networks like ResNet50, MobileNetV4, and ConvNeXt\. This means more layers run on the client side, which is possible because these architectures have more parameters to distribute\. The split point selection remains relatively stable across different client counts\. This suggests that the agent focuses primarily on the model architecture rather than the number of participating clients when making split decisions\. The consistency of these choices across datasets also indicates that the learned policy generalizes well to different data distributions\.
\(a\)MNIST: ResNet50
\(b\)MNIST: MobileNetV4
\(c\)MNIST: ConvNeXt
\(d\)FMNIST: ResNet50
\(e\)FMNIST: MobileNetV4
\(f\)FMNIST: ConvNeXt
\(g\)CIFAR10: ResNet50
\(h\)CIFAR10:MobileNetV4
\(i\)CIFAR10: ConvNeXt
\(j\)CIFAR100: ResNet50
\(k\)CIFAR100: MobileNetV4
\(l\)CIFAR100: ConvNeXt
Figure 4:Dynamic Split Point Selection \(100 Rounds\)\.Adaptive split point selection for ResNet50, MobileNetV4, and ConvNeXt\. The agent consistently selects optimal split layers \(30–59 depending on architecture\) to balance computation and communication\.
### 5\.4Comparison with Baseline Methods
We have compared the QSplitFL framework against a few state\-of\-the art techniques, such as: Centralized Learning \(theoretical upper bound\), FedAvg\[[19](https://arxiv.org/html/2606.09869#bib.bib36)\], FedProx\[[17](https://arxiv.org/html/2606.09869#bib.bib77)\], q\-FedAvg\[[18](https://arxiv.org/html/2606.09869#bib.bib31)\], SplitFed\[[27](https://arxiv.org/html/2606.09869#bib.bib49)\], ClusterSFL\[[3](https://arxiv.org/html/2606.09869#bib.bib7)\], HeteroSFL\[[7](https://arxiv.org/html/2606.09869#bib.bib8)\], SHeRL\-FL\[[28](https://arxiv.org/html/2606.09869#bib.bib9)\], and FLUID/SFL\-V2\[[8](https://arxiv.org/html/2606.09869#bib.bib10)\]\. Table[1](https://arxiv.org/html/2606.09869#S5.T1)presents the comparison of QSplitFL against baseline methods using 5 clients and 100 training rounds\. A detailed ablation study isolating the contribution of each design component is provided in Appendix[0\.A\.10](https://arxiv.org/html/2606.09869#Pt0.A1.SS10)\.
### 5\.5Summary of Findings
Across all datasets and training settings, we observe a consistent performance ranking: ConvNeXt\>\>ResNet50\>\>MobileNetV4\>\>CNN\. This ordering holds regardless of the model architectures, number of clients, and training rounds, which ensures that deeper and more modern architectures benefit the most from QSplitFL’s adaptive split point selection\. Extended training with 100 rounds improves accuracy by 15–28% over 10 rounds, with the largest gains on harder datasets like CIFAR\-100\. Furthermore, all configurations scale stably from 5 to 200 clients, which demonstrates QSplitFL’s effectiveness in both small and large federated settings\. In terms of split behavior, shallow CNNs use early layers \(5–9\), while deeper networks use mid to late layers \(30–59\), which enables efficient resource utilization across heterogeneous devices\. Beyond accuracy, this adaptive split selection also improves communication efficiency, where deeper splits reduce the size of transmitted activations, while the capability\-aware state naturally penalizes poor connectivity clients throughCNet\(k\)\(t\)=1−Latency\(k\)\(t\)/LatencymaxC\_\{\\text\{Net\}\}^\{\(k\)\}\(t\)=1\-\\text\{Latency\}^\{\(k\)\}\(t\)/\\text\{Latency\}\_\{\\max\}, pushing the agent to favor deeper splits that transmit less data\. As a result, QSplitFL implicitly optimizes communication overhead without an explicit bandwidth term in the reward, making QSplitFL well suited for bandwidth\-constrained deployments such as rural healthcare facilities and IoT edge environments\.
Table 1:Baseline Accuracy Comparison \(QSplitFL with 5 Clients, 100 Rounds\)
## 6Conclusion
This paper presented QSplitFL, a capability\-aware Deep Q\-Network framework for dynamic split point selection in SFL\. The proposed approach addresses the fundamental challenge of enabling FL on resource\-constrained edge devices by intelligently partitioning neural network models based on client hardware capabilities\. Our framework introduces three key innovations: \(1\) a lightweight state representation that reduces computational complexity to𝒪\(\|𝒦\|\)\\mathcal\{O\}\(\|\\mathcal\{K\}\|\)by directly encoding client metrics instead of model weights, \(2\) a committee\-based RL architecture with majority voting that mitigates reward hacking and ensures adaptive split point decisions, and \(3\) a decayed loss\-drop reward function that prioritizes early\-round convergence\. Experimental evaluation across four benchmark datasets and four neural network architectures demonstrates that QSplitFL achieves accuracy of 99\.47% on MNIST, 93\.99% on Fashion\-MNIST, 86\.16% on CIFAR\-10, and 68\.27% on CIFAR\-100, while scaling gracefully from 5 to 200 clients\. Finally, QSplitFL outperforms traditional FL methods while enabling participation of resource\-constrained devices\.
\{credits\}
#### 6\.0\.1Acknowledgements
The work of Nazmus Shakib Shadin and Xinyue Zhang is partly supported by the U\.S\. National Science Foundation \(NSF\-2348417 and NSF\-2431597\)\. The work of Jingyi Wang is partly supported by the U\.S\. National Science Foundation \(CNS\-2431594\)\.
## References
- \[1\]M\. Abadi, A\. Chu, I\. Goodfellow, H\. B\. McMahan, I\. Mironov, K\. Talwar, and L\. Zhang\(2016\)Deep learning with differential privacy\.InProceedings of the 2016 ACM SIGSAC conference on computer and communications security,pp\. 308–318\.Cited by:[§0\.A\.11\.1](https://arxiv.org/html/2606.09869#Pt0.A1.SS11.SSS1.p1.4)\.
- \[2\]H\. G\. Abreha, M\. Hayajneh, and M\. A\. Serhani\(2022\)Federated learning in edge computing: a systematic survey\.Sensors22\(2\),pp\. 450\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p1.1)\.
- \[3\]M\. Arafeh, M\. Wazzeh, H\. Sami, H\. Ould\-Slimane, C\. Talhi, A\. Mourad, and H\. Otrok\(2025\)Efficient privacy\-preserving ml for iot: cluster\-based split federated learning scheme for non\-iid data\.Journal of Network and Computer Applications236,pp\. 104105\.Cited by:[§5\.4](https://arxiv.org/html/2606.09869#S5.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.09869#S5.T1.4.1.7.7.1)\.
- \[4\]D\. J\. Beutel, T\. Topal, A\. Mathur, X\. Qiu, J\. Fernandez\-Marques, Y\. Gao, L\. Sani, K\. H\. Li, T\. Parcollet, P\. P\. B\. de Gusmão,et al\.\(2020\)Flower: a friendly federated learning research framework\.arXiv preprint arXiv:2007\.14390\.Cited by:[§0\.A\.9](https://arxiv.org/html/2606.09869#Pt0.A1.SS9.p2.1),[§1](https://arxiv.org/html/2606.09869#S1.p1.1)\.
- \[5\]F\. Binucci, M\. Merluzzi, P\. Banelli, E\. C\. Strinati, and P\. Di Lorenzo\(2024\)Enabling edge artificial intelligence via goal\-oriented deep neural network splitting\.In2024 19th International Symposium on Wireless Communication Systems \(ISWCS\),pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p2.1)\.
- \[6\]H\. Chen, X\. Chen, L\. Peng, and Y\. Bai\(2023\)Personalized fair split learning for resource\-constrained internet of things\.Sensors24\(1\),pp\. 88\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p3.1)\.
- \[7\]X\. Chen, J\. Li, D\. Fan, and C\. Chakrabarti\(2025\)HeteroSFL: split federated learning with heterogeneous clients and non\-iid data\.IEEE Internet of Things Journal12\(15\),pp\. 30460–30474\.External Links:[Document](https://dx.doi.org/10.1109/JIOT.2025.3572393)Cited by:[§5\.4](https://arxiv.org/html/2606.09869#S5.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.09869#S5.T1.4.1.8.8.1)\.
- \[8\]J\. Dachille, C\. Huang, and X\. Liu\(2024\)The impact of cut layer selection in split federated learning\.arXiv preprint arXiv:2412\.15536\.Cited by:[§5\.4](https://arxiv.org/html/2606.09869#S5.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.09869#S5.T1.4.1.10.10.1)\.
- \[9\]W\. Fan, P\. Chen, X\. Chun, and Y\. Liu\(2025\)MADRL\-based model partitioning, aggregation control, and resource allocation for cloud\-edge\-device collaborative split federated learning\.IEEE Transactions on Mobile Computing\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p1.1)\.
- \[10\]O\. Gupta and R\. Raskar\(2018\)Distributed learning of deep neural network over multiple agents\.Journal of Network and Computer Applications116,pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p2.1)\.
- \[11\]N\. Hariharan and A\. G\. Paavai\(2022\)A brief study of deep reinforcement learning with epsilon\-greedy exploration\.\.International Journal of Computing and Digital Systems11\(1\),pp\. 541–552\.Cited by:[§4\.4](https://arxiv.org/html/2606.09869#S4.SS4.p1.12)\.
- \[12\]G\. S\. Hukkeri, R\. Goudar, G\. Dhananjaya, and V\. N\. Rathod\(2025\)A comprehensive survey on split\-fed learning: methods, innovations, and future directions\.IEEE Access13,pp\. 46312\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p1.1)\.
- \[13\]L\. U\. Khan, W\. Saad, Z\. Han, E\. Hossain, and C\. S\. Hong\(2021\)Federated learning for internet of things: recent advances, taxonomy, and open challenges\.IEEE Communications Surveys & Tutorials23\(3\),pp\. 1759–1799\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p1.1)\.
- \[14\]Y\. L\. Lee and D\. Qin\(2019\)A survey on applications of deep reinforcement learning in resource management for 5g heterogeneous networks\.In2019 Asia\-Pacific signal and information processing association annual summit and conference \(APSIPA ASC\),pp\. 1856–1862\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p2.1)\.
- \[15\]J\. Li and S\. Yu\(2024\)Integrity verifiable privacy\-preserving federated learning for healthcare\-iot\.In2024 IEEE International Conference on E\-health Networking, Application & Services \(HealthCom\),pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p1.1)\.
- \[16\]O\. Li, J\. Sun, X\. Yang, W\. Gao, H\. Zhang, J\. Xie, V\. Smith, and C\. Wang\(2021\)Label leakage and protection in two\-party split learning\.arXiv preprint arXiv:2102\.08504\.Cited by:[§0\.A\.11\.1](https://arxiv.org/html/2606.09869#Pt0.A1.SS11.SSS1.p1.4)\.
- \[17\]T\. Li, A\. K\. Sahu, M\. Zaheer, M\. Sanjabi, A\. Talwalkar, and V\. Smith\(2020\)Federated optimization in heterogeneous networks\.Proceedings of Machine learning and systems2,pp\. 429–450\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p1.1),[§5\.4](https://arxiv.org/html/2606.09869#S5.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.09869#S5.T1.4.1.4.4.1)\.
- \[18\]T\. Li, M\. Sanjabi, A\. Beirami, and V\. Smith\(2019\)Fair resource allocation in federated learning\.arXiv preprint arXiv:1905\.10497\.Cited by:[§5\.4](https://arxiv.org/html/2606.09869#S5.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.09869#S5.T1.4.1.5.5.1)\.
- \[19\]X\. Li, K\. Huang, W\. Yang, S\. Wang, and Z\. Zhang\(2019\)On the convergence of fedavg on non\-iid data\.arXiv preprint arXiv:1907\.02189\.Cited by:[§5\.4](https://arxiv.org/html/2606.09869#S5.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.09869#S5.T1.4.1.3.3.1)\.
- \[20\]Y\. Liang, Q\. Chen, G\. Zhu, M\. K\. Awan, and H\. Jiang\(2025\)Communication\-and\-computation efficient split federated learning: gradient aggregation and resource management\.arXiv preprint arXiv:2501\.01078\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p3.1)\.
- \[21\]X\. Liu, Y\. Deng, and T\. Mahmoodi\(2022\)Wireless distributed learning: a new hybrid split and federated learning approach\.IEEE Transactions on Wireless Communications22\(4\),pp\. 2650–2665\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p3.1)\.
- \[22\]J\. Nixon, B\. Lakshminarayanan, and D\. Tran\(2020\)Why are bootstrapped deep ensembles not better?\.In”I Can’t Believe It’s Not Better\!” NeurIPS 2020 workshop,Cited by:[§0\.A\.11\.4](https://arxiv.org/html/2606.09869#Pt0.A1.SS11.SSS4.p1.1)\.
- \[23\]I\. Osband, C\. Blundell, A\. Pritzel, and B\. Van Roy\(2016\)Deep exploration via bootstrapped dqn\.Advances in neural information processing systems29\.Cited by:[§0\.A\.11\.4](https://arxiv.org/html/2606.09869#Pt0.A1.SS11.SSS4.p1.1)\.
- \[24\]E\. Samikwa, A\. Di Maio, and T\. Braun\(2022\)Ares: adaptive resource\-aware split learning for internet of things\.Computer Networks218,pp\. 109380\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p4.1),[§2](https://arxiv.org/html/2606.09869#S2.p2.1),[§2](https://arxiv.org/html/2606.09869#S2.p3.1)\.
- \[25\]T\. Schaul, J\. Quan, I\. Antonoglou, and D\. Silver\(2015\)Prioritized experience replay\.arXiv preprint arXiv:1511\.05952\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p2.1)\.
- \[26\]A\. Singh, P\. Vepakomma, O\. Gupta, and R\. Raskar\(2019\)Detailed comparison of communication efficiency of split learning and federated learning\.arXiv preprint arXiv:1909\.09145\.Cited by:[§3](https://arxiv.org/html/2606.09869#S3.p5.1)\.
- \[27\]C\. Thapa, P\. C\. M\. Arachchige, S\. Camtepe, and L\. Sun\(2022\)Splitfed: when federated learning meets split learning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.36,pp\. 8485–8493\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p1.1),[§1](https://arxiv.org/html/2606.09869#S1.p2.1),[§1](https://arxiv.org/html/2606.09869#S1.p3.1),[§2](https://arxiv.org/html/2606.09869#S2.p1.1),[§5\.4](https://arxiv.org/html/2606.09869#S5.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.09869#S5.T1.4.1.6.6.1)\.
- \[28\]D\. T\. Tran, N\. B\. Ha, V\. Nguyen, and K\. Wong\(2025\)SHeRL\-fl: when representation learning meets split learning in hierarchical federated learning\.arXiv preprint arXiv:2508\.08339\.Cited by:[§5\.4](https://arxiv.org/html/2606.09869#S5.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.09869#S5.T1.4.1.9.9.1)\.
- \[29\]P\. Vepakomma, O\. Gupta, T\. Swedish, and R\. Raskar\(2018\)Split learning for health: distributed deep learning without sharing raw patient data\.arXiv preprint arXiv:1812\.00564\.Cited by:[§0\.A\.11\.1](https://arxiv.org/html/2606.09869#Pt0.A1.SS11.SSS1.p1.4)\.
- \[30\]H\. Wang, Z\. Kaplan, D\. Niu, and B\. Li\(2020\)Optimizing federated learning on non\-iid data with reinforcement learning\.InIEEE INFOCOM 2020\-IEEE conference on computer communications,pp\. 1698–1707\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p4.1),[§1](https://arxiv.org/html/2606.09869#S1.p5.1),[§1](https://arxiv.org/html/2606.09869#S1.p6.1),[§2](https://arxiv.org/html/2606.09869#S2.p2.1)\.
- \[31\]D\. Wu, R\. Ullah, P\. Harvey, P\. Kilpatrick, I\. Spence, and B\. Varghese\(2022\)Fedadapt: adaptive offloading for iot devices in federated learning\.IEEE Internet of Things Journal9\(21\),pp\. 20889–20901\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p3.1)\.
- \[32\]F\. Wu, X\. Liu, H\. Wang, X\. Wang, L\. Su, and J\. Gao\(2024\)Towards federated rlhf with aggregated client preference for llms\.arXiv preprint arXiv:2407\.03038\.Cited by:[§1](https://arxiv.org/html/2606.09869#S1.p6.1),[§2](https://arxiv.org/html/2606.09869#S2.p3.1),[§4\.4](https://arxiv.org/html/2606.09869#S4.SS4.p2.4),[§4\.4](https://arxiv.org/html/2606.09869#S4.SS4.p2.9)\.
- \[33\]M\. Wu, G\. Cheng, P\. Li, R\. Yu, Y\. Wu, M\. Pan, and R\. Lu\(2023\)Split learning with differential privacy for integrated terrestrial and non\-terrestrial networks\.IEEE Wireless Communications31\(3\),pp\. 177–184\.Cited by:[§0\.A\.11\.1](https://arxiv.org/html/2606.09869#Pt0.A1.SS11.SSS1.p1.4),[§2](https://arxiv.org/html/2606.09869#S2.p1.1)\.
- \[34\]S\. Yan, P\. Zhang, S\. Huang, J\. Wang, H\. Sun, Y\. Zhang, and A\. Tolba\(2023\)Node selection algorithm for federated learning based on deep reinforcement learning for edge computing in iot\.Electronics12\(11\),pp\. 2478\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p2.1)\.
- \[35\]L\. Yu, Z\. Chang, Y\. Jia, and G\. Min\(2025\)Model partition and resource allocation for split learning in vehicular edge networks\.IEEE Transactions on Intelligent Transportation Systems\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p1.1)\.
- \[36\]X\. Yuan, Z\. Zhang, C\. Feng, Y\. Cui, S\. Garg, G\. Kaddoum, and K\. Yu\(2022\)A dqn\-based frame aggregation and task offloading approach for edge\-enabled iomt\.IEEE Transactions on Network Science and Engineering10\(3\),pp\. 1339–1351\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p2.1)\.
- \[37\]G\. Zhu, Y\. Deng, X\. Chen, H\. Zhang, Y\. Fang, and T\. F\. Wong\(2024\)ESFL: efficient split federated learning over resource\-constrained heterogeneous wireless devices\.IEEE Internet of Things Journal11\(16\),pp\. 27153–27166\.Cited by:[§2](https://arxiv.org/html/2606.09869#S2.p1.1)\.
## Appendix 0\.AAppendix
This appendix provides detailed supplementary materials that correspond to the main paper\. We begin with a discussion of the key challenges in existing split learning based federated learning \(SFL\) and our proposed solutions, followed by the committee\-based reinforcement learning \(RL\) mechanism, neural network architectures, notation summary, reward function, supporting algorithm pseudocode, detailed experimental setup, and additional experimental results\.
### 0\.A\.1Challenges in Split Learning based Federated Learning and Our Solution
A fundamental challenge in SFL is to determine the optimal split point between client and server portions of the deep neural network \(DNN\)\. As illustrated in Figure[5](https://arxiv.org/html/2606.09869#Pt0.A1.F5), existing state\-of\-the\-art approaches face several critical limitations when addressing device heterogeneity\. The key challenges in existing approaches are:
- •Static Split Point Selection:Traditional methods use fixed split points that fail to adapt to varying client capabilities, leading to suboptimal resource utilization and training inefficiency\.
- •Device Heterogeneity:Clients possess diverse computational resources \(CPU, memory, battery, network bandwidth\), which makes a one\-size\-fits\-all approach inadequate\.
- •Resource Fluctuations:Client capabilities can change over time due to concurrent processes, battery drain, and network conditions, which require adaptive split points in SFL environments\.
Our Solution:QSplitFL addresses these challenges through a capability\-aware reinforcement learning framework that adaptively selects optimal split points based on current client resource measurements\. By formulating split point selection as a Markov Decision Process \(MDP\) and employing a committee based Deep Q\-Networks \(DQN\), our approach achieves robust, adaptive decisions that maximize training efficiency by considering device constraints\.
Figure 5:Challenges in Existing Split Federated Learning vs Our Proposed Solution\.The left side illustrates the key limitations of current state\-of\-the\-art approaches, including static split points, inability to handle device heterogeneity\. The right side presents QSplitFL’s capability\-aware SFL based reinforcement learning solution that dynamically adapts split points based on current client resource metrics\.
### 0\.A\.2Capability Metric Descriptions
The four normalized capability metrics are defined as follows: \(1\)CCPU\(k\)C\_\{\\text\{CPU\}\}^\{\(k\)\}measures available CPU resources normalized by maximum CPU capacity; \(2\)CMem\(k\)C\_\{\\text\{Mem\}\}^\{\(k\)\}represents available memory normalized by maximum memory; \(3\)CBat\(k\)C\_\{\\text\{Bat\}\}^\{\(k\)\}indicates the current battery level as a fraction of full charge; and \(4\)CNet\(k\)C\_\{\\text\{Net\}\}^\{\(k\)\}inversely measures network quality such that lower latency yields higher capability\.
### 0\.A\.3Committee\-Based Reward Hacking Mitigation
A critical consideration in RL based optimization is the potential for reward hacking, which means that the agent learns to find loopholes in the reward function rather than achieving the intended objective\. In the context of SFL, a single DQN agent might learn suboptimal behaviors that artificially inflate rewards without genuinely improving model performance\.
Figure[6](https://arxiv.org/html/2606.09869#Pt0.A1.F6)illustrates how QSplitFL’s committee based architecture mitigates reward hacking through ensemble decision making\. By employing multiple independent DQN heads with a shared encoder backbone, the framework introduces diverse perspectives in action selection: The mechanisms are given below:
- •Shared Feature Extraction:All committee members share a common encoderfs\(⋅;ϕs\)f\_\{s\}\(\\cdot;\\phi\_\{s\}\)that extracts capability which are relevant features from the state representation\.
- •Independent Decision Heads:Each membermmmaintains its own decision headg\(m\)\(⋅;ψ\(m\)\)g^\{\(m\)\}\(\\cdot;\\psi^\{\(m\)\}\), which is potentially different Q\-value estimates and action preferences\. This approach prevents any single network from dominating decisions based on different patterns in the training data\.
- •Majority Voting:The final split point is determined by majority voting across all committee members\.
- •Tie\-Breaking via Mean Q\-Values:When votes are tied, the action with the highest average Q\-value across all members is selected\.
Figure 6:Committee\-Based Reward Hacking Prevention Mechanism\.The architecture employsMMDQN members \(typicallyM=3M=3orM=5M=5, always odd\) which shares a common encoder but maintains independent decision heads\. Each member proposes its preferred split action, and the final decision is made through majority voting\. This ensemble approach mitigates reward hacking by ensuring decisions reflect consensus across diverse learned policies rather than exploitation by a single agent\.Having established the conceptual foundations through the above figures, we now present the detailed technical specifications including DNNs, mathematical notation, reward function behavior, complete algorithm pseudocode, and comprehensive experimental results\.
### 0\.A\.4Neural Network Architectures
Table[2](https://arxiv.org/html/2606.09869#Pt0.A1.T2)summarizes the neural network architectures which are used in QSplitFL and their split point configurations\.
Table 2:Neural Network Architectures and Split Point Selection Action
### 0\.A\.5Notation Summary
Table[3](https://arxiv.org/html/2606.09869#Pt0.A1.T3)provides a comprehensive reference for the mathematical notation used throughout this paper\.
Table 3:Summary of Notation
### 0\.A\.6Reward Function Behavior
Table[4](https://arxiv.org/html/2606.09869#Pt0.A1.T4)illustrates the reward behavior across representative training scenarios\.
Table 4:Reward Function Behavior Across Training Scenarios
### 0\.A\.7Supporting Algorithms
This section presents the supporting algorithmic procedures for the QSplitFL framework\. The main training algorithm \(Algorithm[1](https://arxiv.org/html/2606.09869#alg1)\) is presented in the main paper\.
#### 0\.A\.7\.1Algorithm: SFL Training Round
Algorithm[2](https://arxiv.org/html/2606.09869#alg2)formalizes the execution of a SFL round, corresponding to Steps 5–12 in Figure[1](https://arxiv.org/html/2606.09869#S3.F1)\. This algorithm is invoked by Algorithm[1](https://arxiv.org/html/2606.09869#alg1)whenever a training round must be executed with a selected split point\. The procedure begins with broadcasting model parameters and split configuration to all clients \(Step 5\)\. Clients execute parallel forward passes through their assigned layers11toℓ\\ell\(Steps 6–7\), transmitting smashed dataAk,tA\_\{k,t\}and true labelsYkY\_\{k\}to the server\. The server completes forward propagation through layersℓ\+1\\ell\+1toLL\(Step 8\), computes per\-client losses \(Step 8b\), performs backpropagation \(Step 9\), sends gradients back for client\-side backpropagation \(Steps 10–11\), and aggregates client updates via FedAvg \(Step 12\)\.
Algorithm 2SFL Training Round:TrainRoundSFL\(c,st\(c\),Wt−1,ℓ\)\\texttt\{TrainRoundSFL\}\(c,s\_\{t\}^\{\(c\)\},W\_\{t\-1\},\\ell\)0:Cluster
cc; current state
st\(c\)s\_\{t\}^\{\(c\)\}; previous model weights
Wt−1W\_\{t\-1\}; selected split layer
ℓ\\ell
0:Updated weights
WtW\_\{t\}; cluster loss
LtL\_\{t\}; per\-client losses
\{Lt\(k\)\}\\\{L\_\{t\}^\{\(k\)\}\\\}
1:Broadcast
\(Wt−1,ℓ\)\(W\_\{t\-1\},\\ell\)to all clients
k∈𝒦ck\\in\\mathcal\{K\}\_\{c\}
2:foreach client
k∈𝒦ck\\in\\mathcal\{K\}\_\{c\}in paralleldo
3:Client\-side forward:Compute
Ak,t=fclient\(Xk;Wt−1\[1:ℓ\]\)A\_\{k,t\}=f\_\{\\text\{client\}\}\(X\_\{k\};W\_\{t\-1\}^\{\[1:\\ell\]\}\)\{Smashed data\}
4:Send
\(Ak,t,Yk\)\(A\_\{k,t\},Y\_\{k\}\)to server \{Activations and true labels\}
5:endfor
6:Server\-side forward:For each client
kk, compute
Y^k=fserver\(Ak,t;Wt−1\[ℓ\+1:L\]\)\\hat\{Y\}\_\{k\}=f\_\{\\text\{server\}\}\(A\_\{k,t\};W\_\{t\-1\}^\{\[\\ell\+1:L\]\}\)
7:Compute losses:
Lt\(k\)=ℒ\(Y^k,Yk\)L\_\{t\}^\{\(k\)\}=\\mathcal\{L\}\(\\hat\{Y\}\_\{k\},Y\_\{k\}\)for each
kk;
Lt=∑kωkLt\(k\)L\_\{t\}=\\sum\_\{k\}\\omega\_\{k\}L\_\{t\}^\{\(k\)\}where
ωk=nk/∑jnj\\omega\_\{k\}=n\_\{k\}/\\sum\_\{j\}n\_\{j\}
8:Server backpropagation:Compute
∇W\[ℓ\+1:L\]Lt\\nabla\_\{W^\{\[\\ell\+1:L\]\}\}L\_\{t\}and gradients
∇Ak,t\\nabla A\_\{k,t\}
9:Send
∇Ak,t\\nabla A\_\{k,t\}to respective clients
10:foreach client
k∈𝒦ck\\in\\mathcal\{K\}\_\{c\}in paralleldo
11:Client\-side backpropagation:Compute
∇W\[1:ℓ\]Lt\(k\)\\nabla\_\{W^\{\[1:\\ell\]\}\}L\_\{t\}^\{\(k\)\}using received
∇Ak,t\\nabla A\_\{k,t\}
12:Send client\-side gradients to server
13:endfor
14:FedAvg aggregation:
Wt=Wt−1−ηmodel⋅1\|𝒦c\|∑k∈𝒦c∇WkW\_\{t\}=W\_\{t\-1\}\-\\eta\_\{\\text\{model\}\}\\cdot\\frac\{1\}\{\|\\mathcal\{K\}\_\{c\}\|\}\\sum\_\{k\\in\\mathcal\{K\}\_\{c\}\}\\nabla W\_\{k\}
15:return
WtW\_\{t\},
LtL\_\{t\},
\{Lt\(k\)\}\\\{L\_\{t\}^\{\(k\)\}\\\}
#### 0\.A\.7\.2Algorithm: Committee Majority Voting
Algorithm[3](https://arxiv.org/html/2606.09869#alg3)implements the majority voting mechanism \(Step 4 in Figure[1](https://arxiv.org/html/2606.09869#S3.F1)\) for robust action selection when using a committee ofMMmodels \(whereMMis odd to avoid ties\)\. Each committee memberQθ\(m\)Q\_\{\\theta\}^\{\(m\)\}independently proposes its preferred split point\. Vote counts are tallied for each candidate action, and the action receiving the most votes is selected\. In case of a tie, the algorithm selects the action with the highest mean Q\-value across all committee members\.
Algorithm 3Committee Majority Voting:CommitteeVote\(s,\{Qθ\(m\)\}m=1M\)\\texttt\{CommitteeVote\}\(s,\\\{Q\_\{\\theta\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}\)0:Current state
ss; action set
𝒜=\{⌈L/2⌉,…,L−1\}\\mathcal\{A\}=\\\{\\lceil L/2\\rceil,\\ldots,L\-1\\\}; committee of Q\-networks
\{Qθ\(m\)\}m=1M\\\{Q\_\{\\theta\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}
0:Selected split action
a∗∈𝒜a^\{\*\}\\in\\mathcal\{A\}
1:Member proposals:For each committee member
m=1,…,Mm=1,\\ldots,M:
a~\(m\)←argmaxa∈𝒜Qθ\(m\)\(s,a\)\(each member’s action\)\\tilde\{a\}^\{\(m\)\}\\leftarrow\\arg\\max\_\{a\\in\\mathcal\{A\}\}Q\_\{\\theta\}^\{\(m\)\}\(s,a\)\\quad\\text\{\(each member's action\)\}
2:Vote counting:For each action
a∈𝒜a\\in\\mathcal\{A\}, compute vote count:
v\(a\)←\|\{m∈\{1,…,M\}:a~\(m\)=a\}\|v\(a\)\\leftarrow\\left\|\\\{m\\in\\\{1,\\ldots,M\\\}:\\tilde\{a\}^\{\(m\)\}=a\\\}\\right\|
3:Identify majority winner\(s\):
𝒜max←argmaxa∈𝒜v\(a\)\\mathcal\{A\}\_\{\\max\}\\leftarrow\\arg\\max\_\{a\\in\\mathcal\{A\}\}v\(a\)
4:if
\|𝒜max\|=1\|\\mathcal\{A\}\_\{\\max\}\|=1then
5:return
a∗←a^\{\*\}\\leftarrowthe unique action in
𝒜max\\mathcal\{A\}\_\{\\max\}
6:else
7:Tie\-breaking via mean Q\-value:For each
a∈𝒜maxa\\in\\mathcal\{A\}\_\{\\max\}:
Q¯\(s,a\)=1M∑m=1MQθ\(m\)\(s,a\)\\bar\{Q\}\(s,a\)=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}Q\_\{\\theta\}^\{\(m\)\}\(s,a\)
8:return
a∗←argmaxa∈𝒜maxQ¯\(s,a\)a^\{\*\}\\leftarrow\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{\\max\}\}\\bar\{Q\}\(s,a\)
9:endif
### 0\.A\.8Detailed Experimental Setup
This subsection provides the implementation details needed to reproduce our experiments, including how heterogeneous and resource\-constrained devices are modeled, how client capabilities are generated, how clients are clustered, and the full hyperparameter and baseline tuning settings\.
#### 0\.A\.8\.1Device Heterogeneity Simulation and Capability Generation\.
We model device heterogeneity by assigning each client a hardware profile drawn from one of three device tiers, namely strong, medium, and weak, which reflect the spread of hardware found in deployments such as rural healthcare networks\. For each clientkk, the four raw metrics, which are available CPU fraction, available memory fraction, battery level, and network latency, are sampled from tier\-specific ranges\. Representative ranges are listed in Table[5](https://arxiv.org/html/2606.09869#Pt0.A1.T5)\. At every round, each metric is perturbed by bounded random fluctuation around its tier value to emulate battery drain, competing background processes, and network congestion, so that a client’s capability changes over time rather than staying fixed\. The raw values are then normalized to\[0,1\]\[0,1\]using min\-max normalization across the network, as defined in Section[4](https://arxiv.org/html/2606.09869#S4), which produces the per\-client capability metricsCi\(k\)\(t\)C\_\{i\}^\{\(k\)\}\(t\)\.
Table 5:Representative Capability Ranges for Simulated Device Tiers
#### 0\.A\.8\.2Clustering Strategy\.
Clients are grouped intoCCcapability clusters according to their overall capability scoreCOverall\(k\)\(t\)C\_\{\\text\{Overall\}\}^\{\(k\)\}\(t\)\. Clients with similar scores are placed in the same cluster, which keeps the within\-cluster heterogeneityσc\(t\)\\sigma\_\{c\}\(t\)low and lets a single split point serve all members of a cluster without overloading the weakest device\. Each cluster is controlled by its own RL agent, which observes the six\-dimensional cluster state and selects one split layer per round\. This design separates strong and weak devices, so that capable clusters can use deeper splits while constrained clusters use shallower ones\.
#### 0\.A\.8\.3Hyperparameter Settings\.
Table[6](https://arxiv.org/html/2606.09869#Pt0.A1.T6)lists the hyperparameters for both the SFL training loop and the DQN controller\. These values are used in all experiments unless stated otherwise\.
Table 6:Hyperparameter Settings for SFL Training and the DQN Controller
#### 0\.A\.8\.4Baseline Tuning and Fairness\.
For a fair comparison, every baseline shares the same backbone architecture, non\-IID data partition \(Dirichletα=0\.5\\alpha=0\.5\), client count, optimizer, and number of training rounds as QSplitFL\. FedAvg, FedProx, and q\-FedAvg are full\-model federated baselines: the FedProx proximal weightμ\\muand the q\-FedAvg fairness parameterqqwere each swept over standard ranges and the best configuration is reported\. The split\-based baselines, namely SplitFed, ClusterSFL, HeteroSFL, SHeRL\-FL, and FLUID/SFL\-V2, use the same client and server partition boundaries as QSplitFL where applicable, so that accuracy differences reflect the split\-point policy rather than differing model capacity\. QSplitFL uses no extra training data or additional rounds relative to the baselines\.
### 0\.A\.9Additional Experimental Results
We present detailed experimental results demonstrating the effectiveness of the QSplitFL framework\.
Federated Configurations:We have conducted the experiments across four client configurations \(5, 10, 100, 200 clients\) with data distributed using Dirichlet distribution\[[4](https://arxiv.org/html/2606.09869#bib.bib91)\]\(α=0\.5\\alpha=0\.5\) to simulate non\-IID settings for data heterogeneity\. We have evaluated QSplitFL at training round checkpoints of 10, 20, 50, and 100 rounds\.
Model Architecture Comparison:Figure[7](https://arxiv.org/html/2606.09869#Pt0.A1.F7)presents a detailed comparison of accuracy achieved by all four neural network architectures \(CNN with 10 layers, ResNet50 with 50 layers, MobileNetV4 with 53 layers, and ConvNeXt with 59 layers\) evaluated in the QSplitFL framework across all four benchmark datasets\. Each subplot displays accuracy for different client configurations \(5, 10, 100, 200 clients\) after 100 training rounds\. The results consistently demonstrate that deeper architectures significantly outperform shallower models: ConvNeXt achieves the highest accuracy across all datasets \(99\.6% on MNIST, 94\.1% on Fashion\-MNIST, 86\.5% on CIFAR\-10, 68\.3% on CIFAR\-100\), followed by ResNet50, then MobileNetV4, with CNN performing lowest\. The performance gap between architectures becomes increasingly high as dataset complexity increases\. On MNIST dataset, all architectures achieve almost similar accuracy \(\>99%\>99\\%\), while on CIFAR\-100, ConvNeXt outperforms CNN by 5–8% \. This demonstrates that the QSplitFL capability aware split point selection enables effective training of deep neural networks on heterogeneous edge device devices\.
\(a\)MNIST
\(b\)Fashion\-MNIST
\(c\)CIFAR\-10
\(d\)CIFAR\-100
Figure 7:Comprehensive Model Architecture Performance Comparison Across All Benchmark Datasets\.Here in Figure[8](https://arxiv.org/html/2606.09869#Pt0.A1.F8)illustrates the split point selection behavior of the QSplitFL reinforcement learning agent for the lightweight CNN architecture \(10 total layers, action space: layers 5–9\) across all four benchmark datasets\. Unlike deeper architectures that utilize a wide range of split layers \(30–50\), the CNN consistently converges to earlier split points \(typically layers 5–7\)\.
\(a\)MNIST
\(b\)FMNIST
\(c\)CIFAR10
\(d\)CIFAR100
Figure 8:CNN Architecture Split Point Selection Analysis\.#### 0\.A\.9\.1CNN Accuracy Convergence Analysis
Figure[9](https://arxiv.org/html/2606.09869#Pt0.A1.F9)shows accuracy convergence for the 10\-layer CNN\. While it converges stably on simple tasks \(MNIST\>\>99%\), its limited capacity significantly effects the performance on complex datasets like CIFAR\-100 \(around 50%\), which underscores the necessity for deeper architectures in complex scenarios\.
\(a\)MNIST
\(b\)FMNIST
\(c\)CIFAR10
\(d\)CIFAR100
Figure 9:CNN Accuracy Convergence \(100 Rounds\)\.Stable convergence on simple tasks contrasts with limited performance on CIFAR\-100 due to shallow depth\.
#### 0\.A\.9\.2Accuracy Convergence
Figures[11](https://arxiv.org/html/2606.09869#Pt0.A1.F11)–[13](https://arxiv.org/html/2606.09869#Pt0.A1.F13)provide detailed convergence plots\. All models rapidly converge on MNIST/Fashion\-MNIST \(\>\>95% in 10 rounds\)\. On CIFAR\-10/100, deep architectures \(ConvNeXt, ResNet50\) show clear better performance, with ConvNeXt outperforming CNN by 5\-8% on CIFAR\-100, which shows the benefit of adaptive split selection\.












Figure 10:MNIST Dataset Accuracy Convergence \(10, 20, 50 Rounds\)\.Comprehensive accuracy convergence analysis across four model architectures \(CNN, ResNet50, MobileNetV4, ConvNeXt\) and varying training durations\. Each row corresponds to one architecture, with columns showing 10, 20, and 50 training rounds\. All architectures achieve rapid convergence on this baseline dataset, reaching\>\>95% accuracy within 10 rounds and approaching 99% by 50 rounds across all client configurations \(5, 10, 100, 200 clients\)\. The consistent performance demonstrates QSplitFL’s effectiveness even on simple classification tasks, with minimal performance gap between architectures due to the dataset’s low complexity\.












Figure 11:Fashion\-MNIST Accuracy Convergence \(10, 20, 50 Rounds\)\.Accuracy convergence comparison across architectures and training rounds for the Fashion\-MNIST dataset\. The figure shows CNN \(row 1\), ResNet50 \(row 2\), MobileNetV4 \(row 3\), and ConvNeXt \(row 4\) performance at 10, 20, and 50 rounds\. This dataset is more structurally complex than MNIST: ConvNeXt consistently achieves the highest accuracy \(∼\\sim94% at 50 rounds\), followed by ResNet50 and MobileNetV4, while CNN plateaus around 90%\. The performance gap widens with extended training, demonstrating the benefit of deeper architectures for moderately complex visual classification tasks\. All models show stable convergence across diverse client counts\.












Figure 12:CIFAR\-10 Accuracy Convergence \(10, 20, 50 Rounds\)\.Convergence trends for CIFAR\-10 dataset shows performance differentiation among different architectures\. The figure illustrates that 10\-round training is insufficient for effective feature learning, particularly for shallow CNN\. Extended training to 50 rounds enables deep architectures \(ResNet50, ConvNeXt\) to achieve substantially higher accuracy \(∼\\sim86%\) compared to lighter models\. Varying rounds \(10, 20, 50 rounds\) clearly demonstrates the importance of adequate training duration for complex datasets, while the neural network architectures \(CNN, ResNet50, MobileNetV4, ConvNeXt\) emphasizes the necessity of capability aware split point selection to leverage deep model capacity for RGB data processing\.












Figure 13:CIFAR\-100 Accuracy Convergence \(10, 20, 50 Rounds\)\.Performance analysis on the most challenging fine grained classification task with 100 classes\. The 4×\\times3 grid \(architectures×\\timesrounds\) reveals architectural impact: shallow CNN struggles to capture fine grained features even at 50 rounds \(∼\\sim62% accuracy\), while ConvNeXt reaches∼\\sim68%\. The progressive improvement from 10 to 50 rounds demonstrates that QSplitFL’s split point adaptation successfully enables resource constrained edge devices to train deep models by offloading computational burden to the server\. This capability is critical for complex real world tasks where shallow networks fundamentally lack sufficient capacity, which highlights QSplitFL’s value proposition for heterogeneous federated learning environments\.
### 0\.A\.10Ablation Study
Table[7](https://arxiv.org/html/2606.09869#Pt0.A1.T7)presents an ablation study evaluating the contribution of each design component in QSplitFL\. All configurations are evaluated on CIFAR\-10 with ResNet50 using 10 clients over 100 rounds\. Each row isolates a single design choice while keeping all other components identical to the full model\. The results show that the committee\-based DQN \(M=3M=3\), the decayed reward function, and the full capability\-aware state representation each contribute meaningfully to convergence speed and final accuracy\.
Table 7:Ablation Study on QSplitFL Design Choices \(CIFAR\-10, ResNet50, 10 Clients, 100 Rounds\)Why does CommitteeMM=5 require more rounds to reach 80% accuracy thanMM=3?
Here in this case,MM=5 needs 3 votes to pick a split; but on the other hand,MM=3 only needs 2\. Early on, the five MLP heads are still noisy where the replay buffer barely has data, so Q\-value estimates are scattered\. For this reason more ties happen, and breaking them with mean Q\-values adds variance to which split gets chosen, which also stalls the steady loss drops that drive early accuracy\. The shared encoder takes longer to settle too, since five heads sending conflicting gradients are harder to reconcile than three\.MM=5 does edge ahead by round 100 \(83\.9% vs\. 83\.73%\), but that early friction pushes its 80% crossing from round 28 back to round 31\.
### 0\.A\.11Extended Discussion
This subsection expands on the design questions raised in the main paper, covering the split depth lower bound and its privacy implications, the system\-level cost of the framework, the choice of capability features, the position of QSplitFL relative to prior reinforcement learning work, and its limitations\.
#### 0\.A\.11\.1Split Depth Lower Bound and Privacy\.
The lower boundℓmin=⌈L/2⌉\\ell\_\{\\min\}=\\lceil L/2\\rceilis a configurable prior rather than a fixed rule\. We set it to half the network depth by default for two reasons that were introduced in the action space definition: it forces clients to perform meaningful local feature extraction, and it prevents raw or near\-raw inputs from leaving the device\. We agree that a shallower split can be more compute\-efficient when clients are powerful, so the bound is exposed as a deployment parameter: in a cluster of capable devices where on\-device computation is cheap, an operator can lowerℓmin\\ell\_\{\\min\}so that the agent is free to explore shallower splits and shift more work to the client, whereas a cluster of weak devices benefits from keepingℓmin\\ell\_\{\\min\}high\. The default value encodes a conservative trade\-off that favors privacy and bounded communication when client capability is unknown\. Split depth is also a privacy control, which makes it a decisive factor in partitioning, because a deeper split keeps more layers on the device and transmits higher\-level activations that are abstracted further from the input, and such representations are generally harder to invert than the shallow activations produced near the input layer\[[29](https://arxiv.org/html/2606.09869#bib.bib41),[16](https://arxiv.org/html/2606.09869#bib.bib28)\]\. The bound therefore acts as a minimum privacy floor: operators who require stronger guarantees can raiseℓmin\\ell\_\{\\min\}, add differentially private noise to the smashed data\[[33](https://arxiv.org/html/2606.09869#bib.bib12),[1](https://arxiv.org/html/2606.09869#bib.bib88)\], or, as a direction for future work, incorporate an explicit privacy term into the reward so that the agent jointly optimizes accuracy, communication, and leakage risk\.
#### 0\.A\.11\.2System\-Level Cost Considerations\.
Because the motivation of QSplitFL is capability\-aware optimization in resource\-constrained settings, we clarify how the framework relates to system\-level cost\. First, the controller itself is inexpensive: the capability\-aware state requires only𝒪\(\|𝒦\|\)\\mathcal\{O\}\(\|\\mathcal\{K\}\|\)aggregation, in contrast to weight\-based methods that collect parameters and run PCA, and the committee of small MLPs is negligible next to the DNN being trained\. Second, communication overhead is governed by the split layer, since the volume of smashed data per round scales with the activation size at layerℓ\\ell, and deeper splits transmit smaller tensors\. The capability\-aware state captures this through the network termCNet\(k\)\(t\)=1−Latency\(k\)\(t\)/LatencymaxC\_\{\\text\{Net\}\}^\{\(k\)\}\(t\)=1\-\\text\{Latency\}^\{\(k\)\}\(t\)/\\text\{Latency\}\_\{\\max\}, so poorly connected clients receive lower scores and the agent favors communication\-efficient deeper splits, as reported in the summary of findings\. Third, client computation and energy are handled in the same implicit manner: the CPU, memory, and battery terms penalize weak or low\-battery clients, so the agent offloads more layers to the server for them\. We acknowledge that wall\-clock latency and on\-device energy are reflected here only through these proxies, and that direct hardware\-level profiling of energy and end\-to\-end latency on a physical edge testbed would strengthen the validation\. We leave such measurement, together with explicit communication, latency, and energy terms in the reward for multi\-objective optimization, to future work\.
#### 0\.A\.11\.3Capability Feature Selection\.
The state uses four capability metrics, namely CPU availability, memory availability, battery level, and network latency\. These were selected because they are the dominant and directly observable bottlenecks for on\-device DNN training: compute throughput, working\-set memory, energy budget, and activation transfer time\. They are also cheap to read from standard operating\-system counters with negligible overhead, and each maps directly to the computation and communication trade\-off that the split point controls\. We deliberately avoid weight\-derived features, which require costly collection and PCA projection\. The ablation in Appendix[0\.A\.10](https://arxiv.org/html/2606.09869#Pt0.A1.SS10)quantifies the effect of this choice: reducing the state to CPU only lowers accuracy from83\.73%83\.73\\%to76\.8%76\.8\\%and slows convergence from2828to3838rounds to reach80%80\\%accuracy, while using equal importance weights \(wi=0\.25w\_\{i\}=0\.25\) costs little \(82\.5%82\.5\\%\)\. This indicates that the four metrics are complementary and that the policy is robust to the exact weighting but sensitive to dropping a metric entirely\. The tunable weightswiw\_\{i\}let operators emphasize whichever resource is most constrained in their setting, such as battery for mobile clients or latency for rural network links\.
#### 0\.A\.11\.4Positioning and Novelty\.
QSplitFL integrates established reinforcement learning components, including DQN with experience replay, a target network, reward shaping, and ensemble\-style committee voting\. Its contribution is not a new learning rule but a problem\-specific reformulation for SFL\. The first element is a lightweight, interpretable state that replaces high\-dimensional weight representations and removes the PCA step, which to our knowledge yields the first DQN\-based adaptive split\-point selector for SFL\. The second is a decayed loss\-drop reward tailored to the early\-convergence structure of SFL training\. The third repurposes committee voting, which is related to Bootstrapped DQN\[[23](https://arxiv.org/html/2606.09869#bib.bib15)\]and ensemble DQN\[[22](https://arxiv.org/html/2606.09869#bib.bib16)\], specifically as a defense against reward hacking under this reward design, and we validate its effect empirically in the ablation\. We therefore position the value of the work at the systems level, as a practical, interpretable, and low\-overhead controller, rather than as a new reinforcement learning primitive\.
#### 0\.A\.11\.5Limitations and Future Work\.
Our evaluation uses standard vision benchmarks and four convolutional architectures\. Because the controller only needs a layer\-index range, the framework is architecture\-agnostic, and extending it to less conventional models such as vision transformers, sequence models, and graph networks, as well as to non\-vision modalities such as medical imaging, time\-series IoMT data, and language tasks, is a natural next step\. Other directions follow from the discussion above: validation on a real edge testbed with measured energy and latency, explicit privacy and communication terms in the reward, and adaptive per\-tier relaxation of the split depth boundℓmin\\ell\_\{\\min\}\.Similar Articles
FedQHD: Closed-Form Function-Space Federated Reinforcement Learning
This paper proposes FedQHD, a novel federated Q-learning method using hyperdimensional random-feature state encoders with linear readouts to enable closed-form function-space aggregation, addressing the federation gap due to heterogeneous client encoders.
Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers
Proposes SSD-FL, a serverless semi-decentralized federated learning methodology that optimizes cluster formation in heterogeneous environments using effective loss functions and Cheeger inequality-based iterative clustering, improving convergence and communication efficiency.
Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers
This paper proposes a quantum annealing approach that reformulates client selection in federated learning as a QUBO problem to defend against Byzantine attacks, showing improved detection accuracy over classical MultiKrum on sophisticated attacks, especially when combined with a MultiSignal ensemble.
M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices
Proposes M²FedAQI, a lightweight multimodal federated learning framework for air quality prediction across heterogeneous edge devices, achieving significant improvements over baselines on benchmark datasets.
Accurate and Resource-Efficient Federated Continual Learning
FedRAN is a resource-aware analytic federated continual learning framework that replaces gradient-based updates with compact random feature statistics, achieving high accuracy with significantly lower communication and computation costs.