EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints
Summary
This paper introduces EdgeFlowerTune, a benchmark for evaluating federated LLM fine-tuning under realistic edge system constraints, demonstrating that accuracy-only metrics can be misleading regarding deployability.
View Cached Full Text
Cached at: 05/12/26, 06:58 AM
# EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints
Source: [https://arxiv.org/html/2605.08636](https://arxiv.org/html/2605.08636)
†Jiaxiang Geng1,†Yiyi Lu1,†Lunyu Zhao1, Yan Gao2,3, Nicholas D\. Lane2,3, Bing Luo1 1Duke Kunshan University 2Flower Labs 3University of Cambridge
###### Abstract
Federated fine\-tuning offers a promising paradigm for adapting large language models \(LLMs\) on edge devices by leveraging the rich, diverse, and continuously generated data from smartphones and IoT devices without compromising user data privacy\. Such edge\-side adaptation can improve model personalization, robustness, and responsiveness to local contexts\. However, the practical feasibility of federated LLM fine\-tuning on real edge devices remains unclear, as most existing work focuses on cross\-silo or simulation\-based settings, overlooking the resource and runtime constraints that determine whether a method is deployable on real edge systems\. We presentEdgeFlowerTune, a deployment\-oriented benchmark for federated LLM fine\-tuning under realistic edge\-system constraints\. EdgeFlowerTune jointly evaluates model quality and system costs, including communication, wall\-clock latency, memory usage, energy consumption, and robustness to dynamic edge conditions\. To compare methods in terms of effectiveness, efficiency, and robustness, EdgeFlowerTune introduces three complementary protocols:Quality\-under\-Budget,Cost\-to\-Target, andRobustness\. We instantiate EdgeFlowerTune as a real\-device platform built on Flower and MobileFineTuner, spanning commercial Android smartphones and NVIDIA edge development boards\. Our benchmark results show that accuracy\-only evaluation can lead to misleading conclusions: methods with similar final quality may differ substantially in deployability once realistic system constraints are considered\. EdgeFlowerTune provides a reproducible benchmark for system\-aware evaluation of federated LLM fine\-tuning at the edge\.
22footnotetext:These authors contributed equally to this work\.## 1Introduction
In recent years, large language models \(LLMs\) have achieved remarkable performance across a wide range of language tasks and have become the dominant foundation for downstream adaptation and deployment\[[19](https://arxiv.org/html/2605.08636#bib.bib1)\]\. The scaling of these models faces a critical bottleneck: high\-quality public datasets are projected to be exhausted between 2026 and 2032\[[24](https://arxiv.org/html/2605.08636#bib.bib29)\]\. Fortunately, vast amounts of valuable data are continuously generated on edge devices, such as smartphones and Internet\-of\-Things \(IoT\) devices\[[1](https://arxiv.org/html/2605.08636#bib.bib30)\]\. However, directly collecting data from these edge devices for centralized training may expose sensitive personal information, leading to privacy risks and potentially violating data protection regulations such as the General Data Protection Regulation \(GDPR\)\[[8](https://arxiv.org/html/2605.08636#bib.bib31)\]\.
Federated learning \(FL\) offers a natural solution by enabling multiple clients to collaboratively train or adapt a shared model while keeping training data local\[[13](https://arxiv.org/html/2605.08636#bib.bib4),[18](https://arxiv.org/html/2605.08636#bib.bib3)\]\. This paradigm preserves edge data on local devices without sharing raw data with the server, while still allowing such data to contribute to model training\. Building on this idea, recent works have begun to extend FL to LLM fine\-tuning on edge devices\[[5](https://arxiv.org/html/2605.08636#bib.bib47),[27](https://arxiv.org/html/2605.08636#bib.bib26)\], commonly referred to as federated LLM fine\-tuning\. However, compared with traditional federated learning, federated LLM fine\-tuning imposes substantially greater burdens on edge devices due to the much larger model size, leading to significantly higher demands on computation, communication, memory, and energy\. These system costs have become a major bottleneck to the development of federated LLM fine\-tuning on edge devices and constitute a central challenge faced by existing research efforts\[[9](https://arxiv.org/html/2605.08636#bib.bib32)\]\.
To facilitate the comparison of different federated LLM fine\-tuning methods, OpenFedLLM and FederatedScope\-LLM provide end\-to\-end frameworks and benchmarks for federated LLM fine\-tuning\[[30](https://arxiv.org/html/2605.08636#bib.bib8),[14](https://arxiv.org/html/2605.08636#bib.bib7)\], while FlowerTune further expands the benchmarking landscape with a cross\-domain leaderboard spanning general NLP, finance, medical, and code tasks\[[10](https://arxiv.org/html/2605.08636#bib.bib10)\]\. However, these existing frameworks and benchmarks mainly focus on cross\-silo settings and emphasize final task quality, such as accuracy, while largely overlooking the system constraints imposed by real edge environments\. Considering federated LLM fine\-tuning in edge environments raises several new challenges:\(1\)A method may achieve strong final performance but still be impractical for edge deployment because its consumption of computation, communication, memory, or energy exceeds what edge devices can sustain\.\(2\)When comparing two methods, it remains unclear how to trade off higher final accuracy against better system efficiency\.\(3\)Edge environments are inherently dynamic: client dropout, hardware heterogeneity, and straggler effects may substantially affect the performance of a method during training\.
To overcome these challenges, we presentEdgeFlowerTune, a benchmark for federated fine\-tuning of LLMs under realistic edge\-system constraints\. Rather than focusing only on which method achieves the highest final task score, EdgeFlowerTune evaluates the effectiveness, efficiency, and robustness of different methods under realistic edge deployment conditions\. Furthermore, to enable credible system\-aware evaluation, we instantiate EdgeFlowerTune as a real\-device federated LLM fine\-tuning platform built on top ofFlowerandMobileFineTuner, spanning heterogeneous edge hardware including mainstream commercial smartphones and NVIDIA edge computing development boards\.
Figure 1:Overview of EdgeFlowerTune\. Candidate federated LLM fine\-tuning methods are deployed and executed on a real\-device edge platform, and monitored online for both model quality and system costs\. The collected metrics are then evaluated by proposed benchmarking protocols\.Our contributions are summarized as follows:
- •We presentEdgeFlowerTune, a benchmark for federated fine\-tuning of large language models under realistic edge\-system constraints\. Unlike existing federated LLM fine\-tuning benchmarks that mainly emphasize final task quality, EdgeFlowerTune centers evaluation on deployability in terms of effectiveness, efficiency, and robustness\. Our code is open\-sourced at[https://github\.com/Edge\-Intelligence\-Lab/EdgeFlowerTune](https://github.com/Edge-Intelligence-Lab/EdgeFlowerTune)\.
- •We propose an end\-to\-end evaluation workflow and introduce three complementary deployment\-oriented evaluation protocols, namelyQuality\-under\-Budget,Cost\-to\-Target, andRobustness, to systematically evaluate federated LLM fine\-tuning methods under explicit constraints on communication, latency, memory, energy, and system variability\.
- •We instantiate EdgeFlowerTune as a real\-device federated LLM fine\-tuning evaluation platform built on top ofFlowerandMobileFineTuner, spanning heterogeneous edge hardware including commercial smartphones and NVIDIA edge computing development boards, thereby enabling credible system\-aware benchmarking beyond simulation\-only or server\-only evaluation\.
- •We conduct benchmark studies on representative federated LLM fine\-tuning methods and show that system\-aware evaluation can substantially change method conclusions: methods that appear favorable under accuracy\-only evaluation may become suboptimal once realistic edge constraints and system perturbations are taken into account\.
## 2EdgeFlowerTune Benchmark Design
In this section, we present the benchmark design of EdgeFlowerTune\. We first introduce the end\-to\-end evaluation workflow, which describes how a candidate federated LLM fine\-tuning method is deployed, executed on real edge devices, monitored during training, and finally evaluated by the benchmark\. We then define three deployment\-oriented benchmarking protocols for comparing different methods under realistic edge\-system constraints\.
### 2\.1End\-to\-End Evaluation Workflow
Figure[1](https://arxiv.org/html/2605.08636#S1.F1)illustrates the end\-to\-end evaluation workflow of EdgeFlowerTune\. It has three key characteristics compared to existing cross\-silo benchmarks\. First, each candidate method must be explicitly deployed into the EdgeFlowerTune benchmark environment, so that its client\-side training logic, server\-side coordination, communication pattern, and fine\-tuning configuration can be executed in a unified evaluation stack\. Second, fine\-tuning is performed on a real heterogeneous edge system, where commercial mobile devices and edge computing boards participate as federated clients\. Third, because system behavior evolves during training, EdgeFlowerTune records model\-quality and system\-cost metrics online throughout the fine\-tuning process, instead of estimating them after training from offline measurements\. The collected execution traces are then used by the benchmark protocols to compare different methods in terms of effectiveness, efficiency, and robustness\. The stages of the workflow are listed as follows:
Stage 1: Method deployment\.A candidate federated LLM fine\-tuning method is first instantiated in the EdgeFlowerTune benchmark environment\. The method defines its local training procedure, server\-side aggregation rule, exchanged objects between clients and server, communication schedule, and fine\-tuning configuration\. To ensure fair comparison, all methods are deployed under the same benchmark specification, including the same task, data partition, base model, client pool, training budget, and system constraints\.
Stage 2: Real\-device federated fine\-tuning\.The deployed method is then executed on the real EdgeFlowerTune system\. In each communication round, the server coordinates client selection, distributes the required model states or fine\-tuning parameters, collects method\-specific updates, and performs aggregation\. Selected clients perform local LLM fine\-tuning on their private data using the prescribed method\-specific procedure\. Unlike server\-only or simulation\-based evaluation, this stage exposes each method to practical edge\-system conditions, including heterogeneous device capabilities, real communication links, and runtime resource limitations\.
Stage 3: Online metric collection\.During federated fine\-tuning, EdgeFlowerTune continuously monitors both model\-quality and system\-cost metrics\. Model\-quality metrics include task\-specific scores such as loss and accuracy\. System\-cost metrics include communication volume, client\-side memory usage, energy consumption and wall\-clock time\. These metrics are collected online during actual benchmark execution\.
Stage 4: Protocol\-based benchmarking\.After execution, the collected quality and system traces are passed to the benchmark evaluation module\. EdgeFlowerTune compares methods using deployment\-oriented protocols \(illustrated in Sec\.[2\.2](https://arxiv.org/html/2605.08636#S2.SS2)\) rather than only final task quality\.
Figure 2:EdgeFlowerTune benchmarking protocols\. Protocol A evaluates the best model quality achievable under fixed system budgets; Protocol B measures the system cost required to reach a target quality; and Protocol C quantifies robustness under edge\-system perturbations\.
### 2\.2EdgeFlowerTune Benchmarking Protocols
Figure[2](https://arxiv.org/html/2605.08636#S2.F2)summarizes the three benchmarking protocols in EdgeFlowerTune\. The goal of these protocols is to compare federated LLM fine\-tuning methods from a deployment\-oriented perspective\. In realistic edge environments, a method is not necessarily better simply because it achieves a higher final accuracy or lower perplexity\. It may require excessive communication, exceed the memory capacity of mobile devices, consume too much energy, or become unstable under network and device variability\. Therefore, EdgeFlowerTune evaluates each method along two coupled dimensions: model quality and system cost\. Model quality is measured using task\-specific metrics, such as loss and accuracy\. System cost is measured using real execution metrics, including communication volume, wall\-clock time, client\-side memory usage, power consumption, and energy consumption\.
Based on these measurements, EdgeFlowerTune defines three complementary protocols\. Each protocol corresponds to a different deployment question and produces a different notion of the “best” method\.
Protocol A: Quality\-under\-Budget\.This protocol evaluates method effectiveness under fixed system resource budgets\. Given the same upper bounds on communication volume, wall\-clock time, memory usage, and energy consumption, each candidate method is executed within the allowed budget, and EdgeFlowerTune measures the best model quality it can achieve\. This protocol answers the question:under the same system resource upper bound, which method achieves the highest final quality?The best method under this protocol is the one that delivers the highest task quality without violating the specified system budget\.
Protocol B: Cost\-to\-Target\.This protocol evaluates method efficiency for reaching a required quality level\. Instead of fixing the resource budget, EdgeFlowerTune specifies a target quality threshold, denoted asq∗q^\{\\ast\}, and measures the system cost required for each method to reach this target\. The cost can be reported along multiple dimensions, including total communication volume, training latency, memory footprint, and energy consumption\. This protocol answers the question:to reach a target quality, which method costs the least?The best method under this protocol is the one that reachesq∗q^\{\\ast\}with the minimum system cost\.
Protocol C: Robustness\.This protocol evaluates method stability under realistic edge\-system perturbations\. EdgeFlowerTune compares each method under nominal conditions and perturbed conditions, including bandwidth variation, client dropout, and device heterogeneity\. For each perturbation, the benchmark measures the degradation in both model quality and system efficiency\. This protocol answers the question:which method remains stable under realistic edge\-system shifts?The best method under this protocol is the one with the smallest relative degradation from nominal execution to perturbed execution\.
## 3Experimental Settings
### 3\.1EdgeFlowerTune Platform
To support system\-aware benchmarking, EdgeFlowerTune is built as a real\-device federated LLM fine\-tuning platform rather than a simulation\-only environment\. As shown in Figure[3](https://arxiv.org/html/2605.08636#S3.F3), the platform consists of a GPU server, a cross\-platform communication layer, and heterogeneous edge clients\. The server is a Dell PowerEdge T640 equipped with two NVIDIA A800 GPUs\. It runsFlower\[[10](https://arxiv.org/html/2605.08636#bib.bib10)\]for federated orchestration and usesPyTorchandTransformers\[[28](https://arxiv.org/html/2605.08636#bib.bib33)\]for model execution and aggregation\.
The client pool spans both Android smartphones and Linux edge computing boards111We exclude iPhones from our current testbed because iOS aggressively terminates memory\-intensive background processes, making long\-running on\-device LLM fine\-tuning unstable\., as summarized in Table[1](https://arxiv.org/html/2605.08636#S3.T1)\. Android clients perform local LLM fine\-tuning withMobileFineTuner\[[11](https://arxiv.org/html/2605.08636#bib.bib34)\], a C\+\+ framework for end\-to\-end LLM adaptation on commercial mobile devices\. NVIDIA Jetson Orin Nano clients run local training withPyTorchandTransformers\.
Client\-server communication is implemented using theFlower C\+\+ SDKover a Wi\-Fi 6 network\. During benchmark execution, the platform records system behavior online\. Specifically,adbis used to collect system metrics from Android smartphones, the Linuxtopcommand is used for Jetson devices, andWiresharkis used to capture communication traffic\.
Figure 3:EdgeFlowerTune Platform\. The platform consists of one gpu server and several real edge devices including android smartphones and NVIDIA boards\.Table 1:Heterogeneous client devices in the EdgeFlowerTune platform\. Devices are ordered from faster to slower execution speed in our testbed\.Device ModelProcessor / CPUMemoryJetson Orin NanoNVIDIA Orin, 6\-core Arm Cortex\-A78AE, up to 1\.5 GHz8 GBiQOO 15Snapdragon 8 Elite Gen 5, 2×\\times4\.6 GHz \+ 6×\\times3\.62 GHz16 GB \+ 16 GB virtualHUAWEI P50 ProKirin 9000, 1×\\timesA77 3\.13 GHz \+ 3×\\timesA77 2\.54 GHz \+ 4×\\timesA55 2\.05 GHz8 GBHUAWEI Mate 20Kirin 980, 2×\\timesA76 2\.6 GHz \+ 2×\\timesA76 1\.92 GHz \+ 4×\\timesA55 1\.8 GHz6 GBHUAWEI nova 9 ProSnapdragon 778G, Kryo 670 CPU, up to 2\.4 GHz8 GB
### 3\.2Evaluation Tasks
#### 3\.2\.1Dataset Selection
EdgeFlowerTune evaluates federated LLM fine\-tuning on representative edge\-oriented language tasks\. In real edge applications, LLMs often operate over local and context\-sensitive information, such as user messages, notifications, schedules, application content, device states, sensor summaries, and event logs\. These scenarios require not only open\-ended generation, but also lightweight verification, contextual selection, and reasoning over local information\. Accordingly, we organize the evaluation datasets into three categories:Verify,Choose, andReason\.
Verifyevaluates whether a model can determine if a fact, condition, or semantic relation holds\. This category reflects edge scenarios such as checking whether a notification is urgent, whether a sensor event satisfies an alert condition, or whether a context supports a candidate conclusion\. We instantiate this category with BoolQ\[[6](https://arxiv.org/html/2605.08636#bib.bib35)\]and QNLI\[[25](https://arxiv.org/html/2605.08636#bib.bib36),[20](https://arxiv.org/html/2605.08636#bib.bib37)\], which evaluate boolean question answering and question\-answer entailment, respectively\.
Chooseevaluates whether a model can select the most appropriate option from multiple candidates\. This category reflects scenarios where an edge assistant or local controller needs to choose among candidate replies, recommendations, actions, or explanations based on contextual information\. We instantiate this category with PIQA\[[3](https://arxiv.org/html/2605.08636#bib.bib38)\], HellaSwag\[[31](https://arxiv.org/html/2605.08636#bib.bib39)\], and SocialIQA\[[22](https://arxiv.org/html/2605.08636#bib.bib40)\], covering physical commonsense, event continuation, and social commonsense\.
Reasonevaluates whether a model can infer the correct answer from contextual clues and commonsense knowledge\. This category reflects scenarios where a user asks about local content, a device explains its current state, or reasons over event descriptions\. We instantiate this category with ARC\-E\[[7](https://arxiv.org/html/2605.08636#bib.bib41)\]and WinoGrande\[[21](https://arxiv.org/html/2605.08636#bib.bib42)\], which evaluate elementary science question answering and commonsense coreference\-style reasoning\.
#### 3\.2\.2Model selection
Large\-scale LLMs with multi\-billion parameters are infeasible to run or fine\-tune on typical edge devices due to their high computational and memory requirements\. Edge tasks, while diverse, generally involve lower complexity than centralized server\-side workloads, making extremely large models unnecessary for practical deployment\. To reflect realistic constraints while covering a practical range of model capacities, we select three lightweight LLMs representative of current edge\-oriented large models:Gemma 3\-270M,Gemma 3\-1B\[[23](https://arxiv.org/html/2605.08636#bib.bib43)\], andQwen2\.5\-0\.5B\[[29](https://arxiv.org/html/2605.08636#bib.bib44)\]\. These models are specifically designed for resource\-constrained devices, balancing parameter scale with inference and fine\-tuning efficiency, enabling evaluation of federated fine\-tuning in realistic edge intelligence scenarios\.
### 3\.3Method Selection
To cover the main algorithmic patterns in federated LLM fine\-tuning, we select four representative method families: FedAvg \+ LoRA, FedProx \+ LoRA, HeteroLoRA, and SplitLoRA\.
FedAvg \+ LoRA\.This is the standard adapter\-only federated fine\-tuning baseline\. The pretrained LLM is frozen, each client updates only LoRA parameters\[[12](https://arxiv.org/html/2605.08636#bib.bib5)\], and the server aggregates client LoRA updates using FedAvg\[[18](https://arxiv.org/html/2605.08636#bib.bib3)\]\. This abstraction covers the basic FL\-based instruction tuning setting explored by FedIT\[[32](https://arxiv.org/html/2605.08636#bib.bib23)\]and the FedAvg baseline in OpenFedLLM\[[30](https://arxiv.org/html/2605.08636#bib.bib8)\]\.
FedProx \+ LoRA\.This method extends FedAvg \+ LoRA by adding a proximal regularizer to the local LoRA objective\. It follows the classical FedProx formulation\[[15](https://arxiv.org/html/2605.08636#bib.bib24)\], while applying the regularization only to trainable LoRA parameters\. This category represents local\-correction baselines for federated LLM fine\-tuning, as also included in OpenFedLLM\[[30](https://arxiv.org/html/2605.08636#bib.bib8)\]\.
HeteroLoRA\.This family targets client resource heterogeneity by allowing different clients to use different LoRA ranks or adapter configurations\. Existing examples include FlexLoRA\[[2](https://arxiv.org/html/2605.08636#bib.bib25)\], FLoRA\[[27](https://arxiv.org/html/2605.08636#bib.bib26)\], and HLoRA\[[17](https://arxiv.org/html/2605.08636#bib.bib27)\]\. In our benchmark, HeteroLoRA represents this class by assigning heterogeneous LoRA ranks to clients and aligning them to a common shape before aggregation\.
SplitLoRA\.This family combines split learning with LoRA\-based fine\-tuning\. Instead of training the entire model on each client, SplitLoRA\[[16](https://arxiv.org/html/2605.08636#bib.bib28)\]partitions the LLM into client\-side and server\-side submodels\. Training exchanges activations and activation gradients between the two sides, while client\-side LoRA adapters are periodically synchronized through federated aggregation\.
### 3\.4Parameter Settings
We use the AdamW optimizer with a weight decay of0\.010\.01across all settings\. The learning rate is set to2×10−42\\times 10^\{\-4\}\. The sequence length is set to6464for ARC\-Easy, HellaSwag, PIQA, QNLI, SocialIQA, and WinoGrande, while BoolQ uses a longer sequence length of128128due to its relatively longer input contexts\. Following FlowerTune\[[10](https://arxiv.org/html/2605.08636#bib.bib10)\], we partition the training data of each task into approximately equal\-size client shards\. All evaluations are conducted in a zero\-shot setting\. For accuracy evaluation, we adopt letter\-token classification accuracy for multiple\-choice tasks, and the final accuracy is computed as the fraction of examples for which the predicted letter matches the ground\-truth answer\. This follows the common likelihood\-based multiple\-choice evaluation protocol used for autoregressive language models\[[4](https://arxiv.org/html/2605.08636#bib.bib45),[26](https://arxiv.org/html/2605.08636#bib.bib46)\]\.
For method\-specific settings, FedAvg\+LoRA and FedProx\+LoRA use a LoRA rank of88for all clients\. For HeteroLoRA, the LoRA rank is set according to device capability: Jetson and iQOO clients use rank88, while the remaining clients use rank44\. For SplitLoRA, the LoRA rank is set to88, and each client\-side model contains only the first hidden layer of the backbone model\.
## 4Results and Analysis
We report representative case studies in the main paper and provide the complete benchmark results in the Appendix[A](https://arxiv.org/html/2605.08636#A1)\-[D](https://arxiv.org/html/2605.08636#A4)\. Across the full results covering 3 backbone models, 7 datasets, and 4 federated fine\-tuning methods, we observe a consistent qualitative conclusion: accuracy\-only rankings do not necessarily align with deployability\-aware rankings once wall\-clock time, communication, energy consumption, memory feasibility, and robustness are considered\. In the main\-paper results, for Protocols A and B, we select one representative task from each category: BoolQ forVerify, SocialIQA forChoose, and ARC\-Easy forReason\. For Protocol C, we additionally report HellaSwag for communication fluctuation and client dropout, and ARC\-Easy for device heterogeneity\. All detailed results in the main paper use Qwen2\.5\-0\.5B as the backbone model\.
Table 2:Testing quality across selected tasks and methods\.Task TypeTaskMethodTesting Loss \(lower is better\)Testing Accuracy \(higher is better\)PretrainedBest FinetunedRankPretrainedBest FinetunedRankVerifyBoolQ\\cellcolorgray\!10 Centroid0\.6393\\cellcolorgray\!10 0\.4788\\cellcolorgray\!10 Lower Bound63\.21%\\cellcolorgray\!10 80\.24%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.5051277\.68%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.4939\\cellcolorgray\!10 1\\cellcolorgray\!10 78\.20%\\cellcolorgray\!10 1HeteroLoRA0\.5059476\.73%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.5161\\cellcolorgray\!10 3\\cellcolorgray\!10 76\.79%\\cellcolorgray\!10 3Local Only0\.6127Upper Bound71\.99%Lower BoundChooseSocialIQA\\cellcolorgray\!10 Centroid0\.9644\\cellcolorgray\!10 0\.7769\\cellcolorgray\!10 Lower Bound55\.99%\\cellcolorgray\!10 68\.07%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.7993266\.63%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.7863\\cellcolorgray\!10 1\\cellcolorgray\!10 67\.35%\\cellcolorgray\!10 1HeteroLoRA0\.8250466\.07%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.8044\\cellcolorgray\!10 3\\cellcolorgray\!10 66\.38%\\cellcolorgray\!10 3Local Only0\.9981Upper Bound59\.31%Lower BoundReasonARC\-E\\cellcolorgray\!10 Centroid0\.7180\\cellcolorgray\!10 0\.5913\\cellcolorgray\!10 Lower Bound71\.05%\\cellcolorgray\!10 79\.47%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.6054278\.60%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.6013\\cellcolorgray\!10 1\\cellcolorgray\!10 78\.77%\\cellcolorgray\!10 1HeteroLoRA0\.6274476\.67%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.6102\\cellcolorgray\!10 3\\cellcolorgray\!10 77\.02%\\cellcolorgray\!10 3Local Only0\.6693Upper Bound74\.53%Lower Bound
### 4\.1Results of Protocol A: Quality\-under\-Budget
Under Protocol A, we evaluate the quality that each method can achieve under the current edge\-system budget\. We consider a total of 100 clients, with 20 clients instantiated for each device type listed in Table[1](https://arxiv.org/html/2605.08636#S3.T1)\. In each communication round, 10 clients are randomly selected to participate in federated fine\-tuning\. The experimental results are reported in Table[7](https://arxiv.org/html/2605.08636#A1.T7)\.
Across the three representative tasks, FedAvg\+LoRA achieves the strongest or tied strongest accuracy among the federated baselines, while HeteroLoRA generally obtains the lowest accuracy\. However, final quality alone does not fully characterize edge deployability\. Although SplitLoRA ranks third in quality in Table[7](https://arxiv.org/html/2605.08636#A1.T7), it provides substantially better executability under larger model settings\. Specifically, for Qwen2\.5\-0\.5B and Gemma\-3\-270M, all selected methods can be executed on the evaluated devices\. In contrast, when scaling to Gemma\-3\-1B, only SplitLoRA can be successfully executed, while the other methods encounter out\-of\-memory failures\. The complete results for the larger model setting are provided in the Appendix\. This observation shows that quality\-under\-budget evaluation should consider not only the final accuracy achieved by a method, but also whether the method can actually run within the memory and system constraints of edge devices\.
Insight 1\.A method may achieve strong final task performance under feasible settings, but still be impractical for edge deployment if its system cost exceeds what edge devices can sustain\.
### 4\.2Results of Protocol B: Cost\-to\-Target
For Protocol B, we use the same client configuration as in Sec\.[4\.1](https://arxiv.org/html/2605.08636#S4.SS1)For each task, we set three target accuracy levels to represent different stages of training progress\. These targets are derived from the improvement interval between the pretrained model and the centroid upper bound\. Specifically, the three targets correspond to reaching 50%, 70%, and 90% of the accuracy improvement from the pretrained baseline to the centroid result\.
Table[3](https://arxiv.org/html/2605.08636#S4.T3)reports four system metrics when each method first reaches the corresponding target accuracy\. The wall\-clock time, communication volume, and energy consumption are recorded at the first point where the target accuracy is achieved\. Peak memory is computed as the average peak memory across all participating clients up to that point222We include peak memory because whether a fine\-tuning method can be executed on edge devices directly depends on whether its memory footprint exceeds the device memory limit\.\. Since all metrics in this table represent system cost, lower values are better, and a smaller rank indicates a more efficient method\.
The results show that although FedAvg\+LoRA and FedProx\+LoRA achieve competitive final accuracy, they often require much longer wall\-clock time, higher energy consumption, and larger peak memory to reach the same target accuracy\. The ranking also varies across metrics and target accuracy levels; for example, methods with low communication cost at an early target may not remain the most efficient as the target accuracy increases\.
Insight 2\.Methods with higher final accuracy do not necessarily provide better system efficiency\. Their relative efficiency can also change across target accuracy levels and system metrics\.
Table 3:System cost to reach target accuracy across selected tasks and methods\.Task TypeTaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankVerifyBoolQ71%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 6\.65\\cellcolorgray\!10 3\\cellcolorgray\!10 6624\.38\\cellcolorgray\!10 1\\cellcolorgray\!10 300\.67\\cellcolorgray\!10 3\\cellcolorgray\!10 3453\.60\\cellcolorgray\!10 3FedProx\+LoRA6\.6746624\.381301\.1943470\.354\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 3\.07\\cellcolorgray\!10 2\\cellcolorgray\!10 9326\.95\\cellcolorgray\!10 2\\cellcolorgray\!10 138\.66\\cellcolorgray\!10 2\\cellcolorgray\!10 2499\.21\\cellcolorgray\!10 2SplitLoRA0\.43111744\.04319\.4211142\.87175%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 12\.48\\cellcolorgray\!10 3\\cellcolorgray\!10 12420\.70\\cellcolorgray\!10 1\\cellcolorgray\!10 563\.76\\cellcolorgray\!10 3\\cellcolorgray\!10 3497\.81\\cellcolorgray\!10 3FedProx\+LoRA12\.50412420\.701564\.8343514\.774\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 5\.32\\cellcolorgray\!10 2\\cellcolorgray\!10 16166\.72\\cellcolorgray\!10 3\\cellcolorgray\!10 240\.57\\cellcolorgray\!10 2\\cellcolorgray\!10 2531\.20\\cellcolorgray\!10 2SplitLoRA0\.47112811\.68221\.1811157\.50178%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 32\.46\\cellcolorgray\!10 3\\cellcolorgray\!10 32293\.83\\cellcolorgray\!10 2\\cellcolorgray\!10 1466\.74\\cellcolorgray\!10 3\\cellcolorgray\!10 3520\.44\\cellcolorgray\!10 3FedProx\+LoRA32\.50432293\.8321468\.5243537\.514\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 13\.33\\cellcolorgray\!10 2\\cellcolorgray\!10 40416\.80\\cellcolorgray\!10 3\\cellcolorgray\!10 602\.20\\cellcolorgray\!10 2\\cellcolorgray\!10 2547\.57\\cellcolorgray\!10 2SplitLoRA0\.99127224\.82144\.9411164\.991ChooseSocialIQA62%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 15\.79\\cellcolorgray\!10 4\\cellcolorgray\!10 15732\.89\\cellcolorgray\!10 2\\cellcolorgray\!10 772\.17\\cellcolorgray\!10 4\\cellcolorgray\!10 3418\.89\\cellcolorgray\!10 3FedProx\+LoRA15\.71315732\.892768\.4133462\.204\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 5\.32\\cellcolorgray\!10 2\\cellcolorgray\!10 16166\.72\\cellcolorgray\!10 3\\cellcolorgray\!10 259\.95\\cellcolorgray\!10 2\\cellcolorgray\!10 2562\.68\\cellcolorgray\!10 2SplitLoRA0\.44110142\.58121\.5411147\.95164%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 34\.91\\cellcolorgray\!10 4\\cellcolorgray\!10 34777\.97\\cellcolorgray\!10 3\\cellcolorgray\!10 1707\.33\\cellcolorgray\!10 4\\cellcolorgray\!10 3455\.95\\cellcolorgray\!10 3FedProx\+LoRA34\.72334777\.9731697\.9333499\.734\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 9\.98\\cellcolorgray\!10 2\\cellcolorgray\!10 30468\.05\\cellcolorgray\!10 2\\cellcolorgray\!10 488\.07\\cellcolorgray\!10 2\\cellcolorgray\!10 2590\.46\\cellcolorgray\!10 2SplitLoRA0\.81118683\.70139\.6111160\.39166%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 94\.94\\cellcolorgray\!10 4\\cellcolorgray\!10 94397\.34\\cellcolorgray\!10 3\\cellcolorgray\!10 4643\.29\\cellcolorgray\!10 4\\cellcolorgray\!10 3476\.19\\cellcolorgray\!10 3FedProx\+LoRA94\.25394397\.3434609\.6733520\.234\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 29\.94\\cellcolorgray\!10 2\\cellcolorgray\!10 91404\.14\\cellcolorgray\!10 2\\cellcolorgray\!10 1464\.33\\cellcolorgray\!10 2\\cellcolorgray\!10 2605\.63\\cellcolorgray\!10 2SplitLoRA1\.34130961\.56165\.5311167\.191ReasonARC\-E75%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 24\.29\\cellcolorgray\!10 4\\cellcolorgray\!10 24013\.36\\cellcolorgray\!10 2\\cellcolorgray\!10 1176\.40\\cellcolorgray\!10 4\\cellcolorgray\!10 3404\.12\\cellcolorgray\!10 3FedProx\+LoRA24\.13324013\.3621168\.5633441\.754\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 4\.00\\cellcolorgray\!10 2\\cellcolorgray\!10 24250\.08\\cellcolorgray\!10 3\\cellcolorgray\!10 193\.52\\cellcolorgray\!10 2\\cellcolorgray\!10 2540\.26\\cellcolorgray\!10 2SplitLoRA0\.48110676\.40123\.0911141\.11176%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 32\.64\\cellcolorgray\!10 3\\cellcolorgray\!10 32293\.83\\cellcolorgray\!10 2\\cellcolorgray\!10 1580\.47\\cellcolorgray\!10 3\\cellcolorgray\!10 3445\.54\\cellcolorgray\!10 3FedProx\+LoRA33\.31433121\.8831612\.9643483\.624\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 7\.89\\cellcolorgray\!10 2\\cellcolorgray\!10 47878\.36\\cellcolorgray\!10 4\\cellcolorgray\!10 382\.26\\cellcolorgray\!10 2\\cellcolorgray\!10 2571\.17\\cellcolorgray\!10 2SplitLoRA0\.95121352\.80146\.2011154\.99177%\\cellcolorgray\!10 FedAvg\+LoRA\\cellcolorgray\!10 66\.91\\cellcolorgray\!10 4\\cellcolorgray\!10 66243\.75\\cellcolorgray\!10 2\\cellcolorgray\!10 3240\.10\\cellcolorgray\!10 4\\cellcolorgray\!10 3458\.55\\cellcolorgray\!10 3FedProx\+LoRA66\.58366243\.7523223\.8833496\.774\\cellcolorgray\!10 HeteroLoRA\\cellcolorgray\!10 11\.17\\cellcolorgray\!10 2\\cellcolorgray\!10 67775\.86\\cellcolorgray\!10 3\\cellcolorgray\!10 540\.93\\cellcolorgray\!10 2\\cellcolorgray\!10 2580\.88\\cellcolorgray\!10 2SplitLoRA1\.53134164\.48173\.8611159\.351
Table 4:Robustness under communication fluctuation\. Values in parentheses denote changes relative to the no\-fluctuation setting\. Ranks are computed by the absolute value of the change, where smaller change indicates better robustness\. \(HellaSwag@Qwen2\.5\-0\.5B\)MethodTesting AccuracyRankWall\-clocktime \(h\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)Rank\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1033\.68%\(\+0\.00%\)\(\+0\.00\\%\)\\cellcolorgray\!101\\cellcolorgray\!101\.08\(\+0\.02\)\(\+0\.02\)\\cellcolorgray\!103\\cellcolorgray\!1012006\.68\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1038\.61\(\+1\.78\)\(\+1\.78\)\\cellcolorgray\!101\\cellcolorgray\!103557\.55\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA33\.68%\(\+0\.00%\)\(\+0\.00\\%\)11\.15\(\+0\.02\)\(\+0\.02\)212006\.68\(\+0\.00\)\(\+0\.00\)144\.39\(\+2\.08\)\(\+2\.08\)23548\.12\(\+0\.00\)\(\+0\.00\)1\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1033\.12%\(\+0\.00%\)\(\+0\.00\\%\)\\cellcolorgray\!101\\cellcolorgray\!105\.51\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1011751\.96\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10233\.79\(\+8\.47\)\(\+8\.47\)\\cellcolorgray\!103\\cellcolorgray\!103552\.89\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101SplitLoRA33\.67%\(\+0\.00%\)\(\+0\.00\\%\)112\.08\(\+4\.46\)\(\+4\.46\)410676\.40\(\+0\.00\)\(\+0\.00\)1515\.81\(\+203\.62\)\(\+203\.62\)43548\.27\(\+0\.00\)\(\+0\.00\)1
### 4\.3Results of Protocol C: Robustness
Communication fluctuation:Table[4](https://arxiv.org/html/2605.08636#S4.T4)reports the results of Protocol C under dynamic communication fluctuation\. To simulate unstable wireless connectivity in real edge deployments, we dynamically change the Wi\-Fi bandwidth every1/31/3hour, sequentially setting it to the full bandwidth,1/21/2of the bandwidth, and1/41/4of the bandwidth\. The results show that communication fluctuation can affect system cost even when final model quality remains similar, and the magnitude of this impact varies across methods\.
Drop\-out:Table[5](https://arxiv.org/html/2605.08636#S4.T5)reports the results of Protocol C under different client drop\-out ratios\. We vary the drop\-out ratio from 10% to 50% to simulate intermittent client availability in real edge deployments\. The results show that client drop\-out can affect both model quality and system cost, and the degree of degradation differs across methods\.
Heterogeneity:Table[6](https://arxiv.org/html/2605.08636#S4.T6)reports the results of Protocol C under different client heterogeneity settings\. We keep the total number of clients fixed at 100 and vary the device composition to simulate different deployment scenarios\. For each metric, the value outside the parentheses is the measured result under the corresponding client mix, while the value inside the parentheses denotes the change relative to the balanced setting, i\.e\., 20J\+20I\+20P\+20M\+20N\. The results show that client heterogeneity affects model quality and system efficiency in a method\-dependent manner, indicating that different methods exhibit different levels of robustness to shifts in device composition\.
Insight 3\.Edge\-system perturbations affect different methods differently, leading to method\-dependent robustness in both model quality and system efficiency\.
Table 5:Robustness under different client dropout ratios\. Values in parentheses denote changes relative to the no\-dropout setting\. Ranks are computed by the absolute value of the change, where smaller change indicates better robustness\.\(HellaSwag@Qwen2\.5\-0\.5B\)Dropout RatioMethodTestingAccuracyRankWall\-clocktime \(h\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)Rank10%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1033\.70%\(−0\.14%\)\(\-0\.14\\%\)\\cellcolorgray\!104\\cellcolorgray\!10184\.96\(\+26\.84\)\(\+26\.84\)\\cellcolorgray\!103\\cellcolorgray\!108250\.00\(\+412\.50\)\(\+412\.50\)\\cellcolorgray\!103\\cellcolorgray\!108956\.40\(\+1299\.76\)\(\+1299\.76\)\\cellcolorgray\!103\\cellcolorgray\!104551\.18\(−0\.75\)\(\-0\.75\)\\cellcolorgray\!104FedProx\+LoRA33\.72%\(−0\.12%\)\(\-0\.12\\%\)3185\.16\(\+26\.85\)\(\+26\.85\)48250\.00\(\+412\.50\)\(\+412\.50\)38966\.19\(\+1300\.06\)\(\+1300\.06\)44580\.47\(−0\.11\)\(\-0\.11\)1\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1033\.13%\(−0\.05%\)\(\-0\.05\\%\)\\cellcolorgray\!101\\cellcolorgray\!10182\.40\(\+22\.36\)\(\+22\.36\)\\cellcolorgray\!102\\cellcolorgray\!108250\.00\(\+206\.25\)\(\+206\.25\)\\cellcolorgray\!102\\cellcolorgray\!108832\.25\(\+1082\.90\)\(\+1082\.90\)\\cellcolorgray\!102\\cellcolorgray\!103927\.04\(−0\.57\)\(\-0\.57\)\\cellcolorgray\!102SplitLoRA33\.59%\(−0\.08%\)\(\-0\.08\\%\)25\.22\(\+0\.52\)\(\+0\.52\)18250\.00\(\+0\.00\)\(\+0\.00\)1252\.66\(\+25\.27\)\(\+25\.27\)12023\.79\(\+0\.58\)\(\+0\.58\)330%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1033\.82%\(−0\.02%\)\(\-0\.02\\%\)\\cellcolorgray\!101\\cellcolorgray\!10237\.81\(\+79\.69\)\(\+79\.69\)\\cellcolorgray\!103\\cellcolorgray\!108250\.00\(\+412\.50\)\(\+412\.50\)\\cellcolorgray\!103\\cellcolorgray\!1011515\.38\(\+3858\.73\)\(\+3858\.73\)\\cellcolorgray\!103\\cellcolorgray\!104551\.10\(−0\.83\)\(\-0\.83\)\\cellcolorgray\!101FedProx\+LoRA33\.81%\(−0\.03%\)\(\-0\.03\\%\)2238\.07\(\+79\.75\)\(\+79\.75\)48250\.00\(\+412\.50\)\(\+412\.50\)311527\.95\(\+3861\.83\)\(\+3861\.83\)44582\.08\(\+1\.50\)\(\+1\.50\)3\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1033\.09%\(−0\.09%\)\(\-0\.09\\%\)\\cellcolorgray\!103\\cellcolorgray\!10234\.51\(\+74\.48\)\(\+74\.48\)\\cellcolorgray\!102\\cellcolorgray\!108250\.00\(\+206\.25\)\(\+206\.25\)\\cellcolorgray\!102\\cellcolorgray\!1011355\.75\(\+3606\.40\)\(\+3606\.40\)\\cellcolorgray\!102\\cellcolorgray\!103928\.49\(\+0\.88\)\(\+0\.88\)\\cellcolorgray\!102SplitLoRA33\.57%\(−0\.10%\)\(\-0\.10\\%\)46\.71\(\+2\.01\)\(\+2\.01\)18250\.00\(\+0\.00\)\(\+0\.00\)1324\.85\(\+97\.45\)\(\+97\.45\)12024\.71\(\+1\.50\)\(\+1\.50\)350%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1033\.46%\(−0\.38%\)\(\-0\.38\\%\)\\cellcolorgray\!104\\cellcolorgray\!10289\.47\(\+131\.35\)\(\+131\.35\)\\cellcolorgray\!103\\cellcolorgray\!107177\.50\(−660\.00\)\(\-660\.00\)\\cellcolorgray\!102\\cellcolorgray\!1014017\.18\(\+6360\.54\)\(\+6360\.54\)\\cellcolorgray\!103\\cellcolorgray\!104553\.43\(\+1\.50\)\(\+1\.50\)\\cellcolorgray\!101FedProx\+LoRA33\.73%\(−0\.11%\)\(\-0\.11\\%\)1271\.62\(\+113\.31\)\(\+113\.31\)26723\.75\(−1113\.75\)\(\-1113\.75\)313152\.87\(\+5486\.74\)\(\+5486\.74\)24582\.08\(\+1\.50\)\(\+1\.50\)1\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1033\.06%\(−0\.12%\)\(\-0\.12\\%\)\\cellcolorgray\!102\\cellcolorgray\!10328\.31\(\+168\.28\)\(\+168\.28\)\\cellcolorgray\!104\\cellcolorgray\!108250\.00\(\+206\.25\)\(\+206\.25\)\\cellcolorgray\!101\\cellcolorgray\!1015898\.05\(\+8148\.70\)\(\+8148\.70\)\\cellcolorgray\!104\\cellcolorgray\!103929\.11\(\+1\.50\)\(\+1\.50\)\\cellcolorgray\!101SplitLoRA33\.34%\(−0\.33%\)\(\-0\.33\\%\)37\.56\(\+2\.87\)\(\+2\.87\)16641\.25\(−1608\.75\)\(\-1608\.75\)4366\.23\(\+138\.84\)\(\+138\.84\)12024\.71\(\+1\.50\)\(\+1\.50\)1
Table 6:Robustness under different client heterogeneity settings\. Values in parentheses denote changes relative to the balanced client mix\. Ranks are computed by the absolute value of the change, where smaller change indicates better robustness\. \(ARC\-E@Qwen2\.5\-0\.5B\)Client MixMethodTestingAccuracyRankWall\-clocktime \(h\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)Rank100J\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1078\.60%\(\+0\.00%\)\(\+0\.00\\%\)\\cellcolorgray\!101\\cellcolorgray\!1019\.85\(−136\.51\)\(\-136\.51\)\\cellcolorgray\!104\\cellcolorgray\!106624\.38\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10960\.95\(−6609\.90\)\(\-6609\.90\)\\cellcolorgray\!104\\cellcolorgray\!104569\.06\(\+0\.60\)\(\+0\.60\)\\cellcolorgray\!103FedProx\+LoRA78\.77%\(\+0\.00%\)\(\+0\.00\\%\)123\.92\(−136\.49\)\(\-136\.49\)36624\.38\(\+0\.00\)\(\+0\.00\)11158\.02\(−6608\.81\)\(\-6608\.81\)34569\.57\(\+0\.50\)\(\+0\.50\)2\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1076\.84%\(\+0\.17%\)\(\+0\.17\\%\)\\cellcolorgray\!102\\cellcolorgray\!1031\.57\(−78\.07\)\(\-78\.07\)\\cellcolorgray\!102\\cellcolorgray\!103300\.00\(−1608\.75\)\(\-1608\.75\)\\cellcolorgray\!102\\cellcolorgray\!101528\.42\(−3780\.15\)\(\-3780\.15\)\\cellcolorgray\!102\\cellcolorgray\!103913\.44\(\+0\.40\)\(\+0\.40\)\\cellcolorgray\!101SplitLoRA77\.02%\(\+0\.00%\)\(\+0\.00\\%\)14\.83\(\+0\.04\)\(\+0\.04\)13416\.45\(\+0\.00\)\(\+0\.00\)1233\.71\(\+1\.78\)\(\+1\.78\)12018\.80\(\+0\.40\)\(\+0\.40\)170J\+20I\+10P\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1078\.60%\(\+0\.00%\)\(\+0\.00\\%\)\\cellcolorgray\!101\\cellcolorgray\!1052\.48\(−103\.88\)\(\-103\.88\)\\cellcolorgray\!103\\cellcolorgray\!106624\.38\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!102540\.99\(−5029\.87\)\(\-5029\.87\)\\cellcolorgray\!103\\cellcolorgray\!104568\.76\(\+0\.30\)\(\+0\.30\)\\cellcolorgray\!103FedProx\+LoRA78\.77%\(\+0\.00%\)\(\+0\.00\\%\)152\.34\(−108\.07\)\(\-108\.07\)46624\.38\(\+0\.00\)\(\+0\.00\)12534\.20\(−5232\.63\)\(\-5232\.63\)44569\.32\(\+0\.25\)\(\+0\.25\)2\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1076\.75%\(\+0\.08%\)\(\+0\.08\\%\)\\cellcolorgray\!102\\cellcolorgray\!1051\.46\(−58\.17\)\(\-58\.17\)\\cellcolorgray\!102\\cellcolorgray\!106063\.75\(\+1155\.00\)\(\+1155\.00\)\\cellcolorgray\!102\\cellcolorgray\!102491\.76\(−2816\.80\)\(\-2816\.80\)\\cellcolorgray\!102\\cellcolorgray\!103913\.24\(\+0\.20\)\(\+0\.20\)\\cellcolorgray\!101SplitLoRA77\.02%\(\+0\.00%\)\(\+0\.00\\%\)14\.82\(\+0\.03\)\(\+0\.03\)13416\.45\(\+0\.00\)\(\+0\.00\)1233\.19\(\+1\.25\)\(\+1\.25\)12018\.60\(\+0\.20\)\(\+0\.20\)120J\+20I\+20P\+20M\+20NReference\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1078\.60%\(\+0\.00%\)\(\+0\.00\\%\)\\cellcolorgray\!101\\cellcolorgray\!10156\.36\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!106624\.38\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!107570\.86\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!104568\.46\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA78\.77%\(\+0\.00%\)\(\+0\.00\\%\)1160\.41\(\+0\.00\)\(\+0\.00\)16624\.38\(\+0\.00\)\(\+0\.00\)17766\.83\(\+0\.00\)\(\+0\.00\)14569\.07\(\+0\.00\)\(\+0\.00\)1\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1076\.67%\(\+0\.00%\)\(\+0\.00\\%\)\\cellcolorgray\!101\\cellcolorgray\!10109\.64\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!104908\.75\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!105308\.56\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!103913\.04\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101SplitLoRA77\.02%\(\+0\.00%\)\(\+0\.00\\%\)14\.79\(\+0\.00\)\(\+0\.00\)13416\.45\(\+0\.00\)\(\+0\.00\)1231\.94\(\+0\.00\)\(\+0\.00\)12018\.40\(\+0\.00\)\(\+0\.00\)110J\+20I\+20P\+20M\+30N\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1078\.60%\(\+0\.00%\)\(\+0\.00\\%\)\\cellcolorgray\!101\\cellcolorgray\!10166\.01\(\+9\.65\)\(\+9\.65\)\\cellcolorgray\!104\\cellcolorgray\!106624\.38\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!108038\.00\(\+467\.14\)\(\+467\.14\)\\cellcolorgray\!104\\cellcolorgray\!104568\.16\(−0\.30\)\(\-0\.30\)\\cellcolorgray\!103FedProx\+LoRA78\.77%\(\+0\.00%\)\(\+0\.00\\%\)1165\.21\(\+4\.81\)\(\+4\.81\)36624\.38\(\+0\.00\)\(\+0\.00\)17999\.56\(\+232\.73\)\(\+232\.73\)34568\.82\(−0\.25\)\(\-0\.25\)2\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1076\.40%\(−0\.27%\)\(\-0\.27\\%\)\\cellcolorgray\!102\\cellcolorgray\!10113\.26\(\+3\.62\)\(\+3\.62\)\\cellcolorgray\!102\\cellcolorgray\!104743\.75\(−165\.00\)\(\-165\.00\)\\cellcolorgray\!102\\cellcolorgray\!105483\.93\(\+175\.37\)\(\+175\.37\)\\cellcolorgray\!102\\cellcolorgray\!103912\.84\(−0\.20\)\(\-0\.20\)\\cellcolorgray\!101SplitLoRA77\.02%\(\+0\.00%\)\(\+0\.00\\%\)14\.83\(\+0\.04\)\(\+0\.04\)13416\.45\(\+0\.00\)\(\+0\.00\)1233\.79\(\+1\.85\)\(\+1\.85\)12018\.20\(−0\.20\)\(\-0\.20\)110I\+10P\+30M\+50N\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1078\.60%\(\+0\.00%\)\(\+0\.00\\%\)\\cellcolorgray\!101\\cellcolorgray\!10168\.82\(\+12\.47\)\(\+12\.47\)\\cellcolorgray\!104\\cellcolorgray\!106624\.38\(\+0\.00\)\(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!108174\.42\(\+603\.56\)\(\+603\.56\)\\cellcolorgray\!104\\cellcolorgray\!104567\.86\(−0\.60\)\(\-0\.60\)\\cellcolorgray\!103FedProx\+LoRA78\.77%\(\+0\.00%\)\(\+0\.00\\%\)1167\.20\(\+6\.79\)\(\+6\.79\)36624\.38\(\+0\.00\)\(\+0\.00\)18095\.70\(\+328\.87\)\(\+328\.87\)34568\.57\(−0\.50\)\(\-0\.50\)2\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1076\.32%\(−0\.35%\)\(\-0\.35\\%\)\\cellcolorgray\!102\\cellcolorgray\!10114\.62\(\+4\.98\)\(\+4\.98\)\\cellcolorgray\!102\\cellcolorgray\!104166\.25\(−742\.50\)\(\-742\.50\)\\cellcolorgray\!102\\cellcolorgray\!105549\.86\(\+241\.30\)\(\+241\.30\)\\cellcolorgray\!102\\cellcolorgray\!103912\.64\(−0\.40\)\(\-0\.40\)\\cellcolorgray\!101SplitLoRA77\.02%\(\+0\.00%\)\(\+0\.00\\%\)14\.84\(\+0\.05\)\(\+0\.05\)13416\.45\(\+0\.00\)\(\+0\.00\)1234\.30\(\+2\.36\)\(\+2\.36\)12018\.00\(−0\.40\)\(\-0\.40\)1
Note:J, I, P, M, and N denote Jetson, iQOO, Huawei P50, Mate 20, and Nova clients, respectively\.
## 5Limitations
The evaluated methods do not exhaustively represent all existing or emerging algorithms\. Our current platform does not fully capture other edge deployments such as wearables and embedded IoT devices\. Our task suite focuses on lightweight language understanding and reasoning, while open\-ended generation, multimodal interaction, and long\-context personalization remain future extensions\. In addition, system metrics such as energy and runtime memory can be affected by background processes and measurement\-tool granularity\.
## 6Conclusion
In this paper, we presentedEdgeFlowerTune, a deployment\-oriented benchmark for federated LLM fine\-tuning under realistic edge\-system constraints\. EdgeFlowerTune jointly evaluates model quality, system efficiency, and robustness through three complementary protocols:Quality\-under\-Budget,Cost\-to\-Target, andRobustness\. Our results on real edge devices show that benchmark conclusions can change substantially once system constraints are considered: methods that appear favorable under accuracy\-only evaluation may become suboptimal once realistic edge constraints and system perturbations are taken into account\. These findings highlight the need for system\-aware and robustness\-aware evaluation when developing federated LLM fine\-tuning methods for edge deployment\. We view EdgeFlowerTune as an extensible benchmark and plan to release an open leaderboard for future method submissions\.
## References
- \[1\]\(2021\)Smart at what cost? characterising mobile deep neural networks in the wild\.InProceedings of the 21st ACM Internet Measurement Conference,IMC ’21,New York, NY, USA,pp\. 658–672\.External Links:ISBN 9781450391290,[Link](https://doi.org/10.1145/3487552.3487863),[Document](https://dx.doi.org/10.1145/3487552.3487863)Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p1.1)\.
- \[2\]J\. Bai, D\. Chen, B\. Qian, L\. Yao, and Y\. Li\(2024\)Federated fine\-tuning of large language models under heterogeneous tasks and client resources\.InAdvances in Neural Information Processing Systems,Cited by:[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p4.1)\.
- \[3\]Y\. Bisk, R\. Zellers, R\. Le Bras, J\. Gao, and Y\. Choi\(2020\)PIQA: reasoning about physical commonsense in natural language\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 7432–7439\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v34i05.6239)Cited by:[§3\.2\.1](https://arxiv.org/html/2605.08636#S3.SS2.SSS1.p3.1)\.
- \[4\]T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1877–1901\.Cited by:[§3\.4](https://arxiv.org/html/2605.08636#S3.SS4.p1.4)\.
- \[5\]Y\. J\. Cho, L\. Liu, Z\. Xu, A\. Fahrezi, and G\. Joshi\(2024\-11\)Heterogeneous LoRA for federated fine\-tuning of on\-device foundation models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 12903–12913\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.717/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.717)Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p2.1)\.
- \[6\]C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova\(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),Minneapolis, Minnesota,pp\. 2924–2936\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1300)Cited by:[§3\.2\.1](https://arxiv.org/html/2605.08636#S3.SS2.SSS1.p2.1)\.
- \[7\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§3\.2\.1](https://arxiv.org/html/2605.08636#S3.SS2.SSS1.p4.1)\.
- \[8\]European Parliament and Council of the European Union\(2016\)Regulation \(EU\) 2016/679 of the European Parliament and of the Council\(Website\)External Links:[Link](https://data.europa.eu/eli/reg/2016/679/oj)Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p1.1)\.
- \[9\]T\. Fan, H\. Gu, X\. Cao, C\. S\. Chan, Q\. Chen, Y\. Chen, Y\. Feng, Y\. Gu, J\. Geng, B\. Luo, S\. Liu, W\. K\. Ong, C\. Ren, J\. Shao, C\. Sun, X\. Tang, H\. X\. Tae, Y\. Tong, S\. Wei, F\. Wu, W\. Xi, M\. Xu, H\. Yang, X\. Yang, J\. Yan, H\. Yu, H\. Yu, T\. Zhang, Y\. Zhang, X\. Zhang, Z\. Zheng, L\. Fan, and Q\. Yang\(2025\)Ten challenging problems in federated foundation models\.IEEE Transactions on Knowledge and Data Engineering37\(7\),pp\. 4314–4337\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2025.3555328)Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p2.1)\.
- \[10\]Y\. Gao, M\. R\. Scamarcia, J\. Fernandez\-Marques, M\. Naseri, C\. S\. Ng, D\. Stripelis, Z\. Li, T\. Shen, J\. Bai, D\. Chen, Z\. Zhang, R\. Hu, I\. Song, K\. Lee, H\. Jia, T\. Dang, J\. Wang, Z\. Liu, D\. J\. Beutel, L\. Lyu, and N\. D\. Lane\(2025\)FlowerTune: a cross\-domain benchmark for federated fine\-tuning of large language models\.InAdvances in Neural Information Processing Systems,Note:Datasets and Benchmarks TrackExternal Links:[Link](https://openreview.net/forum?id=l8Nb6ecZjW)Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.08636#S3.SS1.p1.1),[§3\.4](https://arxiv.org/html/2605.08636#S3.SS4.p1.4)\.
- \[11\]J\. Geng, L\. Zhao, Y\. Lu, and B\. Luo\(2025\)MobileFineTuner: a unified end\-to\-end framework for fine\-tuning llms on mobile phones\.External Links:2512\.08211Cited by:[§3\.1](https://arxiv.org/html/2605.08636#S3.SS1.p2.1)\.
- \[12\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p2.1)\.
- \[13\]P\. Kairouz, H\. B\. McMahan, B\. Avent, A\. Bellet, M\. Bennis, A\. N\. Bhagoji, K\. Bonawitz, G\. Cormode, R\. Cummings, R\. D’Oliveira,et al\.\(2021\)Advances and open problems in federated learning\.Foundations and Trends in Machine Learning14\(1–2\),pp\. 1–210\.External Links:[Document](https://dx.doi.org/10.1561/2200000083)Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p2.1)\.
- \[14\]W\. Kuang, B\. Qian, Z\. Li, D\. Chen, D\. Gao, X\. Pan, Y\. Xie, Y\. Li, B\. Ding, and J\. Zhou\(2024\)FederatedScope\-llm: a comprehensive package for fine\-tuning large language models in federated learning\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5260–5271\.External Links:[Document](https://dx.doi.org/10.1145/3637528.3671573)Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p3.1)\.
- \[15\]T\. Li, A\. K\. Sahu, M\. Zaheer, M\. Sanjabi, A\. Talwalkar, and V\. Smith\(2020\)Federated optimization in heterogeneous networks\.InProceedings of Machine Learning and Systems,Vol\.2,pp\. 429–450\.Cited by:[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p3.1)\.
- \[16\]Z\. Lin, X\. Hu, Y\. Zhang, Z\. Chen, Z\. Fang, X\. Chen, A\. Li, P\. Vepakomma, and Y\. Gao\(2024\)SplitLoRA: a split parameter\-efficient fine\-tuning framework for large language models\.arXiv preprint arXiv:2407\.00952\.Cited by:[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p5.1)\.
- \[17\]Q\. Liu, Z\. Zhang, X\. Yao, and B\. Liu\(2025\)HLoRA: efficient federated learning system for LLM heterogeneous fine\-tuning\.arXiv preprint arXiv:2503\.00813\.Cited by:[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p4.1)\.
- \[18\]H\. B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y\. Arcas\(2017\)Communication\-efficient learning of deep networks from decentralized data\.InProceedings of the 20th International Conference on Artificial Intelligence and Statistics \(AISTATS\),pp\. 1273–1282\.Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p2.1)\.
- \[19\]OpenAI\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p1.1)\.
- \[20\]P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang\(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,Austin, Texas,pp\. 2383–2392\.External Links:[Document](https://dx.doi.org/10.18653/v1/D16-1264)Cited by:[§3\.2\.1](https://arxiv.org/html/2605.08636#S3.SS2.SSS1.p2.1)\.
- \[21\]K\. Sakaguchi, R\. Le Bras, C\. Bhagavatula, and Y\. Choi\(2020\)WinoGrande: an adversarial winograd schema challenge at scale\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 8732–8740\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v34i05.6399)Cited by:[§3\.2\.1](https://arxiv.org/html/2605.08636#S3.SS2.SSS1.p4.1)\.
- \[22\]M\. Sap, H\. Rashkin, D\. Chen, R\. Le Bras, and Y\. Choi\(2019\)Social IQa: commonsense reasoning about social interactions\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,Hong Kong, China,pp\. 4463–4473\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1454)Cited by:[§3\.2\.1](https://arxiv.org/html/2605.08636#S3.SS2.SSS1.p3.1)\.
- \[23\]G\. Team\(2025\)Gemma 3 Technical Report\.arXiv preprintarXiv:2503\.19786\.Note:Google DeepMind Gemma 3 model familyExternal Links:[Link](https://arxiv.org/abs/2503.19786)Cited by:[§3\.2\.2](https://arxiv.org/html/2605.08636#S3.SS2.SSS2.p1.1)\.
- \[24\]P\. Villalobos, A\. Ho, J\. Sevilla, T\. Besiroglu, L\. Heim, and M\. Hobbhahn\(2024\)Position: will we run out of data? limits of llm scaling based on human\-generated data\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p1.1)\.
- \[25\]A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman\(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,Brussels, Belgium,pp\. 353–355\.External Links:[Document](https://dx.doi.org/10.18653/v1/W18-5446)Cited by:[§3\.2\.1](https://arxiv.org/html/2605.08636#S3.SS2.SSS1.p2.1)\.
- \[26\]X\. Wang, B\. Ma, C\. Hu, L\. Weber\-Genzel, P\. Röttger, F\. Kreuter, D\. Hovy, and B\. Plank\(2024\)“My answer is c”: first\-token probabilities do not match text answers in instruction\-tuned language models\.InFindings of the Association for Computational Linguistics: ACL 2024,Cited by:[§3\.4](https://arxiv.org/html/2605.08636#S3.SS4.p1.4)\.
- \[27\]Z\. Wang, Z\. Shen, Y\. He, G\. Sun, H\. Wang, L\. Lyu, and A\. Li\(2024\)FLoRA: federated fine\-tuning large language models with heterogeneous low\-rank adaptations\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.4](https://arxiv.org/html/2605.08636#A1.SS4.p2.1),[§1](https://arxiv.org/html/2605.08636#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p4.1)\.
- \[28\]T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, and J\. Brew\(2019\)HuggingFace’s transformers: state\-of\-the\-art natural language processing\.CoRRabs/1910\.03771\.External Links:1910\.03771Cited by:[§3\.1](https://arxiv.org/html/2605.08636#S3.SS1.p1.1)\.
- \[29\]A\. Yang and the Qwen Team\(2024\)Qwen2\.5 Technical Report\.arXiv preprintarXiv:2412\.15115\.Note:Qwen2\.5 model family including 0\.5B variantExternal Links:[Link](https://arxiv.org/abs/2412.15115)Cited by:[§3\.2\.2](https://arxiv.org/html/2605.08636#S3.SS2.SSS2.p1.1)\.
- \[30\]R\. Ye, W\. Wang, J\. Chai, D\. Li, Z\. Li, Y\. Xu, Y\. Du, Y\. Wang, and S\. Chen\(2024\)OpenFedLLM: training large language models on decentralized private data via federated learning\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 6137–6147\.External Links:[Document](https://dx.doi.org/10.1145/3637528.3671582)Cited by:[§1](https://arxiv.org/html/2605.08636#S1.p3.1),[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p2.1),[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p3.1)\.
- \[31\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Florence, Italy,pp\. 4791–4800\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1472)Cited by:[§3\.2\.1](https://arxiv.org/html/2605.08636#S3.SS2.SSS1.p3.1)\.
- \[32\]J\. Zhang, S\. Vahidian, M\. Kuo, C\. Li, R\. Zhang, T\. Yu, Y\. Zhou, G\. Wang, and Y\. Chen\(2023\)Towards building the federated GPT: federated instruction tuning\.arXiv preprint arXiv:2305\.05644\.Cited by:[§3\.3](https://arxiv.org/html/2605.08636#S3.SS3.p2.1)\.
## Appendix AComplete benchmark results of protocol A
This section provides the complete benchmark results for Protocol A, i\.e\.,Quality\-under\-Budget\. We report the achievable quality of Qwen2\.5\-0\.5B, Gemma 3\-270M, and Gemma 3\-1B on seven evaluation datasets under the given system budgets\. For each model and dataset, we evaluate four representative federated LLM fine\-tuning methods: FedAvg\+LoRA, FedProx\+LoRA, HeteroLoRA, and SplitLoRA\. The reported quality metrics include testing loss, where lower is better, and testing accuracy, where higher is better\.
We consider a total of 100 clients, with 20 clients instantiated for each device type listed in Table[1](https://arxiv.org/html/2605.08636#S3.T1)\. In each communication round, 10 clients are randomly selected to participate in federated fine\-tuning\. In addition to the four federated fine\-tuning methods, we also report the results of the pretrained model, centroid fine\-tuning, and local\-only training for reference\. The pretrained result denotes the initial zero\-shot quality before fine\-tuning\. The centroid result is obtained by centralized fine\-tuning on the complete training dataset using the same experimental configuration, and is treated as the quality upper bound\. The local\-only result is obtained by training each client only on its local data without federated aggregation; its testing loss and testing accuracy are reported as the average values across clients\.
Table 7:Testing quality across all selected tasks and methods under Protocol A with Qwen2\.5\-0\.5B\.Task TypeTaskMethodTesting Loss \(lower is better\)Testing Accuracy \(higher is better\)PretrainedBest FinetunedRankPretrainedBest FinetunedRankVerifyBoolQ\\cellcolorgray\!10 Centroid0\.6393\\cellcolorgray\!10 0\.4788\\cellcolorgray\!10 Lower Bound63\.21%\\cellcolorgray\!10 80\.24%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.5051277\.68%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.4939\\cellcolorgray\!10 1\\cellcolorgray\!10 78\.20%\\cellcolorgray\!10 1HeteroLoRA0\.5059376\.73%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.5161\\cellcolorgray\!10 4\\cellcolorgray\!10 76\.79%\\cellcolorgray\!10 3Local Only0\.6127Upper Bound71\.99%Lower BoundQNLI\\cellcolorgray\!10 Centroid1\.2153\\cellcolorgray\!10 0\.4611\\cellcolorgray\!10 Lower Bound58\.17%\\cellcolorgray\!10 85\.76%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.7994264\.84%3\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.7357\\cellcolorgray\!10 1\\cellcolorgray\!10 64\.96%\\cellcolorgray\!10 2HeteroLoRA0\.9490464\.29%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.8051\\cellcolorgray\!10 3\\cellcolorgray\!10 65\.20%\\cellcolorgray\!10 1Local Only0\.8657Upper Bound63\.88%Lower BoundChoosePIQA\\cellcolorgray\!10 Centroid0\.7382\\cellcolorgray\!10 0\.6778\\cellcolorgray\!10 Lower Bound59\.36%\\cellcolorgray\!10 65\.45%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.6781265\.29%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.6779\\cellcolorgray\!10 1\\cellcolorgray\!10 65\.29%\\cellcolorgray\!10 2HeteroLoRA0\.6943465\.34%1\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.6927\\cellcolorgray\!10 3\\cellcolorgray\!10 64\.47%\\cellcolorgray\!10 3Local Only0\.7514Upper Bound61\.22%Lower BoundHellaSwag\\cellcolorgray\!10 Centroid1\.8658\\cellcolorgray\!10 1\.7026\\cellcolorgray\!10 Lower Bound26\.87%\\cellcolorgray\!10 35\.13%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA1\.7446133\.84%1\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 1\.7446\\cellcolorgray\!10 1\\cellcolorgray\!10 33\.84%\\cellcolorgray\!10 1HeteroLoRA1\.7618333\.18%3\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 1\.7563\\cellcolorgray\!10 2\\cellcolorgray\!10 33\.67%\\cellcolorgray\!10 2Local Only2\.0041Upper Bound26\.06%Lower BoundSocialIQA\\cellcolorgray\!10 Centroid0\.9644\\cellcolorgray\!10 0\.7769\\cellcolorgray\!10 Lower Bound55\.99%\\cellcolorgray\!10 68\.07%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.7993266\.63%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.7863\\cellcolorgray\!10 1\\cellcolorgray\!10 67\.35%\\cellcolorgray\!10 1HeteroLoRA0\.8250466\.07%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.8044\\cellcolorgray\!10 3\\cellcolorgray\!10 66\.38%\\cellcolorgray\!10 3Local Only0\.9981Upper Bound59\.31%Lower BoundReasonARC\-E\\cellcolorgray\!10 Centroid0\.7180\\cellcolorgray\!10 0\.5913\\cellcolorgray\!10 Lower Bound71\.05%\\cellcolorgray\!10 79\.47%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.6054278\.60%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.6013\\cellcolorgray\!10 1\\cellcolorgray\!10 78\.77%\\cellcolorgray\!10 1HeteroLoRA0\.6274476\.67%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.6102\\cellcolorgray\!10 3\\cellcolorgray\!10 77\.02%\\cellcolorgray\!10 3Local Only0\.6693Upper Bound74\.53%Lower BoundWinoGrande\\cellcolorgray\!10 Centroid0\.7032\\cellcolorgray\!10 0\.6156\\cellcolorgray\!10 Lower Bound51\.07%\\cellcolorgray\!10 63\.93%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.6905263\.46%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.6906\\cellcolorgray\!10 3\\cellcolorgray\!10 63\.61%\\cellcolorgray\!10 1HeteroLoRA0\.6960461\.17%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.6889\\cellcolorgray\!10 1\\cellcolorgray\!10 62\.43%\\cellcolorgray\!10 3Local Only0\.7073Upper Bound56\.69%Lower Bound
Table 8:Testing quality across all selected tasks and methods under Protocol A with Gemma 3\-270M\.Task TypeTaskMethodTesting Loss \(lower is better\)Testing Accuracy \(higher is better\)PretrainedBest FinetunedRankPretrainedBest FinetunedRankVerifyBoolQ\\cellcolorgray\!10 Centroid0\.8543\\cellcolorgray\!10 0\.6199\\cellcolorgray\!10 Lower Bound60\.40%\\cellcolorgray\!10 69\.66%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.7356362\.94%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.7271\\cellcolorgray\!10 2\\cellcolorgray\!10 64\.53%\\cellcolorgray\!10 1HeteroLoRA0\.7441462\.20%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.7017\\cellcolorgray\!10 1\\cellcolorgray\!10 62\.72%\\cellcolorgray\!10 3Local Only0\.7396Upper Bound60\.28%Lower BoundQNLI\\cellcolorgray\!10 Centroid1\.0986\\cellcolorgray\!10 0\.5454\\cellcolorgray\!10 Lower Bound51\.51%\\cellcolorgray\!10 78\.29%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA0\.6720268\.26%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 0\.6741\\cellcolorgray\!10 3\\cellcolorgray\!10 66\.98%\\cellcolorgray\!10 3HeteroLoRA0\.7411463\.87%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.5718\\cellcolorgray\!10 1\\cellcolorgray\!10 71\.46%\\cellcolorgray\!10 1Local Only0\.7056Upper Bound62\.93%Lower BoundChoosePIQA\\cellcolorgray\!10 Centroid1\.3732\\cellcolorgray\!10 1\.2522\\cellcolorgray\!10 Lower Bound51\.47%\\cellcolorgray\!10 51\.87%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA1\.3194250\.47%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 1\.3194\\cellcolorgray\!10 2\\cellcolorgray\!10 51\.47%\\cellcolorgray\!10 1HeteroLoRA1\.3443351\.47%1\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 1\.3188\\cellcolorgray\!10 1\\cellcolorgray\!10 51\.47%\\cellcolorgray\!10 1Local Only1\.3534Upper Bound50\.24%Lower BoundHellaSwag\\cellcolorgray\!10 Centroid2\.9327\\cellcolorgray\!10 0\.2513\\cellcolorgray\!10 Lower Bound24\.54%\\cellcolorgray\!10 25\.20%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA2\.6755124\.84%3\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 2\.6755\\cellcolorgray\!10 1\\cellcolorgray\!10 24\.84%\\cellcolorgray\!10 3HeteroLoRA2\.7378324\.93%1\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 2\.6765\\cellcolorgray\!10 2\\cellcolorgray\!10 24\.88%\\cellcolorgray\!10 2Local Only2\.6982Upper Bound24\.73%Lower BoundSocialIQA\\cellcolorgray\!10 Centroid1\.8536\\cellcolorgray\!10 1\.2164\\cellcolorgray\!10 Lower Bound34\.34%\\cellcolorgray\!10 44\.98%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA1\.3476340\.94%1\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 1\.3368\\cellcolorgray\!10 2\\cellcolorgray\!10 40\.38%\\cellcolorgray\!10 3HeteroLoRA1\.4414438\.79%4\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 1\.2748\\cellcolorgray\!10 1\\cellcolorgray\!10 40\.84%\\cellcolorgray\!10 2Local Only1\.7288Upper Bound37\.86%Lower BoundReasonARC\-E\\cellcolorgray\!10 Centroid2\.9670\\cellcolorgray\!10 2\.3865\\cellcolorgray\!10 Lower Bound26\.67%\\cellcolorgray\!10 30\.90%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA2\.4061130\.18%1\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 2\.4061\\cellcolorgray\!10 1\\cellcolorgray\!10 30\.18%\\cellcolorgray\!10 1HeteroLoRA2\.5769330\.00%2\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 2\.4278\\cellcolorgray\!10 2\\cellcolorgray\!10 29\.30%\\cellcolorgray\!10 3Local Only2\.5307Upper Bound29\.26%Lower BoundWinoGrande\\cellcolorgray\!10 Centroid1\.5455\\cellcolorgray\!10 1\.3361\\cellcolorgray\!10 Lower Bound49\.88%\\cellcolorgray\!10 50\.59%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA1\.4291350\.51%2\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 1\.3735\\cellcolorgray\!10 1\\cellcolorgray\!10 51\.30%\\cellcolorgray\!10 1HeteroLoRA1\.4621450\.51%2\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 1\.4263\\cellcolorgray\!10 2\\cellcolorgray\!10 50\.43%\\cellcolorgray\!10 3Local Only1\.4315Upper Bound50\.43%Lower Bound
### A\.1Results of Qwen2\.5\-0\.5B
Table[7](https://arxiv.org/html/2605.08636#A1.T7)summarizes the Protocol A results of Qwen2\.5\-0\.5B across all seven tasks\. The centroid setting provides the centralized upper bound, while the pretrained and local\-only results serve as reference baselines\. Overall, federated fine\-tuning improves testing accuracy over the pretrained model on most tasks, with FedAvg\+LoRA and FedProx\+LoRA generally achieving strong final quality\.
Comparing the four federated fine\-tuning methods, FedProx\+LoRA achieves the strongest accuracy on BoolQ, SocialIQA, ARC\-E, and WinoGrande, and ties with FedAvg\+LoRA on PIQA and HellaSwag\. FedAvg\+LoRA remains a competitive baseline, usually ranking second or tied first in accuracy, but it does not consistently outperform FedProx\+LoRA\. SplitLoRA achieves the best accuracy on QNLI and the lowest testing loss on WinoGrande\. HeteroLoRA obtains the best accuracy on PIQA but generally ranks lower on the other tasks, suggesting that heterogeneous LoRA ranks may introduce a quality trade\-off\.
We note several behaviors\. The local\-only baseline can show higher testing loss than the pretrained model because each client trains only on a small local shard without aggregation, which may lead to local overfitting and weaker global generalization\. In addition, testing loss and testing accuracy do not always induce the same ranking: loss measures target\-token likelihood, whereas accuracy only evaluates whether the predicted answer letter is correct\. Thus, a method can achieve better accuracy but worse loss, or vice versa\.
Table 9:Testing quality across all selected tasks and methods under Protocol A with Gemma 3\-1B\. “\-” indicates that the method is not executable under the system budget due to out\-of\-memory\.Task TypeTaskMethodTesting Loss \(lower is better\)Testing Accuracy \(higher is better\)PretrainedBest FinetunedRankPretrainedBest FinetunedRankVerifyBoolQ\\cellcolorgray\!10 Centroid1\.0778\\cellcolorgray\!10 0\.5800\\cellcolorgray\!10 Lower Bound59\.51%\\cellcolorgray\!10 70\.00%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA\-\-\-\-\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-HeteroLoRA\-\-\-\-\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.7590\\cellcolorgray\!10 1\\cellcolorgray\!10 64\.25%\\cellcolorgray\!10 1Local Only\-\-\-\-QNLI\\cellcolorgray\!10 Centroid1\.4075\\cellcolorgray\!10 0\.6451\\cellcolorgray\!10 Lower Bound49\.81%\\cellcolorgray\!10 68\.00%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA\-\-\-\-\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-HeteroLoRA\-\-\-\-\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.7367\\cellcolorgray\!10 1\\cellcolorgray\!10 64\.52%\\cellcolorgray\!10 1Local Only\-\-\-\-ChoosePIQA\\cellcolorgray\!10 Centroid1\.1466\\cellcolorgray\!10 0\.7800\\cellcolorgray\!10 Lower Bound51\.52%\\cellcolorgray\!10 57\.02%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA\-\-\-\-\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-HeteroLoRA\-\-\-\-\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.9847\\cellcolorgray\!10 1\\cellcolorgray\!10 53\.81%\\cellcolorgray\!10 1Local Only\-\-\-\-HellaSwag\\cellcolorgray\!10 Centroid2\.7088\\cellcolorgray\!10 2\.0173\\cellcolorgray\!10 Lower Bound25\.03%\\cellcolorgray\!10 31\.65%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA\-\-\-\-\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-HeteroLoRA\-\-\-\-\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 2\.6109\\cellcolorgray\!10 1\\cellcolorgray\!10 25\.33%\\cellcolorgray\!10 1Local Only\-\-\-\-SocialIQA\\cellcolorgray\!10 Centroid1\.8257\\cellcolorgray\!10 1\.1800\\cellcolorgray\!10 Lower Bound34\.44%\\cellcolorgray\!10 49\.03%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA\-\-\-\-\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-HeteroLoRA\-\-\-\-\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 1\.2247\\cellcolorgray\!10 1\\cellcolorgray\!10 45\.70%\\cellcolorgray\!10 1Local Only\-\-\-\-ReasonARC\-E\\cellcolorgray\!10 Centroid2\.6107\\cellcolorgray\!10 1\.9000\\cellcolorgray\!10 Lower Bound26\.14%\\cellcolorgray\!10 33\.71%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA\-\-\-\-\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-HeteroLoRA\-\-\-\-\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 2\.2811\\cellcolorgray\!10 1\\cellcolorgray\!10 28\.07%\\cellcolorgray\!10 1Local Only\-\-\-\-WinoGrande\\cellcolorgray\!10 Centroid1\.4381\\cellcolorgray\!10 0\.6427\\cellcolorgray\!10 Lower Bound50\.36%\\cellcolorgray\!10 57\.73%\\cellcolorgray\!10 Upper BoundFedAvg\+LoRA\-\-\-\-\\cellcolorgray\!10 FedProx\+LoRA\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-\\cellcolorgray\!10 \-HeteroLoRA\-\-\-\-\\cellcolorgray\!10 SplitLoRA\\cellcolorgray\!10 0\.8169\\cellcolorgray\!10 1\\cellcolorgray\!10 51\.62%\\cellcolorgray\!10 1Local Only\-\-\-\-
### A\.2Results of Gemma 3\-270M
Table[8](https://arxiv.org/html/2605.08636#A1.T8)compares the four federated fine\-tuning methods under Protocol A with Gemma 3\-270M\. FedProx\+LoRA achieves the strongest or tied strongest accuracy on BoolQ, PIQA, ARC\-E, and WinoGrande\. FedAvg\+LoRA is generally competitive and obtains the best accuracy on SocialIQA and tied\-best accuracy on ARC\-E\. SplitLoRA often achieves the lowest testing loss, especially on BoolQ, QNLI, PIQA, and SocialIQA\.
### A\.3Results of Gemma 3\-1B
Table[9](https://arxiv.org/html/2605.08636#A1.T9)reports the Protocol A results with Gemma 3\-1B\. Different from Qwen2\.5\-0\.5B and Gemma 3\-270M, most federated fine\-tuning methods cannot be executed under the current edge\-system budget when scaling to Gemma 3\-1B\. A full LoRA fine\-tuning run of Gemma 3\-1B requires approximately 10 GB of client\-side memory, which can only be supported by the iQOO device in our testbed\. The remaining client devices exceed their memory limits and encounter out\-of\-memory failures\. As a result, FedAvg\+LoRA, FedProx\+LoRA, HeteroLoRA, and Local Only cannot be deployed under the current system configuration, and their entries are marked as “\-”\.
In contrast, SplitLoRA remains executable because each client only hosts the first hidden layer of the backbone model, while the remaining model computation is offloaded to the server through split learning\. Although SplitLoRA does not always achieve the highest final accuracy on smaller models, as shown in the Qwen2\.5\-0\.5B and Gemma 3\-270M results, it provides a clear deployability advantage when the model size increases\. This result highlights an important system\-level trade\-off: methods with stronger quality on smaller models may become infeasible under realistic edge memory constraints, whereas split\-model fine\-tuning can still support larger backbone deployment within the available client\-side budget\.
### A\.4Overall ranking of methods under protocol A
Figure[4](https://arxiv.org/html/2605.08636#A1.F4)summarizes the overall ranking of the four federated fine\-tuning methods under Protocol A\. For Qwen2\.5\-0\.5B and Gemma 3\-270M, FedProx\+LoRA achieves the best overall rank\. This is mainly because FedProx\+LoRA adds a proximal regularizer to the local LoRA objective, which helps constrain local updates and improves training stability under client\-side data partitioning\. FedAvg\+LoRA also remains competitive, but it lacks this local regularization term and therefore shows slightly weaker average ranking across tasks\.
HeteroLoRA obtains the lowest overall rank among the executable methods\. Although heterogeneous LoRA ranks are designed to adapt to different device capabilities, the aggregation process requires aligning low\-rank adapters to a common shape, typically through zero\-padding\. This padding\-based alignment can introduce additional noise or inactive dimensions during aggregation, which may degrade the quality of the aggregated adapter\. This observation is consistent with the findings in FLoRA\[[27](https://arxiv.org/html/2605.08636#bib.bib26)\], where heterogeneous adapter aggregation can introduce nontrivial quality trade\-offs\.
SplitLoRA does not achieve the best overall quality on Qwen2\.5\-0\.5B and Gemma 3\-270M because it only aggregates only a lightweight client\-side submodel and relies on activation exchange with the server, which can affect final fine\-tuning quality compared with full client\-side LoRA updates\. However, SplitLoRA shows a clear deployability advantage when scaling to Gemma 3\-1B\. Under the current edge\-system budget, full LoRA\-based methods cannot be executed due to client\-side memory limitations, whereas SplitLoRA remains executable by keeping only the first hidden layer on the client side\.
Therefore, Figure[4](https://arxiv.org/html/2605.08636#A1.F4)highlights a key Protocol A insight:A method may achieve strong final task performance under feasible settings, but still be impractical for edge deployment if its system cost exceeds what edge devices can sustain\.
Figure 4:Overall ranking of methods under protocol A\.
## Appendix BComplete benchmark results of protocol B
This section provides the complete benchmark results for Protocol B, i\.e\.,Cost\-to\-Target\. We use the same experimental setup as Protocol A, including the same client pool, data partitioning strategy, communication\-round configuration, backbone models, datasets, and federated fine\-tuning methods\. Specifically, we consider a total of 100 clients, with 20 clients instantiated for each device type listed in Table[1](https://arxiv.org/html/2605.08636#S3.T1), and randomly select 10 clients to participate in each communication round\.
Different from Protocol A, which reports the best achievable quality under a fixed system budget, Protocol B evaluates the system cost required for each method to reach a given target accuracy\. For each task, we set three target accuracy levels to represent different stages of training progress\. These targets are derived from the improvement interval between the pretrained model and the centroid upper bound, corresponding to 50%, 70%, and 90% of the accuracy improvement from the pretrained baseline to the centroid result\. For each target accuracy, we report four system metrics when each method first reaches the corresponding target: wall\-clock time, communication volume, energy consumption, and peak memory\. The wall\-clock time, communication volume, and energy consumption are recorded at the first point where the target accuracy is achieved, while peak memory is computed as the average peak memory across all participating clients up to that point\. Since all metrics in Protocol B represent system cost, lower values are better, and a smaller rank indicates a more efficient method\.
Table 10:Complete Protocol B results of Qwen2\.5\-0\.5B on Choose tasks\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankPIQA63%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!105\.02\\cellcolorgray\!104\\cellcolorgray\!104968\.28\\cellcolorgray\!102\\cellcolorgray\!10232\.97\\cellcolorgray\!104\\cellcolorgray\!103432\.59\\cellcolorgray\!104FedProx\+LoRA5\.0134968\.282232\.2133398\.653\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!102\.05\\cellcolorgray\!102\\cellcolorgray\!106217\.97\\cellcolorgray\!103\\cellcolorgray\!1095\.18\\cellcolorgray\!102\\cellcolorgray\!102531\.45\\cellcolorgray\!102SplitLoRA0\.1814804\.3818\.4711137\.96164%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!107\.52\\cellcolorgray\!103\\cellcolorgray\!107452\.42\\cellcolorgray\!101\\cellcolorgray\!10349\.03\\cellcolorgray\!103\\cellcolorgray\!103473\.55\\cellcolorgray\!104FedProx\+LoRA8\.3348280\.473386\.6643439\.223\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!105\.94\\cellcolorgray\!102\\cellcolorgray\!1018032\.11\\cellcolorgray\!104\\cellcolorgray\!10275\.42\\cellcolorgray\!102\\cellcolorgray\!102561\.67\\cellcolorgray\!102SplitLoRA0\.3018007\.30214\.1211151\.54165%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1011\.68\\cellcolorgray\!102\\cellcolorgray\!1011592\.66\\cellcolorgray\!101\\cellcolorgray\!10541\.71\\cellcolorgray\!102\\cellcolorgray\!103492\.19\\cellcolorgray\!104FedProx\+LoRA17\.58417388\.982815\.7243457\.673\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1012\.70\\cellcolorgray\!103\\cellcolorgray\!1038551\.41\\cellcolorgray\!104\\cellcolorgray\!10589\.29\\cellcolorgray\!103\\cellcolorgray\!102575\.41\\cellcolorgray\!102SplitLoRA0\.79120818\.98336\.7211157\.721HellaSwag30%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1015\.84\\cellcolorgray\!103\\cellcolorgray\!1015732\.89\\cellcolorgray\!102\\cellcolorgray\!10770\.04\\cellcolorgray\!103\\cellcolorgray\!103368\.58\\cellcolorgray\!103FedProx\+LoRA15\.87415732\.892771\.1343539\.184\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!107\.38\\cellcolorgray\!102\\cellcolorgray\!1022384\.69\\cellcolorgray\!103\\cellcolorgray\!10358\.58\\cellcolorgray\!102\\cellcolorgray\!102587\.70\\cellcolorgray\!102SplitLoRA0\.45110142\.58121\.7011150\.11132%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1030\.03\\cellcolorgray\!104\\cellcolorgray\!1029809\.69\\cellcolorgray\!102\\cellcolorgray\!101459\.38\\cellcolorgray\!104\\cellcolorgray\!103408\.78\\cellcolorgray\!103FedProx\+LoRA30\.02329809\.6921458\.9633581\.424\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1012\.31\\cellcolorgray\!102\\cellcolorgray\!1037307\.81\\cellcolorgray\!103\\cellcolorgray\!10598\.08\\cellcolorgray\!102\\cellcolorgray\!102618\.58\\cellcolorgray\!102SplitLoRA0\.85119217\.52141\.1511163\.83133%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1073\.99\\cellcolorgray\!103\\cellcolorgray\!1073696\.17\\cellcolorgray\!102\\cellcolorgray\!103596\.01\\cellcolorgray\!103\\cellcolorgray\!103427\.08\\cellcolorgray\!103FedProx\+LoRA74\.10473696\.1723601\.3943600\.644\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1025\.23\\cellcolorgray\!102\\cellcolorgray\!1076481\.02\\cellcolorgray\!103\\cellcolorgray\!101226\.35\\cellcolorgray\!102\\cellcolorgray\!102632\.64\\cellcolorgray\!102SplitLoRA2\.11148043\.801102\.6311170\.081SocialIQA62%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1015\.79\\cellcolorgray\!104\\cellcolorgray\!1015732\.89\\cellcolorgray\!102\\cellcolorgray\!10772\.17\\cellcolorgray\!104\\cellcolorgray\!103418\.89\\cellcolorgray\!103FedProx\+LoRA15\.71315732\.892768\.4133462\.204\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!105\.32\\cellcolorgray\!102\\cellcolorgray\!1016166\.72\\cellcolorgray\!103\\cellcolorgray\!10259\.95\\cellcolorgray\!102\\cellcolorgray\!102562\.68\\cellcolorgray\!102SplitLoRA0\.44110142\.58121\.5411147\.95164%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1034\.91\\cellcolorgray\!104\\cellcolorgray\!1034777\.97\\cellcolorgray\!103\\cellcolorgray\!101707\.33\\cellcolorgray\!104\\cellcolorgray\!103455\.95\\cellcolorgray\!103FedProx\+LoRA34\.72334777\.9731697\.9333499\.734\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!109\.98\\cellcolorgray\!102\\cellcolorgray\!1030468\.05\\cellcolorgray\!102\\cellcolorgray\!10488\.07\\cellcolorgray\!102\\cellcolorgray\!102590\.46\\cellcolorgray\!102SplitLoRA0\.81118683\.70139\.6111160\.39166%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1094\.94\\cellcolorgray\!104\\cellcolorgray\!1094397\.34\\cellcolorgray\!103\\cellcolorgray\!104643\.29\\cellcolorgray\!104\\cellcolorgray\!103476\.19\\cellcolorgray\!103FedProx\+LoRA94\.25394397\.3434609\.6733520\.234\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1029\.94\\cellcolorgray\!102\\cellcolorgray\!1091404\.14\\cellcolorgray\!102\\cellcolorgray\!101464\.33\\cellcolorgray\!102\\cellcolorgray\!102605\.63\\cellcolorgray\!102SplitLoRA1\.34130961\.56165\.5311167\.191
Table 11:Complete Protocol B results of Qwen2\.5\-0\.5B on Verify tasks\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankBoolQ71%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!106\.65\\cellcolorgray\!103\\cellcolorgray\!106624\.38\\cellcolorgray\!101\\cellcolorgray\!10300\.67\\cellcolorgray\!103\\cellcolorgray\!103453\.60\\cellcolorgray\!103FedProx\+LoRA6\.6746624\.381301\.1943470\.354\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!103\.07\\cellcolorgray\!102\\cellcolorgray\!109326\.95\\cellcolorgray\!102\\cellcolorgray\!10138\.66\\cellcolorgray\!102\\cellcolorgray\!102499\.21\\cellcolorgray\!102SplitLoRA0\.43111744\.04319\.4211142\.87175%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1012\.48\\cellcolorgray\!103\\cellcolorgray\!1012420\.70\\cellcolorgray\!101\\cellcolorgray\!10563\.76\\cellcolorgray\!103\\cellcolorgray\!103497\.81\\cellcolorgray\!103FedProx\+LoRA12\.50412420\.701564\.8343514\.774\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!105\.32\\cellcolorgray\!102\\cellcolorgray\!1016166\.72\\cellcolorgray\!103\\cellcolorgray\!10240\.57\\cellcolorgray\!102\\cellcolorgray\!102531\.20\\cellcolorgray\!102SplitLoRA0\.47112811\.68221\.1811157\.50178%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1032\.46\\cellcolorgray\!103\\cellcolorgray\!1032293\.83\\cellcolorgray\!102\\cellcolorgray\!101466\.74\\cellcolorgray\!103\\cellcolorgray\!103520\.44\\cellcolorgray\!103FedProx\+LoRA32\.50432293\.8321468\.5243537\.514\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1013\.33\\cellcolorgray\!102\\cellcolorgray\!1040416\.80\\cellcolorgray\!103\\cellcolorgray\!10602\.20\\cellcolorgray\!102\\cellcolorgray\!102547\.57\\cellcolorgray\!102SplitLoRA0\.99127224\.82144\.9411164\.991QNLI62%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!102\.50\\cellcolorgray\!104\\cellcolorgray\!102484\.14\\cellcolorgray\!102\\cellcolorgray\!10116\.24\\cellcolorgray\!104\\cellcolorgray\!103375\.58\\cellcolorgray\!103FedProx\+LoRA2\.4932484\.142115\.9133497\.834\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!101\.03\\cellcolorgray\!102\\cellcolorgray\!103108\.98\\cellcolorgray\!103\\cellcolorgray\!1048\.17\\cellcolorgray\!102\\cellcolorgray\!102510\.74\\cellcolorgray\!102SplitLoRA0\.0711601\.4613\.4311152\.89163%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1018\.28\\cellcolorgray\!104\\cellcolorgray\!1018217\.03\\cellcolorgray\!102\\cellcolorgray\!10851\.18\\cellcolorgray\!104\\cellcolorgray\!103415\.86\\cellcolorgray\!103FedProx\+LoRA18\.22318217\.032848\.5033539\.574\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!103\.71\\cellcolorgray\!102\\cellcolorgray\!1011192\.34\\cellcolorgray\!101\\cellcolorgray\!10172\.76\\cellcolorgray\!102\\cellcolorgray\!102540\.71\\cellcolorgray\!102SplitLoRA1\.97142705\.60391\.5411166\.65164%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1053\.12\\cellcolorgray\!103\\cellcolorgray\!1052995\.00\\cellcolorgray\!102\\cellcolorgray\!102474\.11\\cellcolorgray\!103\\cellcolorgray\!103434\.19\\cellcolorgray\!103FedProx\+LoRA73\.59473696\.1733427\.2343558\.574\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1016\.29\\cellcolorgray\!102\\cellcolorgray\!1049121\.95\\cellcolorgray\!101\\cellcolorgray\!10758\.47\\cellcolorgray\!102\\cellcolorgray\!102554\.34\\cellcolorgray\!102SplitLoRA3\.78182208\.274176\.1911172\.911
Table 12:Complete Protocol B results of Qwen2\.5\-0\.5B on Reason tasks\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankARC\-E75%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1024\.29\\cellcolorgray\!104\\cellcolorgray\!1024013\.36\\cellcolorgray\!102\\cellcolorgray\!101176\.40\\cellcolorgray\!104\\cellcolorgray\!103404\.12\\cellcolorgray\!103FedProx\+LoRA24\.13324013\.3621168\.5633441\.754\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!104\.00\\cellcolorgray\!102\\cellcolorgray\!1024250\.08\\cellcolorgray\!103\\cellcolorgray\!10193\.52\\cellcolorgray\!102\\cellcolorgray\!102540\.26\\cellcolorgray\!102SplitLoRA0\.48110676\.40123\.0911141\.11176%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1032\.64\\cellcolorgray\!103\\cellcolorgray\!1032293\.83\\cellcolorgray\!102\\cellcolorgray\!101580\.47\\cellcolorgray\!103\\cellcolorgray\!103445\.54\\cellcolorgray\!103FedProx\+LoRA33\.31433121\.8831612\.9643483\.624\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!107\.89\\cellcolorgray\!102\\cellcolorgray\!1047878\.36\\cellcolorgray\!104\\cellcolorgray\!10382\.26\\cellcolorgray\!102\\cellcolorgray\!102571\.17\\cellcolorgray\!102SplitLoRA0\.95121352\.80146\.2011154\.99177%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1066\.91\\cellcolorgray\!104\\cellcolorgray\!1066243\.75\\cellcolorgray\!102\\cellcolorgray\!103240\.10\\cellcolorgray\!104\\cellcolorgray\!103458\.55\\cellcolorgray\!103FedProx\+LoRA66\.58366243\.7523223\.8833496\.774\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1011\.17\\cellcolorgray\!102\\cellcolorgray\!1067775\.86\\cellcolorgray\!103\\cellcolorgray\!10540\.93\\cellcolorgray\!102\\cellcolorgray\!102580\.88\\cellcolorgray\!102SplitLoRA1\.53134164\.48173\.8611159\.351WinoGrande57%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1010\.80\\cellcolorgray\!103\\cellcolorgray\!1010764\.61\\cellcolorgray\!103\\cellcolorgray\!10546\.09\\cellcolorgray\!103\\cellcolorgray\!103373\.35\\cellcolorgray\!103FedProx\+LoRA10\.83410764\.613547\.7543437\.814\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!102\.46\\cellcolorgray\!102\\cellcolorgray\!107461\.56\\cellcolorgray\!102\\cellcolorgray\!10124\.64\\cellcolorgray\!102\\cellcolorgray\!102520\.73\\cellcolorgray\!102SplitLoRA0\.1815338\.2019\.3211133\.06160%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1039\.91\\cellcolorgray\!103\\cellcolorgray\!1039746\.25\\cellcolorgray\!103\\cellcolorgray\!102018\.83\\cellcolorgray\!103\\cellcolorgray\!103413\.61\\cellcolorgray\!103FedProx\+LoRA46\.43446370\.6242348\.8443478\.844\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!109\.46\\cellcolorgray\!102\\cellcolorgray\!1028602\.66\\cellcolorgray\!102\\cellcolorgray\!10478\.32\\cellcolorgray\!102\\cellcolorgray\!102550\.81\\cellcolorgray\!102SplitLoRA0\.74121352\.80137\.3011146\.59162%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1099\.43\\cellcolorgray\!104\\cellcolorgray\!1099365\.62\\cellcolorgray\!102\\cellcolorgray\!105029\.80\\cellcolorgray\!104\\cellcolorgray\!103431\.93\\cellcolorgray\!103FedProx\+LoRA99\.39399365\.6225027\.4833497\.514\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1034\.68\\cellcolorgray\!102\\cellcolorgray\!10105083\.67\\cellcolorgray\!103\\cellcolorgray\!101754\.18\\cellcolorgray\!102\\cellcolorgray\!102564\.50\\cellcolorgray\!102SplitLoRA1\.38140036\.50169\.8411152\.741
### B\.1Results of Qwen2\.5\-0\.5B
Tables[10](https://arxiv.org/html/2605.08636#A2.T10)–[12](https://arxiv.org/html/2605.08636#A2.T12)report the complete Protocol B results of Qwen2\.5\-0\.5B across all selected tasks and target accuracy levels\. Overall, SplitLoRA consistently reaches the target accuracy with the lowest wall\-clock time and energy consumption across Verify, Choose, and Reason tasks, and it also achieves the lowest communication volume in most settings\. This indicates that, although SplitLoRA is not always the strongest method in final quality under Protocol A, it is highly efficient in reaching intermediate target accuracies under realistic edge\-system budgets\. FedAvg\+LoRA and FedProx\+LoRA sometimes achieve competitive communication cost because they only exchange LoRA updates, but they usually require substantially longer wall\-clock time and higher energy consumption to reach the same target accuracy\. HeteroLoRA often reduces wall\-clock time compared with FedAvg\+LoRA and FedProx\+LoRA, but its communication and energy costs can still be high, especially at higher target accuracy levels\.
Table 13:Complete Protocol B results of Gemma 3\-270M on Verify tasks\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankBoolQ62%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1021\.88\\cellcolorgray\!103\\cellcolorgray\!1010166\.13\\cellcolorgray\!101\\cellcolorgray\!101137\.31\\cellcolorgray\!103\\cellcolorgray\!101339\.46\\cellcolorgray\!102FedProx\+LoRA35\.02441229\.3231819\.7942324\.804\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1014\.03\\cellcolorgray\!102\\cellcolorgray\!1050899\.22\\cellcolorgray\!104\\cellcolorgray\!10729\.35\\cellcolorgray\!102\\cellcolorgray\!101403\.02\\cellcolorgray\!103SplitLoRA0\.62114977\.36231\.9811002\.91163%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1089\.88\\cellcolorgray\!104\\cellcolorgray\!1041794\.10\\cellcolorgray\!102\\cellcolorgray\!104671\.20\\cellcolorgray\!104\\cellcolorgray\!101356\.61\\cellcolorgray\!102FedProx\+LoRA50\.36359302\.4432617\.1132354\.564\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1016\.37\\cellcolorgray\!102\\cellcolorgray\!1059382\.42\\cellcolorgray\!104\\cellcolorgray\!10850\.83\\cellcolorgray\!102\\cellcolorgray\!101420\.98\\cellcolorgray\!103SplitLoRA0\.71117223\.96136\.7511015\.75164%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10104\.47\\cellcolorgray\!104\\cellcolorgray\!1048571\.52\\cellcolorgray\!102\\cellcolorgray\!105429\.43\\cellcolorgray\!104\\cellcolorgray\!101365\.38\\cellcolorgray\!102FedProx\+LoRA70\.98383588\.2043688\.7932369\.794\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1016\.37\\cellcolorgray\!102\\cellcolorgray\!1059382\.42\\cellcolorgray\!103\\cellcolorgray\!10850\.83\\cellcolorgray\!102\\cellcolorgray\!101430\.17\\cellcolorgray\!103SplitLoRA0\.71117223\.96136\.7511022\.321QNLI61%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1015\.75\\cellcolorgray\!104\\cellcolorgray\!1018637\.91\\cellcolorgray\!103\\cellcolorgray\!10822\.83\\cellcolorgray\!104\\cellcolorgray\!102365\.36\\cellcolorgray\!103FedProx\+LoRA14\.31316943\.552747\.4832382\.524\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!104\.09\\cellcolorgray\!102\\cellcolorgray\!1014845\.61\\cellcolorgray\!101\\cellcolorgray\!10213\.65\\cellcolorgray\!102\\cellcolorgray\!101450\.77\\cellcolorgray\!102SplitLoRA1\.15122507\.10460\.0511018\.20165%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1031\.02\\cellcolorgray\!103\\cellcolorgray\!1036711\.04\\cellcolorgray\!103\\cellcolorgray\!101620\.23\\cellcolorgray\!103\\cellcolorgray\!102393\.59\\cellcolorgray\!103FedProx\+LoRA32\.43438405\.3941693\.7042410\.954\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!106\.42\\cellcolorgray\!102\\cellcolorgray\!1023328\.81\\cellcolorgray\!101\\cellcolorgray\!10335\.27\\cellcolorgray\!102\\cellcolorgray\!101468\.08\\cellcolorgray\!102SplitLoRA1\.81135477\.29294\.7311030\.35169%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1062\.42\\cellcolorgray\!104\\cellcolorgray\!1073986\.86\\cellcolorgray\!104\\cellcolorgray\!103260\.63\\cellcolorgray\!104\\cellcolorgray\!102406\.44\\cellcolorgray\!103FedProx\+LoRA45\.35353654\.5922368\.7932423\.894\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1017\.72\\cellcolorgray\!102\\cellcolorgray\!1064472\.34\\cellcolorgray\!103\\cellcolorgray\!10925\.60\\cellcolorgray\!102\\cellcolorgray\!101475\.96\\cellcolorgray\!102SplitLoRA2\.07140436\.481107\.9311035\.881
Table 14:Complete Protocol B results of Gemma 3\-270M on Reason tasks\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankARC\-E28%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1029\.95\\cellcolorgray\!103\\cellcolorgray\!1035581\.46\\cellcolorgray\!102\\cellcolorgray\!101331\.65\\cellcolorgray\!103\\cellcolorgray\!102238\.33\\cellcolorgray\!103FedProx\+LoRA30\.15435581\.4621340\.6842274\.424\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1021\.09\\cellcolorgray\!102\\cellcolorgray\!1037326\.09\\cellcolorgray\!103\\cellcolorgray\!10938\.00\\cellcolorgray\!102\\cellcolorgray\!101363\.60\\cellcolorgray\!102SplitLoRA1\.78124795\.95179\.111957\.83129%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1032\.80\\cellcolorgray\!103\\cellcolorgray\!1038970\.18\\cellcolorgray\!102\\cellcolorgray\!101458\.55\\cellcolorgray\!103\\cellcolorgray\!102265\.56\\cellcolorgray\!103FedProx\+LoRA33\.01438970\.1821467\.8142302\.094\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1026\.35\\cellcolorgray\!102\\cellcolorgray\!1046657\.62\\cellcolorgray\!103\\cellcolorgray\!101171\.86\\cellcolorgray\!102\\cellcolorgray\!101380\.19\\cellcolorgray\!102SplitLoRA1\.86125940\.38182\.741969\.48130%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1037\.11\\cellcolorgray\!102\\cellcolorgray\!1044053\.24\\cellcolorgray\!102\\cellcolorgray\!101650\.23\\cellcolorgray\!102\\cellcolorgray\!102274\.11\\cellcolorgray\!103FedProx\+LoRA37\.33344053\.2421660\.1832310\.784\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1039\.01\\cellcolorgray\!104\\cellcolorgray\!1069138\.11\\cellcolorgray\!103\\cellcolorgray\!101734\.87\\cellcolorgray\!104\\cellcolorgray\!101385\.40\\cellcolorgray\!102SplitLoRA1\.97127466\.29187\.611973\.141WinoGrande51%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!104\.30\\cellcolorgray\!103\\cellcolorgray\!105083\.07\\cellcolorgray\!102\\cellcolorgray\!10194\.76\\cellcolorgray\!103\\cellcolorgray\!102327\.35\\cellcolorgray\!103FedProx\+LoRA21\.11424850\.554956\.2442330\.594\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!101\.75\\cellcolorgray\!102\\cellcolorgray\!106362\.40\\cellcolorgray\!103\\cellcolorgray\!1079\.40\\cellcolorgray\!102\\cellcolorgray\!101365\.71\\cellcolorgray\!102SplitLoRA0\.1913433\.2918\.731961\.43151%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!104\.78\\cellcolorgray\!103\\cellcolorgray\!105647\.85\\cellcolorgray\!102\\cellcolorgray\!10216\.42\\cellcolorgray\!103\\cellcolorgray\!102355\.12\\cellcolorgray\!103FedProx\+LoRA23\.03427109\.6941043\.3842358\.404\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!102\.22\\cellcolorgray\!102\\cellcolorgray\!108059\.04\\cellcolorgray\!103\\cellcolorgray\!10100\.49\\cellcolorgray\!102\\cellcolorgray\!101382\.01\\cellcolorgray\!102SplitLoRA0\.1913433\.2918\.731972\.90151%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!105\.73\\cellcolorgray\!103\\cellcolorgray\!106777\.42\\cellcolorgray\!102\\cellcolorgray\!10259\.77\\cellcolorgray\!103\\cellcolorgray\!102367\.76\\cellcolorgray\!103FedProx\+LoRA38\.34445182\.8141737\.2442371\.064\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!102\.68\\cellcolorgray\!102\\cellcolorgray\!109755\.68\\cellcolorgray\!103\\cellcolorgray\!10121\.50\\cellcolorgray\!102\\cellcolorgray\!101389\.43\\cellcolorgray\!102SplitLoRA0\.2814959\.19112\.621978\.121
### B\.2Results of Gemma 3\-270M
Tables[13](https://arxiv.org/html/2605.08636#A2.T13)–[15](https://arxiv.org/html/2605.08636#A2.T15)report the complete Protocol B results of Gemma 3\-270M across all selected tasks and target accuracy levels\. Overall, SplitLoRA shows the strongest cost\-to\-target efficiency: it consistently reaches the target accuracy with the lowest wall\-clock time, energy consumption, and peak memory across Verify, Choose, and Reason tasks\. This advantage is especially clear as the target accuracy increases, where FedAvg\+LoRA and FedProx\+LoRA often require substantially longer training time and much higher energy consumption to reach the same target\. HeteroLoRA usually ranks between SplitLoRA and the FedAvg/FedProx baselines in wall\-clock time and energy consumption, but its communication cost is less stable\. FedAvg\+LoRA and FedProx\+LoRA occasionally achieve lower communication volume, particularly at some lower target levels, but this communication advantage does not translate into lower overall system cost due to their longer training time and higher energy usage\.
Table 15:Complete Protocol B results of Gemma 3\-270M on Choose tasks\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankPIQA52%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1054\.78\\cellcolorgray\!104\\cellcolorgray\!1064851\.75\\cellcolorgray\!104\\cellcolorgray\!102536\.17\\cellcolorgray\!104\\cellcolorgray\!102340\.37\\cellcolorgray\!104FedProx\+LoRA1\.5031127\.86169\.3232313\.913\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!101\.17\\cellcolorgray\!102\\cellcolorgray\!106004\.91\\cellcolorgray\!103\\cellcolorgray\!1054\.24\\cellcolorgray\!102\\cellcolorgray\!101342\.21\\cellcolorgray\!102SplitLoRA0\.1813814\.7628\.3711006\.46153%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1056\.68\\cellcolorgray\!104\\cellcolorgray\!1067107\.46\\cellcolorgray\!104\\cellcolorgray\!102624\.33\\cellcolorgray\!104\\cellcolorgray\!102368\.30\\cellcolorgray\!104FedProx\+LoRA1\.5021127\.86169\.3222341\.523\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!103\.74\\cellcolorgray\!103\\cellcolorgray\!1019142\.58\\cellcolorgray\!103\\cellcolorgray\!10173\.16\\cellcolorgray\!103\\cellcolorgray\!101358\.23\\cellcolorgray\!102SplitLoRA0\.3617629\.52216\.7111018\.47153%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1057\.64\\cellcolorgray\!103\\cellcolorgray\!1068235\.31\\cellcolorgray\!102\\cellcolorgray\!102668\.47\\cellcolorgray\!103\\cellcolorgray\!102381\.01\\cellcolorgray\!104FedProx\+LoRA87\.52468799\.2434052\.0642354\.093\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1016\.96\\cellcolorgray\!102\\cellcolorgray\!1089194\.59\\cellcolorgray\!104\\cellcolorgray\!10785\.27\\cellcolorgray\!102\\cellcolorgray\!101365\.52\\cellcolorgray\!102SplitLoRA0\.3617629\.52116\.7111023\.941HellaSwag25%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1017\.19\\cellcolorgray\!103\\cellcolorgray\!1020332\.27\\cellcolorgray\!102\\cellcolorgray\!10794\.60\\cellcolorgray\!103\\cellcolorgray\!102323\.27\\cellcolorgray\!103FedProx\+LoRA17\.28420332\.272798\.8142336\.174\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!107\.90\\cellcolorgray\!102\\cellcolorgray\!1028842\.89\\cellcolorgray\!103\\cellcolorgray\!10365\.03\\cellcolorgray\!102\\cellcolorgray\!101369\.23\\cellcolorgray\!102SplitLoRA2\.43113351\.671112\.421940\.64125%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1021\.02\\cellcolorgray\!103\\cellcolorgray\!1024850\.55\\cellcolorgray\!101\\cellcolorgray\!10971\.35\\cellcolorgray\!103\\cellcolorgray\!102350\.99\\cellcolorgray\!103FedProx\+LoRA21\.13424850\.551976\.5342364\.054\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1018\.12\\cellcolorgray\!102\\cellcolorgray\!1066168\.98\\cellcolorgray\!103\\cellcolorgray\!10837\.60\\cellcolorgray\!102\\cellcolorgray\!101385\.57\\cellcolorgray\!102SplitLoRA5\.21128610\.722240\.861951\.87125%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1024\.84\\cellcolorgray\!103\\cellcolorgray\!1029368\.83\\cellcolorgray\!101\\cellcolorgray\!101148\.13\\cellcolorgray\!103\\cellcolorgray\!102363\.61\\cellcolorgray\!103FedProx\+LoRA24\.93429368\.8311152\.1742376\.744\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1018\.82\\cellcolorgray\!102\\cellcolorgray\!1068713\.95\\cellcolorgray\!103\\cellcolorgray\!10869\.83\\cellcolorgray\!102\\cellcolorgray\!101393\.01\\cellcolorgray\!102SplitLoRA10\.36156839\.962479\.001956\.981SocialIQA38%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1047\.18\\cellcolorgray\!104\\cellcolorgray\!1055348\.95\\cellcolorgray\!104\\cellcolorgray\!102091\.23\\cellcolorgray\!104\\cellcolorgray\!102317\.00\\cellcolorgray\!104FedProx\+LoRA44\.22351960\.2331959\.8332308\.283\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1011\.17\\cellcolorgray\!102\\cellcolorgray\!1040719\.38\\cellcolorgray\!102\\cellcolorgray\!10494\.95\\cellcolorgray\!102\\cellcolorgray\!101336\.75\\cellcolorgray\!102SplitLoRA1\.46120599\.72164\.801964\.00139%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1069\.76\\cellcolorgray\!104\\cellcolorgray\!1081893\.85\\cellcolorgray\!104\\cellcolorgray\!103091\.84\\cellcolorgray\!104\\cellcolorgray\!102342\.12\\cellcolorgray\!104FedProx\+LoRA53\.82363255\.9432385\.6032333\.293\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1013\.38\\cellcolorgray\!102\\cellcolorgray\!1048778\.42\\cellcolorgray\!102\\cellcolorgray\!10593\.26\\cellcolorgray\!102\\cellcolorgray\!101351\.24\\cellcolorgray\!102SplitLoRA2\.03128610\.72190\.031974\.45140%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1086\.13\\cellcolorgray\!104\\cellcolorgray\!10101096\.54\\cellcolorgray\!104\\cellcolorgray\!103817\.46\\cellcolorgray\!104\\cellcolorgray\!102355\.84\\cellcolorgray\!104FedProx\+LoRA67\.74379634\.7123002\.4832346\.963\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!1021\.99\\cellcolorgray\!102\\cellcolorgray\!1080166\.27\\cellcolorgray\!103\\cellcolorgray\!10974\.85\\cellcolorgray\!102\\cellcolorgray\!101359\.15\\cellcolorgray\!102SplitLoRA3\.76153025\.191166\.861980\.161
### B\.3Results of Gemma 3\-1B
Tables[16](https://arxiv.org/html/2605.08636#A2.T16)–[18](https://arxiv.org/html/2605.08636#A2.T18)report the Protocol B results of Gemma 3\-1B\. Different from Qwen2\.5\-0\.5B and Gemma 3\-270M, FedAvg\+LoRA, FedProx\+LoRA, and HeteroLoRA cannot reach any target accuracy under the current edge\-system budget because they trigger out\-of\-memory failures on most client devices\. In contrast, SplitLoRA remains executable across all Verify, Choose, and Reason tasks\. Its peak memory remains small, ranging from about13131313MB to13471347MB across all target levels, which is far below the memory footprint required by full client\-side LoRA fine\-tuning of Gemma 3\-1B\.
This result highlights the deployability advantage of SplitLoRA for larger backbone models\. Since SplitLoRA places only the first hidden layer on the client side and offloads the remaining model computation to the server, increasing the backbone size does not proportionally increase the client\-side memory footprint\. Therefore, although SplitLoRA may not always achieve the best final quality on smaller models, it becomes the only feasible method when scaling to Gemma 3\-1B under realistic edge memory constraints\.
Table 16:Complete Protocol B results of Gemma 3\-1B on Verify tasks\. “\-” indicates that the method is not executable under the system budget due to out\-of\-memory\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankBoolQ62%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.0914025\.5212\.8311313\.15163%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA1\.42160382\.81142\.4311329\.96164%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA1\.49163401\.95144\.5811338\.561QNLI57%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.1516038\.2814\.8611323\.72160%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.1817044\.6615\.6711339\.52163%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.2018051\.0416\.4811346\.701
Table 17:Complete Protocol B results of Gemma 3\-1B on Reason tasks\. “\-” indicates that the method is not executable under the system budget due to out\-of\-memory\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankARC\-E27%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.1516541\.4714\.5211316\.12127%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.1516541\.4714\.5211332\.13128%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.1918051\.0415\.5611337\.161WinoGrande51%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA1\.58168433\.85147\.0911318\.06151%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA1\.68172962\.56150\.2111333\.79151%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA1\.73174975\.32151\.6111340\.951
Table 18:Complete Protocol B results of Gemma 3\-1B on Choose tasks\. “\-” indicates that the method is not executable under the system budget due to out\-of\-memory\.TaskTarget AccuracyMethodsWall\-clocktime \(hour\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankPIQA53%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.87137236\.07125\.7411316\.86153%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.87137236\.07125\.7411332\.58154%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA1\.15149312\.63134\.1011339\.731HellaSwag25%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.1315535\.0914\.0011319\.53125%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.1315535\.0914\.0011335\.28125%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA1\.31155854\.10140\.2711342\.451SocialIQA40%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.84134216\.93125\.2711321\.08142%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA0\.98140255\.21129\.7011335\.40145%\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\\cellcolorgray\!10HeteroLoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-SplitLoRA1\.36155854\.10141\.1911343\.231
Figure 5:Overall ranking of methods under protocol B\.
### B\.4Overall ranking of methods under protocol B
The overall ranking under Protocol B shows that SplitLoRA is the most efficient and deployable method across all three model scales\. This result should be interpreted together with the quality results in Protocol A: SplitLoRA is not always the method with the highest final accuracy on smaller models, where FedProx\+LoRA often achieves the best accuracy due to its proximal regularization on local LoRA updates\. However, Protocol B focuses on the system cost required to reach target accuracy levels, and under this cost\-to\-target criterion SplitLoRA consistently achieves the best average rank for Qwen2\.5\-0\.5B and Gemma 3\-270M\. It reaches target accuracies with substantially lower wall\-clock time, energy consumption, and peak memory, showing a stronger efficiency–quality trade\-off under realistic edge\-system constraints\.
HeteroLoRA ranks second overall because its heterogeneous LoRA design adapts to different device capabilities\. FedAvg\+LoRA ranks third, while FedProx\+LoRA ranks last under Protocol B\. This indicates that better final accuracy does not necessarily imply better cost\-to\-target efficiency: FedProx\+LoRA can achieve strong accuracy, but it often requires longer training time and higher energy consumption to reach the same target\. For Gemma 3\-1B, only SplitLoRA is feasible, while the other methods are infeasible due to out\-of\-memory errors\. This further highlights the key advantage of SplitLoRA: even when it is not the most accurate method on smaller models, it provides the best efficiency and deployability when realistic edge constraints and larger backbone models are considered\.
Therefore, Figure[5](https://arxiv.org/html/2605.08636#A2.F5)highlights a key Protocol B insight:Methods with higher final accuracy do not necessarily provide better system efficiency\. Their relative efficiency can also change across target accuracy levels and system metrics\.
## Appendix CComplete benchmark results of protocol C
This section provides the complete benchmark results for Protocol C, i\.e\.,Robustness\. We use the same basic experimental setup as Protocols A and B, including the same client pool, device types, data partitioning strategy, communication\-round configuration, backbone models, datasets, and federated fine\-tuning methods\. The key difference is that Protocol C introduces system perturbations to evaluate how robust each method is under realistic edge deployment conditions\.
Here, we show the results of evaluating robustness under dynamic communication fluctuation\. To simulate unstable wireless connectivity in real edge deployments, we dynamically change the Wi\-Fi bandwidth every1/31/3hour, sequentially setting it to the full bandwidth,1/21/2of the bandwidth, and1/41/4of the bandwidth\. For each method, we report both model quality and system metrics under the fluctuating communication setting\.
### C\.1Results of Qwen2\.5\-0\.5B
Table[19](https://arxiv.org/html/2605.08636#A3.T19)reports the robustness results of Qwen2\.5\-0\.5B under dynamic communication fluctuation\. Across all seven tasks, the testing accuracy remains unchanged for all methods, indicating that bandwidth fluctuation mainly affects system cost rather than final model quality\. The communication volume and peak memory also remain unchanged, because the transmitted payload size and client\-side model footprint are not altered by bandwidth variation\. However, wall\-clock time and energy consumption increase significantly under fluctuating bandwidth\. Among the four methods, SplitLoRA consistently shows the smallest increase in wall\-clock time and energy consumption, achieving the best robustness rank across all tasks\. In contrast, FedAvg\+LoRA, FedProx\+LoRA, and HeteroLoRA suffer much larger increases in time and energy, because their training process requires longer client\-side execution and is more exposed to bandwidth degradation\.
Table 19:Complete Protocol C results of Qwen2\.5\-0\.5B under communication fluctuation\. Values in parentheses denote changes relative to the non\-fluctuation setting\. Ranks are computed by the absolute value of the change, where smaller change indicates better robustnessTask TypeTaskMethodsTesting AccuracyRankWall\-clocktime \(h\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankVerifyBoolQ\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1078\.69% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!106\.14 \(\+2\.50\)\\cellcolorgray\!101\\cellcolorgray\!1099824\.33 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10277\.44 \(\+112\.97\)\\cellcolorgray\!101\\cellcolorgray\!101151\.97 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA79\.02% \(\+0\.00%\)1141\.91 \(\+59\.12\)262801\.48 \(\+0\.00\)16412\.58 \(\+2671\.55\)22519\.11 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1079\.39% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10168\.13 \(\+70\.00\)\\cellcolorgray\!103\\cellcolorgray\!1097709\.53 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!107597\.85 \(\+3163\.24\)\\cellcolorgray\!103\\cellcolorgray\!103481\.10 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA79\.33% \(\+0\.00%\)1169\.31 \(\+70\.42\)498537\.58 \(\+0\.00\)17651\.04 \(\+3182\.07\)43497\.98 \(\+0\.00\)1QNLI\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1065\.20% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!106\.28 \(\+2\.50\)\\cellcolorgray\!101\\cellcolorgray\!1082208\.27 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10292\.62 \(\+116\.43\)\\cellcolorgray\!101\\cellcolorgray\!101159\.46 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA64\.29% \(\+0\.00%\)1275\.53 \(\+114\.73\)3121250\.39 \(\+0\.00\)112832\.37 \(\+5343\.39\)32525\.05 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1064\.84% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10225\.03 \(\+93\.75\)\\cellcolorgray\!102\\cellcolorgray\!10130831\.41 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1010480\.39 \(\+4366\.28\)\\cellcolorgray\!102\\cellcolorgray\!103394\.81 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA64\.96% \(\+0\.00%\)1279\.66 \(\+116\.50\)4163125\.23 \(\+0\.00\)113024\.89 \(\+5425\.68\)43517\.76 \(\+0\.00\)1ChoosePIQA\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1064\.47% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!101\.25 \(\+0\.42\)\\cellcolorgray\!101\\cellcolorgray\!1021886\.62 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1057\.93 \(\+19\.33\)\\cellcolorgray\!101\\cellcolorgray\!101144\.44 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA66\.05% \(\+0\.00%\)1106\.81 \(\+44\.49\)447256\.56 \(\+0\.00\)14955\.24 \(\+2063\.96\)42545\.88 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1065\.29% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!1093\.03 \(\+38\.75\)\\cellcolorgray\!103\\cellcolorgray\!1053823\.05 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!104316\.03 \(\+1797\.67\)\\cellcolorgray\!103\\cellcolorgray\!103452\.15 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA65\.29% \(\+0\.00%\)161\.70 \(\+25\.68\)235606\.02 \(\+0\.00\)12862\.36 \(\+1191\.49\)23418\.02 \(\+0\.00\)1HellaSwag\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1033\.67% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!108\.03 \(\+3\.33\)\\cellcolorgray\!101\\cellcolorgray\!10106763\.99 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10390\.22 \(\+162\.00\)\\cellcolorgray\!101\\cellcolorgray\!101156\.66 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA33\.18% \(\+0\.00%\)1274\.20 \(\+114\.17\)3121250\.39 \(\+0\.00\)113326\.07 \(\+5548\.47\)32602\.45 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1033\.84% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10271\.04 \(\+112\.92\)\\cellcolorgray\!102\\cellcolorgray\!10157328\.91 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1013172\.27 \(\+5487\.72\)\\cellcolorgray\!102\\cellcolorgray\!103387\.78 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA33\.84% \(\+0\.00%\)1271\.23 \(\+112\.92\)2157328\.91 \(\+0\.00\)113181\.80 \(\+5487\.72\)23559\.35 \(\+0\.00\)1SocialIQA\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1067\.45% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!105\.17 \(\+2\.08\)\\cellcolorgray\!101\\cellcolorgray\!1071531\.87 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10253\.07 \(\+101\.89\)\\cellcolorgray\!101\\cellcolorgray\!101152\.10 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA67\.45% \(\+0\.00%\)1277\.94 \(\+115\.80\)4123737\.58 \(\+0\.00\)113593\.57 \(\+5663\.76\)42571\.94 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1068\.42% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10274\.18 \(\+114\.17\)\\cellcolorgray\!103\\cellcolorgray\!10158985\.00 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1013409\.35 \(\+5583\.64\)\\cellcolorgray\!103\\cellcolorgray\!103431\.24 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA68\.42% \(\+0\.00%\)1272\.12 \(\+113\.33\)2158985\.00 \(\+0\.00\)113308\.80 \(\+5542\.88\)23474\.71 \(\+0\.00\)1ReasonARC\-E\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1077\.37% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!103\.72 \(\+1\.52\)\\cellcolorgray\!101\\cellcolorgray\!1049111\.44 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10179\.89 \(\+73\.81\)\\cellcolorgray\!101\\cellcolorgray\!101147\.42 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA77\.54% \(\+0\.00%\)1198\.11 \(\+82\.50\)487673\.36 \(\+0\.00\)19592\.96 \(\+3994\.92\)42554\.31 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1079\.47% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10156\.21 \(\+65\.00\)\\cellcolorgray\!103\\cellcolorgray\!1090257\.11 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!107564\.40 \(\+3147\.51\)\\cellcolorgray\!103\\cellcolorgray\!103422\.95 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA79\.47% \(\+0\.00%\)1155\.23 \(\+64\.58\)290257\.11 \(\+0\.00\)17516\.90 \(\+3127\.34\)23460\.78 \(\+0\.00\)1WinoGrande\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1062\.43% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!102\.21 \(\+0\.83\)\\cellcolorgray\!101\\cellcolorgray\!1040036\.50 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10111\.99 \(\+42\.15\)\\cellcolorgray\!101\\cellcolorgray\!101139\.52 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA61\.17% \(\+0\.00%\)1250\.28 \(\+104\.17\)3110679\.84 \(\+0\.00\)112660\.52 \(\+5269\.37\)32535\.09 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1063\.46% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10224\.22 \(\+93\.33\)\\cellcolorgray\!102\\cellcolorgray\!10130831\.41 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1011342\.17 \(\+4721\.35\)\\cellcolorgray\!102\\cellcolorgray\!103392\.58 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA63\.61% \(\+0\.00%\)1266\.85 \(\+111\.18\)4155672\.81 \(\+0\.00\)113498\.93 \(\+5623\.93\)43457\.41 \(\+0\.00\)1
### C\.2Results of Gemma 3\-270M
Table[20](https://arxiv.org/html/2605.08636#A3.T20)shows a similar trend for Gemma 3\-270M\. The final testing accuracy is stable under communication fluctuation, and both communication volume and peak memory remain nearly unchanged across methods\. This confirms that the perturbation mainly changes the effective communication delay rather than the amount of transmitted data or the memory requirement\. Nevertheless, the impact on wall\-clock time and energy consumption is method\-dependent\. SplitLoRA again achieves the smallest perturbation across almost all tasks, showing strong robustness to unstable bandwidth\.
Table 20:Complete Protocol C results of Gemma 3\-270M under communication fluctuation\. Values in parentheses denote changes relative to the non\-fluctuation setting\. Ranks are computed by the absolute value of the change, where smaller change indicates better robustnessTask TypeTaskMethodsTesting AccuracyRankWall\-clocktime \(h\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankVerifyBoolQ\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1062\.72% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!104\.19 \(\+1\.67\)\\cellcolorgray\!101\\cellcolorgray\!1061407\.17 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10217\.62 \(\+86\.62\)\\cellcolorgray\!101\\cellcolorgray\!101010\.90 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA62\.20% \(\+0\.00%\)1112\.15 \(\+46\.67\)259382\.42 \(\+0\.00\)15828\.63 \(\+2425\.33\)21414\.19 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1062\.94% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10372\.85 \(\+155\.34\)\\cellcolorgray\!104\\cellcolorgray\!10101096\.54 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1019377\.49 \(\+8073\.30\)\\cellcolorgray\!104\\cellcolorgray\!101350\.13 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA64\.53% \(\+0\.00%\)1128\.15 \(\+53\.33\)388106\.48 \(\+0\.00\)16660\.06 \(\+2771\.80\)32343\.31 \(\+0\.00\)1QNLI\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1071\.46% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!103\.34 \(\+1\.26\)\\cellcolorgray\!101\\cellcolorgray\!1040817\.95 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10174\.60 \(\+65\.66\)\\cellcolorgray\!101\\cellcolorgray\!101024\.00 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA63\.87% \(\+0\.00%\)1151\.79 \(\+63\.23\)380590\.43 \(\+0\.00\)17928\.40 \(\+3302\.58\)31459\.04 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1068\.26% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10126\.44 \(\+52\.58\)\\cellcolorgray\!102\\cellcolorgray\!1087541\.70 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!106604\.48 \(\+2746\.51\)\\cellcolorgray\!102\\cellcolorgray\!102378\.84 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA66\.98% \(\+0\.00%\)1157\.72 \(\+65\.69\)4109003\.54 \(\+0\.00\)18238\.39 \(\+3431\.46\)42396\.10 \(\+0\.00\)1ChoosePIQA\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1053\.32% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!100\.44 \(\+0\.08\)\\cellcolorgray\!101\\cellcolorgray\!107629\.52 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1020\.54 \(\+3\.83\)\\cellcolorgray\!101\\cellcolorgray\!101012\.20 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA52\.18% \(\+0\.00%\)1123\.32 \(\+51\.25\)395394\.95 \(\+0\.00\)15709\.59 \(\+2372\.77\)31349\.86 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1053\.43% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!1098\.77 \(\+41\.14\)\\cellcolorgray\!102\\cellcolorgray\!1068235\.31 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!104573\.02 \(\+1904\.55\)\\cellcolorgray\!102\\cellcolorgray\!102353\.70 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA51\.69% \(\+0\.00%\)1150\.02 \(\+62\.50\)468799\.24 \(\+0\.00\)16945\.68 \(\+2893\.62\)42327\.10 \(\+0\.00\)1HellaSwag\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1024\.88% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!1022\.10 \(\+9\.17\)\\cellcolorgray\!101\\cellcolorgray\!1070954\.58 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!101021\.62 \(\+423\.66\)\\cellcolorgray\!101\\cellcolorgray\!10946\.00 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA24\.93% \(\+0\.00%\)1157\.70 \(\+65\.68\)483983\.71 \(\+0\.00\)17288\.49 \(\+3035\.72\)41377\.03 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1024\.84% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!1061\.26 \(\+25\.42\)\\cellcolorgray\!102\\cellcolorgray\!1042358\.89 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!102831\.30 \(\+1174\.69\)\\cellcolorgray\!102\\cellcolorgray\!102336\.51 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA24\.84% \(\+0\.00%\)161\.34 \(\+25\.42\)342358\.89 \(\+0\.00\)12835\.02 \(\+1174\.96\)32349\.48 \(\+0\.00\)1SocialIQA\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1040\.84% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!109\.14 \(\+3\.75\)\\cellcolorgray\!101\\cellcolorgray\!1075913\.77 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10405\.15 \(\+166\.21\)\\cellcolorgray\!101\\cellcolorgray\!10967\.49 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA38\.79% \(\+0\.00%\)1153\.14 \(\+63\.75\)281438\.75 \(\+0\.00\)16787\.49 \(\+2825\.62\)21341\.58 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1040\.94% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!10155\.04 \(\+64\.58\)\\cellcolorgray\!103\\cellcolorgray\!10106179\.61 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!106871\.89 \(\+2862\.56\)\\cellcolorgray\!103\\cellcolorgray\!102325\.37 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA40\.38% \(\+0\.00%\)1163\.02 \(\+67\.92\)4111827\.46 \(\+0\.00\)17225\.40 \(\+3010\.30\)42316\.62 \(\+0\.00\)1ReasonARC\-E\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1029\.30% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!103\.43 \(\+1\.32\)\\cellcolorgray\!101\\cellcolorgray\!1029373\.67 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10152\.54 \(\+58\.82\)\\cellcolorgray\!101\\cellcolorgray\!10963\.12 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA30\.00% \(\+0\.00%\)1270\.77 \(\+112\.80\)469986\.43 \(\+0\.00\)112040\.74 \(\+5016\.12\)41371\.14 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1030\.18% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!1063\.44 \(\+26\.33\)\\cellcolorgray\!102\\cellcolorgray\!1044053\.24 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!102821\.07 \(\+1170\.85\)\\cellcolorgray\!102\\cellcolorgray\!102250\.70 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA30\.18% \(\+0\.00%\)164\.00 \(\+26\.67\)344053\.24 \(\+0\.00\)12846\.01 \(\+1185\.83\)32286\.99 \(\+0\.00\)1WinoGrande\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1050\.43% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!100\.28 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!104959\.19 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1012\.62 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10966\.91 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA50\.51% \(\+0\.00%\)118\.23 \(\+7\.50\)39755\.68 \(\+0\.00\)1825\.82 \(\+339\.80\)31373\.49 \(\+0\.00\)1\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!1050\.51% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!109\.80 \(\+4\.07\)\\cellcolorgray\!102\\cellcolorgray\!106777\.42 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10444\.02 \(\+184\.26\)\\cellcolorgray\!102\\cellcolorgray\!102340\.61 \(\+0\.00\)\\cellcolorgray\!101FedProx\+LoRA51\.30% \(\+0\.00%\)165\.69 \(\+27\.34\)445182\.81 \(\+0\.00\)12976\.11 \(\+1238\.87\)42343\.87 \(\+0\.00\)1
### C\.3Results of Gemma 3\-1B
Different from the smaller models, only SplitLoRA is executable under the current edge\-system budget, while FedAvg\+LoRA, FedProx\+LoRA, and HeteroLoRA remain infeasible due to out\-of\-memory failures\. Under communication fluctuation, SplitLoRA preserves the same testing accuracy across all tasks, and its communication volume and peak memory remain unchanged\. Its peak memory stays around1\.321\.32–1\.331\.33GB, showing that the client\-side memory footprint remains small even for the larger Gemma 3\-1B backbone\. Although wall\-clock time and energy consumption increase under fluctuating bandwidth, SplitLoRA still remains deployable and stable across different tasks\.
Table 21:Complete Protocol C results of Gemma 3\-1B under communication fluctuation\. Values in parentheses denote changes relative to the non\-fluctuation setting\. “\-” indicates that the method is not executable under the system budget due to out\-of\-memory\. Ranks are computed by the absolute value of the change, where smaller change indicates better robustness\.Task TypeTaskMethodsTesting AccuracyRankWall\-clocktime \(h\)RankCommunicationvolume \(MB\)RankEnergyconsumption \(kJ\)RankPeakmemory \(MB\)RankVerifyBoolQ\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1064\.25% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!102\.32 \(\+0\.83\)\\cellcolorgray\!101\\cellcolorgray\!1063401\.95 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1069\.52 \(\+24\.93\)\\cellcolorgray\!101\\cellcolorgray\!101323\.60 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA\-\-\-\-\-\-\-\-\-\-\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\-\-QNLI\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1064\.52% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!103\.13 \(\+1\.25\)\\cellcolorgray\!101\\cellcolorgray\!1075478\.51 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10100\.92 \(\+40\.28\)\\cellcolorgray\!101\\cellcolorgray\!101331\.26 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA\-\-\-\-\-\-\-\-\-\-\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\-\-ChoosePIQA\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1053\.81% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!103\.80 \(\+1\.57\)\\cellcolorgray\!101\\cellcolorgray\!1095606\.11 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!10112\.52 \(\+46\.40\)\\cellcolorgray\!101\\cellcolorgray\!101324\.37 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA\-\-\-\-\-\-\-\-\-\-\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\-\-HellaSwag\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1025\.33% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!102\.15 \(\+0\.83\)\\cellcolorgray\!101\\cellcolorgray\!1055854\.10 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1065\.81 \(\+25\.54\)\\cellcolorgray\!101\\cellcolorgray\!101327\.05 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA\-\-\-\-\-\-\-\-\-\-\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\-\-SocialIQA\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1045\.70% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!102\.81 \(\+1\.16\)\\cellcolorgray\!101\\cellcolorgray\!1067930\.66 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1085\.03 \(\+34\.96\)\\cellcolorgray\!101\\cellcolorgray\!101325\.86 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA\-\-\-\-\-\-\-\-\-\-\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\-\-ReasonARC\-E\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1028\.07% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!100\.73 \(\+0\.28\)\\cellcolorgray\!101\\cellcolorgray\!1019121\.22 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1021\.46 \(\+8\.27\)\\cellcolorgray\!101\\cellcolorgray\!101323\.40 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA\-\-\-\-\-\-\-\-\-\-\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\-\-WinoGrande\\cellcolorgray\!10SplitLoRA\\cellcolorgray\!1051\.62% \(\+0\.00%\)\\cellcolorgray\!101\\cellcolorgray\!103\.09 \(\+1\.25\)\\cellcolorgray\!101\\cellcolorgray\!1080007\.22 \(\+0\.00\)\\cellcolorgray\!101\\cellcolorgray\!1092\.40 \(\+37\.32\)\\cellcolorgray\!101\\cellcolorgray\!101325\.58 \(\+0\.00\)\\cellcolorgray\!101HeteroLoRA\-\-\-\-\-\-\-\-\-\-\\cellcolorgray\!10FedAvg\+LoRA\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-\\cellcolorgray\!10\-FedProx\+LoRA\-\-\-\-\-\-\-\-\-\-
### C\.4Overall ranking of methods under protocol C
Figure[6](https://arxiv.org/html/2605.08636#A3.F6)summarizes the overall robustness ranking of different methods under Protocol C\. Unlike Protocol B, which ranks methods by their absolute system cost to reach target accuracy, Protocol C ranks methods by the magnitude of performance variation under system perturbation, where a smaller change indicates stronger robustness\. Overall, SplitLoRA achieves the best average rank across all three model scales, showing that its split architecture is less sensitive to communication fluctuation and can maintain stable testing accuracy, wall\-clock time, energy consumption, and memory footprint under unstable edge conditions\. For Qwen2\.5\-0\.5B and Gemma 3\-270M, FedAvg\+LoRA ranks second overall, while FedProx\+LoRA and HeteroLoRA show larger performance variations depending on the model scale and task\. This indicates that methods with strong final accuracy or low cost under stable settings do not necessarily remain robust under system perturbations\.
Therefore, Figure[6](https://arxiv.org/html/2605.08636#A3.F6)highlights a key Protocol C insight:Edge\-system perturbations affect different methods differently, leading to method\-dependent robustness in both model quality and system efficiency\.
Figure 6:Overall ranking of methods under protocol C\.
## Appendix DDiscussion on overall ranking across methods
Figure 7:Overall ranking across methods\. For each method, the rank under each protocol–model setting is first obtained from the corresponding overall\-ranking table, and then averaged to summarize its cross\-protocol performance\. In the radar chart, each vertex corresponds to one protocol–model pair:A\-Qwen,A\-G270,B\-Qwen,B\-G270,C\-Qwen, andC\-G270, where A/B/C denote Protocols A/B/C,Qwendenotes Qwen2\.5\-0\.5B, andG270denotes Gemma 3\-270M\. The plotted value at each vertex is transformed from the average rank as4−Avg\.Rank4\-\\mathrm\{Avg\.\\ Rank\}, so a larger radius indicates a better overall ranking\. Thus, methods with better ranks form larger hexagons\. Gemma 3\-1B is not included in the hexagon axes because only SplitLoRA is executable under our system budget, while the other methods encounter out\-of\-memory errors\.Figure[7](https://arxiv.org/html/2605.08636#A4.F7)provides a cross\-protocol comparison of the four evaluated federated LLM fine\-tuning methods\.
The results reveal clear trade\-offs among methods\. FedProx\+LoRA achieves strong ranks under Protocol A, because the proximal regularizer stabilizes local LoRA updates and improves final quality\. However, its advantage in final accuracy does not translate into system efficiency: under Protocol B, FedProx\+LoRA shows much lower radar values, indicating that it often requires longer wall\-clock time and higher energy consumption to reach the same target accuracy\. FedAvg\+LoRA shows a similar but slightly more balanced pattern: it is competitive in final quality and relatively stable under communication fluctuation, but it is not the most efficient method when cost\-to\-target is considered\.
HeteroLoRA shows moderate performance under Protocol B and Protocol C, but its Protocol A ranking is relatively weak\. This is consistent with the fact that heterogeneous LoRA aggregation introduces additional approximation noise, especially when adapters with different ranks are aligned or padded before aggregation\. As a result, HeteroLoRA can improve adaptability to heterogeneous clients, but this does not always lead to the best model quality or the lowest system cost\.
SplitLoRA shows the most balanced and deployment\-friendly behavior across protocols\. Although it is not always the best method in final accuracy under Protocol A, it achieves the strongest ranks under Protocol B and Protocol C\. This indicates that SplitLoRA is particularly effective when the evaluation objective shifts from accuracy\-only performance to practical edge deployment constraints\. Moreover, for Gemma 3\-1B, SplitLoRA is the only executable method under our system budget, while FedAvg\+LoRA, FedProx\+LoRA, and HeteroLoRA encounter out\-of\-memory errors\.
Overall, these cross\-protocol results lead to a key conclusion:methods that appear favorable under accuracy\-only evaluation may become suboptimal once realistic edge constraints and system perturbations are taken into account\.
## Appendix EDetailed dynamics of fine\-tuning process
This section provides detailed fine\-tuning trajectories under the experimental settings of Protocols A and B\. While the previous sections mainly report testing results, including testing loss, testing accuracy, and system\-level costs, here we further examine the training dynamics of different methods\. Specifically, we plot the training loss over wall\-clock time for four federated LLM fine\-tuning methods, three backbone models, and seven benchmark datasets\. These curves provide a more fine\-grained view of how each method progresses during fine\-tuning, revealing differences in convergence speed, training stability, and execution efficiency that may not be fully reflected by the final testing metrics alone\. By comparing training loss against wall\-clock time, we can directly observe whether a method reaches lower loss quickly, whether it suffers from slow system execution, and how the training behavior changes across model scales and task types\.
\(a\)BoolQ
\(b\)QNLI
\(c\)PIQA
\(d\)HellaSwag
\(e\)SocialIQA
\(f\)ARC\-E
\(g\)WinoGrande
Figure 8:Training loss curves of the four federated fine\-tuning methods on Qwen2\.5\-0\.5B across seven datasets under the experimental settings of Protocols A and B\. The curves show the evolution of training loss with respect to wall\-clock time\.\(a\)BoolQ
\(b\)QNLI
\(c\)PIQA
\(d\)HellaSwag
\(e\)SocialIQA
\(f\)ARC\-E
\(g\)WinoGrande
Figure 9:Training loss curves of the four federated fine\-tuning methods on Gemma 3\-270M across seven datasets under the experimental settings of Protocols A and B\. The curves show the evolution of training loss with respect to wall\-clock time\.\(a\)BoolQ
\(b\)QNLI
\(c\)PIQA
\(d\)HellaSwag
\(e\)SocialIQA
\(f\)ARC\-E
\(g\)WinoGrande
Figure 10:Training loss curves of the federated fine\-tuning methods on Gemma 3\-1B across seven datasets under the experimental settings of Protocols A and B\. The curves show the evolution of training loss with respect to wall\-clock time\.Similar Articles
LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance
This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.
GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
GFT (Group Fine-Tuning) is a unified post-training framework for LLMs that addresses limitations of supervised fine-tuning by using Group Advantage Learning and Dynamic Coefficient Rectification to improve training stability and generalization. The paper shows SFT can be interpreted as a special case of policy gradient optimization with sparse implicit rewards, and GFT consistently outperforms SFT-based methods while integrating more smoothly with subsequent RL training.
Estimating worst case frontier risks of open weight LLMs
OpenAI researchers study worst-case frontier risks of releasing open-weight LLMs through malicious fine-tuning (MFT) in biology and cybersecurity domains, finding that open-weight models underperform frontier closed-weight models and don't substantially advance harmful capabilities.