Auto-FL-Research: Agentic Search for Federated Learning Algorithms

arXiv cs.AI 07/03/26, 04:00 AM Papers
federated-learning agentic-search automl algorithm-search healthcare nvidia
Summary
Auto-FL-Research introduces a constrained coding-agent workflow for automatically searching and evaluating federated learning algorithmic recipes, showing performance gains on multiple healthcare and LEAF tasks while also exposing seed-sensitive and search-selected failure cases.
arXiv:2607.01366v1 Announce Type: new Abstract: Federated learning (FL) research often depends on many small but consequential algorithmic choices: optimizer variants, server aggregation rules, local training schedules, normalization, regularization, and model architecture. These choices are expensive to explore manually and difficult to compare fairly when candidate changes can also alter the FL training or evaluation path. In this work, we present Auto-FL-Research (AFR), a constrained coding-agent workflow for FL algorithmic recipe search. Agents may propose and implement candidate training algorithms, including server aggregation rules, client update schedules, local objectives, and registered model variants, while task profiles fix the mutation surface, compute budget, communication contract, and final model evaluation. Each campaign records candidate scores, runtime, edited files, artifacts, and failure status. We evaluate AFR on five healthcare cross-silo FLamby tasks and on grouped-client profiles for the five fixed LEAF datasets plus the LEAF synthetic task. Five-seed repeat evaluations support gains on four FLamby tasks and five of six LEAF profiles, while also exposing seed-sensitive and search-selected failure cases. Same-budget controls show that several gains correspond to FL-recipe changes, whereas other improvements are recovered by fixed-surface scalar controls or fail under repeat or held-out evaluation. These mixed outcomes are part of the contribution: they show how agent-generated candidates can be separated into repeated FL mechanisms, fixed-surface tuning effects, and selected single-run artifacts.
Original Article
View Cached Full Text
Cached at: 07/03/26, 05:44 AM
# Auto-FL-Research: Agentic Search for Federated Learning Algorithms
Source: [https://arxiv.org/html/2607.01366](https://arxiv.org/html/2607.01366)
Holger R\. Roth, Ziyue Xu, Chester Chen, Daguang Xu, Peter Cnudde, Andrew FengNVIDIA, Santa Clara, USA

###### Abstract

Federated learning \(FL\) research often depends on many small but consequential algorithmic choices: optimizer variants, server aggregation rules, local training schedules, normalization, regularization, and model architecture\. These choices are expensive to explore manually and difficult to compare fairly when candidate changes can also alter the FL training or evaluation path\. In this work, we present Auto\-FL\-Research \(AFR\), a constrained coding\-agent workflow for FL algorithmic recipe search\. Agents may propose and implement candidate training algorithms, including server aggregation rules, client update schedules, local objectives, and registered model variants, while task profiles fix the mutation surface, compute budget, communication contract, and final model evaluation\. Each campaign records candidate scores, runtime, edited files, artifacts, and failure status\.

We evaluate AFR on five healthcare cross\-silo FLamby tasks and on grouped\-client profiles for the five fixed LEAF datasets plus the LEAF synthetic task\. Five\-seed repeat evaluations support gains on four FLamby tasks and five of six LEAF profiles, while also exposing seed\-sensitive and search\-selected failure cases\. Same\-budget controls show that several gains correspond to FL\-recipe changes, whereas other improvements are recovered by fixed\-surface scalar controls or fail under repeat or held\-out evaluation\. These mixed outcomes are part of the contribution: they show how agent\-generated candidates can be separated into repeated FL mechanisms, fixed\-surface tuning effects, and selected single\-run artifacts\.

## IIntroduction

Federated learning \(FL\) promises collaborative model development without centralizing raw data, but the practical performance of an FL system depends on a large design surface\[[22](https://arxiv.org/html/2607.01366#bib.bib1)\]\. A practitioner must choose local optimizers, server aggregation rules, schedules, regularization, client participation, model architecture, evaluation strategy, and many task\-specific details\[[13](https://arxiv.org/html/2607.01366#bib.bib31)\]\. These choices interact with data heterogeneity and communication constraints, so improvements that appear obvious in centralized training can fail in FL\.

Automated FL methods have explored specific portions of this surface, including learnable aggregation, federated hyperparameter optimization, federated neural architecture search, and adaptive server optimizers\[[28](https://arxiv.org/html/2607.01366#bib.bib23),[32](https://arxiv.org/html/2607.01366#bib.bib15),[8](https://arxiv.org/html/2607.01366#bib.bib17),[24](https://arxiv.org/html/2607.01366#bib.bib3)\]\. However, many useful research advances are not a single scalar hyperparameter\. A competitive FL algorithm may require introducing a new model architecture, changing a local loss, adding a server optimizer, or using an improved server aggregation method while preserving the protocol and the benchmark definition\.

![Refer to caption](https://arxiv.org/html/2607.01366v1/x1.png)Figure 1:Illustrative CIFAR\-10 Auto\-FL\-Research campaign progress\. Each point is a candidate in the run log; gray points are discarded candidates, blue points are active candidates, green points are kept candidates, and the green step line tracks the running best final global\-model score\. Purple markers indicate logged literature\-review events\.Recent coding agents make it possible to automate code\-level research loops, but unconstrained experimentation can confound evaluation: an agent can change the metric, alter the data split, silently increase compute, or break the FL contract\. Auto\-FL\-Research addresses this by fixing what the agent may edit and how every candidate is evaluated\. The agent is instructed and validated to modify code only inside a task\-defined mutation surface and must evaluate candidates through a fixed FL harness, here implemented with NVIDIA FLARE \(NVFlare\)\. Each run records its budget, score, status, artifacts, and literature sources\. The implementation corresponding to the methods described in this paper is available as the NVIDIA FLARE Auto\-FL research example111[https://github\.com/NVIDIA/NVFlare/tree/main/research/auto\-fl\-research](https://github.com/NVIDIA/NVFlare/tree/main/research/auto-fl-research), including the control plane, task profiles, plotting utilities, and reporting workflow\.

We therefore treat agent changes as candidate\-generating steps rather than as final claims\. Candidate scores are interpreted together with the recorded search trace, repeated seed evaluations, and task controls\. AFR is not proposed as a new FL optimizer; it is a constrained research protocol for using coding agents to generate, record, and check candidate FL algorithms under fixed execution and evaluation contracts\.

This paper makes three contributions\.

- •We describe a contract\-preserving agentic FL search harness based on NVFlare task profiles and fixed budgets, and make explicit which code\-level mutations are allowed beyond scalar HPO\.
- •We evaluate the harness on FLamby healthcare tasks\[[5](https://arxiv.org/html/2607.01366#bib.bib8)\]and LEAF federated benchmark tasks\[[2](https://arxiv.org/html/2607.01366#bib.bib6)\], including five\-seed repeats of selected FLamby and LEAF configurations with matched baselines\.
- •We analyze which agent\-discovered FL mechanisms transfer across tasks, identify search\-selected gains that do not survive repeat or held\-out evaluation, and use same\-budget controls to distinguish FL\-specific recipe changes from fixed\-surface scalar tuning\.

The intended outcome is therefore not only a better configuration for a benchmark task, but a reproducible record of what was tried, which ideas transferred, which candidates failed, and which selected gains survived repeat or held\-out evaluation\.

## IIRelated Work

#### Federated Optimization

Federated Averaging \(FedAvg\) remains the canonical baseline for cross\-device and cross\-silo FL\[[22](https://arxiv.org/html/2607.01366#bib.bib1)\]\. FedProx adds a proximal term to stabilize optimization in the presence of client heterogeneity\[[19](https://arxiv.org/html/2607.01366#bib.bib2)\]\. FedOpt generalizes server\-side adaptive optimization, including FedAdam\-style updates over aggregated client model differences\[[24](https://arxiv.org/html/2607.01366#bib.bib3)\]\. SCAFFOLD uses control variates to reduce client drift\[[14](https://arxiv.org/html/2607.01366#bib.bib4)\]\. AFR treats these as baseline mechanisms and as building blocks that agents may combine with task\-specific local training changes\.

#### Automated FL and Federated Architecture Search

Prior Automated FL work, building on AutoML and NAS literature\[[12](https://arxiv.org/html/2607.01366#bib.bib33),[6](https://arxiv.org/html/2607.01366#bib.bib34),[10](https://arxiv.org/html/2607.01366#bib.bib35)\], has automated narrower FL design spaces, including federated NAS, learnable aggregation, Bayesian AutoML in FL, client participation, edge\-resource scheduling, and FL HPO\[[23](https://arxiv.org/html/2607.01366#bib.bib18),[16](https://arxiv.org/html/2607.01366#bib.bib19),[11](https://arxiv.org/html/2607.01366#bib.bib20),[31](https://arxiv.org/html/2607.01366#bib.bib21),[26](https://arxiv.org/html/2607.01366#bib.bib22),[28](https://arxiv.org/html/2607.01366#bib.bib23),[8](https://arxiv.org/html/2607.01366#bib.bib17),[32](https://arxiv.org/html/2607.01366#bib.bib15),[9](https://arxiv.org/html/2607.01366#bib.bib16)\]\. In contrast, AFR is not a single optimizer or controller; it is a constrained coding\-agent harness for code\-level FL recipe search under fixed communication and scoring contracts\.

#### Benchmarks and Execution Frameworks

FLamby provides realistic healthcare cross\-silo FL tasks with public splits, baseline models, and metrics\[[5](https://arxiv.org/html/2607.01366#bib.bib8)\]\. LEAF provides federated datasets for cross\-device\-style settings, including FEMNIST, Sent140, Shakespeare, CelebA, and Reddit\[[2](https://arxiv.org/html/2607.01366#bib.bib6)\]\. The LEAF project also distributes a synthetic classification task\[[17](https://arxiv.org/html/2607.01366#bib.bib7)\]\. NVFlare provides production\-oriented FL execution abstractions and simulation capabilities\[[25](https://arxiv.org/html/2607.01366#bib.bib5)\]\. We use NVFlare as the execution substrate so that candidate changes are evaluated through an FL runtime rather than a standalone benchmark script\.

#### Agentic Research Loops

The AFR workflow is inspired by emerging autonomous research systems that combine experiment records, code edits, and literature\-guided proposal generation\. EAIRA\[[3](https://arxiv.org/html/2607.01366#bib.bib26)\]frames the broader problem of evaluating AI models as scientific research assistants, arguing for assessment beyond static question answering, including controlled lab\-style and field\-style evaluations of how models support real research tasks\. End\-to\-end systems such as The AI Scientist and AI Scientist\-v2 automate idea generation, code execution, experiment analysis, and paper writing for machine\-learning research\[[21](https://arxiv.org/html/2607.01366#bib.bib27),[29](https://arxiv.org/html/2607.01366#bib.bib28)\], while Agent Laboratory studies a more interactive research\-assistant workflow with optional human feedback\[[27](https://arxiv.org/html/2607.01366#bib.bib29)\]\. Karpathy’s “autoresearch” project demonstrates a minimal agentic loop for repeatedly improving a fixed training task under a persistent result log\[[15](https://arxiv.org/html/2607.01366#bib.bib24)\]\. Camyla\[[7](https://arxiv.org/html/2607.01366#bib.bib25)\]emphasizes structured literature search, memory, and proposal generation for medical image segmentation research\. AFR adapts these ideas to federated learning by adding task profiles, communication\-contract invariants, cross\-site evaluation, and FL\-specific mutation boundaries, so that an agent’s contribution is judged by executable benchmark outcomes rather than by text\-only responses or manuscript generation alone\.

## IIIMethod: Agentic Search Harness

![Refer to caption](https://arxiv.org/html/2607.01366v1/figures/auto_fl_research_loop.png)

![Refer to caption](https://arxiv.org/html/2607.01366v1/x2.png)

Figure 2:AFR loop and evaluation coverage\.Left:the agent starts from research intent,program\.md, an active task profile, a fixed budget, and a fixed mutation surface\. Candidate NVFlare runs append results toresults\.tsv; reviewed batches are kept, narrowed, discarded, or used to select the next candidate\.Right:stylized benchmark modalities from FLamby and grouped\-client LEAF profiles evaluated through the same run log and final global\-model scoring path\.### III\-ACampaign Algorithm

Algorithm[1](https://arxiv.org/html/2607.01366#alg1)gives the campaign loop used by the agents\. The algorithm is intentionally simple: all state that matters for scientific comparison is either fixed by the task profile or written into the run record\. The agent may propose code changes, but every candidate must pass the task validation path before it is scored\.

Algorithm 1AFR campaign loopTask profile, candidate cap, mutation surface, validation commands

Initializeautoresearch/branch; run baseline; log toresults\.tsv

whilebudget remains and campaign is not manually stoppeddo

Execute candidate cycle:

1. 1\.Proposecandidate\(s\)\.
2. 2\.Validateedits, budget fields, contract, and smoke test\.
3. 3\.Runcandidate in NVFlare; extract final score\.
4. 4\.Logscore, runtime, status, description, artifacts\.
5. 5\.Reviewas*keep*,*discard*, or*crash*\.

ifplateau watchdog triggersthen

Recovervia literature loop:

select source\-backed proposals; log event\.

Finish:repeat selected configurations; regenerate plots; write final report\.

### III\-BTask Profiles and Fixed Budgets

Each campaign begins with a task profile that specifies the dataset, metric, model budget, client/site configuration, number of rounds, final evaluation policy, and allowed mutation files\. A candidate is comparable only if it preserves the fixed budget fields\. For the architecture sub\-campaign, the profile includes a maximum parameter count and requires that the selected architecture, normalization mode, and parameter cap be instantiated identically on the server and all clients\.

### III\-CFederated Contract

The agent must preserve the NVFlare client contract\. In our experiments, clients receive the current global model, load it strictly, perform local training or evaluation, compute a model difference, and send a DIFF\-typed update with the number of local steps in metadata\. The same final global server model is used for metric evaluation\. This prevents a candidate from appearing better by changing the evaluation route, changing the update type, or using a different model state schema on client and server\. In the implementation, this contract is checked by an AST\-based static validator that requiresflare\.init\(\),flare\.receive\(\),flare\.send\(\), strictstate\_dictloading, typed update outputs,NUM\_STEPS\_CURRENT\_ROUNDmetadata, and the evaluation branch\. Each task profile also runs Python compile checks and a task\-specific smoke command before full campaign use\.

### III\-DMutation Surface

The allowed mutation surface includes task\-local client training logic, task\-local job construction, registered model variants, task\-local utilities, and shared custom aggregators\. Agents may tune optimizers, schedules, regularization, local step counts, server learning rates, momentum, FedProx\-like objectives, FedOpt\-style server rules, and architecture variants\. They may not change raw data bridges or task data semantics unless the human explicitly asks for a protocol or benchmark change\. These boundaries are checked partly by code and partly by review: the static validator catches contract breakage, the smoke run catches many runtime protocol errors, and the final report identifies which files were edited for each kept candidate\. The current system does not yet provide a complete cryptographic or sandbox\-level proof that forbidden files were untouched; we treat that as an engineering target for future hardening\.

### III\-ERun Log, Review, and Literature Loop

Every candidate is recorded in a tab\-separated run log with a score, runtime, budget, status, target file, description, and artifact paths\. Candidate rows are finalized as*keep*,*discard*, or*crash*\. When the search plateaus, the agent must consult related literature, write down the source\-backed idea, and then implement a candidate\. The final report can then distinguish simple parameter tuning from changes to paper\-derived methods\. In the reference workflow, a plateau watchdog recommends switching to literature mode after a sustained run of scored non\-crash candidates without a material improvement or a literature reset\. Literature events are recorded as non\-scored rows, so search cost and proposal timing remain visible after the campaign\.

### III\-FArtifact Trail

The output of a campaign is an artifact trail, not only a best score\. The AFR harness keeps the control prompt, task profile, mutation schema, candidate table, generated progress plot, final report, selected code diffs, and follow\-up seed evaluations together in the experiment branch\. This structure lets a reviewer reconstruct the search surface, identify invalid candidates, separate selected wins from repeated results, and inspect whether a claimed mechanism came from scalar tuning, task\-local code, architecture registration, or literature\-backed proposal generation\. The best configuration is one output of this record, not the only object of analysis\.

## IVExperimental Design

TABLE I:Search\-space comparison for interpreting AFR gains beyond scalar HPO\.All modes preserve the same FL communication contract, data bridge, candidate schema, and final global\-model scoring path\.

### IV\-ASearch Spaces and Controls

The central comparison in this work is not an unconstrained agent versus a weak manual baseline\. The comparison is between search spaces that differ in what they permit while sharing the same communication contract, final scoring path, and candidate schema\. Table[I](https://arxiv.org/html/2607.01366#S4.T1)summarizes the mutation surfaces used for the main evidence blocks\. The scripted HPO controls can tune scalar and categorical knobs already exposed by the harness, but cannot introduce new task\-local code or registered architectures\. The architecture\-open AFR campaigns can add named model variants under a parameter cap, and the literature loop can propose source\-backed methods, but both remain inside the same validation, smoke\-test, and final\-global\-model scoring gates\. Concretely, the scripted scalar controls were generated with fixed random seeds and fixed model architectures as summarized in Table[II](https://arxiv.org/html/2607.01366#S4.T2)\.

TABLE II:Scripted scalar HPO search spaces used for same\-budget controls\. All candidates used fixed model architectures\.
### IV\-BFLamby Campaigns

We evaluate five FLamby tasks: Fed\-Heart\-Disease, Fed\-TCGA\-BRCA, Fed\-IXI, Fed\-ISIC2019, and Fed\-Camelyon16\. The reported score is always the final NVFlare global server model evaluated via the task harness; higher scores are better for all metrics\. Campaigns ran on a local node with four NVIDIA H100 80 GB GPUs, but the reported searches launched one candidate at a time, with each candidate occupying one GPU\. Each architecture\-open campaign was capped at 100 candidates\. Candidates were launched sequentially to avoid concurrent edits to shared task files\. Unless otherwise noted, campaigns used the same fixed coding\-agent backend222Codex GPT\-5\.5 with xHigh effort and Auto\-Review\., prompt/control\-plane files, task profile, mutation schema, validation commands, timeout policy, and scoring path\. Human intervention was limited to campaign setup, interruption, and post\-hoc review\.

We compare against three values\. The first is the NVFlare campaign baseline, i\.e\., the first fixed\-budget run in the same harness\. The second is the best AFR score found in the campaign\. The third is an external target selected from FLamby or closely related published work: the original FLamby reference, FENS one\-shot ensembling results\[[1](https://arxiv.org/html/2607.01366#bib.bib9)\], or FedCompass for IXI\[[20](https://arxiv.org/html/2607.01366#bib.bib10)\]\. These targets are useful calibration points, but they do not always align with our exact NVFlare protocol, compute budget, or final global model scoring\. Therefore, we report repeat\-seed means for the strongest candidates and distinguish stable contextual gains from seed\-sensitive campaign bests\.

### IV\-CLEAF Campaigns

We evaluate the five fixed LEAF benchmark datasets from the original paper\[[2](https://arxiv.org/html/2607.01366#bib.bib6)\]–FEMNIST, Sent140, Shakespeare, CelebA, and Reddit–plus the LEAF synthetic classification task\. Our LEAF adapter is a grouped\-client approximation: each NVFlare client represents a deterministic group of original LEAF users, rather than a single physical device\. This preserves user records and train/test splits while keeping NVFlare simulation costs manageable\. Consequently, the LEAF experiments test whether AFR can improve task profiles under a consistent FL harness, rather than whether it exactly reproduces the large\-scale cross\-device setup of the original paper\.

### IV\-DPost\-Selection Evaluation Protocol

During the search, the agent observes the same profile score reported in the campaign log\. This is appropriate for studying benchmark optimization behavior, but it can overfit the reported score\. We therefore treat single\-seed campaign bests as selected candidates rather than final statistical claims\. For FLamby and LEAF, selected configurations and corresponding baselines were repeated with five seeds and saved with individual per\-seed scores, mean, standard deviation, standard error, and confidence\-interval summaries\. For repeated comparisons, we report paired mean differences with descriptive uncertainty half\-widths computed from matched\-seed winner\-minus\-baseline differences\. Becausen=5n=5is small and the candidates were selected by search, these intervals should not be interpreted as confirmatory hypothesis tests\. As an additional overfitting check, we ran a validation\-selected and held\-out\-reported procedure for Heart Disease and FEMNIST: candidate selection used a deterministic validation subset of each site’s evaluation split, while the selected candidate and matched baseline were rerun on the complementary held\-out subset\. This check is not a replacement for externally defined test sets, but it directly tests whether a selected candidate survives a score it did not observe during search\. We use candidate wall\-clock runtime as the normalized experimental cost\. Agent\-session token telemetry is captured in post\-campaign reports when the agent runtime exposes it, but it is treated as artifact metadata rather than as the search budget because telemetry availability differs across agent frontends\.

TABLE III:Campaign accounting for the main reported searches\. All rows used a 100\-candidate cap and logged 100 candidates\. Lit\. denotes literature events; Wall\-h excludes non\-scored literature rows and follow\-up repeat evaluations\.

## VResults

### V\-AFLamby Healthcare Tasks

Table[IV](https://arxiv.org/html/2607.01366#S5.T4)summarizes the five completed FLamby campaigns with five\-seed repeats of the selected configuration and matched baseline\. The largest reproducible gains were observed on IXI, ISIC2019, and Camelyon16 \(see Fig\.[3](https://arxiv.org/html/2607.01366#S5.F3)\)\. IXI reached a repeat mean Dice of 0\.9895, which is about 0\.0015 above the selected FedCompass calibration target\. Camelyon16 reached a repeat mean ROC AUC of 0\.7494, about 0\.034 above the selected FENS calibration target\. Heart Disease matched its selected external target within rounding\. TCGA\-BRCA found a strong single best C\-index during search, but the repeated configuration regressed toward the baseline and should not be treated as a robust gain\. ISIC2019 improved substantially over its NVFlare baseline but did not reach the strongest selected external target\. Descriptive paired mean\-difference intervals do not include zero for Heart Disease \(\+0\.0735±\\pm0\.0033\), IXI \(\+0\.1981±\\pm0\.0004\), ISIC2019 \(\+0\.1462±\\pm0\.0196\), and Camelyon16 \(\+0\.1634±\\pm0\.0424\), but not for TCGA\-BRCA \(\+0\.0009±\\pm0\.0217\)\. This supports the interpretation that TCGA\-BRCA is a seed\-sensitive search win rather than a repeated improvement\.

TABLE IV:FLamby campaign results\.#### IXI check

Because the IXI gain was unusually large, we verified that the selected configuration preserved the evaluation path: both baseline and selected repeats used the same clients, communication rounds, update contract, data root, and all\-client final global\-model Dice evaluation\. The gain was not explained by scalar HPO alone\. The selected candidate replaced the small stock FLamby U\-Net with a registered residual U\-Net\-family model under the fixed 25M\-parameter cap, increased local training, and retuned AdamW regularization and weighted aggregation\. Intermediate candidates improved first with width/residual capacity and then with local\-update and optimizer retuning\. Across seeds 42–46, selected scores were highly stable \(0\.989–0\.990\), while matched baselines remained near 0\.791\. We therefore interpret IXI as primarily an architecture\-capacity and local\-update\-budget win under the fixed FL contract\.

The kept high\-scoring FLamby mechanisms varied by task\. Heart Disease benefited from a registered quadratic\-linear tabular model and longer local optimization\. IXI improved through a LeakyReLU U\-Net variant, local\-step adjustments, AdamW\-style regularization, and weighted aggregation\. The main Camelyon16 campaign selected a DSMIL\-inspired multiple\-instance model for slide classification\[[18](https://arxiv.org/html/2607.01366#bib.bib13)\], but we treat that mechanism as a hypothesis from the search trace rather than a proven causal source of the gain\. A completed no\-literature, fixed\-architecture repeat of a control candidate reached0\.794±0\.0240\.794\\pm 0\.024ROC AUC over five seeds, above the literature\-enabled campaign repeat mean of0\.749±0\.0180\.749\\pm 0\.018\. A separate repeat of the single\-seed sweep winner averaged0\.738±0\.0840\.738\\pm 0\.084, showing that its 0\.834 seed\-42 score was not a stable repeated result\. Thus, for Camelyon16, the follow\-up evidence supports a fixed\-architecture recipe\-search explanation and caution about seed sensitivity rather than a causal DSMIL/literature\-loop explanation\. ISIC2019 primarily improved through regularization and FedProx\-style stabilization, consistent with overfitting pressure in class\-imbalanced dermoscopy\. TCGA\-BRCA benefited most from local FedAdam\-style interpolation and a reduced server learning rate in the campaign, but the repeat\-seed results caution against treating that single run as a robust target claim\.

### V\-BGrouped\-Client LEAF Task Profiles

Table[V](https://arxiv.org/html/2607.01366#S5.T5)summarizes five\-seed repeats of the selected grouped\-client LEAF task\-profile candidates and matched baselines while Fig\.[3](https://arxiv.org/html/2607.01366#S5.F3)illustrates the gains achieved across this benchmark\. The repeated results strengthen the evidence on FEMNIST, Sent140, Shakespeare, Synthetic, and Reddit, where the selected AFR candidate remains above the matched baseline mean\. They also expose a search\-selected failure case: the CelebA winner from the campaign does not beat the baseline mean on repeated evaluation; a follow\-up top\-k repeat check found only a small alternate gain, whose paired differences still crossed zero\. This result shows why campaign winners should be rerun before they are treated as findings\.

TABLE V:LEAF grouped\-client approximation repeat results\.0%10%20%30%FLamby healthcare tasksISIC2019\+29\.6%Camelyon16\+27\.9%IXI\+25\.0%Heart Disease\+10\.2%TCGA\-BRCA\+0\.1%LEAF grouped\-client profilesShakespeare\+24\.4%Sent140\+16\.0%FEMNIST\+4\.6%Synthetic\+3\.5%Reddit\+2\.9%CelebA\-0\.1%Relative gain over matched repeated baselineFigure 3:Mean relative gains over matched repeated baselines across the two benchmark suites \(five\-seed repeat\)\.The same descriptive paired intervals do not include zero for FEMNIST \(\+0\.0383±\\pm0\.0050\), Sent140 \(\+0\.1032±\\pm0\.0034\), Shakespeare \(\+0\.1126±\\pm0\.0032\), Synthetic \(\+0\.0333±\\pm0\.0017\), and Reddit \(\+0\.0044±\\pm0\.0009\)\. CelebA does not separate from zero \(\-0\.0013±\\pm0\.0083\), reinforcing that selected single\-run wins need repeated evaluation before they become paper claims\.

### V\-CFEMNIST Ablation

To separate the effect of architecture search from fixed\-model tuning, we ran three FEMNIST campaign variants \(Fig\.[4](https://arxiv.org/html/2607.01366#S5.F4)\)\. Fixed\-model hyperparameter search and optimizer/scheduler search both improved the baseline by roughly 0\.03–0\.04 accuracy\. Allowing registered architecture variants produced the best result, improving the baseline by 0\.046\. This supports the claim that code\-level mutations can add value beyond what a traditional scalar hyperparameter sweep can\.

0\.000\.020\.040\.05Fixed model\+0\.037\+0\.0370\.833→0\.8690\.833\\rightarrow 0\.869Opt\./sched\.\+0\.032\+0\.0320\.835→0\.8670\.835\\rightarrow 0\.867Arch\.\-open\+0\.046\+0\.0460\.834→0\.8800\.834\\rightarrow 0\.880Accuracy gain over matched baselineFigure 4:FEMNIST ablation gains over matched baselines\.
### V\-DCross\-Task Patterns

Several patterns recur across tasks\. First, FedProx\-like proximal regularization was frequently useful on heterogeneous and noisy tasks, including ISIC2019, FEMNIST, Shakespeare, and Sent140; on CelebA, it produced the largest selected score but did not survive repeat evaluation\. Second, server\-side momentum or FedOpt\-style scaling helped when the default aggregated model difference was too conservative for the fixed round budget\. Third, task\-specific architecture variants mattered most when the baseline architecture was clearly underpowered or mismatched to the data representation, as in IXI, Heart Disease, and FEMNIST\. Camelyon16 is a useful counterexample: the campaign trace suggested an architecture mechanism, but a no\-literature fixed\-architecture repeat later exceeded the literature\-enabled repeat mean\. Finally, discarded candidate rows made plateaus visible and encouraged the agent to switch from parameter jitter to source\-backed proposals\.

The logged literature loop had mixed effects\. Among the main FLamby campaigns with literature events, the best post\-literature score exceeded the best pre\-literature score for TCGA\-BRCA \(0\.8426 vs\. 0\.8411\), IXI \(0\.9896 vs\. 0\.9894\), and Camelyon16 \(0\.7780 vs\. 0\.5478, where the early literature loop preceded the DSMIL\-style MIL search\)\. However, the completed no\-literature/fixed\-architecture Camelyon16 sweep reached a single\-seed best of 0\.8344, and a five\-seed repeat of a fixed\-architecture control candidate reached0\.794±0\.0240\.794\\pm 0\.024, exceeding the literature\-enabled repeat mean\. Repeating the single\-seed sweep winner gave0\.738±0\.0840\.738\\pm 0\.084, below the literature\-enabled repeat mean and demonstrating why selected single\-seed winners should not be promoted without repeats\. It did not improve the final best score for Heart Disease or ISIC2019, and the Sent140 LEAF literature events also did not exceed the earlier best\. A 101\-row Sent140 no\-literature local\-sweep ablation reached 0\.7545, essentially matching the main Sent140 campaign best under the same grouped\-client harness\. This suggests that literature\-grounded recovery can generate useful candidates, but the available ablations do not show that the final gains depended causally on literature\-derived proposals\.

### V\-EMechanism Attribution and Controls

TABLE VI:Condensed FL\-mechanism attribution for selected AFR improvements\. Gains are relative to matched repeated baselines unless noted\.#### Mechanism attribution

Table[VI](https://arxiv.org/html/2607.01366#S5.T6)summarizes the main FL mechanisms supported by the controls\. For an FL audience, the strongest results are not the largest raw score changes alone, but the cases where AFR changed the federated recipe while preserving the communication and scoring contract: server aggregation in Heart Disease, client capacity and local\-update budget in IXI, and grouped\-client architecture/update choices in FEMNIST\. Sent140 and ISIC2019 show that strong gains can also arise from careful tuning of existing FL recipe knobs\. Camelyon16 and CelebA show why search\-selected mechanisms need repeat checks before being treated as causal findings\.

Figure[5](https://arxiv.org/html/2607.01366#S5.F5)reports the held\-out check: Heart Disease exposed a validation\-selected false positive, while FEMNIST retained a \+0\.030 held\-out gain\. This reinforces the main evaluation rule: search scores drive campaigns, but repeat and held\-out evaluations decide which findings become scientific claims\.

A\. Validation score used for selectionHeart0\.6980\.784FEMNIST0\.8350\.870B\. Complementary held\-out score0\.680\.730\.780\.830\.88Heart0\.7430\.727−0\.016\-0\.016FEMNIST0\.8320\.862\+0\.030\+0\.030Metric value; gray = matched baseline, green = selected candidateFigure 5:Validation\-selected and held\-out\-reported check\.

## VIDiscussion

The results show that a coding agent can be productive in FL research when it operates inside a fixed execution harness\. AFR did not merely tune a static list of hyperparameters: it introduced task\-specific model architectures, optimization rules, and regularization strategies while preserving final global\-model scoring\. This is important because many useful FL improvements require code changes that do not fit neatly into conventional HPO\. The causal evidence is strongest when the same mechanism is recovered by independent trajectories or when the same\-budget controls isolate a smaller search surface; otherwise, we report the mechanism as an attribution supported by the search trace rather than as a single\-factor causal effect\.

The same mechanism also introduces limitations\. Agentic searches are not deterministic, and a single campaign best can be seed\-sensitive\. TCGA\-BRCA illustrates this: the campaign found a strong best run, but the selected configuration did not reproduce the same margin across follow\-up seeds\. CelebA shows the same issue in the LEAF block: the selected winner was slightly below the matched baseline after five seeds, and a targeted top\-k repeat check found only a small, uncertain alternate gain\. ISIC2019 also demonstrates that improving a local NVFlare baseline does not imply beating a strong published target, especially when that target may use a different protocol, personalization strategy, or evaluation budget\. For LEAF, our grouped\-client adapter is a practical approximation rather than an exact reproduction of the original large\-scale cross\-device setting\. These negative and mixed cases motivate the distinction between search scores, repeated results, and held\-out checks\. The results therefore support AFR as a tool for generating and triaging FL candidates, not as an autonomous source of final benchmark claims\. We also do not claim that the observed trajectories are independent of the chosen coding\-agent backend; evaluating multiple agent models under the same task profiles and budgets is an important next step\.

The current implementation also does not solve broader governance questions around autonomous research\. The agent can waste compute, overfit to a benchmark, or cite literature too shallowly if the prompt and review rules are weak\. The records and fixed mutation surface reduce these risks but do not remove the need for human scientific review\. This study adds five\-seed FLamby and LEAF repeats, same\-budget scripted HPO controls for FEMNIST, Heart Disease, ISIC2019, and Sent140, a Sent140 no\-literature ablation, Heart agent\-trajectory repeats, and two validation\-selected/held\-out\-reported checks\. The held\-out checks are especially instructive: Heart Disease exposed a validation\-selected false positive, whereas FEMNIST retained a held\-out gain\. Stronger causal claims would still benefit from randomized, Bayesian, or evolutionary HPO/NAS controls under the same candidate budget across more tasks, additional no\-literature controls for architecture\-heavy discoveries, and externally defined validation/test splits\.

## VIIConclusion

We presented Auto\-FL\-Research, a constrained agentic workflow for FL research on top of NVFlare\. AFR uses task profiles, explicit mutation surfaces, fixed budgets, immutable FL communication contracts, final global\-model scoring, and persistent candidate records\. This setup lets agents explore code\-level FL recipe changes while preserving comparability across candidates\. Across healthcare FLamby tasks and grouped\-client LEAF profiles, AFR selected higher\-scoring candidates under the fixed harness\. Five\-seed repeats supported gains on four of five FLamby tasks and five of six LEAF profiles, while TCGA\-BRCA and CelebA showed why selected wins require repeat evaluation\. Same\-budget controls further showed that the most informative wins came from FL mechanisms such as robust aggregation, local\-update budgeting, client\-objective stabilization, and architecture choices under a fixed communication contract\. Other gains were recovered by fixed\-surface tuning or failed under repeat or held\-out evaluation\. The main result is therefore not an unconstrained benchmark claim; it is a practical workflow for generating, recording, and checking FL candidate algorithms\.

## Code Availability and AI Use Disclosure

The code artifact corresponding to the methods described in this paper is available as the NVIDIA FLARE Auto\-FL research example:[https://github\.com/NVIDIA/NVFlare/tree/main/research/auto\-fl\-research](https://github.com/NVIDIA/NVFlare/tree/main/research/auto-fl-research)\. Codex was used to help set up and run agentic experiments, and Codex and ChatGPT assisted with initial drafting and editing\. The authors reviewed, verified, and approved all results, claims, citations, figures, tables, and final text, and remain responsible for the paper\.

## References

- \[1\]\(2024\)Revisiting ensembling in one\-shot federated learning\.InNeurIPS,Vol\.37\.Cited by:[§IV\-B](https://arxiv.org/html/2607.01366#S4.SS2.p2.1),[TABLE IV](https://arxiv.org/html/2607.01366#S5.T4.10.10.10.9),[TABLE IV](https://arxiv.org/html/2607.01366#S5.T4.2.2.2.9),[TABLE IV](https://arxiv.org/html/2607.01366#S5.T4.8.8.8.9)\.
- \[2\]S\. Caldas, S\. M\. K\. Duddu, P\. Wu, T\. Li, J\. Konečný, H\. B\. McMahan, V\. Smith, and A\. Talwalkar\(2019\)LEAF: a benchmark for federated settings\.InWorkshop on Federated Learning for Data Privacy and Confidentiality,Cited by:[2nd item](https://arxiv.org/html/2607.01366#S1.I1.i2.p1.1),[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px3.p1.1),[§IV\-C](https://arxiv.org/html/2607.01366#S4.SS3.p1.1)\.
- \[3\]F\. Cappello, S\. Madireddy, R\. Underwood, N\. L\. Chia,et al\.\(2025\)EAIRA: establishing a methodology for evaluating ai models as scientific research assistants\.Note:arXiv:2502\.20309Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px4.p1.1)\.
- \[4\]O\. Cicek, A\. Abdulkadir, S\. S\. Lienkamp, T\. Brox, and O\. Ronneberger\(2016\)3D U\-Net: learning dense volumetric segmentation from sparse annotation\.InMICCAI,pp\. 424–432\.Cited by:[TABLE VI](https://arxiv.org/html/2607.01366#S5.T6.1.2.1.3.1.1)\.
- \[5\]J\. O\. du Terrail, S\. Ayed, E\. Cyffers, F\. Grimberg, C\. He, R\. Loeb, P\. Mangold, T\. Marchand, O\. Marfoq, E\. Mushtaq, B\. Muzellec, C\. Philippenko, S\. Silva, M\. Teleńczuk, S\. Albarqouni, S\. Avestimehr, A\. Bellet, A\. Dieuleveut, M\. Jaggi, S\. P\. Karimireddy, M\. Lorenzi, G\. Neglia, M\. Tommasi, and M\. Andreux\(2022\)FLamby: datasets and benchmarks for cross\-silo federated learning in realistic healthcare settings\.InNeurIPS,Vol\.35\.Cited by:[2nd item](https://arxiv.org/html/2607.01366#S1.I1.i2.p1.1),[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px3.p1.1),[TABLE IV](https://arxiv.org/html/2607.01366#S5.T4.4.4.4.9)\.
- \[6\]T\. Elsken, J\. H\. Metzen, and F\. Hutter\(2019\)Neural architecture search: a survey\.JMLR20\(55\),pp\. 1–21\.Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]Y\. Gao, H\. Li, F\. Yuan, X\. Gao, W\. Huang, and X\. Wang\(2026\)Camyla: scaling autonomous research in medical image segmentation\.Note:arXiv:2604\.10696Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px4.p1.1)\.
- \[8\]P\. Guo, D\. Yang, A\. Hatamizadeh, A\. Xu, Z\. Xu, W\. Li, C\. Zhao, D\. Xu, S\. Harmon, E\. Turkbey, B\. Turkbey, B\. Wood, F\. Patella, E\. Stellato, G\. Carrafiello, V\. M\. Patel, and H\. R\. Roth\(2022\)Auto\-fedrl: federated hyperparameter optimization for multi\-institutional medical image segmentation\.Note:ECCVCited by:[§I](https://arxiv.org/html/2607.01366#S1.p2.1),[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[9\]C\. He, M\. Annavaram, and S\. Avestimehr\(2020\)Towards non\-I\.I\.D\. and invisible data with FedNAS: federated deep learning via neural architecture search\.Note:arXiv:2004\.08546Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[10\]X\. He, K\. Zhao, and X\. Chu\(2021\)AutoML: a survey of the state\-of\-the\-art\.Knowledge\-Based Systems212,pp\. 106622\.External Links:[Document](https://dx.doi.org/10.1016/j.knosys.2020.106622)Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]M\. Hu, W\. Yang, Z\. Luo, X\. Liu, Y\. Zhou, X\. Chen, and D\. Wu\(2024\)AutoFL: a bayesian game approach for autonomous client participation in federated edge learning\.IEEE Transactions on Mobile Computing23\(1\),pp\. 194–208\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2022.3227014)Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[12\]F\. Hutter, L\. Kotthoff, and J\. Vanschoren \(Eds\.\)\(2019\)Automated machine learning: methods, systems, challenges\.Springer\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-05318-5)Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[13\]P\. Kairouz and H\. B\. McMahan\(2021\)Advances and open problems in federated learning\.Foundations and trends in machine learning14\(1\-2\),pp\. 1–210\.Cited by:[§I](https://arxiv.org/html/2607.01366#S1.p1.1)\.
- \[14\]S\. P\. Karimireddy, S\. Kale, M\. Mohri, S\. Reddi, S\. U\. Stich, and A\. T\. Suresh\(2020\)SCAFFOLD: stochastic controlled averaging for federated learning\.InProc\., 37th ICML,Vol\.119,pp\. 5132–5143\.Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px1.p1.1)\.
- \[15\]A\. Karpathy\(2026\)Autoresearch: ai agents running research on single\-gpu nanochat training automatically\.Note:[https://github\.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch)Software repository; accessed 2026\-06\-08Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px4.p1.1)\.
- \[16\]Y\. G\. Kim and C\. Wu\(2021\)AutoFL: enabling heterogeneity\-aware energy efficient federated learning\.InProceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture,pp\. 183–198\.External Links:[Document](https://dx.doi.org/10.1145/3466752.3480129)Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]LEAF Project\(2026\)LEAF: a benchmark for federated settings project page\.Note:[https://leaf\.cmu\.edu/](https://leaf.cmu.edu/)Accessed 2026\-06\-03Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px3.p1.1)\.
- \[18\]B\. Li, Y\. Li, and K\. W\. Eliceiri\(2021\)Dual\-stream multiple instance learning network for whole slide image classification with self\-supervised contrastive learning\.InCVPR,pp\. 14318–14328\.Cited by:[§V\-A](https://arxiv.org/html/2607.01366#S5.SS1.SSS0.Px1.p2.3),[TABLE VI](https://arxiv.org/html/2607.01366#S5.T6.1.7.6.3.1.1)\.
- \[19\]T\. Li, A\. K\. Sahu, A\. Talwalkar, and V\. Smith\(2020\)Federated optimization in heterogeneous networks\.InProc\., MLSys,Vol\.2,pp\. 429–450\.Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px1.p1.1),[TABLE VI](https://arxiv.org/html/2607.01366#S5.T6.1.4.3.3.1.1),[TABLE VI](https://arxiv.org/html/2607.01366#S5.T6.1.5.4.3.1.1),[TABLE VI](https://arxiv.org/html/2607.01366#S5.T6.1.6.5.3.1.1),[TABLE VI](https://arxiv.org/html/2607.01366#S5.T6.1.8.7.3.1.1)\.
- \[20\]Z\. Li, P\. Chaturvedi, S\. He, H\. Chen, G\. Singh, V\. Kindratenko, E\. A\. Huerta, K\. Kim, and R\. Madduri\(2024\)FedCompass: efficient cross\-silo federated learning on heterogeneous client devices using a computing power\-aware scheduler\.InICLR,Cited by:[§IV\-B](https://arxiv.org/html/2607.01366#S4.SS2.p2.1),[TABLE IV](https://arxiv.org/html/2607.01366#S5.T4.6.6.6.9)\.
- \[21\]C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha\(2024\)The AI scientist: towards fully automated open\-ended scientific discovery\.Note:arXiv:2408\.06292Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px4.p1.1)\.
- \[22\]H\. B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. Agüera y Arcas\(2017\)Communication\-efficient learning of deep networks from decentralized data\.InAISTATS,Vol\.54,pp\. 1273–1282\.Cited by:[§I](https://arxiv.org/html/2607.01366#S1.p1.1),[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px1.p1.1),[TABLE VI](https://arxiv.org/html/2607.01366#S5.T6.1.2.1.3.1.1)\.
- \[23\]D\. Preuveneers\(2023\)AutoFL: towards AutoML in a federated learning context\.Applied Sciences13\(14\),pp\. 8019\.External Links:[Document](https://dx.doi.org/10.3390/app13148019)Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[24\]S\. Reddi, Z\. Charles, M\. Zaheer, Z\. Garrett, K\. Rush, J\. Konečný, S\. Kumar, and H\. B\. McMahan\(2021\)Adaptive federated optimization\.InICLR,Cited by:[§I](https://arxiv.org/html/2607.01366#S1.p2.1),[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px1.p1.1)\.
- \[25\]H\. R\. Roth, Y\. Cheng, Y\. Wen, I\. Yang, Z\. Xu, Y\. Hsieh, K\. Kersten, A\. Harouni, C\. Zhao, K\. Lu, Z\. Zhang, W\. Li, A\. Myronenko, D\. Yang, S\. Yang, N\. Rieke, A\. Quraini, C\. Chen, D\. Xu, N\. Ma, P\. Dogra, M\. Flores, and A\. Feng\(2022\)NVIDIA FLARE: federated learning from simulation to real\-world\.arXiv:2210\.13291\.Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px3.p1.1)\.
- \[26\]Y\. Saadati and M\. H\. Amini\(2024\)Hyper\-parameter optimization for federated learning with step\-wise adaptive mechanism\.Note:arXiv:2411\.12244Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[27\]S\. Schmidgall, Y\. Su, Z\. Wang, X\. Sun, J\. Wu, X\. Yu, J\. Liu, Z\. Liu, E\. Barsoum, and M\. Moor\(2025\)Agent laboratory: using LLM agents as research assistants\.Note:arXiv:2501\.04227Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px4.p1.1)\.
- \[28\]Y\. Xia, D\. Yang, W\. Li, A\. Myronenko, D\. Xu, H\. Obinata, H\. Mori, P\. An, S\. Harmon, E\. Turkbey, B\. Turkbey, B\. Wood, F\. Patella, E\. Stellato, G\. Carrafiello, A\. Ierardi, A\. Yuille, and H\. Roth\(2021\)Auto\-fedavg: learnable federated averaging for multi\-institutional medical image segmentation\.Note:arXiv:2104\.10195Cited by:[§I](https://arxiv.org/html/2607.01366#S1.p2.1),[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[29\]Y\. Yamada, R\. T\. Lange, C\. Lu, S\. Hu, C\. Lu, J\. Foerster, J\. Clune, and D\. Ha\(2025\)The AI scientist\-v2: workshop\-level automated scientific discovery via agentic tree search\.Note:arXiv:2504\.08066Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px4.p1.1)\.
- \[30\]D\. Yin, Y\. Chen, R\. Kannan, and P\. Bartlett\(2018\)Byzantine\-robust distributed learning: towards optimal statistical rates\.InProceedings of the 35th International Conference on Machine Learning,Vol\.80,pp\. 5650–5659\.Cited by:[TABLE VI](https://arxiv.org/html/2607.01366#S5.T6.1.3.2.3.1.1)\.
- \[31\]C\. You, K\. Guo, G\. Feng, P\. Yang, and T\. Q\. S\. Quek\(2023\)Automated federated learning in mobile\-edge networks: fast adaptation and convergence\.IEEE Internet of Things Journal10\(15\),pp\. 13571–13586\.External Links:[Document](https://dx.doi.org/10.1109/JIOT.2023.3262664)Cited by:[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
- \[32\]H\. Zhu, H\. Zhang, and Y\. Jin\(2021\)From federated learning to federated neural architecture search: a survey\.Complex and Intelligent Systems7,pp\. 639–657\.Cited by:[§I](https://arxiv.org/html/2607.01366#S1.p2.1),[§II](https://arxiv.org/html/2607.01366#S2.SS0.SSS0.Px2.p1.1)\.
Auto-FL-Research: Agentic Search for Federated Learning Algorithms

Similar Articles

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

@PyTorch: Federated Learning Without the Refactoring Overhead The most valuable data is often the least movable. Regulatory bound…

@lftherios: 1/ Autoresearch from @karpathy has been one of the most interesting agentic patterns to emerge this year. The challenge…

Federated Learning

Submit Feedback

Similar Articles

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
@PyTorch: Federated Learning Without the Refactoring Overhead The most valuable data is often the least movable. Regulatory bound…
@lftherios: 1/ Autoresearch from @karpathy has been one of the most interesting agentic patterns to emerge this year. The challenge…