@PyTorch: Federated Learning Without the Refactoring Overhead The most valuable data is often the least movable. Regulatory bound…

X AI KOLs Following 05/21/26, 05:20 PM Tools

federated-learning nvidia-flare pytorch privacy data-sovereignty compliance developer-experience

Summary

NVIDIA FLARE's latest version enables federated learning without requiring refactoring of existing training scripts, using a client API and job recipes for seamless deployment across simulation and production environments.

Federated Learning Without the Refactoring Overhead The most valuable data is often the least movable. Regulatory boundaries, data sovereignty rules, and organizational risk tolerance routinely prevent centralized aggregation. Meanwhile, sheer data gravity makes even permitted transfers slow, expensive, and fragile at scale. The latest version of NVIDIA FLARE addresses this reality with a Federated Learning (FL) computing runtime that moves the training logic to the data, while raw data stays put. See examples of how to leverage PyTorch in a federated learning system. Read the full post:

Original Article

View Cached Full Text

Cached at: 05/22/26, 09:45 AM

Federated Learning Without the Refactoring Overhead

The most valuable data is often the least movable. Regulatory boundaries, data sovereignty rules, and organizational risk tolerance routinely prevent centralized aggregation. Meanwhile, sheer data gravity makes even permitted transfers slow, expensive, and fragile at scale.

The latest version of NVIDIA FLARE addresses this reality with a Federated Learning (FL) computing runtime that moves the training logic to the data, while raw data stays put. See examples of how to leverage PyTorch in a federated learning system.

Read the full post:

Federated Learning Without the Refactoring Overhead Using NVIDIA FLARE

Source: https://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/ Federated learning (FL) is no longer a research curiosity—it’s a practical response to a hard constraint: the most valuable data is often the least movable. Regulatory boundaries, data sovereignty rules, and organizational risk tolerance routinely prevent centralized aggregation. Meanwhile, sheer data gravity makes even permitted transfers slow, expensive, and fragile at scale.

The latest version ofNVIDIA FLAREaddresses this reality with a federated computing runtime that moves the training logic to the data, while raw data stays put. In high-stakes environments, centrally aggregating data is often not possible or practical, so a modern federated platform must treat data isolation**,compliance,**andprivacy-enhancing technologiesas first-class requirements.

What has historically slowed adoption isn’t the concept of FL—it’s the developer experience. If the path from “my local script trains” to “my job runs across federated sites” requires deep refactoring, new class hierarchies, or brittle configuration, many projects stall after the pilot.

The FLARE API evolution targets exactly that: eliminating the refactoring overhead by splitting the work into two concrete steps that map cleanly onto how teams actually build and ship ML systems:

**Step 1 (client API):**Turn an existing local training script into a federated client with ~5–6 lines of code, without changing your training loop structure.
**Step 2 (job recipes):**Select the FL workflow and bind it to your client training script, then run the same job across simulation, PoC, and production by swapping only the execution environment.

‘No data copy’ as a system requirementhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#%E2%80%98no_data_copy%E2%80%99_as_a_system_requirement

In regulated or high-sensitivity settings, “just centralize the dataset” is increasingly off the table. A practicalfederated computingplatform needs to support:

**No data copy:**Data stays local, and only model updates (or equivalent signals) move.
**Compliance posture:**Deployment and governance controls that support sovereignty and audit requirements.
**Privacy-enhancing techniques:**Multiple layers of defenses (examples include homomorphic encryption, differential privacy, and confidential computing).

Figure 1. Federated computing keeps data in place, enabling collaboration through model updates while supporting compliance and privacy-enhancing protections.

The refactoring cliff: Why FL projects stallhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#the_refactoring_cliff_why_fl_projects_stall

Teams typically hit one of two cliffs after the pilot:

**The code cliff:**Converting working PyTorch/TensorFlow/Lightning training into FL can require invasive restructuring—new abstractions, messaging glue, and framework-specific scaffolding.
**The lifecycle cliff:**Even when simulation works, moving to PoC and production triggers rewrites via job redefinition, reconfiguration, and environment-specific branching.

FLARE flattens both cliffs by standardizing the workflow into two steps:

Make your script federated (client API)
Execute it as a portable job (job recipe)

The intended experience is explicitly to combine these so you can go from zero to an operational federated job quickly.

Step 1: Convert your local training script into a federated client (client API)https://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#step_1_convert_your_local_training_script_into_a_federated_client_client_api

Who it’s for: Practitioners and ML engineers with existing training code who want the smallest possible difference.

The mental model is intentionally simple:

Initialize the client runtime
Loop while the job is running
Receive the current global model
Train locally (your code)
Send updated weights + metrics back

FLARE’s client API is designed for minimal code changes and avoids forcing you into heavy “Executor/Learner” inheritance—use the FLModel structure or simple data exchange to communicate with the runtime.

Example 1a: Convert PyTorch to FLAREhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#example_1a_convert_pytorch_to_flare_

Below is a concrete pattern you can apply to many scripts. The key touchpoints are:flare\.init\(\),flare\.receive\(\), loading model weights, andflare\.send\(\)with updated weights and metrics.

We show the local training code on the left and the federated version on the right, highlighting: import,flare\.init\(\),receive\(\),send\(\).

train.py

# train.py

import torch
import torchvision
import torchvision.transforms as transforms

from model import Net

batch_size = 4
epochs = 1
lr = 0.01
model = Net()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
transform = transforms.Compose(
   [
       transforms.ToTensor(),
       transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
   ]
)

train_dataset = torchvision.datasets.CIFAR10(
   root="/tmp/data/cifar10", transform=transform, download=True, train=True
)

trainloader = torch.utils.data.DataLoader(
   train_dataset, batch_size=batch_size, shuffle=True
)

model.to(device)

for epoch in range(epochs):
   running_loss = 0.0

   for i, batch in enumerate(trainloader):
       images, labels = batch[0].to(device), batch[1].to(device)

       optimizer.zero_grad()

       predictions = model(images)
       cost = loss(predictions, labels)
       cost.backward()
       optimizer.step()

       running_loss += cost.cpu().detach().numpy() / batch_size

       if i % 3000 == 2999:
           print(
               f"Epoch: {epoch + 1}/{epochs}, batch: {i + 1}, Loss: {running_loss / 3000}"
           )
           running_loss = 0.0

   print(
       f"Epoch: {epoch + 1}/{epochs}, batch: {i + 1}, Loss: {running_loss / (i + 1)}"
   )

print("Finished Training")

torch.save(model.state_dict(), "./cifar_net.pth")

client.py

# client.py

# 1. Import client API
import nvflare.client as flare
import torch
import torchvision
import torchvision.transforms as transforms

from model import Net

batch_size = 4
epochs = 1
lr = 0.01
model = Net()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
transform = transforms.Compose(
   [
       transforms.ToTensor(),
       transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
   ]
)

train_dataset = torchvision.datasets.CIFAR10(
   root="/tmp/data/cifar10", transform=transform, download=True, train=True
)

trainloader = torch.utils.data.DataLoader(
   train_dataset, batch_size=batch_size, shuffle=True
)

# 2. Initialize FLARE
flare.init()

# At each round while FLARE is running
while flare.is_running():
   # 3. Receive the global model
   input_model = flare.receive()

   # 4. Load global model
   model.load_state_dict(input_model.params)
   model.to(device)

   for epoch in range(epochs):
       running_loss = 0.0

       for i, batch in enumerate(trainloader):
           images, labels = batch[0].to(device), batch[1].to(device)

           optimizer.zero_grad()

           predictions = model(images)
           cost = loss(predictions, labels)
           cost.backward()
           optimizer.step()

           running_loss += cost.cpu().detach().numpy() / batch_size

           if i % 3000 == 2999:
               print(
                   f"Epoch: {epoch + 1}/{epochs}, batch: {i + 1}, Loss: {running_loss / 3000}"
               )
               running_loss = 0.0

       print(
           f"Epoch: {epoch + 1}/{epochs}, batch: {i + 1}, Loss: {running_loss / (i + 1)}"
       )

   print("Finished Training")

   torch.save(model.state_dict(), "./cifar_net.pth")

   # 5. Send back the updated model
   output_model = flare.FLModel(
       params=model.cpu().state_dict(),
       meta={"NUM_STEPS_CURRENT_ROUND": len(trainloader) * epochs},
   )
   flare.send(output_model)

Example 1b: PyTorch Lightning client The Lightning integration keeps the samehttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#example_1b_pytorch_lightning_client_the_lightning_integration_keeps_the_same

The Lightning integration keeps the sameintent—receive global model, train, send updates—but exposes it in a Lightning-friendly way: import the Lightning client adapter and patch the Trainer.

The typical flow is: import, patch, (optional) validate, train as usual.

# lightning_client.py
import pytorch_lightning as pl
from pytorch_lightning import Trainer

import nvflare.client.lightning as flare  # Lightning Client API  

from model import LitNet
from data import CIFAR10DataModule
def main():
   model = LitNet()
   dm = CIFAR10DataModule()

   trainer = Trainer(max_epochs=1, accelerator="gpu", devices=1)

   # Patch trainer to participate in FL
   flare.patch(trainer)

   while flare.is_running():
       # Optional: validate current global model (useful for server-side selection flows)
       trainer.validate(model, datamodule=dm)

       # Train starting from received global model (handled internally after patch)
       trainer.fit(model, datamodule=dm)

if __name__ == "__main__":
   main()

The point: Lightning users don’t have to drop into custom federated messaging—they keep the Trainer abstraction and still participate correctly in FL rounds.

Step 2: Package and execute the federated job anywhere (job recipes)https://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#step_2_package_and_execute_the_federated_job_anywhere_job_recipes

**Who it’s for:**Data scientists and applied teams who want a code-first job definition that remains stable across environments.

After step 1, you have a federated client script. Step 2 makes it a federated job you can run repeatedly and move through the lifecycle cleanly.

Job recipes are designed to replace JSON-based job configuration with a Python-based job definition:

**Code-first:**Define complete FL jobs in Python, not complex config files
**Write once, run anywhere:**Same recipe runs in simulator, PoC, or production
**Speed to deployment:**Go from experimentation to deployment without changing code structure

Example 2a: Execute a FedAvg recipe in simulationhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#example_2a_execute_a_fedavg_recipe_in_simulation

The key linkage is that your recipe references the client training script you created in step 1 (e.g.,train\_script="client\.py"), then you execute it in an environment.

# job.py
from nvflare.app_common.workflows.job import FedAvgRecipe
from nvflare.job_config import SimEnv  # exact import path can vary by NVFlare version

from model import SimpleNetwork

def main():
   n_clients = 3
   num_rounds = 5
   batch_size = 32

   recipe = FedAvgRecipe(
       name="hello-pt",
       min_clients=n_clients,
       num_rounds=num_rounds,
       model=SimpleNetwork(),
       train_script="client.py",  # <-- Step A script
       train_args=f"--batch_size {batch_size} --epochs 1",
   )

   env = SimEnv(num_clients=n_clients, num_threads=n_clients)
   recipe.execute(env=env)

if __name__ == "__main__":
   main()

This is the “write once” idea in practice: Once the recipe correctly references your client script, the rest becomes an execution concern.

Example 2b: Move from simulation to real-world with an environment swap.https://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#example_2b_move_from_simulation_to_real-world_with_an_environment_swap%C2%A0

Job recipes formalize a progressive workflow by swapping the execution environment:

**SimEnv (Simulation):**Easy development, rapid debugging
**PocEnv (Proof-of-Concept):**Local runtime, multi-process, realistic testing
**ProdEnv (Production):**Distributed deployment on secure, scalable infrastructure

Alt text: Figure shows a three-stage JobRecipe pipeline flowing into three execution environments. A box labeled “JobRecipe” at the top splits into three arrows pointing to side-by-side panels: SimEnv (Simulation) for easy development and rapid debugging, PocEnv (Proof-of-Concept) for realistic multi-process testing in a local runtime, and ProdEnv (Production) for secure distributed deployment. Figure 2. One JobRecipe, multiple execution environments: Debug in SimEnv, validate in PocEnv, and deploy in ProdEnv without rewriting the job definition

Getting startedhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#getting_started

Start with a script you already trust.
**Step 1:**Add the client API handshake (or patch your Lightning Trainer).
**Step 2:**Wrap it in a job recipe and execute first in simulation, then PoC, then production by swapping environments.

FLARE in the Newshttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#flare_in_the_news

FLARE is showing up in real deployments—fromEli Lilly TuneLab’s federated learning platform(built byRhino Federated Computingusing NVFlare) toTaiwan MOHW’s national healthcare federated learning initiative, and aTri-labs (Sandia/LANL/LLNL)federated AI pilot across sensitive datasets.

Going furtherhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#going_further

Start with a script you already trust. Add the minimal FLARE client handshake (receive → train → send). Then scale from single-node simulation to multi-site deployment when you’re ready.

**Start here:**Hello World examples (fastest path to your first federated run) —NVFlare Hello World
**Watch the walkthrough:**see the simplified API stack in action —Webinar recording
Client APIdocs
JobRecipedocs
NVFlare onGitHub

@PyTorch: Federated Learning Without the Refactoring Overhead The most valuable data is often the least movable. Regulatory bound…

Federated Learning Without the Refactoring Overhead Using NVIDIA FLARE

‘No data copy’ as a system requirementhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#%E2%80%98no_data_copy%E2%80%99_as_a_system_requirement

The refactoring cliff: Why FL projects stallhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#the_refactoring_cliff_why_fl_projects_stall

Step 1: Convert your local training script into a federated client (client API)https://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#step_1_convert_your_local_training_script_into_a_federated_client_client_api

Example 1a: Convert PyTorch to FLAREhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#example_1a_convert_pytorch_to_flare_

Example 1b: PyTorch Lightning client The Lightning integration keeps the samehttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#example_1b_pytorch_lightning_client_the_lightning_integration_keeps_the_same

Step 2: Package and execute the federated job anywhere (job recipes)https://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#step_2_package_and_execute_the_federated_job_anywhere_job_recipes

Example 2a: Execute a FedAvg recipe in simulationhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#example_2a_execute_a_fedavg_recipe_in_simulation

Example 2b: Move from simulation to real-world with an environment swap.https://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#example_2b_move_from_simulation_to_real-world_with_an_environment_swap%C2%A0

Getting startedhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#getting_started

FLARE in the Newshttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#flare_in_the_news

Going furtherhttps://developer.nvidia.com/blog/federated-learning-without-the-refactoring-overhead-using-nvidia-flare/#going_further

About the Authors

Similar Articles

Federated Learning

Accurate and Resource-Efficient Federated Continual Learning

Enabling privacy-preserving AI training on everyday devices

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

Submit Feedback

Similar Articles

Accurate and Resource-Efficient Federated Continual Learning

Enabling privacy-preserving AI training on everyday devices

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning