PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

arXiv cs.LG 05/26/26, 04:00 AM Papers
Summary
PrivFusion is a privacy-preserving multi-agent framework that automates the harmonization of structured datasets across institutions before federated training, reducing manual effort and enabling collaborative analytics on sensitive clinical data.
arXiv:2605.24249v1 Announce Type: new Abstract: The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitive health information. Federated Learning (FL) offers a distributed alternative, but its adoption is limited by substantial heterogeneity across institutional datasets, making harmonization a critical but frequently overlooked prerequisite for multi-site analytics. We introduce PrivFusion, a privacy-preserving multi-agent framework that automates the harmonization of structured datasets prior to federated training. PrivFusion uses agents to analyze local data, cluster semantically similar features across sites, and provide iterative transformation recommendations until alignment is achieved. Evaluation across four heterogeneous COVID-19 datasets demonstrates that PrivFusion effectively and efficiently harmonizes multi-site data while substantially reducing manual effort.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:02 AM
# PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets
Source: [https://arxiv.org/html/2605.24249](https://arxiv.org/html/2605.24249)
###### Abstract

The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitive health information\. Federated Learning \(FL\) offers a distributed alternative, but its adoption is limited by substantial heterogeneity across institutional datasets, making harmonization a critical but frequently overlooked prerequisite for multi\-site analytics\. We introduce PrivFusion, a privacy\-preserving multi\-agent framework that automates the harmonization of structured datasets prior to federated training\. PrivFusion uses agents to analyze local data, cluster semantically similar features across sites, and provide iterative transformation recommendations until alignment is achieved\. Evaluation across four heterogeneous COVID\-19 datasets demonstrates that PrivFusion effectively and efficiently harmonizes multi\-site data while substantially reducing manual effort\.

## IIntroduction

The volume and complexity of data generated across clinical, biomedical, and public health settings continue to grow rapidly\. This has led to the adoption ofMachine Learning\(ML\) methods for tasks such as risk prediction, phenotyping, and cohort discovery\[[6](https://arxiv.org/html/2605.24249#bib.bib47)\]\. However, traditionalMLpipelines typically rely on centralized data aggregation, which is often infeasible for sensitive information such as electronic health records or financial data due to privacy, security, and governance constraints\. As a result, many healthcare institutions remain unable or unwilling to share raw data, limiting opportunities for large\-scale, multi\-site analytics\.

Federated learning\(FL\) has emerged as an alternative that enables collaborative model training across institutions while keeping data on\-premises\[[3](https://arxiv.org/html/2605.24249#bib.bib48)\]\. AlthoughFLhas demonstrated strong potential in various areas, deployingFLin the real world remains challenging\. A central barrier is the data heterogeneity across institutions\. Clinical datasets often differ in feature definitions, coding systems, granularity, measurement practices, and data quality\. Before any distributed analysis can be performed, these datasets must be harmonized – a labor\-intensive process that requires aligning variables, resolving semantic inconsistencies, and standardizing representations\[[20](https://arxiv.org/html/2605.24249#bib.bib45)\]\. Yet, mostFLmethodologies implicitly assume that harmonization has been completed prior to model training, an assumption that is not realistic\. As a result, this has become a major bottleneck, delaying or limiting the scalability of federated studies\.

To mitigate these challenges, several studies have explored ontology\-driven and semantic techniques for data harmonization\[[19](https://arxiv.org/html/2605.24249#bib.bib49),[5](https://arxiv.org/html/2605.24249#bib.bib50),[10](https://arxiv.org/html/2605.24249#bib.bib51)\]\. While these approaches alleviate aspects of data heterogeneity, they often require some degree of centralization or rely on predefined common formalisms and mappings, which can be difficult to establish across institutions with differing workflows and data models\. These limitations underscore the need for automated, scalable, and privacy\-preserving harmonization methods that operate in distributed environments without centralizing sensitive data\. Such capabilities would substantially reduce the manual burden on participating institutions, broaden access to collaborative analytics, and ultimately improve the generalizability of resulting models\.

To address this gap, we introduce PrivFusion111Our code is available at https://github\.com/IBM/PrivFusion, a privacy\-preserving multi\-agent framework designed to harmonize structured datasets across institutions prior to federated model training\. In PrivFusion, each participating site,*researcher*, first performs a local analysis of its dataset using a suite of agents that extract data types, generate dataset and feature\-level descriptions, infer relationships among features, and generate a small set of synthetic samples\. Sites then share this metadata with a central server,*aggregator*\. The server uses the received metadata to cluster semantically related features across datasets and, based on these clusters, invokes an additional agent to generate harmonization recommendations for each site\. These recommendations specify which features require transformation and the target representations needed for alignment\. Each site applies the proposed transformations locally, after which the updated metadata are resent to the server\. This iterative process continues until no additional transformations are required\. We evaluate PrivFusion on four real\-world COVID\-19 datasets originating from different countries and find that it typically achieves harmonization within 2–3 iterations across dataset pairs, even when varying the underlyingLarge Language Model\(LLM\)\. We further observe a consistent increase in feature\-name similarity over successive iterations, indicating progressive convergence of the schemas\.

## IISystem and Threat Models

In this section, we introduce our system and threat models\.

System model\.We consider a system that includes two parties: \(i\) two or more researchers, and \(ii\) a server\. The researchers’ goal is to collaboratively train a medical diagnostic model with high\-quality data\. For that, they need to ensure that their data is harmonized in a privacy\-preserving way\. All computations are outsourced to the server\. To harmonize the features across the datasets, each researcher provides some metadata, as will be discussed in Section[III](https://arxiv.org/html/2605.24249#S3), to the server\. Using the received metadata, the server determines for each dataset which features require transformations and the target representation, and sends these recommendations to each researcher\. The researchers transform their datasets using the received recommendations and conduct their desired study\.

Threat model\.We assume honest researchers with legitimate research datasets in an honest\-but\-curious setting, where the server follows the protocol, but it might leverage shared data and metadata to infer sensitive information about the individuals represented in the datasets\. Some of the known attacks are membership inference\[[23](https://arxiv.org/html/2605.24249#bib.bib17),[2](https://arxiv.org/html/2605.24249#bib.bib19)\]and attribute inference attacks\[[14](https://arxiv.org/html/2605.24249#bib.bib18)\]\. In a membership inference attack, the attacker \(the server\) aims to determine whether a target individual is part of the dataset or not\. In an attribute inference attack, the attacker’s goal is to infer some additional sensitive information about a target individual given the observed ones\.

## IIIProposed Framework

To harmonize the features across different datasets, the proposed approach requires researchers to share information about their respective datasets and samples with the server in a privacy\-preserving manner\. The researchers’ goal is to provide metadata useful for determining the required dataset transformations\. At the same time, they want to ensure that the metadata does not increase the baseline privacy risk of sharing aggregate statistics about the dataset\. LetSSdenote the server,DiD^\{i\}represent the original dataset of researcherRiR^\{i\}, andMiM^\{i\}the metadata that each researcher sends to the server\. In the following, we describe the different steps of the proposed framework, shown in[Figure˜1](https://arxiv.org/html/2605.24249#S3.F1)\.

![Refer to caption](https://arxiv.org/html/2605.24249v1/x1.png)Figure 1:An overview of PrivFusion\.Dataset Analyzer\.Initially, each researcher locally extracts the feature data types \(e\.g\., string, numeric, floating point single precision, etc\.\) for their dataset\. Moreover, each feature is associated with a semantic type using advanced data classification techniques\[[22](https://arxiv.org/html/2605.24249#bib.bib13),[4](https://arxiv.org/html/2605.24249#bib.bib14)\]and by mapping against semantic concepts within a selected general purpose ontology\. In the following, we leverage DBPedia as our reference ontology\. After that, the researchers use an agentADA\_\{D\}to generate a description of the dataset\. Given the dataset description and feature data types, each researcher uses another agent,AFA\_\{F\}, to obtain a semantic description of the features\. In addition, each researcher uses an agentATA\_\{T\}to extract a list of relevant topics represented in their datasetDiD^\{i\}\. To identify feature dependencies, they use an agentARA\_\{R\}that infers relationships between the features\. Finally, each researcher generatesnnsynthetic samples, low\-utility preserving samples, which will mainly guide the server on harmonizing the features across datasets in the next steps\. The researchers’ goal is to generate synthetic samples that maintain the format of the features in their respective datasets and whose values are within the domain\. Note that, to generate synthetic samples, the researchers can use any of the SOTADifferential Privacy\(DP\) approaches\.

At the end of this stage, each researcher generates a metadataMiM^\{i\}consisting of \(i\) a description of the dataset, \(ii\) feature names and data types, \(iii\) semantic descriptions of individual features, \(iv\) a list of dataset topics, and \(v\) inferred relationships among features\. In addition to these, each researcher sharesnnsynthetically generated samples\. Each researcher,RiR^\{i\}, sends their prepared metadata,MiM^\{i\}, to the serverSS, which determines which features need to be transformed and their granularity, and then sends this information back to the researchers\.

Clustering Features\.By examining the metadataMiM^\{i\}\(specifically, feature names, data types, and semantic types\), the server clusters the features across the datasets, grouping those that are semantically similar or related and are likely candidates for merging into a single feature\. This is achieved through an agentACA\_\{C\}, which clusters features that share similar semantics and are candidates for consolidation\. The agent outputs a set of clustersCCwhere each clusterCjC\_\{j\}consists of a cluster identifier, feature name, and its dataset name\. Features within the same cluster share a cluster ID, indicating potential alignment across datasets\. Thus, clusters contain one or more features\.

Transformations Recommender\.Given the clusters obtained in the previous step and the metadataMiM^\{i\}provided by each researcher – especially, the dataset description, the features semantic type \(i\.e\., DBpedia URIs\) and semantic descriptions, the feature relationships, and the synthetic samples – the server invokes another agentAHA\_\{H\}to determine which features within each dataset should be combined, removed, or modified\. For each cluster,AHA\_\{H\}compares the DBpedia URIs and semantic feature descriptions to converge on a representative DBpedia URI that best reflects the shared concept across datasets, while preserving an appropriate level of granularity\. The outcome is a set of transformation recommendations for each researcher, specifying which features require modification and the target representation that the serverSSsends to the respective parties\.

Applying Transformations\.Upon receiving the transformation instructions, each researcherRiR^\{i\}applies the recommended transformations to their dataset\. Researchers can implement these transformations directly using Python scripts or leverage a code generationLLMMCM\_\{C\}to automatically produce the transformation functions\. Once the transformations are applied, all participating datasets are harmonized to a common semantic and structural format\. This alignment enables collaborative development and training of high\-quality diagnostic models across institutions\. These steps are repeated until the server does not suggest further transformations or the maximum number of iterationsTTis reached\.

## IVEvaluation

Our evaluation of PrivFusion focuses on two key aspects: \(i\) the number of iterations required to reach convergence and \(ii\) the number of feature\-level transformations recommended during the harmonization process\. We measure feature similarity between datasets at each iteration to showcase harmonization progress and examine how the choice of the underlyingLLMinfluences these outcomes\.

Datasets\.We use four publicly available COVID\-19 datasets: \(i\) COVID\-19 Indonesia222[https://www\.kaggle\.com/dsv/4214699](https://www.kaggle.com/dsv/4214699)\(IDN\), \(ii\) COVID\-19 Afghanistan333[https://www\.kaggle\.com/datasets/georgesaavedra/covid19\-dataset](https://www.kaggle.com/datasets/georgesaavedra/covid19-dataset)\(AFG\), \(iii\) COVID\-19 Italy444[https://www\.kaggle\.com/datasets/sudalairajkumar/covid19\-in\-italy](https://www.kaggle.com/datasets/sudalairajkumar/covid19-in-italy)\(IT\), and \(iv\) COVID\-19 US555[https://www\.kaggle\.com/dsv/13711559](https://www.kaggle.com/dsv/13711559)\(US\)\. Each dataset contains information about COVID\-19 cases in the specified country \(e\.g\., the total number of coronavirus cases on a given date in a given location\)\. These datasets share some features in different representations, such as date, ISO code, location, etc\. Note that we removed some of the unique features from the AFG and IDN datasets in order to focus the analysis on features that may be aligned\.

Date\.All four datasets include a date feature, but use different formats\. IDN records dates asmm/dd/yyyy\(Date\), while AFG and US use the formatyyyy\-mm\-dd\(date\)\. IT provides the most detailed representation, following the ISO\-8601 timestamp formatYYYY\-MM\-DDTHH:MM:SS\(date\)\.

Location\.Location granularity and encoding differ widely\. US reportscountyandstatelevel identifiers\. IDN uses ISO codes \(Location ISO Code\) and province\-level labels \(Location\)\. AFG describes location using 3 featuresiso\_code,continent, andlocation\. IT offers the finest granularity includingCountry,RegionCode,RegionName,ProvinceName,ProvinceAbbreviation,Lattitude,Longitude\.

Case counts\.All datasets report the total number of COVID\-19 cases, but under different feature names:TotalPositiveCases\(Italy\),cases\(US\),total\_cases\(Afghanistan\),Total Cases\(Indonesia\)\.

TABLE I:Synthetic samples generated from COVID\-19 datasets for Afghanistan, Indonesia, Italy, and US\.\\csvautobooktabular\[separator=comma, respect all, filter=\\c@csvrow<3\]data/covid\_dataset\_afg\_synthetic\_head\.csv

\(a\)COVID\-19 Afghanistan
\\csvautobooktabular\[separator=comma, respect all, filter=\\c@csvrow<3\]data/covid\_dataset\_ind\_synthetic\_head\.csv

\(b\)COVID\-19 Indonesia
\\csvautobooktabular\[separator=comma, respect all, filter=\\c@csvrow<3\]data/covid\_dataset\_it\_synthetic\_head\.csv

\(c\)COVID\-19 Italy
\\csvautobooktabular\[separator=comma, respect all, filter=\\c@csvrow<3\]data/covid\_dataset\_US\_synthetic\_head\.csv

\(d\)COVID\-19 US

[Table˜I](https://arxiv.org/html/2605.24249#S4.T1)shows a snapshot of synthetic samples generated from each dataset\. A brief inspection of these samples reveals that, despite having several conceptually aligned features, harmonizing them requires substantial pre\-processing\. Aligning semantically equivalent columns ranges from simple operations, such as standardizing feature names \(e\.g\., convertingDatetodate\), to more complex tasks, such as aggregating multiple location\-related features and selecting a consistent level of granularity \(e\.g\., reducing province\-level information to the country level\)\. These challenges underscore the need for a reasoning\-driven alignment process capable of interpreting semantic meaning rather than relying solely on string matching or manual rules\. Moreover, discrepancies in representation formats, such as ISO codes or GPS coordinates, require dataset\-specific transformations that must be consistently applied across sites to enable meaningful harmonization\.

Models\.Each agent in PrivFusion can either rely on a shared large language model \(LLM\) or use specialized LLMs tailored to its specific task \(e\.g\., use a codingLLMfor the code generation of transformations\)\. In our experiments, we evaluateGPT\-OSS\-120b\[[1](https://arxiv.org/html/2605.24249#bib.bib33)\]\(GPT\-120b\) andllama\-3\-3\-70b\-instruct\(Llama\-70b\)\[[12](https://arxiv.org/html/2605.24249#bib.bib53)\]for dataset analysis, clustering features, and recommending transformations\. For code generation, we evaluate the general\-purposeGPT\-120band a code generationLLMdeepseek\-coder\-33b\-instruct\(DSC\-33b\)\[[16](https://arxiv.org/html/2605.24249#bib.bib52)\]\.

Results\.We evaluate the performance of PrivFusion using multiple combinations of the four datasets\. To quantify the alignment efficiency, we use three metrics: \(i\) the total \# of iterations needed for convergence, \(ii\) the total \# of transformations recommended during the process, and \(iii\) the \# of common features – defined as features sharing the same name and DBpedia URI – after harmonization\. The maximum number of iterations is capped atT=20T=20\. We first construct all possible combinations, yielding1111dataset combinations, and focus our analysis on a subset of these\.[Table˜II](https://arxiv.org/html/2605.24249#S4.T2)shows an example of the harmonized output for the AFG\+IDN pair \(usingGPT\-120bfor all alignment steps\)\. Compared with the original schema \(shown in[Table˜I](https://arxiv.org/html/2605.24249#S4.T1)\), PrivFusion consistently standardized the feature names to snake case, harmonized ISO codes, and aligned date formats to match the COVID\-19 Indonesia convention \(see ListingLABEL:lst:code\_examplefor an example\)\. The framework preserved dataset\-specific attributes when appropriate, for instance, retainingpopulation\_densityfrom AFG, while removing features that could not be consistently represented across datasets, such asLocationfrom IDN, which lacked a viable counterpart in AFG\.

TABLE II:Example of AFG and IDN fromgpt\_all\.\\csvautobooktabular\[ separator=comma, respect all, filter=\\c@csvrow<3 \]data/covid19\-dataset\_afg\_ind\_gpt\_all\.csv

\(a\)COVID\-19 Afghanistan
\\csvautobooktabular\[separator=comma, respect all, filter=\\c@csvrow<3\]data/covid19\-indonesia\_afg\_ind\_gpt\_all\.csv

\(b\)COVID\-19 Indonesia

Listing 1:Example ofGPT\-120bfeature transformation\.importpandasaspd

deftransform\_date\(value\):

"""

␣␣␣␣Convert␣a␣date␣string␣\(e\.g\.,␣’3/28/2020’\)␣into

␣␣␣␣ISO\-8601␣format␣’YYYY\-MM\-DD’\.

␣␣␣␣Returns␣the␣formatted␣string,␣or␣the␣original

␣␣␣␣value␣if␣parsing␣fails\.

␣␣␣␣"""

try:

ifpd\.isna\(value\):

returnvalue

returnpd\.to\_datetime\(value\)\.strftime\(’%Y\-%m\-%d’\)

exceptException:

returnvalue

The results of all conducted experiments are summarized in[Table˜III](https://arxiv.org/html/2605.24249#S4.T3), which also shows that the number of aligned features can vary across model configurations for the same dataset combination\. For instance, in the IDN\-AFG pair, the number of aligned features ranges from44to55depending on the underlying LLM\. This variability becomes more obvious in the AFG\-IDN\-IT triplet, where configurations usingLlama\-70bresult in only 1\-2 common features\. When moving from pairwise to multi\-party harmonization, the number of required transformations increases by 10x, highlighting the growing complexity of aligning heterogeneous schemas\. The triplet and all\-datasets combinations are particularly demanding, requiring4949and7878transformations, respectively, showing that multi\-party harmonization scales non\-linearly in complexity\.

TABLE III:The total number of iterations and transformations applied to harmonize the datasets across various LLMs\.To assess the degree of alignment, we compute theJaccard Similarity\(JS\)\[[18](https://arxiv.org/html/2605.24249#bib.bib54)\]between feature names at each iteration\. This metric quantifies the proportion of shared features \(those with matching names and formats\) relative to the total \# of unique features across the two datasets\. Figure[2](https://arxiv.org/html/2605.24249#S4.F2)shows theJSincreasing with each iteration for all models and dataset combinations\. Across all subplots, most configurations reach their peakJSwithin 3–6 iterations, indicating quick stabilization\. Most improvements occur in the first 2–4 iterations, after which curves flatten, highlighting the importance of early iterations\.gpt\_allandgpt\_all\_ds\_codetypically stabilize around JS∈\(0\.4,0\.6\)\\in\(0\.4,0\.6\), which is significantly lower than LLaMA\-based results\. We noticed thatLlama\-70btends to introduce new features, i\.e\.,Diseaseand fill it withCOVID\-19, which explains the higherJSJSscores for LLaMA combinations\.

As part of its design, PrivFusion is instructed to preserve dataset\-specific features when they can not be generated, inferred, or reliably aligned across sites\. This behavior explains the lower scores observed in some model combinations, where agents were unable to cluster or map certain features due to their uniqueness\. Conversely, for the AFG\+IDN pair, in theafg\_ind\_llama\_all\_gpt\_codeconfiguration, whereLlama\-70bwas used for all stages except code generation, the framework proposed a transformation that removed thepopulation\_densityfeature from AFG due to its uniqueness in the dataset \(see[Table˜I](https://arxiv.org/html/2605.24249#S4.T1)for a snapshot of the dataset\)\.

![Refer to caption](https://arxiv.org/html/2605.24249v1/imgs/afg_ind_js.png)\(a\)IDN \+ AFG
![Refer to caption](https://arxiv.org/html/2605.24249v1/imgs/it_ind_js.png)\(b\)IDN \+ IT
![Refer to caption](https://arxiv.org/html/2605.24249v1/imgs/afg_it_ind_js.png)\(c\)AFG \+ IT \+ IDN
![Refer to caption](https://arxiv.org/html/2605.24249v1/imgs/all_js.png)\(d\)IT \+ IDN \+ AFG \+ US

Figure 2:Jaccard Similarity wrt iterations \# for different LLMs\.
## VDiscussion

Impact ofLLMs\.Our evaluation revealed distinct performance characteristics across the testedLLMs\. For example, GPT\-120b demonstrated superior instruction adherence while maintaining creativity, often recognizing when further iterations were unnecessary, even in independent optimization cycles\. The experiment referenced asgpt\_allrequired fewer iterations and transformations, favoring incremental updates over introducing new features\. In contrast, LLaMA exhibited higher creativity but occasionally misapplied normalization steps or introduced unnecessary changes \(e\.g\., renaming features or altering string cases\)\. While LLaMA performed well in data analysis tasks, it struggled in determining when the harmonization process should terminate\. DeepSeek consistently followed instructions and produced correct code\. However, it more frequently generated incomplete responses requiring manual completion \(e\.g\., placeholder lists with comments\)\. Both GPT\-120b and DeepSeek occasionally violated coding guidelines by prematurely replacing values with None before attempting transformations\. Finally, combining LLaMA with DeepSeek proved ineffective, causing persistent errors and non\-convergence withinT=20T=20iterations\.

Privacy Analysis of PrivFusion\.Under the threat model \(Section[II](https://arxiv.org/html/2605.24249#S2)\), we assume an honest\-but\-curious setting in which the server and participating researchers follow the prescribed protocol, and no parties collude to extract additional information about any specific participant\. Within this context, we can analyze the steps executed by PrivFusion and conclude that no unintended information is leaked\. Each participant shares only high‑level “structural metadata” \(data types, semantic feature descriptions\) and a small set of synthetic samples\. Researchers fully control synthetic sample generation, and PrivFusion requires only syntactically valid, not statistically faithful, examples\. This allows correlation‑free generation that lowers re‑identification risk\[[14](https://arxiv.org/html/2605.24249#bib.bib18)\]\. Each researcher receives only the transformation instructions for their own dataset, preventing visibility into others’ data characteristics\. While the transformation instructions may reveal the overall level of granularity or formatting chosen during harmonization, this information reflects an aggregate view across datasets rather than any single researcher’s contribution\. Because the transformations are produced as a collective abstraction \(coordinated by the server\), participants cannot reliably attribute specific schema elements or granularity decisions to individual researchers\. Overall, by restricting communication to structural metadata and locally generated, low\-utility synthetic samples, and by limiting the visibility of transformation recommendations, PrivFusion effectively minimizes cross\-site disclosure risks while enabling automated, privacy\-preserving harmonization\.

Limitations\.While PrivFusion is designed to be adaptable to a wide range of domains, several limitations remain\. Its performance depends on the reasoning consistency ofLLM, which may vary across models and tasks\. Although the framework encourages preservation of dataset\-specific features, LLM\-driven harmonization may occasionally over/under\-align features\. In clinical settings, incorporating a human\-in\-the\-loop to review and validate the harmonization recommendations would strengthen semantic accuracy and reliability\. Additionally, PrivFusion typically converged in our experiments, but lacks theoretical guarantees on iteration count\. Finally, the multi\-agent workflow can be computationally intensive and may not be feasible for resource\-constrained institutions\.

## VIRelated Work

We review two primary lines of related research: \(i\) data harmonization and \(ii\)federated and distributed health analytics\.

Data Harmonization\.Data harmonization has been extensively studied across a range of scientific domains, often in field\-specific contexts such as genomics, epidemiology, and multi\-omics research\[[20](https://arxiv.org/html/2605.24249#bib.bib45),[27](https://arxiv.org/html/2605.24249#bib.bib42),[8](https://arxiv.org/html/2605.24249#bib.bib43),[7](https://arxiv.org/html/2605.24249#bib.bib44)\]\. Broadly, harmonization approaches can be classified into retrospective and prospective strategies\[[9](https://arxiv.org/html/2605.24249#bib.bib38)\]\. Retrospective harmonization aligns datasets that have already been collected, as in the case of harmonizing existing COVID\-19 datasets\[[17](https://arxiv.org/html/2605.24249#bib.bib39)\]\. In contrast, prospective harmonization requires collaborators to agree on standardized measures prior to data collection\. While powerful, this approach can be challenging to implement because research groups often operate under different scientific aims, or healthcare systems\[[26](https://arxiv.org/html/2605.24249#bib.bib40)\]\. In practice, harmonization efforts often fall along a continuum between these two extremes\[[25](https://arxiv.org/html/2605.24249#bib.bib46)\]\. Harmonization techniques can also be classified by the mechanism used to align heterogeneous data\. One strategy is merging, in which a unified taxonomy or ontology is constructed that encompasses all local taxonomies or ontologies\. Alternatively, mapping\-based methods define alignment rules between ontologies, enabling interoperability without requiring complete unification\. These approaches offer different trade\-offs in scalability, flexibility, and consistency, depending on the intended analytical use case\[[9](https://arxiv.org/html/2605.24249#bib.bib38)\]\.

Federated and Distributed Health Analytics\.Several prior works have focused on enabling multi\-institutional health analytics and collaborative model development without centralizing sensitive data\[[15](https://arxiv.org/html/2605.24249#bib.bib11),[11](https://arxiv.org/html/2605.24249#bib.bib7)\]\. Gaye et al\.\[[15](https://arxiv.org/html/2605.24249#bib.bib11)\]develop DataShield, a framework that enables the co\-analysis of individual\-level data from several studies without transferring the data across institutions\. Similarly, thePersonal Health Train\(PHT\)\[[11](https://arxiv.org/html/2605.24249#bib.bib7)\]provides a distributed infrastructure that brings computation to the data rather than the reverse\. Evaluations on lung cancer datasets \(tumor staging and post\-treatment survival information\) have shown the effectiveness ofPHT\. More recent works inFLhave explored how to improve model performance under heterogeneous client data distributions\. Personalized FL approaches aim to tailor global models to local client needs\[[24](https://arxiv.org/html/2605.24249#bib.bib35)\]\. Per\-FedAvg\[[13](https://arxiv.org/html/2605.24249#bib.bib34)\]uses a meta\-learning strategy to learn an initial shared model, which clients can rapidly adapt to their local distribution with only a few gradient updates\. Li et al\.\[[21](https://arxiv.org/html/2605.24249#bib.bib36)\]propose a privacy\-preserving method for selecting informative training samples, enabling clients to build more effective models while reducing computation\.

## VIIConclusion

In this work, we have proposed PrivFusion, a novel multi\-agent framework for harmonizing datasets across sites in a privacy\-preserving way\. Evaluation on four real\-world COVID\-19 datasets showed that PrivFusion, particularly when leveraging GPT\-OSS as the underlying large language model, effectively harmonized datasets in just a few iterations with a minimal number of required transformations\. By enabling automatic alignment of heterogeneous datasets while preserving data privacy, PrivFusion will facilitate high\-quality collaborative research across institutions\. In future work, we will explore the application of PrivFusion to additional datasets and domains, as well as the evaluation of alternative large language models to further enhance its scalability and applicability\.

## Acknowledgments

Anisa Halimi and Stefano Braghin were partly supported by the Innovative Health Initiative Joint Undertaking \(IHI JU\) under grant agreement No\. 101172997 – SEARCH\. The content is solely the responsibility of the authors and does not necessarily represent the official views of the agencies funding the research\.

## References

- \[1\]S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§IV](https://arxiv.org/html/2605.24249#S4.p7.1)\.
- \[2\]M\. S\. M\. S\. Annamalai, G\. Ganev, and E\. De Cristofaro\(2024\)"What do you want from theory alone?" experimenting with tight auditing of differentially private synthetic data generation\.In33rd USENIX Security Symposium \(USENIX Security\),pp\. 4855–4871\.Cited by:[§II](https://arxiv.org/html/2605.24249#S2.p3.1)\.
- \[3\]R\. S\. Antunes, C\. André da Costa, A\. Küderle, I\. A\. Yari, and B\. Eskofier\(2022\)Federated learning for healthcare: systematic review and architecture proposal\.ACM Transactions on Intelligent Systems and Technology13\(4\),pp\. 1–23\.Cited by:[§I](https://arxiv.org/html/2605.24249#S1.p2.1)\.
- \[4\]S\. Braghin, J\. H\. Bettencourt\-Silva, K\. Levacher, and S\. Antonatos\(2019\)An extensible de\-identification framework for privacy protection of unstructured health information: creating sustainable privacy infrastructures\.InMEDINFO: Health and Wellbeing E\-networks for All,Cited by:[§III](https://arxiv.org/html/2605.24249#S3.p2.6)\.
- \[5\]M\. d\. Carmen Legaz\-García, J\. A\. Miñarro\-Giménez, M\. Menárguez\-Tortosa, and J\. T\. Fernández\-Breis\(2016\)Generation of open biomedical datasets through ontology\-driven transformation and integration processes\.Journal of Biomedical Semantics7\(1\),pp\. 32\.Cited by:[§I](https://arxiv.org/html/2605.24249#S1.p3.1)\.
- \[6\]M\. Chen, Y\. Hao, K\. Hwang, L\. Wang, and L\. Wang\(2017\)Disease prediction by machine learning over big data from healthcare communities\.IEEE access5,pp\. 8869–8879\.Cited by:[§I](https://arxiv.org/html/2605.24249#S1.p1.1)\.
- \[7\]T\. Chen, A\. J\. Abadi, K\. Lê Cao, and S\. Tyagi\(2023\)Multiomics: a user\-friendly multi\-omics data harmonisation r pipeline\.F1000Research10\(538\),pp\. 538\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p2.1)\.
- \[8\]Y\. Chen, S\. Sabri, A\. Rajabifard, and M\. E\. Agunbiade\(2018\)An ontology\-based spatial data harmonisation for urban analytics\.Computers, Environment and Urban Systems72,pp\. 177–190\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p2.1)\.
- \[9\]C\. Cheng, L\. Messerschmidt, I\. Bravo, M\. Waldbauer, R\. Bhavikatti, C\. Schenk, V\. Grujic, T\. Model, R\. Kubinec, and J\. Barceló\(2024\)A general primer for data harmonization\.Scientific data11,pp\. 152\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p2.1)\.
- \[10\]E\. Chondrogiannis, V\. Andronikou, E\. Karanastasis, and T\. Varvarigou\(2019\)A novel approach for clinical data harmonization\.InIEEE International Conference on Big Data and Smart Computing \(BigComp\),Cited by:[§I](https://arxiv.org/html/2605.24249#S1.p3.1)\.
- \[11\]T\. M\. Deist, F\. J\.W\.M\. Dankers, P\. Ojha, M\. Scott Marshall, T\. Janssen, C\. Faivre\-Finn, P\. Lambin, and A\. Dekker\(2020\)Distributed learning on 20000\+ lung cancer patients – the personal health train\.Radiotherapy and Oncology144,pp\. 189–200\.External Links:[Document](https://dx.doi.org/10.1016/j.radonc.2019.11.019)Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p3.1)\.
- \[12\]A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[§IV](https://arxiv.org/html/2605.24249#S4.p7.1)\.
- \[13\]A\. Fallah, A\. Mokhtari, and A\. Ozdaglar\(2020\)Personalized federated learning: a meta\-learning approach\.arXiv preprint arXiv:2002\.07948\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p3.1)\.
- \[14\]G\. Ganev and E\. De Cristofaro\(2025\)The inadequacy of similarity\-based privacy metrics: privacy attacks against “truly anonymous” synthetic datasets\.InIEEE Symposium on Security and Privacy \(SP\),Cited by:[§II](https://arxiv.org/html/2605.24249#S2.p3.1),[§V](https://arxiv.org/html/2605.24249#S5.p2.1)\.
- \[15\]A\. Gaye, Y\. Marcon, J\. Isaeva, P\. LaFlamme, A\. Turner, E\. M\. Jones, J\. Minion, A\. W\. Boyd, C\. J\. Newby, M\. Nuotio,et al\.\(2014\)DataSHIELD: taking the analysis to the data, not the data to the analysis\.International Journal of Epidemiology43\(6\),pp\. 1929–1944\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p3.1)\.
- \[16\]D\. Guo, Q\. Zhu, D\. Yang, Z\. Xie, K\. Dong, W\. Zhang, G\. Chen, X\. Bi, Y\. Wu, Y\. Li,et al\.\(2024\)DeepSeek\-coder: when the large language model meets programming–the rise of code intelligence\.arXiv preprint arXiv:2401\.14196\.Cited by:[§IV](https://arxiv.org/html/2605.24249#S4.p7.1)\.
- \[17\]G\. C\. Hurtt, L\. P\. Chini, S\. Frolking, R\. Betts, J\. Feddema, G\. Fischer, J\. Fisk, K\. Hibbard, R\. Houghton, A\. Janetos,et al\.\(2011\)Harmonization of land\-use scenarios for the period 1500–2100: 600 years of global gridded annual land\-use transitions, wood harvest, and resulting secondary lands\.Climatic Change109\(1\),pp\. 117\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p2.1)\.
- \[18\]P\. Jaccard\(1901\)Étude comparative de la distribution florale dans une portion des alpes et des jura\.Bulletin de la Société Vaudoise des Sciences Naturelles37,pp\. 547–579\.Cited by:[§IV](https://arxiv.org/html/2605.24249#S4.p10.2)\.
- \[19\]K\. D\. Kourou, V\. C\. Pezoulas, E\. I\. Georga, T\. P\. Exarchos, P\. Tsanakas, M\. Tsiknakis, T\. Varvarigou, S\. De Vita, A\. Tzioufas, and D\. I\. Fotiadis\(2018\)Cohort harmonization and integrative analysis from a biomedical engineering perspective\.IEEE Reviews in Biomedical Engineering\.Cited by:[§I](https://arxiv.org/html/2605.24249#S1.p3.1)\.
- \[20\]G\. Kumar, S\. Basri, A\. A\. Imam, S\. A\. Khowaja, L\. F\. Capretz, and A\. O\. Balogun\(2021\)Data harmonization for heterogeneous datasets: a systematic literature review\.Applied Sciences11\(17\),pp\. 8275\.Cited by:[§I](https://arxiv.org/html/2605.24249#S1.p2.1),[§VI](https://arxiv.org/html/2605.24249#S6.p2.1)\.
- \[21\]A\. Li, L\. Zhang, J\. Tan, Y\. Qin, J\. Wang, and X\. Li\(2021\)Sample\-level data selection for federated learning\.InIEEE INFOCOM\-IEEE Conference on Computer Communications,pp\. 1–10\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p3.1)\.
- \[22\]L\. Nedoshivina, A\. Halimi, J\. Bettencourt\-Silva, and S\. Braghin\(2024\)Pragmatic de\-identification of cross\-domain unstructured documents: a utility\-preserving approach with relation extraction filtering\.AMIA Summits on Translational Science Proceedings2024,pp\. 85\.Cited by:[§III](https://arxiv.org/html/2605.24249#S3.p2.6)\.
- \[23\]T\. Stadler, B\. Oprisanu, and C\. Troncoso\(2022\)Synthetic data–anonymisation groundhog day\.In31st USENIX Security Symposium,Cited by:[§II](https://arxiv.org/html/2605.24249#S2.p3.1)\.
- \[24\]A\. Z\. Tan, H\. Yu, L\. Cui, and Q\. Yang\(2022\)Towards personalized federated learning\.IEEE Transactions on Neural Networks and Learning Systems34\(12\),pp\. 9587–9603\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p3.1)\.
- \[25\]A\. Torres\-Espín and A\. R\. Ferguson\(2022\)Harmonization\-information trade\-offs for sharing individual participant data in biomedicine\.Harvard Data Science Review4\(3\),pp\. 10–1162\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p2.1)\.
- \[26\]H\. Uphoff, J\. Cohen, D\. Fleming, and A\. Noone\(2003\)Harmonisation of national influenza surveillance morbidity data from eiss: a simple index\.\.Euro Surveillance: Bulletin Europeen sur les Maladies Transmissibles= European Communicable Disease Bulletin\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p2.1)\.
- \[27\]A\. H\. Zhu, D\. C\. Moyer, T\. M\. Nir, P\. M\. Thompson, and N\. Jahanshad\(2019\)Challenges and opportunities in dmri data harmonization\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 157–172\.Cited by:[§VI](https://arxiv.org/html/2605.24249#S6.p2.1)\.
PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

Similar Articles

FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning

"OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support"

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

Federated Learning

Submit Feedback

Similar Articles

FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning
"OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support"
PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design
Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations