GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling
Summary
GenesisFunc is an automated multi-agent pipeline for generating high-quality, diverse synthetic training data for function-calling in LLMs. Fine-tuning an 8B model on this data achieves strong in-domain and out-of-domain performance, rivaling some API-based models.
View Cached Full Text
Cached at: 05/29/26, 09:12 AM
# GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling
Source: [https://arxiv.org/html/2605.28835](https://arxiv.org/html/2605.28835)
Hao\-Xiang Xu, Chong Deng1, Jiaqing Liu1, Wen Wang1, Qian Chen1, Lujia Bao1,Xiangang Li1,Zhen\-Hua Ling 1Tongyi Fun Team, Alibaba Group nh2001620@mail\.ustc\.edu\.cn
###### Abstract
Large Language Models \(LLMs\) extend their capabilities through function\-calling \(FC\), which relies on training data with high quality, diversity, and broad coverage of scenarios\. However, obtaining and annotating real function\-calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control\. To address these, we presentGenesisFunc, an automated pipeline for generating FC training data\. Starting from reliable tools in widely used public benchmarks, ourGenesisFuncemploys a multi\-agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process\. The accuracy of the data is further reinforced through a multi\-stage evaluation system\. We fine\-tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open\-source models in in\-domain FC performance and out\-of\-domain generalization, while reaching FC capabilities comparable to some of the latest API\-based models\. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real\-world applicability\. The complete pipeline and the constructed dataset is available at[https://github\.com/famoustourist/GenesisFunc](https://github.com/famoustourist/GenesisFunc)\.
GenesisFunc: Multi\-Agent Data Generation for Accurate and Generalizable Function\-Calling
Hao\-Xiang Xu, Chong Deng1, Jiaqing Liu1, Wen Wang1, Qian Chen1,Lujia Bao1,Xiangang Li1,Zhen\-Hua Ling††thanks:Corresponding author\.1Tongyi Fun Team, Alibaba Groupnh2001620@mail\.ustc\.edu\.cn
## 1Introduction
Tool learning represents a crucial step in advancing the frontier of Large Language Models \(LLMs\), as it transforms them from passive language processors into proactive agents capable of interacting with dynamic environmentsHuang et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib13)\); Qin et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib28)\)\. By bridging internal reasoning with external execution, tool\-equipped LLMs are no longer constrained by static training data but can generalize to open\-ended tasks and adapt to evolving user needsPatil et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib26)\); Qu et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib29)\)\. This paradigm thus underscores not only the practical value of enhancing task coverage across diverse applications such as workflow automation and travel planningZhong et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib43)\); Hao et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib9)\), but also the theoretical importance of extending the fundamental boundaries of LLM capabilitiesLiu et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib23),[2025a](https://arxiv.org/html/2605.28835#bib.bib21)\)\.
Despite the rapid progress in tool learning, the field still faces fundamental challenges that constrain its broader applicability\. The effectiveness of function\-calling \(FC\) relies heavily on high\-quality training data, yet collecting and annotating real\-world data is costly and labor\-intensiveQin et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib28)\)\. Moreover, real\-world function\-calls are inherently complex and diverse, often characterized by ambiguous user intents, rapidly changing dynamic environments, multi\-task requirements, and extended multi\-turn interactionsTang et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib33)\); Patil et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib26)\); Liu et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib23)\); Abdelaziz et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib1)\)\. These intertwined challenges underscore the urgent need for more robust data generation pipelines that can produceaccurate,diverse, andbroadly representative training datato support realistic function\-calling scenarios\.
Increasing research efforts have sought to design automated pipelines for synthesizing training data, which are then used to build tool\-augmented LLMsQu et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib29)\); Liu et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib21)\)\. While progress has been made, fundamental limitations remain\. Existing methods often rely on publicly available or manually constructed APIs, which are frequently unreliable and difficult to scale across broader tool sets, thereby constraining function\-calling capabilities\. In addition, the generated data often suffers from insufficient diversity and inadequate quality control, further weakening performance\. From the perspective of scenario coverage, most approaches still emphasize single\-turn or isolated function calls, leaving them unable to capture more realistic settings such as multi\-turn dialogues or multi\-task interactions\. Together, these limitations suggest the insufficiency of current approaches and highlight the need for more effective and scalable solutions\.
In this paper, we introduceGenesisFunc, a three\-stage automated pipeline designed to generate function\-calling training data that ishigh\-quality,diverse, andbroadly representative of real\-world scenarios\. To ensure the reliability of APIs and enable seamless extension to a wider range of tools,GenesisFuncbegins byselecting a curated seed setfrom the widely adopted BFCL benchmarkYan et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib36)\)\. These APIs are verified to cover multiple domains, making them not only dependable but also practical for extension to downstream tools, since such functions are readily accessible in real applicationsZhang et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib40)\)\. Next, to improve dialogue quality, diversity, and coverage,GenesisFuncincorporates amulti\-agent\-assisted dialogue generation system\. Through the coordinated interaction of agents, the framework leverages history\-aware role differentiation and parameter\-slot selection to expand diversity, while an agent\-based scoring mechanism safeguards the quality of synthesized dialogues\. Finally, to further guarantee correctness,GenesisFuncemploys amulti\-stage evaluation modulethat combines rule\-based and model\-based checks, followed by targeted human review, thereby ensuring both conformity and executability of the training data\. Through this end\-to\-end process,GenesisFuncproduces robust, diverse, and scenario\-rich training data, which significantly enhances the function\-calling capabilities of LLMs\.
To validate the effectiveness ofGenesisFunc, we conduct supervised fine\-tuning on Qwen3\-8BYang et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib37)\)using the training data generated by our pipeline and evaluate the zero\-shot performance of the fine\-tuned model across several public benchmarks\. Since the tools are sourced from BFCLYan et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib36)\), we report results not only on the in\-domain BFCL benchmark but also on the out\-of\-domain API\-BankLi et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib15)\)and ACEBenchChen et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib4)\), thereby assessing both in\-domain performance and generalization to unseen domains\. Under a consistent evaluation protocol, our model consistently surpasses open\-source baselines of comparable scales and remains highly competitive with API\-based modelsOpenAI \([2023](https://arxiv.org/html/2605.28835#bib.bib25)\)\. Furthermore, we demonstrate that our approach can scale up to more downstream tools and verify that the data generated by our pipeline can also be used through reinforcement learning to enhance function\-calling performance in multi\-turn dialogue scenarios\.
In summary, this paper makes three primary contributions: \(1\) We introduceGenesisFunc, an automated pipeline for generating high\-quality function\-calling training data for LLMs\. The pipeline begins with selecting reliable APIs and integrates aMulti\-agent Dialogue Generation Moduleand aMulti\-stage Evaluation Moduleto ensure robustness, diversity, and scenario coverage\. \(2\) Through extensive verification,GenesisFuncproduces datasets that areaccurate and diverse while covering real\-world scenarios\. Moreover, our approach demonstratesstrong generalizability to downstream tools, underscoring its practical applicability\. \(3\) We fine\-tune Qwen3\-8B with training data generated by our pipeline and evaluate theGenesisFunc\-8B on three widely used benchmarks, BFCLYan et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib36)\), API\-BankLi et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib15)\), and ACEBenchChen et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib4)\)\.GenesisFunc\-8B consistently outperforms open\-source LLMs of comparable scales and remains highly competitive with API\-based models\.
## 2Related Work
LLM Function\-Calling Paradigms\.Equipping LLMs with executable tools enables reliable and specialized problem solvingQin et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib28)\)\. Existing approaches mainly follow two paradigms: prompting and fine\-tuning\. Prompting leverages in\-context tool specifications and demonstrationsMialon et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib24)\); Hsieh et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib11)\)\. ReActYao et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib38)\)combines reasoning with API calls for multi\-step tasks, but its performance is constrained by pretraining and degrades with increased tool complexity\. These limitations motivate fine\-tuning via supervised or reinforcement learningTang et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib33)\); Abdelaziz et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib1)\)\. Representative methods include ToolACELiu et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib21)\), which constructs large\-scale tool\-use corpora automatically, and RL\-based approaches such as ToolRLQian et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib27)\)and AWPOLin et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib19)\), which enhance function\-calling through policy optimization with reasoning rewards\.
Data Synthesis for Function\-Calling\.As LLMs advance, reliance solely on human\-authored corpora becomes insufficient for sustained progressBauer et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib3)\)\. To expand supervision without heavy annotation costs, recent works adopt prompt\-driven transformations to augment existing datasets\. For example,Yu et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib39)\)proposes targeted prompting to elicit rare skills and long\-tail behaviors from base models\. However, purpose\-driven data remain limited in tool\-use scenarios\. Prior efforts adapt resources from adjacent domainsBasu et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib2)\)or synthesize samples around public APIsLiu et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib23)\)\. Moreover, ToolACELiu et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib21)\)constructs large\-scale function\-calling corpora via automated pipelines, while ToolForgeChen et al\. \([2025b](https://arxiv.org/html/2605.28835#bib.bib5)\)introduces automated tool\-use synthesis with reduced reliance on real APIs\.
Our work is distinguished from the most relevant studiesQin et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib28)\); Liu et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib23),[2025a](https://arxiv.org/html/2605.28835#bib.bib21)\)in the following aspects\. Prior works often rely on annotated or synthetic APIs, which lack reliability and struggle to scale across larger tool sets\. These approaches also face limitations in diversity, quality, and coverage\. In contrast,GenesisFuncis built upon reliable tools drawn from public benchmarks and offers stronger scalability\. Furthermore, by leveraging a multi\-agent dialogue generation framework and a multi\-stage verification system,GenesisFuncproduces tool\-use training data with richer diversity, higher quality, and broader scenario coverage than prior works\.
## 3Preliminary
Given a user queryqqand a candidate tool set𝒯=\{t1,…,tn\}\\mathcal\{T\}=\\\{t\_\{1\},\\ldots,t\_\{n\}\\\}, where each tooltit\_\{i\}is defined by a name, a usage description, and a schema specifying required and optional parameters, the goal is to generate a valid sequence of tool calls by selecting tools and filling arguments with appropriate values and units\. This process can be framed as follows:
𝒮=\[t1\(a1\),…,tk\(ak\)\]=gϕ\(q,𝒯\),\\mathcal\{S\}=\\big\[t\_\{1\}\(a\_\{1\}\),\\ldots,t\_\{k\}\(a\_\{k\}\)\\big\]=g\_\{\\phi\}\\\!\\big\(q,\\mathcal\{T\}\\big\),\(1\)wheregϕ\(⋅\)g\_\{\\phi\}\(\\cdot\)denotes an LLM with parametersϕ\\phi,kkis the number of invocations, andaia\_\{i\}denotes the argument payload for theii\-th call \(1≤i≤k1\\leq i\\leq k\), which is a set of parameter\-value pairs, that is,ai=\[r1:w1,r2:w2,…,rℓ:wℓ\]a\_\{i\}=\[\\,r\_\{1\}\\\!:\\\!w\_\{1\},\\,r\_\{2\}\\\!:\\\!w\_\{2\},\\ldots,r\_\{\\ell\}\\\!:\\\!w\_\{\\ell\}\\,\], with parameter namesrjr\_\{j\}and corresponding valueswjw\_\{j\}\. The queryqqmay be a single turn or a full multi\-turn history\.
For fine\-tuning, the pair\(q,𝒯\)\(q,\\mathcal\{T\}\)is treated as input context, while the gold\-standard tool\-call sequence𝒮\\mathcal\{S\}serves as the supervised target\. Formally, the training samples can be represented as\{⟨q,𝒯⟩,𝒮\}\\\{\\,\\langle q,\\mathcal\{T\}\\rangle,\\mathcal\{S\}\\,\\\}, where each training instance connects a user query and the candidate tools with the corresponding sequence of tool invocations\.
Figure 1:The overall framework ofGenesisFunc, consisting of aTool Poolbuilt from reliable tools in open\-source benchmarks, aMulti\-Agent Dialogue Generationsystem, and aMulti\-Stage Evaluationprocess\.
## 4Data Generation Pipeline
Synthetic data plays a vital role in enhancing the function\-calling capability of LLMsLiu et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib21)\)\. We propose an automated pipelineGenesisFuncthat employs a structured, three\-stage process to generate high\-quality training data for tool\-augmented LLMs, as illustrated in Figure[1](https://arxiv.org/html/2605.28835#S3.F1)\. First, reliable and functionally diverse tools are selected from public datasets and placed into theTool Pool\. Next, theMulti\-Agent Dialogue Generationmodule leverages a multi\-agent\-assisted framework to produce tool\-use dialogues that are diverse, accurate, and broadly representative\. Finally, theMulti\-Stage Evaluationmodule systematically examines the correctness of the generated dialogues to guarantee training data quality\.
### 4\.1Tool Pool
The quality of training data for enhancing LLM tool\-use abilities depends critically on the reliability and coverage of the underlying APIs\. However, existing datasets often rely on manually annotated APIs or handcrafted synthetic pipelines, which are either costly to scale or difficult to extend to real\-world tools, limiting model generalizability\.
To address these issues,GenesisFuncconstructs a curated Tool Pool entirely sourced from the BFCL evaluation setYan et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib36)\)\. The pool is built through a two\-stage filtering process: GPT\-4oOpenAI \([2023](https://arxiv.org/html/2605.28835#bib.bib25)\)is first used to perform semantic clustering over all tools to remove redundant or highly similar ones, followed by light human verification to ensure correctness and usability\. This process yields a curated Tool Pool of 1,000 tools\. Since tools in BFCL are manually designed to cover a broad range of real\-world usage scenarios, we further verify that the curated pool maintains high functional diversity across multiple domains\. By relying on widely used and practically accessible tools,the Tool Pool balances reliability, diversity, and scalability, and serves as a critical foundation for the subsequent data generation modules inGenesisFunc\.
### 4\.2Multi\-Agent Dialogue Generation
For LLMs, the quality of fine\-tuning data significantly affects downstream performanceLiu et al\. \([2025b](https://arxiv.org/html/2605.28835#bib.bib22)\)\. We propose a multi\-agent dialogue generation module that leverages a multi\-agent framework to assist the dialogue generation system in synthesizing function\-calling dialogues\. By leveraging agent interaction and collaboration, this module produces dialogues that are high\-quality, diverse, and broadly representative, which is crucial to enhance tool\-use capabilities of LLMs\.
#### 4\.2\.1Multi\-Agent Framework
The multi\-agent framework includes four LLM\-based agents:Sample Agent,Memory Agent,Function Agent, andJudge Agent\. We adopt Gemini\-2\.5 ProComanici et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib7)\)as the backbone of all agents, which is an engineering choice due to Gemini\-2\.5 Pro’sstable API\-following performancerather than an algorithmic requirement\. Importantly, the multi\-agent framework ismodel\-agnostic; Appendix[A\.1](https://arxiv.org/html/2605.28835#A1.SS1)shows thatreplacing proprietary models with strong open\-source alternatives retains the overall workflow and leads to only moderate data quality degradation and slight downstream performance drops\. System prompts and detailed agent descriptions and analyses are also in Appendix[A\.1](https://arxiv.org/html/2605.28835#A1.SS1)\.
Sample Agentselects a subset𝒯s⊆𝒯\\mathcal\{T\}\_\{s\}\\subseteq\\mathcal\{T\}from the global Tool Pool, including target tools𝒯target\\mathcal\{T\}\_\{\\text\{target\}\}and distractors𝒯dist\\mathcal\{T\}\_\{\\text\{dist\}\}\. Target tools are grouped by semantic or functional similarity, while distractors are chosen with varied relevance to increase realism and require the model to distinguish between strongly and weakly related tools\.
Memory Agentrecords dialogue instances𝒟=\{d1,…,dk\}\\mathcal\{D\}=\\\{d\_\{1\},\\ldots,d\_\{k\}\\\}and assigns each a type labelτ\(di\)∈𝒞\\tau\(d\_\{i\}\)\\in\\mathcal\{C\}reflecting the semantic context of tool usage, where the same math tool may address architectural geometry in one case and year calculations in another\. It then summarizes past pairs\(di,τ\(di\)\)\(d\_\{i\},\\tau\(d\_\{i\}\)\)and guides the generator to produce new dialogues with unseenτ\\tau, improving semantic diversity\.
Function Agentselects one or more tools\{t1,…,tk\}⊆𝒯s\\\{t\_\{1\},\\ldots,t\_\{k\}\\\}\\subseteq\\mathcal\{T\}\_\{s\}to address the user queryqq, and extracts both required and optional parameters for each tooltit\_\{i\}\. To enhance diversity and realism, a slot\-selection strategy is applied: during parameter extraction, the agent randomly chooses a subset𝒫′⊆𝒫\\mathcal\{P\}^\{\\prime\}\\subseteq\\mathcal\{P\}of optional parameters to instantiate, resulting in varied and realistic tool calls\.
Judge Agentensures rigorous quality control by selecting the best dialogued∗d^\{\*\}fromNNcandidates\{d1,…,dN\}\\\{d\_\{1\},\\ldots,d\_\{N\}\\\}\(defaultN=4N=4\) each generated round\. Evaluation considersproblem significanceandtool appropriateness, withpositional bias controlledas discussed in Appendix[A\.1](https://arxiv.org/html/2605.28835#A1.SS1)\.N=4N=4balances selection quality and generation efficiency\. The selectedd∗d^\{\*\}serves as the final output\.
#### 4\.2\.2Dialogue Generation System
We design a dialogue generation system to produce three types of function\-calling dialogues:single\-turn,multi\-turn, andspecial\-case\. The first two types cover both single\-task and multi\-task settings, enabling the simulation of simple requests and more complex interaction flows\. The special cases, on the other hand, are designed to cover scenarios for relevance checking and error detection\. The overall system is composed of two agents, a user agentUUand an assistant agentAA, both instantiated with Gemini\-2\.5 Pro, that interact with each other to generate task\-oriented conversations\.
The detailed algorithm of the multi\-agent dialogue generation module and dialogue examples are provided in Appendix[A\.2](https://arxiv.org/html/2605.28835#A1.SS2)\. The dialogue generation system grounds each conversation in the sampled tool set𝒯s\\mathcal\{T\}\_\{s\}drawn from the curated pool\. Every dialogue contains user requests that can be handled by𝒯s\\mathcal\{T\}\_\{s\}and tool call answers that explicitly specify the selected tools and the full argument payload\. The assistant’s action space is𝒜asst=\{call,ask,answer\}\\mathcal\{A\}\_\{\\text\{asst\}\}=\\\{\\texttt\{call\},\\,\\texttt\{ask\},\\,\\texttt\{answer\}\\\}\. The assistant can issue a tool callt\(a\)t\(a\), request clarification, summarize tool outputs, or provide a direct non\-tool reply when a call is unnecessary\. The system records the chosen toolttand the payloadaain the turn\-level state and appends them to theturn\-level trajectoryfor later evaluation and reuse\. These recorded trajectories, after being categorized by the memory agent, are leveraged to guide subsequent dialogue generation, preventing the system from reproducing similar contexts and encouraging scenario diversity\.
At turntt, the user agentUUissues a requestqtq\_\{t\}based on the tool set𝒯s\\mathcal\{T\}\_\{s\}\. The assistant agentAAreceives the requestqtq\_\{t\}together with the dialogue history𝒟\\mathcal\{D\}, follows the system prompts, and generates the next action\. In single\-turn dialogues,AAconstructs a problem solvable with𝒯s\\mathcal\{T\}\_\{s\}, produces a tool call, and returns the final answer\. In multi\-turn dialogues,AAalternates withUUby requesting additional information when constraints are missing, and once the target length is reached or a stop signal triggers, the model outputs eithercalloranswer\. In special cases, prompts guide the generation of unsolved or erroneous samples, such as mismatched tools or invalid parameter values, to support relevance checking and error detection\.
### 4\.3Multi\-Stage Evaluation
We introduce a multi\-stage evaluation system to assess the quality of synthesized dialogues, since inaccurate training data may significantly weaken the models’ ability to understand and execute functions\. This system integrates a rule\-based checker and a model\-based checker, with final verification by human experts to ensure the accuracy of the resulting training data\. More details can be found in Appendix[A\.3](https://arxiv.org/html/2605.28835#A1.SS3)\.
\(1\) Rule Checker\.Without executing tools, the Rule Checker performs basic compliance checks on synthesized dialogues to quickly filter out samples with format and alignment issues\. It validates four aspects: completeness of the tool definition, compliance of the call format and parameters, soundness of the dialogue structure, and consistency between tool results and the assistant’s statements \(rule details in Appendix[A\.3](https://arxiv.org/html/2605.28835#A1.SS3)\) Each rule returns apassorfailflag with a hint for correction, and outputs are aggregated for the downstream Model Checker and Human Validation\.
\(2\) Model Checker\.After format screening, the Model Checker employs GPT\-4oOpenAI \([2023](https://arxiv.org/html/2605.28835#bib.bib25)\)to evaluate semantic quality and task completion beyond the static rules \(the prompt is detailed in Appendix[A\.3](https://arxiv.org/html/2605.28835#A1.SS3)\)\. Unlike the Judge Agent that ranks candidate dialoguesduringgeneration, the Model Checker workspost\-hocto verify finalized dialogues\. Given the dialogue context and tool calls, it checks faithfulness, task satisfaction, and compliance, returning a rationale and a confidence score\. Only samples with confidence scores above the threshold \(θ=0\.75\\theta=0\.75\) are retained\.
\(3\) Human Validation\.After the Rule Checker and the Model Checker complete automated screening, the remaining error rate isbelow 5%\. Manual analysis shows thatover 80%of the errors after automated screening stem from parameter extraction rather than function selection or intent understanding\. Samples that fail the Rule Checker or obtain low\-confidence from the Model Checker are routed to a human review, with higher priority given to errors that prevent correct function execution and errors in high\-impact functions that are commonly used in core real\-world scenarios\. These samples account for about5%of all samples\. Human experts review each of these samples and provide revisions; the approved or revised dialogues are added to the training data after passing the second\-pass human validation\. Overall, the Human Validation stage only costs∼\\sim15 human\-hours\.
Overall comparison with other strong synthetic function\-calling datasets\.Appendix[B](https://arxiv.org/html/2605.28835#A2)provides detailed analyses of quantity, quality, coverage, diversity of the final training data\. Compared to other strong synthetic function\-calling datasets, our dataset achieves higher quality through richer multi\-tool and multi\-turn compositions, broader coverage across diverse scenarios, and balanced parameter utilization patterns, while maintaining efficient construction by reusing existing tool schemas and avoiding additional manual designs\.
Table 1:Accuracyon theIn\-DomainBFCL dataset\.GenesisFunc\-8B is fine\-tuned on Qwen3\-8B using training data generated by our pipeline\. The best results in each category are inboldand the second best results areunderlined\. We report mean and standard deviation \(±\\pmSD\) of theOverallaccuracy based on three independent runs\.
## 5Experiments
### 5\.1Experimental Setup
We fine\-tune LLMs on the training data produced by our pipeline and evaluate the resulting models in a broad range of settings\. Unless otherwise noted, we conduct SFTHu et al\. \([2022](https://arxiv.org/html/2605.28835#bib.bib12)\)on Qwen3\-8BYang et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib37)\)using our training data and denote the resulting model byGenesisFunc\-8B\. We compareGenesisFunc\-8B against top\-tier API\-based and open\-source foundation LLMs, as well as representative function\-calling models trained with specialized function\-calling training data including ToolACE\-8B \(fine\-tuning LLaMA3\.1\-8B\-Instruct on ToolACE\-generated data\), Hammer2\.1\-7B \(based on Qwen 2\.5 coder series\), and Qwen\-ToolRL\-8B \(fine\-tuning Qwen3\-8B with ToolRL datasets\)\. Evaluations are conducted on three commonly used benchmarks: BFCLYan et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib36)\), API\-BankLi et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib15)\), and ACEBenchChen et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib4)\), with all results averaged over three independent runs\. More experimental details, including evaluation metric definitions and settings, are in Appendix[C](https://arxiv.org/html/2605.28835#A3)\.
Table 2:Accuracyon theOut\-of\-DomainAPI\-Bank and ACEBench datasets\. The best results in each category are inboldand the second best results areunderlined\. We report mean and standard deviation \(±\\pmSD\) of theOverallaccuracy from our model based on three independent runs\.
### 5\.2Main Results
To comprehensively assess function\-calling performance, we compareGenesisFunc\-8B with baselines on both in\-domain and out\-of\-domain datasets\. The in\-domain evaluation measures performance on data aligned with the training distribution, while the out\-of\-domain evaluation assesses its generalizability to unseen scenarios\.
In\-Domain Evaluation\.Since the tools in pool are drawn directly from BFCL, we treat BFCL as the in\-domain benchmark\. As shown in Table[1](https://arxiv.org/html/2605.28835#S4.T1),GenesisFunc\-8B achieves 93\.31 on the Non\-Live split and 83\.78 on the Live split, showing strong and robust performance, and outperforming the API\-based models\.GenesisFunc\-8B achieves substantial gains over open\-source models of similar scale and on some subsets matches or outperforms much larger models such as Qwen3\-32B\. Compared with the prior open\-source function\-calling SOTA ToolACE\-8B, our model improves overall accuracy by2\.5% relativeon Non\-Live and3\.8% relativeon Live\. Overall, these results indicate that,for function\-calling, strong and systematic alignment between training data distribution and real tool semantics can significantly narrow the performance gap between smaller models and much larger ones\.
Out\-of\-Domain Evaluation\.To evaluate the generalizability of our fine\-tuned model, we conduct experiments on two out\-of\-domain benchmarks: API\-Bank and ACEBench\. As shown in Table[2](https://arxiv.org/html/2605.28835#S5.T2), API\-based models maintain a clear advantage over open\-source ones, with GPT\-4o reaching 85\.10 on ACEBench\. Open\-source models fine\-tuned for function calling achieve competitive performance but fall short\. Compared with the prior open\-source SOTA, our model achieves 64\.79 on API\-Bank and 78\.64 \(mean of Normal Overall and Special Overall\) on ACEBench, yielding 7\.3% and 9\.4% relative gains\. Notably,GenesisFunc\-8B performs on par with GPT\-4o\-mini and Gemini\-1\.5\-Pro\. These results validate thatthe training data generated by our pipeline effectively enables model generalization to unseen scenarios\.
General Abilities\.We also find thatfine\-tuning on the training data generated by our pipeline preserves the model’s general abilities while substantially improving function\-calling ability\(more details in Appendix[D\.1](https://arxiv.org/html/2605.28835#A4.SS1)\)\.
\(a\)Non\-Live
\(b\)Live
Figure 2:Ablation study of the proposed agents\. We separately remove the Judge Agent and the Memory Agent, and evaluate ourGenesisFunc\-8B on BFCL in \(a\) Non\-Live and \(b\) Live settings\.\(a\)Non\-Live
\(b\)Live
Figure 3:Ablation study on the number of samples\. We control the number of dialogues generated per run and evaluate on BFCL in \(a\) Non\-Live and \(b\) Live settings\.
### 5\.3Ablation Study
Ablation on Multi\-Agent Framework\.Our multi\-agent framework improves function\-calling ability through agent collaboration, where the Memory Agent enhances dialogue diversity and the Judge Agent ensures accuracy\. We conduct ablation studies by removing either agent and fine\-tuning Qwen3\-8B with LoRA, noting that the Sample Agent and Function Agent are indispensable to the workflow\. As shown in Figure[2](https://arxiv.org/html/2605.28835#S5.F2), BFCL results indicate that removing the Judge Agent causes a clear performance drop, while removing the Memory Agent leads to significant degradation, highlighting the importance of accuracy and diversity and validating the effectiveness of both agents\. We further compare alternative slot selection strategies in the Function Agent in Appendix[D\.2](https://arxiv.org/html/2605.28835#A4.SS2)\.
Table 3:Accuracyon ACEBench usingreinforcement learning\. The best overall results in each category are marked in bold\. The second best results areunderlined\.Impact of the Number of Samples\.To assess how the number of generated samples influences function\-calling ability, we use the same set of tools and only vary how many dialogues are produced for each tool\. Besides the default setting of 5 dialogues per tool, we also construct datasets with 1 and 10 dialogues per tool\. We fine\-tune Qwen3\-8B on these datasets and report BFCL results in Figure[3](https://arxiv.org/html/2605.28835#S5.F3)\. Overall accuracy steadily consistently rises with more samples per tool, with a substantial gain from 1 to 5 samples but only modest improvement from 5 to 10, indicating eventually diminishing returns once sufficient scenario diversity is reached\.
Ablation on Multi\-Stage Evaluation\.Detailed results are in Appendix[D\.3](https://arxiv.org/html/2605.28835#A4.SS3)\. We find that models trained on data with the multi\-stage evaluation module achieve higher accuracy across all conditions than those trained on non\-evaluated data, verifying the effectiveness of this module\.
Table 4:Overall Accuracyon BFCL, API\-Bank and ACEBench adding the tools in ACEBench\. The best overall results in each category are marked in bold\. The second best results areunderlined\.
### 5\.4Scalability and Reinforcement Learning
Scalable to More Tools\.To assess scalability of our pipeline to more downstream tools, we add tools defined in ACEBench into the pool and use our pipeline to generate high\-quality training data for fine\-tuning\. As shown in Table[4](https://arxiv.org/html/2605.28835#S5.T4), performance on ACEBench improves markedly from 78\.64 to 81\.87 after adding these tools, due to better tool alignment\. Meanwhile, aggregate results on BFCL remain comparable at 87\.89 versus 88\.55, and on API\-Bank at 65\.11 versus 64\.79, indicating that introducing additional tool definitions does not notably degrade performance on other benchmarks\. Overall,our pipeline scales well to broader tool inventories and yields consistent gains on targeted benchmarks with no observable degradations on other datasets, across different model sizes and different backbone architectures, as shown in Appendix[D\.4](https://arxiv.org/html/2605.28835#A4.SS4)and Appendix[D\.5](https://arxiv.org/html/2605.28835#A4.SS5)\.
Enhancing Function\-Calling Ability in Multi\-Turn Dialogues via RL\.Inspired by prior worksQian et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib27)\); Wan et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib34)\), one research question iswhether applying reinforcement learning \(RL\) on our data can further enhance the model’s function\-calling capability, particularly in multi\-turn dialogue scenarios where performance is less satisfactory\. We conduct two sets of experiments:GenesisFunc\-8B\-RL\(all\) applies RL instead of SFT to Qwen3\-8B using our training data;GenesisFunc\-8B\-RL\(part\) focuses on data in multi\-turn scenarios, where the model is first SFT\-trained on single\-turn and special\-case data and then RL\-trained with multi\-turn dialogues\. To encourage deeper reasoning during training, we enable the built\-in “thinking mode” of Qwen3\-8B and augment the training samples with explicit reasoning traces\. We use GRPOShao et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib31)\)for RL, with rewards designed around two dimensions: format compliance and functional correctness\. More details are in Appendix[E](https://arxiv.org/html/2605.28835#A5)\.
Table[3](https://arxiv.org/html/2605.28835#S5.T3)compares the performance of different training strategies on ACEBench\. The model SFT\-ed on our data achieves 83\.67\. Compared with the default SFT setup,GenesisFunc\-8B\-RL \(part\) improves Normal tasks substantially from 73\.60 to 75\.20, with multi\-turn performance improved from 65\.00 to 70\.00, while maintaining 82\.88 on Special subset, confirming thattargeted RL training enhances long\-context reasoning and handling complex interaction, improving generalization on multi\-turn tasks while maintaining stability on Special cases\. These results highlight theadvantage of combining SFT and RLto improve function\-calling abilities of the model\.
Notably, both SFT and RL incur low training costs, as summarized in Appendix[C\.3](https://arxiv.org/html/2605.28835#A3.SS3)\.
## 6Conclusion
This paper introducesGenesisFunc, an automated data\-generation pipeline for strengthening the function\-calling abilities of LLMs\. Beginning with a reliable tool set curated from open\-source benchmarks,GenesisFuncleverages a coordinated multi\-agent framework that assists a dialogue generator in producing high\-quality, diverse, and representative function\-calling training data, while remaining model\-agnostic in design\. In extensive experiments, models trained withGenesisFuncachieve state\-of\-the\-art performance, marking a concrete advance in tool\-augmented AI agents, and the methodology can be extended to more agentic and complex function\-calling tasks using proprietary or open\-source LLMs in future work\.
## Limitations
Despite the effectiveness ofGenesisFunc, several important limitations still remain\. First,GenesisFunc\-8B achieves competitive tool\-use performance but still falls short of API\-based models like GPT\-4 in broader reasoning and comprehension\. Enhancing general abilities alongside function\-calling remains an open challenge\. Second, our training data does not yet fully encompass highly complex multi\-turn scenarios that require tightly coupled tool sequences\. In future work, we plan to focus more on challenging settings, such as agentic workflows and complex benchmarks\. Nevertheless, we firmly believe that, in principle, these more demanding scenarios can also be addressed by extending the methodology developed in this work\.
## Ethical Considerations
##### AI Assistant Use
We used LLMs to assist with improving grammar, clarity, and wording in parts of this work\. The use of LLMs was limited to language refinement, with all ideas, analyses, and conclusions solely developed by the authors\.
## References
- Abdelaziz et al\. \(2024\)Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, G\. P\. Shrivatsa Bhargav, Maxwell Crouse, R\. Chulaka Gunasekara, Shajith Ikbal, Sachindra Joshi, Hima Karanam, Vineet Kumar, Asim Munawar, Sumit Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, and 7 others\. 2024\.[Granite\-function calling model: Introducing function calling abilities via multi\-task learning of granular tasks](https://doi.org/10.18653/V1/2024.EMNLP-INDUSTRY.85)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: EMNLP 2024 \- Industry Track, Miami, Florida, USA, November 12\-16, 2024*, pages 1131–1139\. Association for Computational Linguistics\.
- Basu et al\. \(2024\)Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, Soham Dan, Maxwell Crouse, Asim Munawar, Vernon Austel, Sadhana Kumaravel, Vinod Muthusamy, Pavan Kapanipathi, and Luis A\. Lastras\. 2024\.[API\-BLEND: A comprehensive corpora for training and benchmarking API llms](https://doi.org/10.18653/V1/2024.ACL-LONG.694)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024*, pages 12859–12870\. Association for Computational Linguistics\.
- Bauer et al\. \(2024\)André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, and Ian T\. Foster\. 2024\.[Comprehensive exploration of synthetic data generation: A survey](https://doi.org/10.48550/ARXIV.2401.02524)\.*CoRR*, abs/2401\.02524\.
- Chen et al\. \(2025a\)Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, and Wu Liu\. 2025a\.[Acebench: Who wins the match point in tool learning?](https://doi.org/10.48550/ARXIV.2501.12851)*CoRR*, abs/2501\.12851\.
- Chen et al\. \(2025b\)Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, and Zhuofeng Zhao\. 2025b\.[Toolforge: A data synthesis pipeline for multi\-hop search without real\-world apis](https://doi.org/10.48550/ARXIV.2512.16149)\.*CoRR*, abs/2512\.16149\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\. 2021\.[Training verifiers to solve math word problems](https://arxiv.org/abs/2110.14168)\.*CoRR*, abs/2110\.14168\.
- Comanici et al\. \(2025\)Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit S\. Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan\-Jiang Jiang, and 81 others\. 2025\.[Gemini 2\.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities](https://doi.org/10.48550/ARXIV.2507.06261)\.*CoRR*, abs/2507\.06261\.
- Dubey et al\. \(2024\)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, and 82 others\. 2024\.[The llama 3 herd of models](https://doi.org/10.48550/ARXIV.2407.21783)\.*CoRR*, abs/2407\.21783\.
- Hao et al\. \(2024\)Yilun Hao, Yongchao Chen, Yang Zhang, and Chuchu Fan\. 2024\.[Large language models can plan your travels rigorously with formal verification tools](https://doi.org/10.48550/ARXIV.2404.11891)\.*CoRR*, abs/2404\.11891\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\. 2021\.[Measuring massive multitask language understanding](https://openreview.net/forum?id=d7KBjmI3GmQ)\.In*9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021*\. OpenReview\.net\.
- Hsieh et al\. \(2023\)Cheng\-Yu Hsieh, Si\-An Chen, Chun\-Liang Li, Yasuhisa Fujii, Alexander Ratner, Chen\-Yu Lee, Ranjay Krishna, and Tomas Pfister\. 2023\.[Tool documentation enables zero\-shot tool\-usage with large language models](https://doi.org/10.48550/ARXIV.2308.00675)\.*CoRR*, abs/2308\.00675\.
- Hu et al\. \(2022\)Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\. 2022\.[Lora: Low\-rank adaptation of large language models](https://openreview.net/forum?id=nZeVKeeFYf9)\.In*The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022*\. OpenReview\.net\.
- Huang et al\. \(2024\)Shijue Huang, Wanjun Zhong, Jianqiao Lu, Qi Zhu, Jiahui Gao, Weiwen Liu, Yutai Hou, Xingshan Zeng, Yasheng Wang, Lifeng Shang, Xin Jiang, Ruifeng Xu, and Qun Liu\. 2024\.[Planning, creation, usage: Benchmarking llms for comprehensive tool utilization in real\-world complex scenarios](https://doi.org/10.18653/V1/2024.FINDINGS-ACL.259)\.In*Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11\-16, 2024*, pages 4363–4400\. Association for Computational Linguistics\.
- Jin et al\. \(2025\)Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han\. 2025\.[Search\-r1: Training llms to reason and leverage search engines with reinforcement learning](https://doi.org/10.48550/ARXIV.2503.09516)\.*CoRR*, abs/2503\.09516\.
- Li et al\. \(2023\)Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li\. 2023\.[Api\-bank: A comprehensive benchmark for tool\-augmented llms](https://doi.org/10.18653/V1/2023.EMNLP-MAIN.187)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023*, pages 3102–3116\. Association for Computational Linguistics\.
- Li et al\. \(2025\)Xuefeng Li, Haoyang Zou, and Pengfei Liu\. 2025\.[Torl: Scaling tool\-integrated RL](https://doi.org/10.48550/ARXIV.2503.23383)\.*CoRR*, abs/2503\.23383\.
- Li et al\. \(2024\)Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, and Yang Liu\. 2024\.[Split and merge: Aligning position biases in llm\-based evaluators](https://doi.org/10.18653/V1/2024.EMNLP-MAIN.621)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024*, pages 11084–11108\. Association for Computational Linguistics\.
- Lin et al\. \(2024\)Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, Jun Wang, and Weinan Zhang\. 2024\.[Hammer: Robust function\-calling for on\-device language models via function masking](https://doi.org/10.48550/ARXIV.2410.04587)\.*CoRR*, abs/2410\.04587\.
- Lin et al\. \(2025\)Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, and Ran He\. 2025\.[Awpo:enhancing tool\-use of large language models through explicit integration of reasoning rewards](https://doi.org/10.48550/ARXIV.2512.19126)\.*CoRR*, abs/2512\.19126\.
- Liu et al\. \(2023\)Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang\. 2023\.[Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation](http://papers.nips.cc/paper_files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html)\.In*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023*\.
- Liu et al\. \(2025a\)Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, and 8 others\. 2025a\.[Toolace: Winning the points of LLM function calling](https://openreview.net/forum?id=8EB8k6DdCU)\.In*The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025*\. OpenReview\.net\.
- Liu et al\. \(2025b\)Ziche Liu, Rui Ke, Yajiao Liu, Feng Jiang, and Haizhou Li\. 2025b\.[Take the essence and discard the dross: A rethinking on data selection for fine\-tuning large language models](https://doi.org/10.18653/V1/2025.NAACL-LONG.336)\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025*, pages 6595–6611\. Association for Computational Linguistics\.
- Liu et al\. \(2024\)Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh R\. N\., Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, and Caiming Xiong\. 2024\.[Apigen: Automated pipeline for generating verifiable and diverse function\-calling datasets](http://papers.nips.cc/paper_files/paper/2024/hash/61cce86d180b1184949e58939c4f983d-Abstract-Datasets_and_Benchmarks_Track.html)\.In*Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024*\.
- Mialon et al\. \(2023\)Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi\-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom\. 2023\.[Augmented language models: a survey](https://openreview.net/forum?id=jh7wH2AzKK)\.*Trans\. Mach\. Learn\. Res\.*, 2023\.
- OpenAI \(2023\)OpenAI\. 2023\.[GPT\-4 technical report](https://doi.org/10.48550/ARXIV.2303.08774)\.*CoRR*, abs/2303\.08774\.
- Patil et al\. \(2024\)Shishir G\. Patil, Tianjun Zhang, Xin Wang, and Joseph E\. Gonzalez\. 2024\.[Gorilla: Large language model connected with massive apis](http://papers.nips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html)\.In*Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024*\.
- Qian et al\. \(2025\)Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani\-Tür, Gokhan Tur, and Heng Ji\. 2025\.[Toolrl: Reward is all tool learning needs](https://doi.org/10.48550/ARXIV.2504.13958)\.*CoRR*, abs/2504\.13958\.
- Qin et al\. \(2024\)Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun\. 2024\.[Toolllm: Facilitating large language models to master 16000\+ real\-world apis](https://openreview.net/forum?id=dHng2O0Jjr)\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net\.
- Qu et al\. \(2025\)Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji\-Rong Wen\. 2025\.[Tool learning with large language models: a survey](https://doi.org/10.1007/S11704-024-40678-2)\.*Frontiers Comput\. Sci\.*, 19\(8\):198343\.
- Rein et al\. \(2023\)David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R\. Bowman\. 2023\.[GPQA: A graduate\-level google\-proof q&a benchmark](https://doi.org/10.48550/ARXIV.2311.12022)\.*CoRR*, abs/2311\.12022\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.[Deepseekmath: Pushing the limits of mathematical reasoning in open language models](https://doi.org/10.48550/ARXIV.2402.03300)\.*CoRR*, abs/2402\.03300\.
- Shi et al\. \(2023\)Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei\. 2023\.[Language models are multilingual chain\-of\-thought reasoners](https://openreview.net/forum?id=fR3wGCk-IXp)\.In*The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023*\. OpenReview\.net\.
- Tang et al\. \(2023\)Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun\. 2023\.[Toolalpaca: Generalized tool learning for language models with 3000 simulated cases](https://doi.org/10.48550/ARXIV.2306.05301)\.*CoRR*, abs/2306\.05301\.
- Wan et al\. \(2025\)Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, and Ming Yan\. 2025\.[Qwenlong\-l1: Towards long\-context large reasoning models with reinforcement learning](https://doi.org/10.48550/ARXIV.2505.17667)\.*CoRR*, abs/2505\.17667\.
- Xie et al\. \(2025\)Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo\. 2025\.[Logic\-rl: Unleashing LLM reasoning with rule\-based reinforcement learning](https://doi.org/10.48550/ARXIV.2502.14768)\.*CoRR*, abs/2502\.14768\.
- Yan et al\. \(2024\)Fanjia Yan, Huanzhi Mao, Charlie Cheng\-Jie Ji, Tianjun Zhang, Shishir G\. Patil, Ion Stoica, and Joseph E\. Gonzalez\. 2024\.Berkeley function calling leaderboard\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 40 others\. 2025\.[Qwen3 technical report](https://doi.org/10.48550/ARXIV.2505.09388)\.*CoRR*, abs/2505\.09388\.
- Yao et al\. \(2023\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R\. Narasimhan, and Yuan Cao\. 2023\.[React: Synergizing reasoning and acting in language models](https://openreview.net/forum?id=WE_vluYUL-X)\.In*The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023*\. OpenReview\.net\.
- Yu et al\. \(2024\)Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T\. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu\. 2024\.[Metamath: Bootstrap your own mathematical questions for large language models](https://openreview.net/forum?id=N8N0hgNDRt)\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net\.
- Zhang et al\. \(2024\)Dylan Zhang, Justin Wang, and François Charton\. 2024\.[Instruction diversity drives generalization to unseen tasks](https://doi.org/10.48550/ARXIV.2402.10891)\.*CoRR*, abs/2402\.10891\.
- Zhang et al\. \(2025\)Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Manoj Awalgaonkar, Rithesh R\. N\., Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, and 2 others\. 2025\.[xlam: A family of large action models to empower AI agent systems](https://doi.org/10.18653/V1/2025.NAACL-LONG.578)\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025*, pages 11583–11597\. Association for Computational Linguistics\.
- Zheng et al\. \(2024\)Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, and Yongqiang Ma\. 2024\.[Llamafactory: Unified efficient fine\-tuning of 100\+ language models](https://doi.org/10.48550/ARXIV.2403.13372)\.*CoRR*, abs/2403\.13372\.
- Zhong et al\. \(2024\)Ruizhe Zhong, Xingbo Du, Shixiong Kai, Zhentao Tang, Siyuan Xu, Hui\-Ling Zhen, Jianye Hao, Qiang Xu, Mingxuan Yuan, and Junchi Yan\. 2024\.[LLM4EDA: emerging progress in large language models for electronic design automation](https://doi.org/10.48550/ARXIV.2401.12224)\.*CoRR*, abs/2401\.12224\.
## Appendix ADetails ofGenesisFunc
### A\.1Multi\-Agent Framework
##### System Prompts in Multi\-Agent Framework
In our proposed multi\-agent framework, we employ four distinct agents, Sample Agent, Memory Agent, Function Agent, and Judge Agent, to assist dialogue generation by producing diverse and high\-quality conversations\. Figure[4](https://arxiv.org/html/2605.28835#A1.F4)presents the complete system\-level prompts that define the roles and behaviors of the Sample Agent, Memory Agent, Function Agent, and Judge Agent\. These prompts are used as system prompts throughout the generation process, with bracketed segments serving as dynamically filled placeholders\.
Figure 4:The system prompts of each agent in multi\-agent framework\.
##### Details of Sample Agent
To construct a sampled subset𝒯s⊆𝒯\\mathcal\{T\}\_\{s\}\\subseteq\\mathcal\{T\}from the globalTools Pool, the sample agent employs GPT\-4oOpenAI \([2023](https://arxiv.org/html/2605.28835#bib.bib25)\)to evaluate the semantic and functional similarity between tools\. Given two tool descriptions that specify their name, purpose, and parameter schema, the model assigns a similarity scorer∈\[0,1\]r\\in\[0,1\], wherer=1r=1indicates nearly identical functionality andr=0r=0indicates entirely unrelated purposes\. Each tool pair is scored with a fixed prompt and temperature=0=0to ensure deterministic outputs\. Tools withr\>0\.75r\>0\.75or sharing similar functions are grouped as target tools𝒯target=\{t1,…,tm\}\\mathcal\{T\}\_\{\\text\{target\}\}=\\\{t\_\{1\},\\ldots,t\_\{m\}\\\}\. Distractors𝒯dist=\{tm\+1,…,tm\+n\}\\mathcal\{T\}\_\{\\text\{dist\}\}=\\\{t\_\{m\+1\},\\ldots,t\_\{m\+n\}\\\}are sampled from other clusters according to their relevance scores, categorized as high \(r\>0\.6r\>0\.6\), medium \(0\.3<r≤0\.60\.3<r\\leq 0\.6\), and low \(r≤0\.3r\\leq 0\.3\)\. All threshold values and sampling ratios were empirically tuned for consistency and reproducibility\.
##### Details of the Judge Agent
The Judge Agent selects the best dialogue from multiple candidates based on problem significance and tool appropriateness\. Prior literature has documented that LLM\-based evaluators can exhibit positional bias in pairwise comparisons, where judgments may systematically favor specific positions within promptsLi et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib17)\)\. To mitigate this, candidate dialogues are randomly shuffled before being evaluated by the Judge Agent\. Additionally, we perform an A/B swap test on a subset of pairs, where dialogues are evaluated under swapped ordering\. We observe no consistent directional preference across swapped conditions, indicating that positional bias is unlikely to be a major confounding factor in our selection process\. This analysis enhances the reliability of the Judge Agent’s decisions and helps ensure that the data pipeline’s improvements are not driven by spurious positional effects\.
##### Reproducibility
Our Multi\-Agent Framework is model\-agnostic and does not rely on any proprietary model capabilities or a specific combination of models\. In practice, different stages of the pipeline impose different requirements on model behavior\. In particular, tool\-calling generation benefits from models with stable API\-following performance, while other stages primarily require output consistency rather than advanced reasoning ability\. In our implementation, Gemini\-2\.5 ProComanici et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib7)\)is adopted for tool\-calling related agents due to its stable behavior, which reflects an engineering choice rather than an algorithmic requirement\. Moreover, replacing Gemini\-2\.5 Pro with open\-source models such as Qwen3\-32BYang et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib37)\)preserves the overall workflow and produces coherent training data, although data quality and downstream tool\-calling performance are moderately reduced\. Specifically, on BFCL, the performance decreases by about 0\.5% on the non\-live split and 0\.9% on the live split\. These results indicate that stronger models primarily affect data quality, while the pipeline logic and functionality remain unchanged, supporting both reproducibility and practical accessibility\.
Input:Global tools pool
𝒯\\mathcal\{T\}; history
𝒟\\mathcal\{D\}with type labels
τ\(⋅\)∈𝒞\\tau\(\\cdot\)\\\!\\in\\\!\\mathcal\{C\}; max turns
TmaxT\_\{\\max\}; candidate count
NN
Output:Final dialogueDialog; turn\-level trajectorytraj
𝒯s←SampleAgent\.Select\(𝒯\)\\mathcal\{T\}\_\{s\}\\leftarrow\\textsc\{SampleAgent\.Select\}\(\\mathcal\{T\}\);
//targets \+ distractors
summary,forbidden←MemoryAgent\.Summarize\(𝒟\)\\texttt\{summary\},\\,\\texttt\{forbidden\}\\leftarrow\\textsc\{MemoryAgent\.Summarize\}\(\\mathcal\{D\}\);
//avoid seen types
Dialog
←\[\]\\leftarrow\[\\,\];traj
←\[\]\\leftarrow\[\\,\];
s0←InitState\(𝒯s,summary\)s\_\{0\}\\leftarrow\\textsc\{InitState\}\(\\mathcal\{T\}\_\{s\},\\texttt\{summary\}\)
for*t←0t\\leftarrow 0toTmaxT\_\{\\max\}*do
𝒞𝒜𝒩𝒟←∅\\mathcal\{CAND\}\\leftarrow\\varnothing;
//N candidates per round
for*i←1i\\leftarrow 1toNN*do
qt←U\.Issue\(𝒯s,st,summary,forbidden\)q\_\{t\}\\leftarrow U\.\\textsc\{Issue\}\(\\mathcal\{T\}\_\{s\},s\_\{t\},\\texttt\{summary\},\\texttt\{forbidden\}\);
//user request
at←A\.Plan\(qt,st\)a\_\{t\}\\leftarrow A\.\\textsc\{Plan\}\(q\_\{t\},s\_\{t\}\);
//
𝒜asst=\{ask,call,answer\}\\mathcal\{A\}\_\{\\text\{asst\}\}=\\\{\\texttt\{ask\},\\texttt\{call\},\\texttt\{answer\}\\\}
if*at=*ask*a\_\{t\}=\\texttt\{ask\}*then
ot←U\.Clarify\(at\)o\_\{t\}\\leftarrow U\.\\textsc\{Clarify\}\(a\_\{t\}\);
//supply missing constraints
else if*at=*call*a\_\{t\}=\\texttt\{call\}*then
\{t1,…,tm\}←FunctionAgent\.Select\(𝒯s,qt\)\\\{t\_\{1\},\\ldots,t\_\{m\}\\\}\\leftarrow\\textsc\{FunctionAgent\.Select\}\(\\mathcal\{T\}\_\{s\},q\_\{t\}\);
//one or more tools
𝒫′←FunctionAgent\.SlotSelect\(qt\)\\mathcal\{P\}^\{\\prime\}\\leftarrow\\textsc\{FunctionAgent\.SlotSelect\}\(q\_\{t\}\);
//optional param subset
args←FunctionAgent\.Fill\(qt,𝒜′\)\\texttt\{args\}\\leftarrow\\textsc\{FunctionAgent\.Fill\}\(q\_\{t\},\\mathcal\{A\}^\{\\prime\}\);
//instantiate required \+ optional
rt←FunctionAgent\.Exec\(\{tj\},args\)r\_\{t\}\\leftarrow\\textsc\{FunctionAgent\.Exec\}\(\\\{t\_\{j\}\\\},\\texttt\{args\}\);
//simulate
ot←A\.Summarize\(rt\)o\_\{t\}\\leftarrow A\.\\textsc\{Summarize\}\(r\_\{t\}\);
//explicit tool outputs
else
ot←A\.DirectAnswer\(qt\)o\_\{t\}\\leftarrow A\.\\textsc\{DirectAnswer\}\(q\_\{t\}\);
//non\-tool reply
candi←\(qt,at,ot\)\\texttt\{cand\}\_\{i\}\\leftarrow\(q\_\{t\},a\_\{t\},o\_\{t\}\);
𝒞𝒜𝒩𝒟←𝒞𝒜𝒩𝒟∪\{candi\}\\mathcal\{CAND\}\\leftarrow\\mathcal\{CAND\}\\cup\\\{\\texttt\{cand\}\_\{i\}\\\};
cand∗←JudgeAgent\.Select\(𝒞𝒜𝒩𝒟;significance,suitability\)\\texttt\{cand\}^\{\*\}\\leftarrow\\textsc\{JudgeAgent\.Select\}\(\\mathcal\{CAND\};\\,\\texttt\{significance\},\\,\\texttt\{suitability\}\);
//pick best ofNN
Dialog
←\\leftarrowDialog
∥\\mathbin\{\\\|\}cand∗;traj
←\\leftarrowtraj
∥\\mathbin\{\\\|\}cand∗;
τnew←InferType\(cand∗\)\\tau\_\{\\text\{new\}\}\\leftarrow\\textsc\{InferType\}\(\\texttt\{cand\}^\{\*\}\);
𝒟←𝒟∪\{\(cand∗,τnew\)\}\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\\{\(\\texttt\{cand\}^\{\*\},\\tau\_\{\\text\{new\}\}\)\\\};
forbidden←forbidden∪\{τnew\}\\texttt\{forbidden\}\\leftarrow\\texttt\{forbidden\}\\cup\\\{\\tau\_\{\\text\{new\}\}\\\};
//diversity control
st\+1←UpdateState\(st,cand∗\)s\_\{t\+1\}\\leftarrow\\textsc\{UpdateState\}\(s\_\{t\},\\texttt\{cand\}^\{\*\}\);
//recordtt,aa, and payloadaa
if*Stop\(st\+1\)\(s\_\{t\+1\}\)*then
break
;
//single\-turn met / constraints satisfied / max turns
return*Dialog,traj*
Algorithm 1Dialogue Generation with Multi\-Agent Coordination
### A\.2Dialogue Generation System
##### Algorithmic Workflow of Multi\-Agent Dialogue Generation
Algorithm[1](https://arxiv.org/html/2605.28835#algorithm1)illustrates the pseudocode workflow of our multi\-agent dialogue generation system, highlighting the interactions among different agents, the data flow between successive stages, and the overall generation and selection process\.
##### Case Study
We leverage a multi\-agent framework to enable the dialogue generation system to produce three categories of dialogue: single\-turn, multi\-turn, and special cases\. For both single\-turn and multi\-turn, we considered situations where users may accomplish either a single task or multiple tasks simultaneously\. In addition, we incorporate special cases, such as when none of the available tools can address the user’s request, or when the user’s query prevents the model from filling in the tool parameters\.
Figure[9](https://arxiv.org/html/2605.28835#A5.F9)and Figure[10](https://arxiv.org/html/2605.28835#A5.F10)illustrate single\-turn scenarios in which the user intends to invoke one or multiple tools to complete either a single task or multiple tasks\. Figure[11](https://arxiv.org/html/2605.28835#A5.F11)and Figure[12](https://arxiv.org/html/2605.28835#A5.F12)present analogous cases in multi\-turn dialogues, where one or more tools are required to handle a single task or multiple tasks\. Finally, Figure[13](https://arxiv.org/html/2605.28835#A5.F13)demonstrates the special\-case dialogue that we constructed\.
### A\.3Multi\-Stage Evaluation
##### Rule Checker
Table[5](https://arxiv.org/html/2605.28835#A1.T5)outlines the check rules we use, which consist of four complementary aspects: tool definition completeness, call format and argument compliance, dialog structure soundness, and consistency between tool outputs and the assistant’s responses\.
##### Model Checker
The Model Checker verifies the correctness of function\-calling dialogues using the following system\-level prompt:*“You are a Model Checker responsible for verifying the correctness of a function\-calling dialogue\. Given the dialogue, tool specifications, and tool\-call outputs, determine whether the assistant selected the appropriate tool, used correct parameter formats, and provided a faithful answer\. Identify any semantic errors and assign a confidence score between 0 and 1\.”*This prompt defines the Model Checker’s role and behavior across evaluations\. We adopt GPT\-4oOpenAI \([2023](https://arxiv.org/html/2605.28835#bib.bib25)\)as the backend model due to its stable evaluative behavior, which improves robustness in automatic screening\. This choice is an engineering consideration rather than an algorithmic requirement, and alternative models can be used without affecting the evaluation pipeline\.
Table 5:Example rules for our Rule Checker in multi\-stage evaluation\.Table 6:Comparison of dialogue scenario coverage across representative tool\-augmented datasets\. Here,Single\-Singledenotes single\-turn dialogues designed to accomplish a single task\.Single\-Multirefers to single\-turn dialogues that involve multiple tasks within the same interaction\.Multi\-Singleindicates multi\-turn dialogues focused on a single task\.Multi\-Multirepresents multi\-turn dialogues that cover multiple tasks\. andSpecial\-Casecorresponds to challenging scenarios such as the absence of applicable tools or cases where parameter filling is infeasible\. Our dataset uniquely covers all five categories\.
## Appendix BDetails of Training Data
Our constructed function\-calling dataset consists of 2,000 single\-turn dialogue scenarios, 2,000 multi\-turn dialogue scenarios, and 500 special\-case dialogues\. To further demonstrate the strengths of our dataset in terms of quality, coverage, and diversity, we provide a set of statistical analyses\.
##### Relationship Between Training Data and BFCL Evaluation
Our training data uses the tool schemas provided by BFCL\. During training, we do not use any BFCL test queries, test outputs, or instantiated evaluation samples\. All dialogue are newly generated by our multi\-agent framework, including user queries, dialogue contexts, and concrete parameter values\. Although the tool schemas are shared between training and evaluation, this setting follows the standard in\-domain evaluation protocol of BFCL and prior function\-calling benchmarks\. The BFCL evaluation focuses on generalization to unseen queries and novel argument combinations rather than memorization of specific tool invocations\. As a result, the observed performance gains reflect improved tool selection and parameter grounding instead of exposure to evaluation instances\.
##### Quality
In terms of quality, our constructed dataset features an average of 3\.11 tools used per dialogue, 4\.46 turns per multi\-turn scenario, and 3\.27 tasks per multi\-task case\. This design enables each dialogue to cover multiple tools, resulting in a substantially larger number of tool invocation instances than the number of dialogues, and the high number of tool calls, longer dialogue lengths, and richer task compositions expose the model to more complex tool\-using situations, thereby better stimulating and enhancing its function\-calling capability, while ensuring all data are newly generated and do not overlap with evaluation samples\.
##### Coverage
In terms of coverage, our dataset incorporates a broader range of scenarios compared to previously constructed dataset\. This expanded scenario coverage is a key factor in strengthening the model’s function\-calling ability\. An overview of the data statistics used in these representative tool\-augmented LLMs is presented in Table[6](https://arxiv.org/html/2605.28835#A1.T6)\.
Figure 5:Distribution of slot filling ratios across 400 sampled tools\.
##### Diversity
In terms of diversity, we further analyze the slot filling ratios to better demonstrate the diversity of our tool\-call training data\. Specifically, we sample 400 tools with concrete parameter values from the constructed dialogues\. For each tool, we compute the slot filling ratio by dividing the number of filled optional parameters by the total number of available optional parameters, and then group the results into five equal\-width intervals\. Figure[5](https://arxiv.org/html/2605.28835#A2.F5)reports the distribution across these ranges\. The relatively balanced counts indicate that our dataset effectively captures heterogeneous slot utilization patterns, ranging from sparse filling to near\-complete filling, highlighting its diversity\.
##### Efficiency
We examine the efficiency of the data construction process and compare generation cost with other synthetic datasets\. Unlike synthetic function\-calling datasets such as ToolLLMQin et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib28)\)and ToolACELiu et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib21)\)that rely on manually designed or synthesized APIs, our approach operates directly on the tools already defined in BFCLYan et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib36)\)\. This design avoids additional tool construction and manual schema engineering, thereby reducing the cost of dataset creation\. The generation cost of our approach arises from model generation within a multi\-agent framework, while the tool space remains fixed and reusable across samples\. Consequently, the overall construction cost scales linearly with the number of generated samples and does not introduce dataset\-specific tool design or manual annotation overhead\. Since existing datasets differ in tool complexity, agent design, and model backbones, a direct numerical comparison of generation cost is difficult\. Nevertheless, under a unified benchmark setting, our approach achieves a balance between the quality of the training data and construction efficiency\.
## Appendix CExperimental Details
### C\.1Benchmarks
##### BFCL
The Berkeley Function\-Calling Benchmark \(BFCL\)Yan et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib36)\)is a comprehensive framework for evaluating the function\-calling abilities of LLMs across multiple languages, domains, and complex scenarios\. It includes 4,951 test cases, with 3,951 single\-turn dialogs and 1,000 multi\-turn dialogs that emphasize dynamic and realistic settings\. The tools pool is broad, combiningNon\-Livetools curated to cover diverse situations withLivetools that are continuously uploaded by users\. BFCL supports two evaluation methods, one based on Python and the other not\. In this work, we adopt the Python\-based approach, which is shown below:
- •Simple Function:This category represents the most basic yet also the most frequently encountered setting, where the input explicitly contains exactly one json function description and the model is expected to correctly invoke precisely that single function\.
- •Multiple Function:In this setting, the model receives 2 to 4 json function descriptions, but only one of them should be invoked\. The task requires the model to identify and select the most suitable function call based on the user’s query and context\.
- •Parallel Function:Here, the model must invoke multiple function calls simultaneously in response to a single user query\. The challenge lies in determining how many calls are required and executing them in parallel, whether the query is phrased as a short request or a longer description\.
- •Parallel Multiple Function:This combines the complexity of both multiple and parallel function settings\. The model is given several function descriptions and must decide, for each one, whether it should be invoked, possibly multiple times or not at all\.
For every category, BFCL provides both AST\-based evaluation and a corresponding executable evaluation\. In the executable setting, Python functions are manually implemented, inspired by publicly available REST API endpoints \(such as retrieving weather data\) as well as directly computable functions \(such as linear regression\)\. The purpose of this executable track is to assess whether the generated function calls can be reliably applied in real\-world applications that depend on stable function execution\.
Table 7:Configuration of hyper\-parameters for model training\.
##### API\-Bank
The API\-BankLi et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib15)\)is composed of 314 dialogues involving a total of 753 API calls, specifically constructed to evaluate the abilities of LLMs in planning, retrieving, and invoking APIs in diverse scenarios\. Within the dataset, 363 cases require only a single call, while 122 instances involve multiple calls, reflecting both simple and more complex usage patterns\. Performance is systematically measured along three distinct dimensions, providing a comprehensive assessment of tool\-use capability:
- •Call:Evaluates whether the language model can correctly invoke a known API based on a given query\.
- •Retrieval\+Call:Evaluates the model’s ability to first identify and accurately retrieve the correct API from context and then successfully perform the corresponding call when the API is not provided\.
- •Plan\+Retrieval\+Call:Assesses the capacity to plan a sequence of actions, retrieve multiple APIs, and invoke them when the APIs are initially unknown\.
The primary evaluation metric for API\-Bank isaccuracy, formally defined as the ratio of correctly generated predictions to the total number of evaluation attempts\.
##### ACEBench
ACEBenchChen et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib4)\)is a bilingual Chinese and English benchmark for assessing LLMs tool use ability under realistic conditions\. It contains about 2,000 annotated instances spanning a broad API set and organizes evaluation into three tracks:
- •Normal:Single/multi\-turn cases with similar\-API and preference settings; calls are checked via AST matching against gold annotations\.
- •Special:Inputs with missing, malformed, or otherwise irrelevant parameters are included to thoroughly test the model’s robustness when handling imperfect or noisy instructions\.
- •Agent:Multi\-turn, multi\-step interactions in sandboxed scenarios, measuring process correctness and end\-to\-end task success\.
Overall performance is computed from track\-wise accuracies, providing fine\-grained diagnostics of tool\-use failures while avoiding reliance on live APIs or external LLM graders\.
### C\.2Baselines
Here we mainly introduce the two open\-source models fine\-tuned on function\-calling data that we use as baselines, while details of the other open\-source models and API\-based models can be found in their publicly available technical reports\.
- •ToolACE\-8BLiu et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib21)\): Obtained by fine\-tuning LLaMA3\.1\-8B\-Instruct with function\-calling training data generated through data pipeline ToolACE\.
- •Qwen\-ToolRL:Obtained by fine\-tuning Qwen3\-8B with the publicly available function\-calling training dataset ToolRLQian et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib27)\), which is a hybrid corpus sampling 2K examples from ToolACE and 1K each from HammerLin et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib18)\)and xLAMZhang et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib41)\)\.
### C\.3Compute Cost
We apply the parameter\-efficient LoRAHu et al\. \([2022](https://arxiv.org/html/2605.28835#bib.bib12)\)strategy for fine\-tuning and perform SFT using the LLaMA\-Factory frameworkZheng et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib42)\)\. The primary computational overhead during generation arises from API calls, while post\-training costs remain lightweight\. Specifically, SFT is completed on a single A800 GPU in about 30 minutes, and the RL stage on four A800 GPUs in roughly 4 hours, making the overall compute requirements practical for reproduction\. The training data follows the Alpaca format, where the instruction includes the system prompt, tool pool, and user query, and the output contains the tool call\. Hyper\-parameters are shown in Table[7](https://arxiv.org/html/2605.28835#A3.T7)\.
### C\.4Statistical Significance and Variance Analysis
All results reported in the main text are averaged over three runs with different random seeds\. We also compute the standard deviation \(reported as±\\pmSD in the tables\) to assess model stability\. Across all benchmarks,GenesisFunc\-8B shows small standard deviations, typically below 0\.5, indicating consistent performance across runs\. The observed performance gains ofGenesisFunc\-8B exceed twice the standard deviations on all benchmarks, demonstrating that the improvements are well beyond random variation\. To further verify robustness, we perform paired t\-tests betweenGenesisFunc\-8B and the strongest open\-source baseline \(ToolACE\-8B\), showing statistically significant gains \(p < 0\.05\)\.
Table 8:Ablation study of different slot selection strategies in the function agent\. To compare with our proposed dynamic slot selection approach, we also adopt two fixed strategies: filling all optional parameters and leaving them empty, and evaluate on BFCL\. The best results are marked in bold, and the second best areunderlined\.
## Appendix DAdditional Experiments
### D\.1Study on General Abilities
To assess the broader impact ofGenesisFunc\-8B on general abilities, we evaluated it on five benchmarks: MMLU for knowledgeHendrycks et al\. \([2021](https://arxiv.org/html/2605.28835#bib.bib10)\), EvalPlus for code generationLiu et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib20)\), GSM8K for mathematicsCobbe et al\. \([2021](https://arxiv.org/html/2605.28835#bib.bib6)\), MGSM for multilingual reasoningShi et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib32)\), and GPQA for logical reasoningRein et al\. \([2023](https://arxiv.org/html/2605.28835#bib.bib30)\)\. Baselines included LLaMA\-3\.1\-8B\-InstructDubey et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib8)\), Qwen3\-8BYang et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib37)\), ToolACE\-8BLiu et al\. \([2025a](https://arxiv.org/html/2605.28835#bib.bib21)\), and GPT\-4OpenAI \([2023](https://arxiv.org/html/2605.28835#bib.bib25)\)\. As shown in Figure[6](https://arxiv.org/html/2605.28835#A4.F6),GenesisFunc\-8B performs on par with Qwen3\-8B on most benchmarks, indicating that our fine\-tuning substantially improves function\-calling ability while preserving the model’s general abilities\. It also surpasses similarly sized open\-source models on many tasks, demonstrating competitive performance at the 8B scale\. The remaining gap to GPT\-4 in reasoning and comprehension is expected and likely stems from differences in model size and the breadth of training data rather than negative transfer from function\-calling training\. Overall, these results highlight the promise of targeted specialization for function\-calling while maintaining broad competence, and they leave open the challenge of jointly improving multiple capabilities together with function\-calling performance in a single model\.
Figure 6:General abilities of models\. Evaluation are conducted in five dimensions\.
### D\.2Different Slot Selection Strategies in the Function Agent
Table[8](https://arxiv.org/html/2605.28835#A3.T8)presents the results of different slot selection strategies in the function agent\. Compared with two fixed strategies, filling all optional parameters \(max\) and leaving all optional parameters empty \(min\), our proposed dynamic slot selection approach achieves the best overall performance on both the Non\-Live and Live settings\. Specifically,GenesisFunc\-8B\(dynamic\) reaches an overall accuracy of 93\.31 on the Non\-Live split and 84\.83 on the Live split, surpassing both fixed strategies\. These results demonstrate that dynamically selecting relevant slots enables the model to better capture diverse tool\-use patterns while avoiding unnecessary noise, thereby improving the robustness and generalization of function\-calling\.
### D\.3Ablation on Multi\-Stage Evaluation Module
As discussed earlier, we employ a multi\-stage evaluation module to ensure the correctness of the dialogues generated in the previous stage\. To evaluate its effectiveness, we fine\-tune models on two datasets: one that has passed through this evaluation module and another that has not\. The fine\-tuned models are then evaluated on the BFCL benchmark, with results presented in Table[9](https://arxiv.org/html/2605.28835#A4.T9)\. The comparison clearly shows that models trained on data evaluated by the multi\-stage module achieve higher overall accuracy than those trained on non\-evaluated data, thereby demonstrating the effectiveness of our proposed evaluation module\.
Table 9:Ablation study of multi\-stage evaluation module\. We remove the multi\-stage evaluation module, and evaluate on BFCL\. The best results in each category are marked in bold\. The second best results areunderlined\.\(a\)Non\-Live
\(b\)Live
Figure 7:Scaling analysis of function\-calling performance\. We evaluate raw and fine\-tuned Qwen\-3\-xB models of different sizes on the BFCL benchmark in \(a\) Non\-Live and \(b\) Live settings\.
### D\.4Scaling to Different Model Sizes
Scaling laws suggest a close relationship between model capacity and empirical performance\. To examine how function\-calling ability scales with model size, we evaluate the Qwen\-3\-xB model familyYang et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib37)\), covering a range of parameter scales \(4B, 8B, 14B, and 32B\)\. Both raw models and models fine\-tuned on our dataset are assessed on the BFCLYan et al\. \([2024](https://arxiv.org/html/2605.28835#bib.bib36)\), and the corresponding results are presented in Figure[7](https://arxiv.org/html/2605.28835#A4.F7)\. Overall, the results show a steady improvement in function\-calling performance as model size increases across both non\-live and live evaluation settings, with larger models exhibiting stronger robustness and more consistent execution accuracy\. In comparison to raw models, fine\-tuned models consistently achieve higher performance at all scales, indicating that supervision from theGenesisFunceffectively enhances function\-calling behavior\. Importantly, the fine\-tuned models maintain a smooth and stable scaling pattern, suggesting thatGenesisFunccomplements model capacity rather than altering the underlying scaling dynamics\. Together, these observations demonstrate the effectiveness ofGenesisFuncin supporting scalable performance gains for LLMs\.
\(a\)Non\-Live
\(b\)Live
Figure 8:Generalization across model backbones\. We evaluate raw and fine\-tuned LLMs with different backbone architectures on the BFCL benchmark in \(a\) Non\-Live and \(b\) Live settings\.
### D\.5Generalization across Model Backbones
To analyze the impact of backbone choice on function\-calling performance, we evaluate a set of representative LLMs, including Qwen3\-14B, LLaMA\-3\-8B\-Instruct, and LLaMA\-3\.1\-8B\-Instruct\. All models are fine\-tuned using our dataset and evaluated on the BFCL benchmark\. In addition, to mitigate potential confounding effects introduced by different backbones and to more rigorously isolate the contribution of the data\-generation pipeline itself, we conduct a controlled comparison on the same backbone\. Specifically, since ToolACE is trained on LLaMA\-3\.1\-8B\-Instruct, we directly compare the two data\-generation pipelines on this backbone\. The comparative results are summarized in Figure[8](https://arxiv.org/html/2605.28835#A4.F8)\. Figure[8](https://arxiv.org/html/2605.28835#A4.F8)shows that fine\-tuning withGenesisFuncconsistently improves performance across all examined backbones under both non\-live and live evaluation settings, demonstrating strong cross\-backbone robustness\. Despite differences in model architectures and pre\-training objectives, fine\-tuning yields stable and significant gains overall\. Notably, models with lower initial function\-calling accuracy tend to exhibit larger relative improvements, thereby narrowing the performance gap across backbones\. More importantly, the controlled comparison on LLaMA\-3\.1\-8B\-Instruct indicates that, under the same backbone,GenesisFuncoutperforms ToolACE, providing stronger evidence that the observed gains stem from the data\-generation pipeline rather than backbone\-specific characteristics\.
## Appendix EDetails of Reinforcement Learning
### E\.1GRPO Algorithm
To fine\-tune the model with structured rewards, we adoptGrouped Relative Policy Optimization\(GRPO\), a variant of PPO that normalizes advantages within groups of responses derived from the same input query\. This design mitigates variance across samples under identical contexts, thereby stabilizing and accelerating training\. Specifically, for each queryQQ, the rollout responses form a groupGQ=\{\(s1,r1\),\(s2,r2\),…,\(sn,rn\)\}G\_\{Q\}=\\\{\(s\_\{1\},r\_\{1\}\),\(s\_\{2\},r\_\{2\}\),\\ldots,\(s\_\{n\},r\_\{n\}\)\\\}, wheresis\_\{i\}denotes a candidate response andrir\_\{i\}its reward\. Each reward is obtained as the sum of correctness and formatting scores relative to the reference annotation\. Within each group, the meanμQ\\mu\_\{Q\}and standard deviationσQ\\sigma\_\{Q\}of rewards are computed as:
μQ=1n∑i=1nri,σQ=1n∑i=1n\(ri−μQ\)2,\\mu\_\{Q\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}r\_\{i\},\\quad\\sigma\_\{Q\}=\\sqrt\{\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(r\_\{i\}\-\\mu\_\{Q\}\)^\{2\}\},\(2\)and the normalized advantage of responsesis\_\{i\}is defined as:
A\(si\|Q\)=ri−μQσQ\+η,A\(s\_\{i\}\|Q\)=\\frac\{r\_\{i\}\-\\mu\_\{Q\}\}\{\\sigma\_\{Q\}\+\\eta\},\(3\)whereη\\etais a small constant added for numerical stability\.
The policyπθ\\pi\_\{\\theta\}is then updated using the clipped PPO objective with group\-normalized advantages:
rθ\(si\|Q\)=πθ\(si\|Q\)πold\(si\|Q\)\.r\_\{\\theta\}\(s\_\{i\}\|Q\)=\\tfrac\{\\pi\_\{\\theta\}\(s\_\{i\}\|Q\)\}\{\\pi\_\{\\text\{old\}\}\(s\_\{i\}\|Q\)\}\.\(4\)
JGRPO\(θ\)=𝔼\[min\(\\displaystyle J\_\{\\text\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\\big\[\\min\(rθ\(si\|Q\)A\(si\|Q\),\\displaystyle r\_\{\\theta\}\(s\_\{i\}\|Q\)\\,A\(s\_\{i\}\|Q\),\(5\)clip\(rθ\(si\|Q\),1−ϵ,1\+ϵ\)\\displaystyle\\operatorname\{clip\}\\big\(r\_\{\\theta\}\(s\_\{i\}\|Q\),1\-\\epsilon,1\+\\epsilon\\big\)×A\(si\|Q\)\)\]\.\\displaystyle\\qquad\\qquad\\times A\(s\_\{i\}\|Q\)\)\\big\]\.Unlike standard PPO, GRPO omits the KL penalty against a reference model, thereby allowing greater flexibility in adapting to diverse custom reward functions while still retaining training stability\. This modification improves overall sample efficiency, encourages stronger alignment with structured reward signals, and leads to faster convergence as well as more robust policy performance\.
### E\.2Reward Design
Reward functions play a central role in reinforcement learning, guiding models to generate outputs that are both valid and useful\. Following prior studies on rule\-based reward signalsXie et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib35)\); Jin et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib14)\); Li et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib16)\); Qian et al\. \([2025](https://arxiv.org/html/2605.28835#bib.bib27)\), we adopt a two\-dimensional design that combines structural compliance with functional correctness, tailored to the demands of tool\-augmented dialogue\.
Figure 9:Example of a single\-task scenario within single\-turn dialogue\.##### Structural Compliance
To ensure that generated outputs conform to the expected schema, we introduce astructural compliance reward\. This reward evaluates whether each predicted tool call follows the prescribed format, including the presence of all mandatory fields and the correct logical ordering of elements\. For single\-tool cases, the reward reduces to a simple binary check defined as
rstructural\(i\)=\{1,if thei\-th output followsthe required schema,0,otherwise\.r\_\{\\text\{structural\}\}^\{\(i\)\}=\\begin\{cases\}1,&\\begin\{aligned\} &\\text\{if the $i$\-th output follows\}\\\\ &\\text\{the required schema\},\\end\{aligned\}\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(6\)For multi\-tool cases, the score is computed for each tool call individually according to the above rule and then averaged across all calls:
Rstructural=1N∑i=1Nrstructural\(i\),R\_\{\\text\{structural\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}r\_\{\\text\{structural\}\}^\{\(i\)\},\(7\)whereNNdenotes the number of tools in the output\.
##### Functional Correctness
In addition to structural validity, we further assess whether the predicted tool calls can successfully achieve the intended functionality\. For this purpose, we introduce afunctional correctness reward\. This reward explicitly captures the degree of semantic and functional agreement between the predicted calls and the reference calls\. For a single\-tool case, the reward is defined as follows:
rcorrect\(i\)=\{3,if the tool name and all parametersmatch exactly,2,if the tool name is correct andat least one parameter matches,1,if only the tool name is correct,0,otherwise\.r\_\{\\text\{correct\}\}^\{\(i\)\}=\\begin\{cases\}3,&\\begin\{aligned\} &\\text\{if the tool name and all parameters\}\\\\ &\\text\{match exactly\},\\end\{aligned\}\\\\ 2,&\\begin\{aligned\} &\\text\{if the tool name is correct and\}\\\\ &\\text\{at least one parameter matches\},\\end\{aligned\}\\\\ 1,&\\text\{if only the tool name is correct\},\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(8\)For multi\-tool cases, the correctness score is first computed for each tool call using the above rule, and then the results are averaged across all calls:
Rcorrect=1N∑i=1Nrcorrect\(i\),R\_\{\\text\{correct\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}r\_\{\\text\{correct\}\}^\{\(i\)\},\(9\)whereNNdenotes the number of tools in the output\.
Finally, we combine the two components to form the overall reward function\. The structural compliance reward ensures that outputs remain syntactically valid, while the functional correctness reward encourages accurate execution of tool calls\. By integrating both signals, the model receives feedback that jointly emphasizes format validity and semantic accuracy\. The final reward is simply defined as follows:
Rfinal=Rstructural\+Rcorrect\.R\_\{\\text\{final\}\}=R\_\{\\text\{structural\}\}\+R\_\{\\text\{correct\}\}\.\(10\)This unified formulation provides a balanced training signal, preventing the model from generating structurally invalid outputs and at the same time guiding it toward functionally reliable tool usage\.
Figure 10:Example of a multi\-task scenario within single\-turn dialogue\.Figure 11:Example of a single\-task scenario within multi\-turn dialogue\.Figure 12:Example of a multi\-task scenario within multi\-turn dialogue\.Figure 13:Example of a special\-case dialogue\.Similar Articles
Evolution through large models
This paper demonstrates that large language models trained on code can significantly enhance genetic programming mutation operators, enabling the generation of hundreds of thousands of functional Python programs for robot design in the Sodarace domain without prior training data. The approach, called Evolution through Large Models (ELM), combines LLMs with MAP-Elites to bootstrap new conditional models for context-specific artifact generation.
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
This paper introduces GenericAgent, a self-evolving LLM agent system designed to maximize context information density. It addresses long-horizon limitations through hierarchical memory, reusable SOPs, and efficient compression, achieving better performance with fewer tokens compared to leading agents.
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance on benchmarks like BFCLv3 and MCP-Atlas with fewer environments than prior work.
AgForce Enables Antigen-conditioned Generative Antibody Design
This paper identifies three failure modes in existing antibody design methods (antigen blindness, vocabulary collapse, convergence to marginal distribution) and proposes AgForce, a novel encoder-decoder architecture using graph neural networks and mixture density networks, achieving state-of-the-art binding quality and sequence recovery on the Chimera-Bench benchmark.