Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications
Summary
This paper proposes a unified framework for customizing and deploying LLM-based multi-agent systems in enterprise settings, combining model customization through continual pretraining, fine-tuning, and preference optimization with inference optimization using speculative decoding and FP8 quantization. It achieves 4.48x throughput speedup while maintaining performance on enterprise workloads.
View Cached Full Text
Cached at: 06/18/26, 05:45 AM
# Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications
Source: [https://arxiv.org/html/2606.18502](https://arxiv.org/html/2606.18502)
Paresh Dashore†, Shreyas Kulkarni∗, Uttam Gurram∗, Nadia Bathaee, Kartik Balasubramaniam, Genta Indra Winata, Sambit Sahu, Shi\-Xiong Zhang AI Foundations, Capital One
###### Abstract
Large language model \(LLM\)\-based multi\-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications\. However, production deployment remains challenging due to domain\-specific customization requirements and high latency and inference costs in agentic workflows\. We propose a unified framework for customization and efficient deployment of multi\-agent systems in real\-world settings\. The first stage, Agentic Model Customization, combines continual pretraining, supervised fine\-tuning, and preference optimization to adapt a compact model to specialized domains while retaining strong agentic capabilities\. The second stage, Inference Optimization, integrates speculative decoding and FP8 quantization with targeted calibration to enable cost\-efficient serving with minimal quality loss\. Across enterprise workloads, our framework enables rapid domain adaptation and achieves a 4\.48×\\timesspeedup in throughput while maintaining performance and improving robustness on long\-tail scenarios\.
Towards Scalable Customization and Deployment of Multi\-Agent Systems for Enterprise Applications
Paresh Dashore†††thanks:Co\-first authors\.†Corresponding author\. Email:paresh\.dashore@capitalone\.com\., Shreyas Kulkarni∗, Uttam Gurram∗, Nadia Bathaee,Kartik Balasubramaniam, Genta Indra Winata, Sambit Sahu, Shi\-Xiong ZhangAI Foundations, Capital One
## 1Introduction
The progress of Large Language Models \(LLMs\) enables agentic applications, including tool\-callingShiet al\.\([2025](https://arxiv.org/html/2606.18502#bib.bib3)\); Xuet al\.\([2025](https://arxiv.org/html/2606.18502#bib.bib2)\); Chakrabortyet al\.\([2026](https://arxiv.org/html/2606.18502#bib.bib1)\); Winataet al\.\([2026](https://arxiv.org/html/2606.18502#bib.bib43)\)and multi\-agent systemsGuoet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib7)\); Wuet al\.\([2024b](https://arxiv.org/html/2606.18502#bib.bib9)\)\. By decomposing complex tasks across specialized agents, multi\-agent systems often achieve higher\-quality outputs than single\-agent approaches\. However, coordinating multiple LLM calls incurs significant latency and computational overhead, making deployment challenging in production environments with strict service\-level agreement \(SLA\) requirements\. Moreover, the reliance on large models increases infrastructure costs and limits scalability for latency\-sensitive, high\-volume applications, potentially degrading the user experience\.
User QueryUnderstanderPlannerEvaluatorSafe?ExecutorExplainerResponseYesNo \(Replan\)
Figure 1:Multi\-Agent System Pipeline\. The sequential workflow routes a user query through specialized agents to produce a tool\-based plan\. Safety guardrails in theEvaluator Agentensure that only valid plans proceed to execution and explanation, while invalid or unsafe plans trigger a replanning loop\.1\. Simulation & Evaluation2\. Dataset Curation3\. SequentialPipeline4\. InferenceOptimizationUser Simulator \(SS\)Multi\-Agent System \(MM\)Teacher Model\(πT\\pi\_\{T\}\)User: “Schedule a test drive for a Toyota Camry\.”Agent: “That sounds great\! When would you like to come in?”Turnii:User Utteranceuiu\_\{i\}Turnii:Multi\-Agent SystemResponseaia\_\{i\}I/O Traces\(yy\)prompt–output pairsExtractsPrompt 1→\\rightarrow “Your task is to extract …” Output→\\rightarrowtest\_driveLLM Judge\(JJ\)Offline Eval\(Φrefine\\Phi\_\{\\text\{refine\}\}\)Domain Alignment\(𝒳CPT\\mathcal\{X\}\_\{\\text\{CPT\}\}\)Instruction Tuning\(𝒴SFT\\mathcal\{Y\}\_\{\\text\{SFT\}\}\)Preference Alignment\(𝒵DPO\\mathcal\{Z\}\_\{\\text\{DPO\}\}\)Verified \(y∗y^\{\*\}\)Chosen vs\.RejectedStudent Init \(πθ\(0\)\\pi\_\{\\theta\}^\{\(0\)\}\)Stage 1: CPT\(πθCPT\\pi\_\{\\theta\}^\{\\text\{CPT\}\}\)Stage 2: SFT\(πθSFT\\pi\_\{\\theta\}^\{\\text\{SFT\}\}\)Stage 3: DPOCustomized Student\(πθBF16\\pi\_\{\\theta\}^\{\\text\{BF16\}\}\)πθBF16\\pi\_\{\\theta\}^\{\\text\{BF16\}\}EAGLE Spec\. Decoding\(πθEAGLE\\pi\_\{\\theta\}^\{\\text\{EAGLE\}\}\)FP8 QuantizationOptimized Model\(πθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\text\{EAGLE\+FP8\}\}\)
Figure 2:End\-to\-End Pipeline\.An end\-to\-end flow distilling agentic capabilities from a teacher model \(πT\\pi\_\{T\}\) into a customized student \(πθBF16\\pi\_\{\\theta\}^\{\\text\{BF16\}\}\) and further optimizing it through inference techniques intoπθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\text\{EAGLE\+FP8\}\}model\.At the same time, deployed agentic systems must retain strong task\-specific capabilities\. Skill transfer in LLMs emerges as a key approach to enabling dense models to acquire multiple competencies through fine\-tuningNottinghamet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib4)\); Wanget al\.\([2025](https://arxiv.org/html/2606.18502#bib.bib6)\)\. This is particularly important in agentic settings, where a single model is expected to perform specialized roles without relying on multiple independent models that are harder to maintain and scale\. In addition, compressing knowledge into smaller models, commonly achieved through model distillation, is essential for accelerating inference while preserving performance comparable to larger models\. Meanwhile, speculative decodingLeviathanet al\.\([2023](https://arxiv.org/html/2606.18502#bib.bib5)\); Liet al\.\([2024b](https://arxiv.org/html/2606.18502#bib.bib35)\)proves to be an effective technique for reducing latency by leveraging smaller models during inference\.
To address the deployment challenges of multi\-agent systems, we propose an agentic model customization and inference optimization pipeline that substantially reduces latency while preserving strong task performance\. Our approach begins with model distillation using a student–teacher framework to consolidate agentic capabilities into a single optimized model\. The pipeline further leverages unlabeled data for domain adaptation and knowledge transfer via Continual Pretraining \(CPT\), incorporates supervised fine\-tuning \(SFT\) during distillation, and applies post\-training Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.18502#bib.bib12)\)to better align model behavior with desired preferences\. Finally, we enhance inference efficiency through a combination of EAGLE speculative decodingLiet al\.\([2024b](https://arxiv.org/html/2606.18502#bib.bib35)\)and FP8 quantization, achieving additional latency reductions with minimal impact on model quality\. We illustrate our sequential multi\-agent pipeline in Figure[1](https://arxiv.org/html/2606.18502#S1.F1)\.
Our contributions are summarized as follows:
- •We propose a production\-ready multi\-agent system that integrates bothagentic model customizationand aninference optimizationpipeline for real\-world enterprise deployments\. The former distills agentic capabilities into a smaller model, whereas the latter preserves model performance while significantly reducing inference latency via EAGLE speculative decoding and FP8 quantization\.
- •We present an end\-to\-end \(E2E\) training pipeline comprising a user\-simulator\-driven data generation framework and a sequential training process for customized agentic models\. Through a systematic analysis of each training stage, we quantify its contribution to production\-grade quality and show that preference optimization is essential for achieving competitive performance\.
- •We conduct comprehensive empirical studies demonstrating that carefully curated mixtures of proprietary and public data enable near\-lossless acceleration, EAGLE can be tuned to reach optimal efficiency and lower latency even while speculating less correct tokens\.
## 2Agentic Model Customization
Our system comprises a customer\-facing chatbot for automotive retail, governed by an LLM\-powered multi\-agent workflow\. To reduce operational latency, we implement an offline distillation and optimization pipeline to transition from a high\-parameter productionteacher model,πT\\pi\_\{T\}, to an optimizedstudent model,πθ\\pi\_\{\\theta\}\. All training stages within this customization phase are conducted inBF16precision\.
### 2\.1Multi\-Agent System Architecture
To support complex customer interactions in automotive retail, we develop a Multi\-Agent System \(MM\) powering a customer\-facing chatbot\. All agents share the same foundation model but differ in context, including memory, knowledge bases, and tool access\. This single\-model design simplifies production deployment while preserving agent specialization\.
As illustrated in Figure[1](https://arxiv.org/html/2606.18502#S1.F1), the systemMMfollows a sequential pipeline of five agents with a planning feedback loop\. The system decomposes complex queries across specialized, collaborative roles: theUnderstander Agent,Planner Agent,Evaluator Agent,Executor Agent, andExplainer Agent\. A comprehensive breakdown of each agent’s specific responsibilities is provided in Appendix[C\.1](https://arxiv.org/html/2606.18502#A3.SS1)\.
Because a single user request may require multiple multi\-turn exchanges and replanning iterations, the cumulative latency and compute costs escalate quickly\. Our goal is to maximize throughput on AWS EC2 P5 \(8×\\timesNVIDIA H100 80GB GPUs\) while meeting sub\-second end\-to\-end latency SLAs\. However, profiling identifies three primary bottlenecks: \(1\) cumulative latency from multiple LLM calls per request; \(2\) massive memory footprints from serving large LLMs; and \(3\) high generation costs that cannot be solved by prefill optimization alone\. This compounding inference overhead necessitates the aggressive distillation and inference optimization strategies detailed in the subsequent sections\. Further details regarding our specific deployment constraints and system profiling can be found in Appendix[C\.2](https://arxiv.org/html/2606.18502#A3.SS2)\.
### 2\.2Conversational Data Synthesis via Agentic Simulation
To curate a high\-fidelity training corpus, we develop an automated user simulation framework where a specialized User Simulator \(UU\) models human customer interactions\. As illustrated in the end\-to\-end pipeline in Figure[2](https://arxiv.org/html/2606.18502#S1.F2), the simulator is driven by an optimized prompt configuration,ΦS\\Phi\_\{S\}, engineered to maximize conversational diversity and expose the system to complex edge cases\. Specifically, the simulation promptΦS\\Phi\_\{S\}dynamically ingests four distinct context vectors at each turn: \(i\) the accumulated conversation historyHH, \(ii\) a targeted set of intent and capability definitions𝒩\\mathcal\{N\}mapping to supported business logic, \(iii\) seed topicsℐ\\mathcal\{I\}used to anchor the initial dialogue domain, and \(iv\) environmental contextℰ\\mathcal\{E\}, which encompasses available vehicle inventory constraints and synthetic customer profiles\.
A single simulation sessionTTcontinues until the simulator achieves its assigned goal and outputs anEXITtoken\. The interaction follows a sequential turn\-taking logic\. For each exchangeii, the simulator generates a user utteranceuiu\_\{i\}conditioned on the prior historyH<iH\_\{<i\}, and the multi\-agent systemMM\(powered by the teacher modelπT\\pi\_\{T\}\) generates an assistant responseaia\_\{i\}:
ui\\displaystyle u\_\{i\}=S\(H<i,𝒩,ℐ,ℰ,ΦS\),\\displaystyle=S\(H\_\{<i\},\\mathcal\{N\},\\mathcal\{I\},\\mathcal\{E\},\\Phi\_\{S\}\),\(1\)ai\\displaystyle a\_\{i\}=M\(ui∣πT\)\.\\displaystyle=M\(u\_\{i\}\\mid\\pi\_\{T\}\)\.\(2\)During simulation, we capture the complete internal state, including all intermediate LLM prompts and corresponding outputs\.
### 2\.3Refinement via LLM\-as\-a\-Judge
To ensure the distillation of high\-quality reasoning, we utilize a model acting as a judge,JJ\. We defineJJsuch that its reasoning capabilities and parameter scale significantly exceed the teacher model \(J≫πTJ\\gg\\pi\_\{T\}\)\. For every teacher\-generated responsey∈Ty\\in T, the judge generates a refined responsey∗y^\{\*\}using a specialized instruction\-adherence promptΦrefine\\Phi\_\{\\text\{refine\}\}:
y∗=J\(y,Φrefine\)\.y^\{\*\}=J\(y,\\Phi\_\{\\text\{refine\}\}\)\.\(3\)
### 2\.4Dataset Formulation
Using the refined traces, we construct three corpora for the student model: \(i\)Domain Alignment \(𝒳CPT\\mathcal\{X\}\_\{\\text\{CPT\}\}\), comprising synthetic and public unlabeled datasets, alongside domain\-specific automotive texts; \(ii\)Instruction Tuning \(𝒴SFT\\mathcal\{Y\}\_\{\\text\{SFT\}\}\), containing judge\-refined outputsy∗y^\{\*\}to distill teacher proficiency; and \(iii\)Preference Alignment \(𝒵DPO\\mathcal\{Z\}\_\{\\text\{DPO\}\}\), consisting of chosen/rejected triples\(x,y∗,y\)\(x,y^\{\*\},y\)\.
### 2\.5Agentic Training Procedure
#### Stage 0: Model Curation\.
We initialize the policy modelπθ\(0\)\\pi\_\{\\theta\}^\{\(0\)\}by applying block expansionWuet al\.\([2024a](https://arxiv.org/html/2606.18502#bib.bib32)\)to a base foundation modelπbase\\pi\_\{\\text\{base\}\}Grattafioriet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib33)\)\. Specifically, we insert one new transformer block after every four original blocks to increase model capacity for domain\-specific adaptation\. The attention and feed\-forward weights of each inserted block are initialized to zero\. Because of the residual connections, these blocks initially act as identity mappings, allowing hidden states to pass through unchanged\. Consequently, the expanded modelπθ\(0\)\\pi\_\{\\theta\}^\{\(0\)\}retains the exact behavior and performance ofπbase\\pi\_\{\\text\{base\}\}at initialization, consistent with the findings ofWuet al\.\([2024a](https://arxiv.org/html/2606.18502#bib.bib32)\)\.
#### Stage 1: Context\-aware Continual Pretraining\.
Continual pretraining is widely used to adapt pretrained models to new domain data, but updating model parameters on new distributions can degrade previously acquired capabilities, a phenomenon known as catastrophic forgettingWinataet al\.\([2023](https://arxiv.org/html/2606.18502#bib.bib34)\)\. Our method is motivated by the observation that the per\-token loss is consistently much higher at the beginning of each training sequence, as shown in Figure[5](https://arxiv.org/html/2606.18502#A2.F5)\. In the CPT setting, where the model already possesses strong general linguistic knowledge, this initial loss spike is often caused by limited preceding context rather than a true failure to model the domain content, making it an inefficient and potentially noisy training signal\. To reduce this effect, we propose*Context\-aware Continual Pretraining*, which prepends sample\-specific contextCxC\_\{x\}to each training document before updating the model, thereby smoothing the token\-level loss and reducing abrupt distributional shifts during adaptation\. The full mathematical formulation of the loss \(ℒ𝙲𝙰−𝙲𝙿𝚃\\mathcal\{L\}\_\{\\tt\{CA\}\-\\tt\{CPT\}\}\) is detailed in[AppendixA](https://arxiv.org/html/2606.18502#A1)\. To further mitigate forgetting, we perform model merging after each training stage to combine domain\-specific adaptation with the general capabilities of the original model, yielding the merged modelπθ𝙲𝙿𝚃\\pi\_\{\\theta\}^\{\\tt\{CPT\}\}\.
#### Stage 2: Agentic Fine Tuning\.
Starting from the mergedπθ𝙲𝙿𝚃\\pi\_\{\\theta\}^\{\\tt\{CPT\}\}, we perform Supervised Fine\-Tuning on𝒴SFT\\mathcal\{Y\}\_\{\\text\{SFT\}\}using Low\-Rank Adaptation \(LoRA\)\. We avoid full\-parameter fine\-tuning to prevent catastrophic forgetting and ensure robustness to future prompt updates\. The adapters are merged to createπθ𝚂𝙵𝚃\\pi\_\{\\theta\}^\{\\tt\{SFT\}\}\. The SFT loss objective \(ℒ𝚂𝙵𝚃\\mathcal\{L\}\_\{\\tt\{SFT\}\}\) is provided in[AppendixA](https://arxiv.org/html/2606.18502#A1)\.
#### Stage 3: Agentic Preference Tuning\.
Usingπθ𝚂𝙵𝚃\\pi\_\{\\theta\}^\{\\tt\{SFT\}\}as the reference, we apply DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.18502#bib.bib12)\)using LoRA on𝒵DPO\\mathcal\{Z\}\_\{\\text\{DPO\}\}\. This stage aligns the student with the judge’s logic and corrects teacher errors\. The complete optimization objective \(ℒ𝙳𝙿𝙾\\mathcal\{L\}\_\{\\tt\{DPO\}\}\) is explicitly outlined in[AppendixA](https://arxiv.org/html/2606.18502#A1)\. The final adapters are merged to yield the optimized student model,πθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}, acting as the foundationalBF16checkpoint for downstream inference optimization\.
## 3Inference Optimization
### 3\.1EAGLE
Speculative decoding utilizes a draft model to predict tokens that are verified in parallel by the target modelChenet al\.\([2023](https://arxiv.org/html/2606.18502#bib.bib31)\); Leviathanet al\.\([2023](https://arxiv.org/html/2606.18502#bib.bib5)\)\. Crucially, it accelerates generation while guaranteeing to preserve accuracy\. EAGLELiet al\.\([2024b](https://arxiv.org/html/2606.18502#bib.bib35),[a](https://arxiv.org/html/2606.18502#bib.bib42),[2026](https://arxiv.org/html/2606.18502#bib.bib24)\)is a lightweight draft model that consumes target\-model hidden states to increase token acceptance rates, yielding higher throughput\. However, prior work has limited analysis of training the EAGLE drafter for domain\-specific applications\. We demonstrate that, with carefully curated training data, we can achieve significant throughput improvements, independent of the decoding algorithm\. We denote the EAGLE\-augmented student asπθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\. We detail the architectural trade\-offs of draft model quantization and tree versus greedy decoding in Appendix[E](https://arxiv.org/html/2606.18502#A5.SS0.SSS0.Px5)\.
### 3\.2FP8 Post\-Training Quantization
FP8 quantizationKuzminet al\.\([2022](https://arxiv.org/html/2606.18502#bib.bib41)\)compresses weights and activations resulting in lower memory requirements and faster computations leading to a reduction in latency\. Additionally, quantizing the KV cache to FP8 halves the storage compared to FP16/BF16, effectively allowing for doubled context lengths or larger batch sizes\. We apply FP8 \(E4M3\) weight\-and\-activation quantization \(W8A8\) to theπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}\([Section2](https://arxiv.org/html/2606.18502#S2)\) andπθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}models using min\-max per\-tensor post\-training quantization \(PTQ\), yielding the quantized studentπθFP8\\pi\_\{\\theta\}^\{\\texttt\{FP8\}\}and optimized modelπθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\+FP8\}\}, respectively\. Static quantization sets scales offline from activation statistics collected on calibration data, avoiding runtime calibration overhead\. We select W8A8 over weight\-only schemes \(e\.g\., AWQ, GPTQ\) because W8A8 maintains throughput gains under high concurrency when inference becomes compute\-bound, and the FP8 format minimizes accuracy regression relative to integer quantization\. The optimization methods described in Sections[3\.1](https://arxiv.org/html/2606.18502#S3.SS1)and[3\.2](https://arxiv.org/html/2606.18502#S3.SS2)are complementary and can be stacked on top of theπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}model to produce compounding gains as shown in Table[1](https://arxiv.org/html/2606.18502#S4.T1)\.
## 4Experimental Setup
We construct a customized 10B model, denoted byπθ\(0\)\\pi\_\{\\theta\}^\{\(0\)\}, following the Stage 0 procedure\. This model is built upon Llama 3\.1 8B InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib33)\), which serves as the base policy modelπbase\\pi\_\{\\text\{base\}\}\. The CPT stage consumes approximately 5T tokens, consisting of a mixture of in\-domain and public\-domain data\. During the agentic fine\-tuning \(AFT\) stage, we use a larger teacher model,πT\\pi\_\{\\text\{T\}\}with Llama 3 70B Instruct, to curate the supervised fine\-tuning corpus\. We train a 250M EAGLE drafter using responses and hidden states generated byπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}on a combined dataset of 127k samples, consisting of 77k open\-source dialogue traces and 50k proprietary synthetic simulations fromMM\.
Following the training phase, we utilize AWS EC2 P5 instances to measure inference performance and run FP8 calibration experiments on a mixed dataset, comprising 1\.4k public samples fromNallapatiet al\.\([2016](https://arxiv.org/html/2606.18502#bib.bib38)\)and 5\.8k in\-domain synthetic traces fromMM\. All inference metrics are obtained using the NVIDIA TensorRT\-LLM v19 framework\. A comprehensive breakdown of our training infrastructure, detailed dataset synthesis, and sequential distillation hyperparameters is provided in Appendix[B](https://arxiv.org/html/2606.18502#A2)\.
ConfigurationDecodingMGLLatency \(s\)QPSSpeedupLlama 3 70B \(πT\\pi\_\{T\}\)––3\.921\.461\.00×\\timesBaseline \(πθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}\)––1\.693\.402\.33×\\timesBaseline \(πθFP8\\pi\_\{\\theta\}^\{\\texttt\{FP8\}\}\)––1\.603\.662\.50×\\timesπθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\+FP8\}\}Greedy3\.800\.926\.544\.48×\\times
Table 1:P90 latency and throughput across optimization configurations\. MGL = Mean Generated Length; QPS = Queries Per Second\.πθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\+FP8\}\}combines EAGLE with FP8 quantization\. All configurations use BF16 unless noted otherwise\.ConfigurationDecodingMGLLatency \(s\)QPSSpeedupBaseline \(πθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}\)––1\.693\.401\.00×\\timesπθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\(E\)Tree3\.661\.504\.041\.19×\\timesπθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\(S\)Tree3\.981\.294\.661\.37×\\timesπθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\(C\)Tree4\.291\.194\.961\.46×\\timesπθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\(S\)Greedy3\.531\.135\.501\.62×\\timesπθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\(C\)Greedy3\.800\.966\.071\.78×\\times
Table 2:Performance across different EAGLE configurations\. E = External, S = Synthetic, C = Combined\.Figure 3:Performance evaluation of the multi\-agent system across different agents and training stages, evaluated on a held\-out set of 1,424 simulated conversations \(8,848 data points\)\. It also includes End\-to\-End functional stress test pass rates across 120 complex scenarios\.32641282565121024204810−310^\{\-3\}10−210^\{\-2\}10−110^\{\-1\}10010^\{0\}10110^\{1\}Calibration sequence length \(tokens\)Activation clip rate \(×10−5\\times 10^\{\-5\}%\)E2EPass RateP 97\.27%I 98\.18%M100\.00%Figure 4:Activation clip rate \(×10−5\\times 10^\{\-5\}%\) vs\. calibration sequence length forπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}\. P = Public, I = In\-domain \(Synthetic\), M = Mixed \(P\+I\)\. 1024 calibration/test samples; 8192\-token test length; both axes log\-scaled\.
## 5Results and Analysis
We measure distillation success and inference optimizations using agent\-level evaluations \(1,424 simulated conversations, 8,848 data points\) and an E2E functional stress test \(120 scenarios, 10 turns each\) that simulates mid\-conversation business switches\. Figure[3](https://arxiv.org/html/2606.18502#S4.F3)summarizes these results\.
#### Distillation Progression\.
Task metrics validate our multi\-stage pipeline \(Figure[3](https://arxiv.org/html/2606.18502#S4.F3)\)\. Continual Pretraining alone \(πθCPT\\pi\_\{\\theta\}^\{\\text\{CPT\}\}\) yields poor agentic capabilities and near\-total E2E failure due to weak instruction\-following\. Supervised Fine\-Tuning \(πθSFT\\pi\_\{\\theta\}^\{\\text\{SFT\}\}\) establishes foundational task structures, driving substantial improvements and a higher E2E success rate\. DPO bridges the remaining capability gap\. The final distilled student \(πθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}\) outperforms the 70B teacher \(πT\\pi\_\{T\}\) inPlanner AgentandUnderstander Agent\(Figure[3](https://arxiv.org/html/2606.18502#S4.F3)\), and navigates all E2E scenarios to exceed the teacher’s baseline\. Additionally,πθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}achieves a 2\.33×\\timesspeedup overπT\\pi\_\{T\}\(Table[1](https://arxiv.org/html/2606.18502#S4.T1)\)\.
#### Inference Optimization Impact\.
Stacking optimization techniques producesπθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\+FP8\}\}, which runs 1\.92×\\timesfaster thanπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}and 4\.48×\\timesfaster thanπT\\pi\_\{T\}\(Table[1](https://arxiv.org/html/2606.18502#S4.T1)\)\. Task\-level accuracy remains highly resilient under quantization and speculative decoding:πθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\+FP8\}\}maintains near\-identical performance to the unquantized student, with only a minor regression in the Planner agent that still exceeds the teacher baseline \(Figure[3](https://arxiv.org/html/2606.18502#S4.F3)\)\. Crucially,πθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\+FP8\}\}retains a perfect E2E stress test pass rate, confirming our calibration and training data mixing strategies preserve agentic behavior\.
#### EAGLE Alignment\.
We analyze the performance of three EAGLE drafters trained on the combined dataset and its individual external and synthetic subsets\. Table[2](https://arxiv.org/html/2606.18502#S4.T2)shows that synthetic training is more effective than using external data alone \(1\.37×\\timesvs\. 1\.19×\\times\), underscoring the value of in\-domain alignment\. Combining both data sources further improves performance to 1\.46×\\times, suggesting complementary gains in generalization across both tree and greedy decoding settings\. While greedy decoding lowers MGL from 4\.29 to 3\.80, the reduced drafting cost and target\-model verification overhead outweigh the decrease in acceptance rate, yielding a peak speedup of 1\.78×\\timesunder target\-serving concurrency\.
#### FP8 Calibration Performance\.
Static FP8 Post\-Training Quantization is sensitive to the calibration data being used, so we measure the calibration performance on the mixed set along with its individual public and in\-domain subsets\. The E2E pass rate in Figure[4](https://arxiv.org/html/2606.18502#S4.F4)shows that we preserve performance with the mixed calibration set, and we analyze this usingactivation clip rate: the fraction of test\-time activations falling outside the per\-tensor min/max bounds set during calibration\. High clip rates indicate insufficient dynamic range and can cause silent regressions on long contexts\.
#### Clip Rate Regimes\.
Figure[4](https://arxiv.org/html/2606.18502#S4.F4)shows two regimes\. In the first regime, below 128 tokens, the public set has the lowest clip rate, since short internet snippets already cover a broad activation range\. In the second regime, from 256 tokens onward, the mixed set dominates: at 2,048 tokens, it is6\.1×6\.1\\timeslower than the public set and1\.4×1\.4\\timeslower than the in\-domain set\. Production prompts in our system routinely exceed 1,000 tokens once tools, memory, and few\-shot exemplars are injected, so the long\-context regime is the operative one and motivates the mixed set as the deployed default\.
## 6Related Work
Multi\-agent architectures decompose problems across specialized agentsGuoet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib7)\); Honget al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib10)\); Wuet al\.\([2024b](https://arxiv.org/html/2606.18502#bib.bib9)\), with coordination patterns including sequential pipelines and hierarchical orchestrationDuet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib11)\)\.Chenet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib8)\)observe that increasing agent calls yields diminishing returns without optimization\. Our work shows how inference\-level optimizations reduce per\-call cost and increase achievable throughput in agentic workflows\. Speculative decoding methods like EAGLELiet al\.\([2024b](https://arxiv.org/html/2606.18502#bib.bib35),[a](https://arxiv.org/html/2606.18502#bib.bib42),[2026](https://arxiv.org/html/2606.18502#bib.bib24)\)accelerate generation via draft models, but their use in application\-specific settings remains under\-explored\. We show the impact of mixed training data on draft acceptance rates and how it improves throughput\. Post\-training quantization compresses LLMs without retrainingFrantaret al\.\([2023](https://arxiv.org/html/2606.18502#bib.bib19)\); Xiaoet al\.\([2023](https://arxiv.org/html/2606.18502#bib.bib20)\); Shenet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib16)\)\. FP8 quantization preserves quality better than integer formats for certain workloadsShenet al\.\([2024](https://arxiv.org/html/2606.18502#bib.bib16)\); Fishmanet al\.\([2025](https://arxiv.org/html/2606.18502#bib.bib18)\)\. However, direct comparisons of public versus application\-specific calibration data for FP8 PTQ are limited\. Our results show that data mixture composition strongly affects quality preservation under compression\.
## 7Practical Takeaways
The process of distilling complex agentic workflows into a compact, production\-ready model yields several important findings\. First, student model performance is largely bounded by the fidelity of the synthetic trajectories produced by the Agent Simulator, underscoring the critical role of high\-quality synthetic data\. Second, LoRA\-based adaptation is necessary to preserve zero\-shot generalization under evolving system prompts, whereas full\-parameter fine\-tuning degrades this capability\. Finally, a layered optimization strategy that combines the CPT–SFT–DPO distillation pipeline with custom EAGLE drafters and FP8 quantization delivers a sustained 4\.48× end\-to\-end speedup without measurable degradation in task intelligence\. Additional details on these training methodologies and inference trade\-offs are provided in Appendix[E](https://arxiv.org/html/2606.18502#A5)\.
## 8Conclusion
We describe an integrated optimization framework for our deployed, production\-ready Multi\-Agent System that achieves a 4\.48×\\timesimprovement in throughput with no measurable loss in quality\. Our experiments show that FP8 post\-training quantization requires mixed calibration to preserve performance, application\-specific EAGLE drafters substantially outperform generic variants in token acceptance rates, and jointly optimized system components deliver multiplicative efficiency gains\. These findings highlight the importance of holistic optimization and demonstrate that deployment\-scale improvements in multi\-agent LLM systems depend on coordinated data engineering, model adaptation, and system\-level optimization\.
### Limitations
The reported results are specific to our production multi\-agent system in the automotive retail domain\. The core principles of data alignment and multi\-layer optimization generalize to other use cases, but exact performance gains vary by application\. During training, our distillation pipeline relies heavily on a single model acting as both teachers and automated judges to simulate and verify synthetic traces\. Any inherent biases, domain blind spots, or reasoning gaps in these models inevitably propagate to the student\. This dependency requires manual intervention, including our hand\-crafted DPO pairs, to correct business\-logic edge cases that the automated judge misses\. Additionally, generating and verifying hundreds of thousands of multi\-turn conversational traces requires significant upfront computational resources\. Also, the EAGLE drafter must be retrained whenever system prompts or business logic are updated, as these modifications induce distribution shifts that degrade draft acceptance rates\. Finally, some of our inference optimizations are hardware\-dependent\. FP8 quantization requires native hardware support like NVIDIA Hopper GPUs, and falling back to higher precision execution severely reduces the reported throughput benefits\.
## References
- A\. Chakraborty, P\. Dashore, N\. Bathaee, A\. Jain, A\. Das, S\. Zhang, S\. Sahu, M\. Naphade, and G\. Winata \(2026\)T1: a tool\-oriented conversational dataset for multi\-turn agentic planning\.Advances in Neural Information Processing Systems38\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p1.1)\.
- C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper \(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§3\.1](https://arxiv.org/html/2606.18502#S3.SS1.p1.1)\.
- L\. Chen, J\. Davis, B\. Hanin, P\. Bailis, I\. Stoica, M\. Zaharia, and J\. Zou \(2024\)Are more llm calls all you need? towards the scaling properties of compound ai systems\.Advances in Neural Information Processing Systems37,pp\. 45767–45790\.Cited by:[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2024\)Improving factuality and reasoning in language models through multiagent debate\.InProceedings of the 41st International Conference on Machine Learning,pp\. 11733–11763\.Cited by:[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- M\. Fishman, B\. Chmiel, R\. Banner, and D\. Soudry \(2025\)Scaling fp8 training to trillion\-token llms\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 98631–98644\.Cited by:[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2023\)OPTQ: accurate quantization for generative pre\-trained transformers\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=tcbBPnfwxS)Cited by:[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§2\.5](https://arxiv.org/html/2606.18502#S2.SS5.SSS0.Px1.p1.4),[§4](https://arxiv.org/html/2606.18502#S4.p1.5)\.
- T\. Guo, X\. Chen, Y\. Wang, R\. Chang, S\. Pei, N\. V\. Chawla, O\. Wiest, and X\. Zhang \(2024\)Large language model based multi\-agents: a survey of progress and challenges\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence,pp\. 8048–8057\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p1.1),[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, S\. Yau, Z\. Lin, L\. Zhou,et al\.\(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 23247–23275\.Cited by:[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- A\. Kuzmin, M\. Van Baalen, Y\. Ren, M\. Nagel, J\. Peters, and T\. Blankevoort \(2022\)Fp8 quantization: the power of the exponent\.Advances in Neural Information Processing Systems35,pp\. 14651–14662\.Cited by:[§3\.2](https://arxiv.org/html/2606.18502#S3.SS2.p1.5)\.
- Y\. Leviathan, M\. Kalman, and Y\. Matias \(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.18502#S3.SS1.p1.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2024a\)Eagle\-2: faster inference of language models with dynamic draft trees\.InProceedings of the 2024 conference on empirical methods in natural language processing,pp\. 7421–7432\.Cited by:[§3\.1](https://arxiv.org/html/2606.18502#S3.SS1.p1.1),[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2024b\)EAGLE: speculative sampling requires rethinking feature uncertainty\.InProceedings of the 41st International Conference on Machine Learning,pp\. 28935–28948\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p2.1),[§1](https://arxiv.org/html/2606.18502#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.18502#S3.SS1.p1.1),[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2026\)Eagle\-3: scaling up inference acceleration of large language models via training\-time test\.Advances in Neural Information Processing Systems38,pp\. 136737–136756\.Cited by:[§3\.1](https://arxiv.org/html/2606.18502#S3.SS1.p1.1),[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- R\. Nallapati, B\. Zhou, C\. Dos Santos, Ç\. Gulçehre, and B\. Xiang \(2016\)Abstractive text summarization using sequence\-to\-sequence rnns and beyond\.InProceedings of the 20th SIGNLL conference on computational natural language learning,pp\. 280–290\.Cited by:[§4](https://arxiv.org/html/2606.18502#S4.p2.1)\.
- K\. Nottingham, B\. P\. Majumder, B\. D\. Mishra, S\. Singh, P\. Clark, and R\. Fox \(2024\)Skill set optimization: reinforcing language model behavior via transferable skills\.InInternational Conference on Machine Learning,pp\. 38409–38425\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p2.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p3.1),[§2\.5](https://arxiv.org/html/2606.18502#S2.SS5.SSS0.Px4.p1.4)\.
- H\. Shen, N\. Mellempudi, X\. He, Q\. Gao, C\. Wang, and M\. Wang \(2024\)Efficient post\-training quantization with fp8 formats\.Proceedings of Machine Learning and Systems6,pp\. 483–498\.Cited by:[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- Z\. Shi, S\. Gao, L\. Yan, Y\. Feng, X\. Chen, Z\. Chen, D\. Yin, S\. Verberne, and Z\. Ren \(2025\)Tool learning in the wild: empowering language models as automatic tool agents\.InProceedings of the ACM on Web Conference 2025,pp\. 2222–2237\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p1.1)\.
- Z\. Wang, S\. Zhao, Y\. Wang, H\. Huang, S\. Xie, Y\. Zhang, J\. Shi, Z\. Wang, H\. Li, and J\. Yan \(2025\)Re\-task: revisiting llm tasks from capability, skill, and knowledge perspectives\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 4925–4936\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p2.1)\.
- G\. I\. Winata, A\. Chakraborty, Y\. Lin, S\. P\. Rao, S\. Siingh, H\. Lu, N\. Bathaee, S\. Hatwar, P\. Dashore, A\. Jain,et al\.\(2026\)T1\-bench: benchmarking multi\-scenario agents in real\-world domains\.arXiv preprint arXiv:2606\.11070\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p1.1)\.
- G\. I\. Winata, L\. Xie, K\. Radhakrishnan, S\. Wu, X\. Jin, P\. Cheng, M\. Kulkarni, and D\. Preoţiuc\-Pietro \(2023\)Overcoming catastrophic forgetting in massively multilingual continual learning\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 768–777\.Cited by:[§B\.1](https://arxiv.org/html/2606.18502#A2.SS1.p1.2),[§2\.5](https://arxiv.org/html/2606.18502#S2.SS5.SSS0.Px2.p1.3)\.
- C\. Wu, Y\. Gan, Y\. Ge, Z\. Lu, J\. Wang, Y\. Feng, Y\. Shan, and P\. Luo \(2024a\)Llama pro: progressive llama with block expansion\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6518–6537\.Cited by:[§2\.5](https://arxiv.org/html/2606.18502#S2.SS5.SSS0.Px1.p1.4)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2024b\)Autogen: enabling next\-gen llm applications via multi\-agent conversations\.InFirst Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p1.1),[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han \(2023\)Smoothquant: accurate and efficient post\-training quantization for large language models\.InInternational conference on machine learning,pp\. 38087–38099\.Cited by:[§6](https://arxiv.org/html/2606.18502#S6.p1.1)\.
- H\. Xu, Z\. Wang, Z\. Zhu, L\. Pan, X\. Chen, S\. Fan, L\. Chen, and K\. Yu \(2025\)Alignment for efficient tool calling of large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 17787–17803\.Cited by:[§1](https://arxiv.org/html/2606.18502#S1.p1.1)\.
- B\. Zhao, B\. Kapusuzoglu, K\. Balasubramaniam, S\. Sahu, S\. Chakraborty, and G\. I\. Winata \(2025\)Optimizing reasoning efficiency through prompt difficulty prediction\.InNeurIPS 2025 Workshop on Efficient Reasoning,External Links:[Link](https://openreview.net/forum?id=vAFTyX6kbF)Cited by:[Appendix E](https://arxiv.org/html/2606.18502#A5.SS0.SSS0.Px4.p1.7)\.
- R\. Zhen, J\. Li, Y\. Ji, Z\. Yang, T\. Liu, Q\. Xia, X\. Duan, Z\. Wang, B\. Huai, and M\. Zhang \(2025\)Taming the titans: a survey of efficient llm inference serving\.InProceedings of the 18th International Natural Language Generation Conference,pp\. 522–541\.Cited by:[Appendix E](https://arxiv.org/html/2606.18502#A5.SS0.SSS0.Px4.p1.7)\.
## Appendix ALoss Formulations for Agentic Training Stages
In this section, we explicitly detail the mathematical loss functions minimized during the three sequential stages of our agentic model customization pipeline\.
### A\.1Context\-aware Continual Pretraining \(CA\-CPT\)
The loss objective for Stage 1 adjusts standard language modeling objectives by prefixing documentx∈𝒳𝙲𝙿𝚃x\\in\\mathcal\{X\}\_\{\\tt\{CPT\}\}with its generated context token vectorCxC\_\{x\}:
ℒ𝙲𝙰−𝙲𝙿𝚃\(θ\)=−𝔼x∼𝒳𝙲𝙿𝚃\[∑tlogPθ\(xt∣x<t;Cx\)\]\.\\displaystyle\\mathcal\{L\}\_\{\\tt\{CA\}\-\\tt\{CPT\}\}\(\\theta\)=\-\\underset\{x\\sim\\mathcal\{X\}\_\{\\tt\{CPT\}\}\}\{\\mathbb\{E\}\}\\Bigl\[\\sum\_\{t\}\\log P\_\{\\theta\}\(x\_\{t\}\\mid x\_\{<t\};C\_\{x\}\)\\Bigr\]\.\(4\)
### A\.2Supervised Fine\-Tuning \(SFT\)
The loss minimized during Stage 2 optimizes the model parameters over the distribution of refined instruction\-following synthetic traces𝒴SFT\\mathcal\{Y\}\_\{\\text\{SFT\}\}:
ℒ𝚂𝙵𝚃\(θ\)=−𝔼\(x,y\)∼𝒴SFT\[logP\(y∣x;θ\)\]\.\\displaystyle\\mathcal\{L\}\_\{\\tt\{SFT\}\}\(\\theta\)=\-\\underset\{\(x,y\)\\sim\\mathcal\{Y\}\_\{\\text\{SFT\}\}\}\{\\mathbb\{E\}\}\\Bigl\[\\log P\(y\\mid x;\\theta\)\\Bigr\]\.\(5\)
### A\.3Direct Preference Optimization \(DPO\)
Stage 3 aligns model generations using the choice triples\(x,y∗,y\)∼𝒵𝙳𝙿𝙾\(x,y^\{\*\},y\)\\sim\\mathcal\{Z\}\_\{\\tt\{DPO\}\}via the native implicit reward optimization objective:
ℒ𝙳𝙿𝙾\(θ\)=\\displaystyle\\mathcal\{L\}\_\{\\tt\{DPO\}\}\(\\theta\)=−𝔼\(x,y∗,y\)∼𝒵𝙳𝙿𝙾\\displaystyle\-\\underset\{\(x,y^\{\*\},y\)\\sim\\mathcal\{Z\}\_\{\\tt\{DPO\}\}\}\{\\mathbb\{E\}\}\(6\)\[logσ\(βlogπθ\(y∗∣x\)πθ𝚂𝙵𝚃\(y∗∣x\)\\displaystyle\\left\[\\log\\sigma\\left\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y^\{\*\}\\mid x\)\}\{\\pi\_\{\\theta\}^\{\\tt\{SFT\}\}\(y^\{\*\}\\mid x\)\}\\right\.\\right\.−βlogπθ\(y∣x\)πθ𝚂𝙵𝚃\(y∣x\)\)\]\.\\displaystyle\\left\.\\left\.\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(y\\mid x\)\}\{\\pi\_\{\\theta\}^\{\\tt\{SFT\}\}\(y\\mid x\)\}\\right\)\\right\]\.
## Appendix BTraining Details
### B\.1Why Context Reduces Forgetting in CPT
The key intuition behind Context\-Aware CPT is that the first few tokens of a training sequence often produce disproportionately high loss because the model has little or no preceding context\. In continual pretraining, this high initial loss can introduce noisy, high\-variance gradients that are less reflective of the model’s true domain knowledge gap and more reflective of uncertainty caused by insufficient context\. Since the negative log\-likelihood gradient with respect to the logits is∇ztℒt=pt−yt\\nabla\_\{z\_\{t\}\}\\mathcal\{L\}\_\{t\}=p\_\{t\}\-y\_\{t\}, uncertain predictions at early positions can induce large and unstable updates, pulling shared parameters in inconsistent directions across samples\. Such variance is especially harmful in continual learning, where stochastic updates can move the model away from parameter regions that preserve previously learned capabilities, thereby contributing to catastrophic forgetting\(Winataet al\.,[2023](https://arxiv.org/html/2606.18502#bib.bib34)\)\. For a documentx=\(x1,…,x\|x\|\)x=\(x\_\{1\},\\ldots,x\_\{\|x\|\}\), the standard CPT objective is
ℒ𝙲𝙿𝚃\(x;θ\)=−∑t=1\|x\|logpθ\(xt∣x<t\)\.\\mathcal\{L\}\_\{\\tt\{CPT\}\}\(x;\\theta\)=\-\\sum\_\{t=1\}^\{\|x\|\}\\log p\_\{\\theta\}\(x\_\{t\}\\mid x\_\{<t\}\)\.\(7\)The corresponding gradient can be decomposed by token position:
∇θℒ𝙲𝙿𝚃\(x;θ\)\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\tt\{CPT\}\}\(x;\\theta\)=−∑t=1k∇θlogpθ\(xt∣x<t\)⏟∇θℒ𝚎𝚊𝚛𝚕𝚢\+\\displaystyle=\\underbrace\{\-\\sum\_\{t=1\}^\{k\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(x\_\{t\}\\mid x\_\{<t\}\)\}\_\{\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\tt\{early\}\}\}\+\(8\)−∑t=k\+1\|x\|∇θlogpθ\(xt∣x<t\)⏟∇θℒ𝚕𝚊𝚝𝚎𝚛\.\\displaystyle\\underbrace\{\-\\sum\_\{t=k\+1\}^\{\|x\|\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(x\_\{t\}\\mid x\_\{<t\}\)\}\_\{\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\tt\{later\}\}\}\.Here,∇θℒ𝚎𝚊𝚛𝚕𝚢\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\tt\{early\}\}denotes the gradient contribution from the firstkktokens, while∇θℒ𝚕𝚊𝚝𝚎𝚛\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\tt\{later\}\}denotes the contribution from the remaining tokens\. Because early tokens are predicted with limited context,∇θℒ𝚎𝚊𝚛𝚕𝚢\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\tt\{early\}\}can have higher variance and may dominate the update direction despite carrying weaker domain\-specific signal\.
Figure 5:Loss and Token Position Across Domain Adaptation Datasets\.Context\-Aware CPT reduces this instability by prepending each document with a sample\-specific contextCxC\_\{x\}and excluding the context tokens from the training loss\. The resulting objective is
ℒ𝙲𝙰\-𝙲𝙿𝚃\(x;θ\)=−∑t=1\|x\|logpθ\(xt∣x<t,Cx\)\.\\mathcal\{L\}\_\{\\tt\{CA\\text\{\-\}CPT\}\}\(x;\\theta\)=\-\\sum\_\{t=1\}^\{\|x\|\}\\log p\_\{\\theta\}\(x\_\{t\}\\mid x\_\{<t\},C\_\{x\}\)\.\(9\)
By conditioning document tokens onCxC\_\{x\}, the model receives a more informative prefix before predicting the original document content, reducing early\-token uncertainty and improving the signal\-to\-noise ratio of CPT gradients\. As a result, adaptation is driven by a cleaner and more contextually grounded training signal, allowing the model to absorb new domain knowledge while reducing destructive parameter drift and better balancing plasticity with stability\.
### B\.2Training Hyperparameters
The detailed hyperparameters of CPT, Agentic Fine Tuning, Preference Tuning, and EAGLE can be found in Table[3](https://arxiv.org/html/2606.18502#A2.T3)
ParameterCPT \(πθCPT\\pi\_\{\\theta\}^\{\\text\{CPT\}\}\)SFT \(πθSFT\\pi\_\{\\theta\}^\{\\text\{SFT\}\}\)DPO \(πθfinal\\pi\_\{\\theta\}^\{\\text\{final\}\}\)EAGLE \(πθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\)PrecisionBF16BF16BF16BF16LoRA Rank \(rr\)–1281286464–LoRA Target–All modulesAll modules–Initial Learning Rate10−510^\{\-5\}2\.0×10−52\.0\\times 10^\{\-5\}5\.0×10−65\.0\\times 10^\{\-6\}3\.0×10−43\.0\\times 10^\{\-4\}LR SchedulerCosineCosineCosineLinearWeight Decay0\.1–––Warmup Ratio0\.010\.010\.050\.050\.100\.100\.010\.01Epochs11133100100Batch Size \(per DP\)Varies \(SEQ⋅\\cdotGBS = 4M\)111188Gradient AccumulationVaries \(SEQ⋅\\cdotGBS = 4M\)112211Max Context Length4k,8k,16k4k,8k,16k8k8k8k8k8k8kDPO Penalty \(β\\beta\)––0\.10\.1–
Table 3:Hyperparameter configurations for the continual pretraining, supervised fine\-tuning, preference alignment, and EAGLE stages of the student modelπθ\\pi\_\{\\theta\}\.
### B\.3Detailed Dataset Synthesis
To curate a high\-fidelity training corpus for the student modelπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}, we utilize the LLM\-driven agentic simulator pipeline detailed in Section[2](https://arxiv.org/html/2606.18502#S2)\. We simulate 7,172 conversations against the teacher system, producing an extensive corpus of 495,772 individual training traces spanning various intents and tool\-calling behaviors\. These traces constitute the Supervised Fine\-Tuning \(𝒴SFT\\mathcal\{Y\}\_\{\\text\{SFT\}\}\) corpus\.
For the Preference Alignment \(DPO\) stage, we construct a specialized dataset \(𝒵DPO\\mathcal\{Z\}\_\{\\text\{DPO\}\}\) comprising 10,000 preference pairs\. Each preference pair consists of a ”chosen” \(correct\) response and a ”rejected” \(incorrect\) response\. The automated pipeline generates 9,000 of these pairs, where the LLM\-as\-a\-Judge successfully flags mistakes in the teacher model’s outputs and provides refined corrections\. However, because the teacher model \(πT\\pi\_\{T\}\) is already highly optimized, relying solely on the automated judge to identify subtle logical errors proves challenging\. To address this, we manually craft the remaining 1,000 hard\-negative pairs\. These manual pairs explicitly target known failure modes, complex business\-logic boundary conditions, and specific scenarios where the model traditionally struggles and the automated judge fails to flag the error\.
To rigorously test the distilled model against unseen scenarios, we separately generate an internal evaluation set by running an additional 1,424 simulated conversations, yielding 8,848 distinct evaluation instances\. Furthermore, to train the EAGLE drafter, which requires responses from the trained student model, we utilize the same agentic simulator pipeline withπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}to generate 50,033 data points\. Finally, for FP8 calibration, we sample 5,800 traces from this EAGLE dataset\.
### B\.4Training Infrastructure
Context\-aware Continual Pretraining is conducted on256256nodes, each equipped with88A100 GPUs, for a total of2,0482\{,\}048A100 GPUs\. We train on approximately55T tokens using tensor parallelism withTP=8\\mathrm\{TP\}=8, pipeline parallelism withPP=1\\mathrm\{PP\}=1, and data parallelism withDP=256\\mathrm\{DP\}=256\. The micro\-batch size is set to11, and we maintain a global batch size of approximately44M tokens by adjusting the gradient accumulation steps according to the sequence length, which ranges from44K to1616K tokens\.
Agentic fine\-tuning \(AFT\) and DPO are performed on1616nodes with88A100 GPUs per node, using Fully Sharded Data Parallelism \(FSDP\)\. High\-bandwidth cross\-node communication is enabled through Elastic Fabric Adapter \(EFA\) with RDMA support\. For FSDP training, we use theFULL\_SHARDstrategy with CPU parameter offloading, backward prefetching, and transformer\-based auto\-wrapping\.
For EAGLE drafter training, we utilize one AWSp4d\.24xlargeinstance and a standard data parallel setting\. Until this stage, all model weights, activations, and gradients are maintained strictly inBF16precision to ensure numerical stability without sacrificing throughput\.
## Appendix CMulti\-Agent Architecture and Deployment Details
### C\.1Detailed Agent Roles
Our production Multi\-Agent System \(MM\) is a five\-agent system for customer\-facing reasoning tasks spanning intent understanding, plan generation, verification, and explanation\. The agents collaborate as follows:
- •Understander Agent:Interacts with the user to detect their core needs and gather all necessary information\. It processes user utterances alongside chat history to maintain context, extracting essential state\-level details and domain\-specific entities\.
- •Planner Agent:Uses the extracted context to formulate an actionable strategy\. It generates structured, executable action plans using provided tools \(𝒯\\mathcal\{T\}\) while strictly complying with dealership business rules\.
- •Evaluator Agent:Operates as a critical safety guardrail by verifying the generated plans using both rule\-based and LLM\-based validation\. If safety or logic violations are detected, it triggers a replanning loop, sending feedback to thePlanner Agentfor correction\.
- •Executor Agent:Once theEvaluator Agentvalidates the code, theExecutor Agentsecurely runs the executable plan within an external environment and passes the results to the next agent\.
- •Explainer Agent:Finally, theExplainer Agenttranslates these executed plans and raw tool outputs into natural language explanations for the customer\.
### C\.2Deployment Constraints and Performance Bottlenecks
All agents in our system invoke the same foundation model\. The overarching goal for production deployment is to maximize throughput on AWS EC2 P5 \(8×\\timesNVIDIA H100 80GB GPUs\) while strictly adhering to sub\-second end\-to\-end latency SLAs\.
Initial system profiling identified three major bottlenecks:
1. 1\.*Cumulative latency*resulting from multiple sequential LLM calls per request, which compounds to multi\-second delays\.
2. 2\.*Memory footprint*constraints from serving large LLMs in BF16, which severely limits batch sizes and overall concurrent capacity\.
3. 3\.*Generation cost*, which cannot be optimized through prefill optimization \(such as prompt caching\) alone\.
While standard batching and quantization can partially address latency and memory constraints, the generation cost bottleneck requires a fundamentally different approach\. Speculative decoding addresses this by verifying multiple speculated tokens per target\-model forward pass, bypassing the traditional autoregressive bottleneck\.
## Appendix DAdditional Serving Optimizations
Beyond quantization and speculative decoding, we apply several systems\-level changes to reduce end\-to\-end latency\. Our serving baseline already includes many widely adopted systems\-level optimizations, making it exceptionally strong and difficult to improve upon\. The compounding gains from our proposed methods are achieved on top of this highly optimized foundation, which specifically includes:
#### Conditional Agent Invocation\.
In production traffic, most Planner outputs are simple enough for deterministic evaluation\. We measure plan complexity via Halstead complexity metrics and only invoke the LLM\-based Evaluator when complexity exceeds a threshold; simple plans bypass the Evaluator entirely\. This reduces total LLM calls per request\.
#### Continuous Batching\.
Instead of waiting for the longest sequence in a static batch to finish, requests are scheduled at the iteration level\. Completed requests are immediately replaced with new ones from the queue, maximizing GPU utilization\.
#### Tensor Parallelism\.
Model weights are sharded across multiple GPUs to distribute the memory footprint and compute load, significantly reducing time\-to\-first\-token \(TTFT\) and per\-token generation latency\.
#### KV\-Cache CPU Offloading\.
To prevent Out\-Of\-Memory \(OOM\) errors and increase concurrent capacity, inactive KV\-caches are dynamically swapped to host CPU memory and asynchronously prefetched back to VRAM when needed\.
Figure[6](https://arxiv.org/html/2606.18502#A4.F6)illustrates the different layers of our complete optimization stack\.
Layer 4End\-to\-End Impact▼P90 Latencyper\-query tail▲ThroughputQPS / GPU▼Cost / Query$ per requestSpeedups compound multiplicatively across layers\.Layer 3Throughput Density — FP8 \(W8A8\) PTQ▼Memory∼\\sim2×\\timessmaller▲Compute∼\\sim2×\\timesTC throughput✔Qualityper\-tensor scalingHalf the memory, double the tensor\-core throughput, no quality loss\.Layer 2Per\-Call Generation Latency — EAGLE Spec\. DecodingDraft modelproposeskktokensVerifier target LLMparallel checkAccepted tokensk¯\>1\\bar\{k\}\>1per stepAmortizes target\-model decode cost; fewer sequential steps per token\.Layer 1Systems\-Level Call ReductionConditional Agent InvocationPrompt\-Cache ReuseContinuous BatchingRemoves redundant calls; raises cache hits; lifts GPU utilization\.Layer 0Base Inference — unoptimized baselineBF16precisionπθBF16\\pi\_\{\\theta\}^\{\\text\{BF16\}\}model1call / agent step
Figure 6:Optimization Stack\.The four optimization layers \(L1–L4\) yield compounding performance gains over the unoptimized baseline \(L0\)\.
## Appendix EDetailed Discussions and Practical Takeaways
The process of distilling complex agentic workflows into a smaller, production\-ready model yields several critical insights regarding data generation, training methodologies, and inference optimization\.
#### The Crucial Role of the Agent Simulator\.
We find that the quality of the student model is entirely bottlenecked by the fidelity of the synthetic data\. Building an effective Agent Simulator requires rigorous, manual optimization of its governing prompts to ensure it accurately mirrors the distribution and nuances of real\-world production conversations\. Investing time in a high\-quality simulator is paramount; without it, the downstream distillation process will simply reinforce unrealistic interaction patterns\.
#### Preserving Prompt Adherence with LoRA\.
In a live production environment, business requirements frequently evolve\. Product teams regularly need to introduce new tool APIs, alter business logic, or modify the user experience\. Consequently, the distilled student model must remain highly adaptable to system prompt updates\. We initially experiment with full\-parameter fine\-tuning\. Although it achieves comparable baseline performance, the fully fine\-tuned model severely overfits the specific prompt structures seen during training, losing its ability to generalize or adapt to new instructions\. Conversely, applying LoRA successfully preserves the foundation model’s innate zero\-shot adaptability\. LoRA allows the model to learn the required domain expertise while remaining responsive to subsequent prompt modifications, which is a mandatory requirement for maintaining a dynamic production system\.
#### The Necessity of Preference Alignment\.
Supervised Fine\-Tuning \(SFT\) alone is insufficient for achieving production\-grade reliability\. While SFT successfully instills the general tool\-calling formats and conversational tone, Direct Preference Optimization \(DPO\) is essential for addressing complex boundary conditions\. By explicitly contrasting successful outputs against failure modes, DPO effectively corrects nuanced logical errors and edge cases where even the high\-parameter teacher model occasionally struggles\.
#### Stacking Inference Optimizations\.
In a large design space of optimizationsZhenet al\.\([2025](https://arxiv.org/html/2606.18502#bib.bib40)\); Zhaoet al\.\([2025](https://arxiv.org/html/2606.18502#bib.bib39)\), we discover the importance of stacking optimization combinations that provide durable acceleration while preserving intelligence\. At the systems level, call reduction techniques such as conditional agent invocation, prompt\-cache reuse, and continuous batching remove redundant calls and raise GPU utilization and are included in evaluating all baselines\. CPT–SFT–DPO Distillation fromπT\\pi\_\{T\}toπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}provides a 2\.33×\\timesE2E speedup\. Trained EAGLE drafters and W8A8\-FP8 further durably accelerateπθBF16\\pi\_\{\\theta\}^\{\\texttt\{BF16\}\}by 1\.92×\\times, stacking to produceπθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\+FP8\}\}with a 4\.48×\\timesspeedup while staying under latency SLOs \(Service Level Objectives\)\. We also find that, across optimization phases, proper public and in\-domain data mixtures are critical to prevent catastrophic forgetting during distillation, ensure robust FP8\-W8A8 calibration, and enable strong acceptance lengths in EAGLE drafter training\. Finally, careful tradeoffs such as choosing greedy speculation as opposed to tree speculation improve latency and throughput even when resulting in a lower speculative MGL\.
#### Draft Model Quantization and Greedy Decoding\.
Quantizing the 250M EAGLE drafter with a mixed calibration set under greedy decoding moves speedup from 4\.16×\\times\(πθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\(C\)\) to 4\.48×\\times\(πθEAGLE\+FP8\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\+FP8\}\}\), reflecting higher draft throughput at an unchanged MGL \(3\.80 tokens\) as latency drops from 0\.96s to 0\.92s and QPS rises from 6\.07 to 6\.54\.
Standard EAGLE uses tree\-structured draft expansion, generating multiple candidate continuations per step that are verified in a single forward pass of the target model via tree attention\. Greedy draft decoding \(argmax sampling\) generates a single candidate chain per step, reducing per\-step compute at the cost of lower acceptance length\. OnπθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\(Combined data\), switching from tree to greedy decoding increases the speedup from 3\.40×\\timesto 4\.16×\\times\(latency 1\.19s→\\rightarrow0\.96s, QPS 4\.96→\\rightarrow6\.07\) despite MGL dropping from 4\.29 to 3\.80 tokens\. The same trend holds for the Synthetic adapter:πθEAGLE\\pi\_\{\\theta\}^\{\\texttt\{EAGLE\}\}\(Synthetic data\) tree decoding yields 3\.19×\\times, while greedy yields 3\.77×\\times\.
At low concurrency, greedy slightly increases latency because acceptance breadth dominates the cost of each verification pass\. At high concurrency, which is the operating regime in Table[1](https://arxiv.org/html/2606.18502#S4.T1), draft throughput becomes the bottleneck and greedy yields latency reduction\. Motivated by the strong performance ofEAGLE\+FP8in the greedy regime in our ablations, we make upstream contributions to vLLM to natively enableEAGLE\+FP8, making it available to the community\.Similar Articles
Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic
IBM Research explores how agent logic—software primitives like knowledge graphs and program analysis—can guide LLM-based agents to efficiently handle complex enterprise workflows, reducing hallucinations and costs while improving outcomes.
From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs
This paper presents a two-stage methodology for end-to-end LLM deployment on spatial NPUs, progressing from human-guided development to an autonomous agent skill system. The system achieves speedups of 2.2x on prefill and 4.0x on decode for a reference model, and autonomously deploys eight additional LLMs on AMD XDNA 2 NPU with minimal human guidance.
Are we wasting time building enterprise agents on open-source models? (My experience with Ling 1T 2.6)
An enterprise agent developer discusses the trade-offs of using open-source models like Ling 1T 2.6, highlighting the high overhead of optimization and benchmarking compared to proprietary APIs.
TradingAgents: Multi-Agents LLM Financial Trading Framework
This paper introduces TradingAgents, a multi-agent LLM framework that simulates real-world trading firms to improve stock trading performance. It utilizes specialized agents for analysis and risk management, demonstrating superior results in cumulative returns and Sharpe ratio compared to baselines.
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
TMAS introduces a multi-agent framework that enhances large language model reasoning by scaling test-time compute through structured collaboration and hierarchical memory systems. The approach uses specialized agents, cross-trajectory information flow, and hybrid reward reinforcement learning to improve iterative scaling and stability on challenging reasoning benchmarks.