An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

arXiv cs.AI Papers

Summary

The paper presents an agentic AI framework that leverages large language models and chain-of-thought reasoning to optimize UAV-assisted logistics scheduling with mobile edge computing, aiming to improve efficiency and resource allocation in manufacturing logistics.

arXiv:2605.13221v1 Announce Type: new Abstract: In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile, computational tasks generated by industrial sensor devices at these stations are processed locally, at UAVs, or offloaded via UAVs to the cloud. This coupling makes the problem challenging. A UAV can provide MEC services only during its service window at a station, so routing decisions directly determine when UAV-assisted offloading is available. Routing decisions also affect the UAV energy budget and the availability of onboard computing and communication resources for computational task execution under task deadline constraints. To address this, we propose an agentic-AI-assisted optimization framework with two components. First, we develop an agentic AI that combines large language models, retrieval-augmented generation, and chain-of-thought reasoning to translate user input into an interpretable mathematical formulation for the hybrid scheduling problem. Second, we design a hierarchical deep reinforcement learning approach based on proximal policy optimization (PPO), where the upper layer learns UAV routing and the lower layer optimizes per-slot task execution and resource allocation. Simulation results show that the proposed framework yields more consistent formulations, while the hierarchical PPO achieves full product collection in 99.6% of the last 500 episodes and maintains a 100% deadline satisfaction rate, with more stable performance than the advantage actor-critic approach.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:15 AM

# An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing
Source: [https://arxiv.org/html/2605.13221](https://arxiv.org/html/2605.13221)
SymbolsDescriptionDecision Variablesbk,τ,t,ubhb\_\{k,\\tau,t,u\}^\{\\mathrm\{bh\}\}UAV\-u→u\\\!\\tocloud rate allocated \(bits/s\)\.bk,τ,t,uulb\_\{k,\\tau,t,u\}^\{\\mathrm\{ul\}\}ISD→\\\!\\toUAV\-uurate allocated \(bits/s\)\.cmc\_\{m\}Binary: stationmmcollected\.δk,τ,t\\delta\_\{k,\\tau,t\}Binary: task\(k,τ\)\(k,\\tau\)completes in slottt\.ηu,m,t\\eta\_\{u,m,t\}Binary: UAVuuserves stationmmin slottt\.fk,τ,t,ucldf\_\{k,\\tau,t,u\}^\{\\mathrm\{cld\}\}Cloud compute via UAVuuallocated \(work\-unit/s\)\.fk,τ,t,uuavf\_\{k,\\tau,t,u\}^\{\\mathrm\{uav\}\}UAV\-uucompute allocated \(work\-unit/s\)\.fk,τ,tlocf\_\{k,\\tau,t\}^\{\\mathrm\{loc\}\}Local compute allocated \(work\-unit/s\)\.gk,τ,ug\_\{k,\\tau,u\}Binary: task\(k,τ\)\(k,\\tau\)executed in cloud via UAVuu\.pk,τ,up\_\{k,\\tau,u\}Binary: task\(k,τ\)\(k,\\tau\)processed on UAVuu\.ζk,τ\\zeta\_\{k,\\tau\}Binary: task\(k,τ\)\(k,\\tau\)processed locally\.ιk,τ,ufh\\iota\_\{k,\\tau,u\}^\{\\mathrm\{fh\}\}Binary: UAVuuis used as the first hop for task\(k,τ\)\(k,\\tau\)\.su,rus\_\{u\},\\ r\_\{u\}Depot departure / return time of UAVuu\.Tk,τT\_\{k,\\tau\}Task completion time\.Tu,marr,Tu,mdepT\_\{u,m\}^\{\\mathrm\{arr\}\},\\ T\_\{u,m\}^\{\\mathrm\{dep\}\}Arrival / departure time of UAVuuat stationmm\.xu,i,jx\_\{u,i,j\}Binary: UAVuutravelsi→ji\\\!\\rightarrow\\\!j\.yu,my\_\{u,m\}Binary: stationmmassigned to UAVuu\.zk,τz\_\{k,\\tau\}Binary: task\(k,τ\)\(k,\\tau\)meets deadline\.Parametersαbh\\alpha\_\{\\mathrm\{bh\}\}Energy coefficient for UAV\-to\-cloud communication\.αcmp\\alpha\_\{\\mathrm\{cmp\}\}Energy coefficient for onboard computing\.αfly,αhov\\alpha\_\{\\mathrm\{fly\}\},\\ \\alpha\_\{\\mathrm\{hov\}\}Energy coefficients for flying and hovering\.αul\\alpha\_\{\\mathrm\{ul\}\}Energy coefficient for ISD\-to\-UAV communication\.Bk,τB\_\{k,\\tau\}Task input size \(bits\)\.Buul,BubhB\_\{u\}^\{\\mathrm\{ul\}\},\\ B\_\{u\}^\{\\mathrm\{bh\}\}UAVuuuplink / backhaul limits \(bits/s\)\.γk,t,uul,γt,ubh\\gamma\_\{k,t,u\}^\{\\mathrm\{ul\}\},\\ \\gamma\_\{t,u\}^\{\\mathrm\{bh\}\}Effective\-rate factors \(uplink / backhaul\)\.Δ\\DeltaTime slot duration\.di,jd\_\{i,j\}Distance between nodesiiandjj\.DkD\_\{k\}Task deadline duration for ISDkktasks\.DuD\_\{u\}UAV distance budget\.EumaxE\_\{u\}^\{\\max\}UAVuuenergy budget\.FcldF^\{\\mathrm\{cld\}\}Cloud compute capacity \(work\-unit/s\)\.FklocF\_\{k\}^\{\\mathrm\{loc\}\}ISDkklocal compute capacity \(work\-unit/s\)\.FuuavF\_\{u\}^\{\\mathrm\{uav\}\}UAVuucompute capacity \(work\-unit/s\)\.ωcol,ωcmp\\omega\_\{\\mathrm\{col\}\},\\ \\omega\_\{\\mathrm\{cmp\}\}Weights for collection value and task completion\.ωmiss,ωflow\\omega\_\{\\mathrm\{miss\}\},\\ \\omega\_\{\\mathrm\{flow\}\}Weights for miss penalty and flow\-time\.ωres\\omega\_\{\\mathrm\{res\}\}Weight for resource\-occupation cost\.QuQ\_\{u\}UAV payload capacity\.τm\\tau\_\{m\}Minimum service time at stationmm\.TmissionT\_\{\\mathrm\{mission\}\}Mission horizon\.vflyv\_\{\\mathrm\{fly\}\}UAV flight speed\.vm,wmv\_\{m\},\\ w\_\{m\}Product value / weight at stationmm\.Wk,τW\_\{k,\\tau\}Task workload \(work\-units\)\.
![Refer to caption](https://arxiv.org/html/2605.13221v1/x1.png)Figure 1:An example of UAV\-assisted electronics manufacturing: Two\-phase logistic coordination system\. The manufacturing process flow includes PCB assembly, in\-circuit testing, module integration, and final testing\. The UAVs handle logistics in two phases: delivering materials before manufacturing in phase 1 and collecting products after manufacturing in phase 2\.
### III\-ASystem Overview

A typical manufacturing sequence consists of material delivery, manufacturing completion, and product collection\. Accordingly, the logistics process has two phases, namely phase 1 for material delivery and phase 2 for product collection\. In this paper, we focus on phase 2\. Specifically, a fleet ofUUhomogeneous UAVs is dispatched from a central depot equipped with terrestrial MEC servers, hereafter referred to as the cloud, to collect finished products fromMMgeographically distributed manufacturing stations\. At the same time, these UAVs provide MEC services to industrial sensor devices \(ISDs\) deployed at the stations\. All UAVs depart from and return to the depot, and they fly at a fixed altitude with constant speed\. Each stationm∈Sm\\in Sis located at\(Xm,Ym\)\(X\_\{m\},Y\_\{m\}\), has a collection rewardvmv\_\{m\}, product weightwmw\_\{m\}, and hosts a set of ISDsKmK\_\{m\}\. All finished products are assumed to be ready for collection at the mission start, while ISDs generate computational tasks stochastically over a discretized time horizon\.

The considered system couples physical logistics with computational task processing\. On the logistics side111[https://wing\.com/](https://wing.com/), each UAV is constrained by payload, travel distance, mission time, and battery energy, and must complete a depot\-based collection mission\[Liet al\.\[[29](https://arxiv.org/html/2605.13221#bib.bib60)\]\]\. On the computing side, task execution follows a three\-layer cloud\-edge\-device architecture, where a task can be processed locally at the ISD, at a serving UAV, or at the cloud via a UAV\-assisted path\[Sunet al\.\[[44](https://arxiv.org/html/2605.13221#bib.bib59)\]\]\. Since UAV\-assisted processing is available only when a UAV is serving the corresponding station, routing decisions directly determine the service windows for MEC tasks\[Jiaoet al\.\[[21](https://arxiv.org/html/2605.13221#bib.bib57)\]\]\. Therefore, the system is modeled from two coupled perspectives in the following subsections: the*UAV Routing Model*, which captures UAV collection decisions and routing feasibility, and the*MEC Service Model*, which captures task execution and resource\-allocation decisions under the routing\-induced service availability\.

A representative application of the proposed system model is the electronics component assembly and testing industry, where geographically distributed stations perform operations such as PCB assembly, in\-circuit testing, module integration, and final testing\. As illustrated in[Fig\.1](https://arxiv.org/html/2605.13221#S3.F1), UAVs support two\-phase logistics: They first deliver raw materials or sub\-assembled components to the stations and then collect finished products for transporting to a central depot after processing is completed\[Satoglu and Sahin \[[42](https://arxiv.org/html/2605.13221#bib.bib20)\]\]\. During both phases, the UAVs can also provide MEC support by executing offloaded tasks on board or relaying them to the cloud, thereby supporting timely processing of industrial sensing and analytics tasks\. Such UAV\-assisted operation is well suited to modern manufacturing environments, where UAVs are increasingly used for delivery, monitoring, inspection, inventory management, and predictive maintenance\[Askerbekovet al\.\[[4](https://arxiv.org/html/2605.13221#bib.bib24)\]\]\. Compared with ground vehicles, UAVs offer three\-dimensional mobility, can access elevated stations, bypass floor obstacles, and operate without dedicated ground infrastructure, making them effective in spatially constrained and complex industrial settings\[Walker \[[45](https://arxiv.org/html/2605.13221#bib.bib90)\], Mohsanet al\.\[[34](https://arxiv.org/html/2605.13221#bib.bib21)\]\]\.

### III\-BUAV Routing Model

The UAV\-routing constraints are organized into three parts, namely collection assignment and route consistency, payload and distance budgets, and mission timing feasibility\.

#### III\-B1Collection assignment and route consistency

∑u∈Uyu,m≤1,cm≤∑u∈Uyu,m,\\displaystyle\\sum\_\{u\\in U\}y\_\{u,m\}\\leq 1,\\quad c\_\{m\}\\leq\\sum\_\{u\\in U\}y\_\{u,m\},∀m∈S,\\displaystyle\\forall m\\in S,\(1\)∑i∈Lxu,i,m=∑j∈Lxu,m,j=yu,m,\\displaystyle\\sum\_\{i\\in L\}x\_\{u,i,m\}=\\sum\_\{j\\in L\}x\_\{u,m,j\}=y\_\{u,m\},∀u∈U,∀m∈S,\\displaystyle\\forall u\\in U,\\ \\forall m\\in S,\(2\)∑m∈Sxu,0,m=∑m∈Sxu,m,0,\\displaystyle\\sum\_\{m\\in S\}x\_\{u,0,m\}=\\sum\_\{m\\in S\}x\_\{u,m,0\},∀u∈U\.\\displaystyle\\forall u\\in U\.\(3\)[Eq\.1](https://arxiv.org/html/2605.13221#S3.E1)limits each station to at most one UAV and allows a station to be counted as collected only if it is assigned\.[Eq\.2](https://arxiv.org/html/2605.13221#S3.E2)ensures that stationmmappears on the route of UAVuuexactly when it is assigned to UAVuu, in which case it is entered and left exactly once by UAVuu\.[Eq\.3](https://arxiv.org/html/2605.13221#S3.E3)ensures that each UAV departs from and returns to the depot the same number of times\.

#### III\-B2Payload and distance budgets

∑m∈Swm​yu,m≤Qu,∑i∈L∑j∈Ldi,j​xu,i,j≤Du,∀u∈U\.\\sum\_\{m\\in S\}w\_\{m\}y\_\{u,m\}\\leq Q\_\{u\},\\,\\,\\,\\,\\sum\_\{i\\in L\}\\sum\_\{j\\in L\}d\_\{i,j\}x\_\{u,i,j\}\\leq D\_\{u\},\\,\\,\\,\\forall u\\in U\.\(4\)[Eq\.4](https://arxiv.org/html/2605.13221#S3.E4)limits the total assigned payload weight and the total travel distance of each UAV\.

#### III\-B3Mission timing

su=0,ru≤Tmission,\\displaystyle s\_\{u\}=0,\\qquad r\_\{u\}\\leq T\_\{\\mathrm\{mission\}\},∀u∈U,\\displaystyle\\forall u\\in U,\(5\)xu,0,m=1⇒Tu,marr≥su\+d0,mvfly,\\displaystyle x\_\{u,0,m\}=1\\Rightarrow T\_\{u,m\}^\{\\mathrm\{arr\}\}\\geq s\_\{u\}\+\\frac\{d\_\{0,m\}\}\{v\_\{\\mathrm\{fly\}\}\},∀u∈U,∀m∈S,\\displaystyle\\forall u\\in U,\\ \\forall m\\in S,\(6\)xu,i,j=1⇒Tu,jarr≥Tu,idep\+di,jvfly,\\displaystyle x\_\{u,i,j\}=1\\Rightarrow T\_\{u,j\}^\{\\mathrm\{arr\}\}\\geq T\_\{u,i\}^\{\\mathrm\{dep\}\}\+\\frac\{d\_\{i,j\}\}\{v\_\{\\mathrm\{fly\}\}\},∀u∈U,∀i,j∈S,\\displaystyle\\forall u\\in U,\\ \\forall i,j\\in S,\(7\)xu,m,0=1⇒ru≥Tu,mdep\+dm,0vfly,\\displaystyle x\_\{u,m,0\}=1\\Rightarrow r\_\{u\}\\geq T\_\{u,m\}^\{\\mathrm\{dep\}\}\+\\frac\{d\_\{m,0\}\}\{v\_\{\\mathrm\{fly\}\}\},∀u∈U,∀m∈S,\\displaystyle\\forall u\\in U,\\ \\forall m\\in S,\(8\)Tu,mdep−Tu,marr≥τm​yu,m,\\displaystyle T\_\{u,m\}^\{\\mathrm\{dep\}\}\-T\_\{u,m\}^\{\\mathrm\{arr\}\}\\geq\\tau\_\{m\}\\,y\_\{u,m\},∀u∈U,∀m∈S\.\\displaystyle\\forall u\\in U,\\ \\forall m\\in S\.\(9\)[Eq\.5](https://arxiv.org/html/2605.13221#S3.E5)states that each UAV starts at time0and must return within the mission horizon\.[Eqs\.6](https://arxiv.org/html/2605.13221#S3.E6),[7](https://arxiv.org/html/2605.13221#S3.E7)and[8](https://arxiv.org/html/2605.13221#S3.E8)ensure that the route timing is consistent with the corresponding travel times\. In[Eq\.7](https://arxiv.org/html/2605.13221#S3.E7), the casei=ji=jis excluded, since no travel is needed from a station to itself\.[Eq\.9](https://arxiv.org/html/2605.13221#S3.E9)requires a minimum service duration at each station assigned to UAVuu\.

### III\-CMEC Service Model

The constraints associated with MEC service model are about task offloading, including execution\-mode selection, completion\-time encoding, computing and communication resource limits, service\-window indicators, service\-window feasibility for UAV\-involved processing, deadline satisfaction, and UAV energy budgeting\.

#### III\-C1Execution\-mode selection

ζk,τ\+∑u∈Upk,τ,u\+∑u∈Ugk,τ,u=1,∀\(k,τ\)∈ℐ\.\\zeta\_\{k,\\tau\}\+\\sum\_\{u\\in U\}p\_\{k,\\tau,u\}\+\\sum\_\{u\\in U\}g\_\{k,\\tau,u\}=1,\\qquad\\forall\(k,\\tau\)\\in\\mathcal\{I\}\.\(10\)[Eq\.10](https://arxiv.org/html/2605.13221#S3.E10)requires every task to choose exactly one execution mode: local execution, UAV execution, or cloud execution via a UAV relay\.

#### III\-C2Completion\-time encoding

zk,τ\\displaystyle z\_\{k,\\tau\}=∑t=τNslot−1δk,τ,t,∀\(k,τ\)∈ℐ,\\displaystyle=\\sum\_\{t=\\tau\}^\{N\_\{\\mathrm\{slot\}\}\-1\}\\delta\_\{k,\\tau,t\},\\qquad\\forall\(k,\\tau\)\\in\\mathcal\{I\},\(11\)Tk,τ\\displaystyle T\_\{k,\\tau\}=∑t=τNslot−1\(t\+1\)​Δ​δk,τ,t\+Tmission​\(1−zk,τ\)\.\\displaystyle=\\sum\_\{t=\\tau\}^\{N\_\{\\mathrm\{slot\}\}\-1\}\(t\+1\)\\Delta\\,\\delta\_\{k,\\tau,t\}\+T\_\{\\mathrm\{mission\}\}\(1\-z\_\{k,\\tau\}\)\.\(12\)[Eqs\.11](https://arxiv.org/html/2605.13221#S3.E11)and[12](https://arxiv.org/html/2605.13221#S3.E12)define the completion\-status indicatorzk,τz\_\{k,\\tau\}and the completion timeTk,τT\_\{k,\\tau\}based on the slot\-selection binaryδk,τ,t∈\{0,1\}\\delta\_\{k,\\tau,t\}\\in\\\{0,1\\\}\. Specifically,δk,τ,t=1\\delta\_\{k,\\tau,t\}=1indicates that task\(k,τ\)\(k,\\tau\)is completed by the end of slottt\.

#### III\-C3Computing and communication resource constraints

The computing and communication resources are subject to three types of constraints, namely per\-slot capacity limits, mode consistency, and cumulative service sufficiency\.

\(i\) Per\-slot capacity limits

\{∑τ′:\(k,τ′\)∈ℐ,τ′≤tfk,τ′,tloc≤Fkloc,∀k∈K,∀t∈𝒯,∑\(k,τ\)∈ℐ:τ≤tfk,τ,t,uuav≤Fuuav,∀u∈U,∀t∈𝒯,∑u∈U∑\(k,τ\)∈ℐ:τ≤tfk,τ,t,ucld≤Fcld,∀t∈𝒯\.\\displaystyle\\left\\\{\\begin\{aligned\} &\\sum\_\{\\tau^\{\\prime\}:\(k,\\tau^\{\\prime\}\)\\in\\mathcal\{I\},\\ \\tau^\{\\prime\}\\leq t\}f\_\{k,\\tau^\{\\prime\},t\}^\{\\mathrm\{loc\}\}\\leq F\_\{k\}^\{\\mathrm\{loc\}\},&&\\forall k\\in K,\\ \\forall t\\in\\mathcal\{T\},\\\\ &\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}:\\tau\\leq t\}f\_\{k,\\tau,t,u\}^\{\\mathrm\{uav\}\}\\leq F\_\{u\}^\{\\mathrm\{uav\}\},&&\\forall u\\in U,\\ \\forall t\\in\\mathcal\{T\},\\\\ &\\sum\_\{u\\in U\}\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}:\\tau\\leq t\}f\_\{k,\\tau,t,u\}^\{\\mathrm\{cld\}\}\\leq F^\{\\mathrm\{cld\}\},&&\\forall t\\in\\mathcal\{T\}\.\\end\{aligned\}\\right\.\(13\)\{∑\(k,τ\)∈ℐ:τ≤tbk,τ,t,uul≤Buul,∀u∈U,∀t∈𝒯,∑\(k,τ\)∈ℐ:τ≤tbk,τ,t,ubh≤Bubh,∀u∈U,∀t∈𝒯\.\\displaystyle\\left\\\{\\begin\{aligned\} &\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}:\\tau\\leq t\}b\_\{k,\\tau,t,u\}^\{\\mathrm\{ul\}\}\\leq B\_\{u\}^\{\\mathrm\{ul\}\},&&\\forall u\\in U,\\ \\forall t\\in\\mathcal\{T\},\\\\ &\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}:\\tau\\leq t\}b\_\{k,\\tau,t,u\}^\{\\mathrm\{bh\}\}\\leq B\_\{u\}^\{\\mathrm\{bh\}\},&&\\forall u\\in U,\\ \\forall t\\in\\mathcal\{T\}\.\\end\{aligned\}\\right\.\(14\)[Eqs\.13](https://arxiv.org/html/2605.13221#S3.E13)and[14](https://arxiv.org/html/2605.13221#S3.E14)bound the aggregate computing and communication resource allocations in each slot by the available local, UAV, cloud, uplink, and backhaul capacities\.

\(ii\) Mode Consistency

\{0≤fk,τ,tloc≤Fkloc​ζk,τ,∀\(k,τ\)∈ℐ,∀t≥τ,0≤fk,τ,t,uuav≤Fuuav​pk,τ,u,∀\(k,τ\)∈ℐ,∀u∈U,∀t≥τ,0≤fk,τ,t,ucld≤Fcld​gk,τ,u,∀\(k,τ\)∈ℐ,∀u∈U,∀t≥τ,\\displaystyle\\left\\\{\\begin\{aligned\} &0\\leq f\_\{k,\\tau,t\}^\{\\mathrm\{loc\}\}\\leq F\_\{k\}^\{\\mathrm\{loc\}\}\\,\\zeta\_\{k,\\tau\},&&\\forall\(k,\\tau\)\\in\\mathcal\{I\},\\ \\forall t\\geq\\tau,\\\\ &0\\leq f\_\{k,\\tau,t,u\}^\{\\mathrm\{uav\}\}\\leq F\_\{u\}^\{\\mathrm\{uav\}\}\\,p\_\{k,\\tau,u\},&&\\forall\(k,\\tau\)\\in\\mathcal\{I\},\\ \\forall u\\in U,\\ \\forall t\\geq\\tau,\\\\ &0\\leq f\_\{k,\\tau,t,u\}^\{\\mathrm\{cld\}\}\\leq F^\{\\mathrm\{cld\}\}\\,g\_\{k,\\tau,u\},&&\\forall\(k,\\tau\)\\in\\mathcal\{I\},\\ \\forall u\\in U,\\ \\forall t\\geq\\tau,\\end\{aligned\}\\right\.\(15\)\{0≤bk,τ,t,uul≤Buul​ιk,τ,ufh,∀\(k,τ\)∈ℐ,∀u∈U,∀t≥τ,0≤bk,τ,t,ubh≤Bubh​gk,τ,u,∀\(k,τ\)∈ℐ,∀u∈U,∀t≥τ,ιk,τ,ufh=pk,τ,u\+gk,τ,u\.\\displaystyle\\left\\\{\\begin\{aligned\} &0\\leq b\_\{k,\\tau,t,u\}^\{\\mathrm\{ul\}\}\\leq B\_\{u\}^\{\\mathrm\{ul\}\}\\,\\iota\_\{k,\\tau,u\}^\{\\mathrm\{fh\}\},\\quad\\forall\(k,\\tau\)\\in\\mathcal\{I\},\\ \\forall u\\in U,\\ \\forall t\\geq\\tau,\\\\ &0\\leq b\_\{k,\\tau,t,u\}^\{\\mathrm\{bh\}\}\\leq B\_\{u\}^\{\\mathrm\{bh\}\}\\,g\_\{k,\\tau,u\},\\quad\\forall\(k,\\tau\)\\in\\mathcal\{I\},\\ \\forall u\\in U,\\ \\forall t\\geq\\tau,\\\\ &\\iota\_\{k,\\tau,u\}^\{\\mathrm\{fh\}\}=p\_\{k,\\tau,u\}\+g\_\{k,\\tau,u\}\.\\end\{aligned\}\\right\.\(16\)[Eqs\.15](https://arxiv.org/html/2605.13221#S3.E15)and[16](https://arxiv.org/html/2605.13221#S3.E16)impose mode consistency on computing and communication allocations by requiring each allocation to be zero unless the corresponding execution mode is selected\.

\(iii\) Multi\-slot sufficiency for completionDefineχk,τ​\(t\)≜∑r=τtδk,τ,r∈\{0,1\}\\chi\_\{k,\\tau\}\(t\)\\triangleq\\sum\_\{r=\\tau\}^\{t\}\\delta\_\{k,\\tau,r\}\\in\\\{0,1\\\}, which indicates whether task\(k,τ\)\(k,\\tau\)has been completed by the end of slottt\. Then, for anyt≥τt\\geq\\tau, the cumulative service must be sufficient whenever completion by slotttis claimed\. Then, we have the following constraints:

\{ζk,τ=1⇒∑s=τtfk,τ,sloc​Δ≥Wk,τ​χk,τ​\(t\),pk,τ,u=1⇒∑s=τtfk,τ,s,uuav​Δ≥Wk,τ​χk,τ​\(t\),gk,τ,u=1⇒∑s=τtfk,τ,s,ucld​Δ≥Wk,τ​χk,τ​\(t\),where,∀\(k,τ\)∈ℐ,∀t≥τ,∀u∈U\.\\displaystyle\\left\\\{\\begin\{aligned\} &\\zeta\_\{k,\\tau\}=1\\Rightarrow\\sum\_\{s=\\tau\}^\{t\}f\_\{k,\\tau,s\}^\{\\mathrm\{loc\}\}\\Delta\\geq W\_\{k,\\tau\}\\chi\_\{k,\\tau\}\(t\),\\\\ &p\_\{k,\\tau,u\}=1\\Rightarrow\\sum\_\{s=\\tau\}^\{t\}f\_\{k,\\tau,s,u\}^\{\\mathrm\{uav\}\}\\Delta\\geq W\_\{k,\\tau\}\\chi\_\{k,\\tau\}\(t\),\\\\ &g\_\{k,\\tau,u\}=1\\Rightarrow\\sum\_\{s=\\tau\}^\{t\}f\_\{k,\\tau,s,u\}^\{\\mathrm\{cld\}\}\\Delta\\geq W\_\{k,\\tau\}\\chi\_\{k,\\tau\}\(t\),\\\\ &\\text\{where\},\\quad\\forall\(k,\\tau\)\\in\\mathcal\{I\},\\quad\\forall t\\geq\\tau,\\quad\\forall u\\in U\.\\end\{aligned\}\\right\.\(17\)\{ιk,τ,ufh=1⇒∑s=τtγk,s,uul​bk,τ,s,uul​Δ≥Bk,τ​χk,τ​\(t\),gk,τ,u=1⇒∑s=τtγs,ubh​bk,τ,s,ubh​Δ≥Bk,τ​χk,τ​\(t\),where,∀\(k,τ\)∈ℐ,∀t≥τ,∀u∈U\.\\displaystyle\\left\\\{\\begin\{aligned\} &\\iota\_\{k,\\tau,u\}^\{\\mathrm\{fh\}\}=1\\Rightarrow\\sum\_\{s=\\tau\}^\{t\}\\gamma\_\{k,s,u\}^\{\\mathrm\{ul\}\}\\,b\_\{k,\\tau,s,u\}^\{\\mathrm\{ul\}\}\\,\\Delta\\geq B\_\{k,\\tau\}\\chi\_\{k,\\tau\}\(t\),\\\\ &g\_\{k,\\tau,u\}=1\\Rightarrow\\sum\_\{s=\\tau\}^\{t\}\\gamma\_\{s,u\}^\{\\mathrm\{bh\}\}\\,b\_\{k,\\tau,s,u\}^\{\\mathrm\{bh\}\}\\,\\Delta\\geq B\_\{k,\\tau\}\\chi\_\{k,\\tau\}\(t\),\\\\ &\\text\{where\},\\quad\\forall\(k,\\tau\)\\in\\mathcal\{I\},\\quad\\forall t\\geq\\tau,\\quad\\forall u\\in U\.\\end\{aligned\}\\right\.\(18\)[Eq\.17](https://arxiv.org/html/2605.13221#S3.E17)ensures that a task can be declared completed only after receiving enough cumulative computing service\.[Eq\.18](https://arxiv.org/html/2605.13221#S3.E18)imposes the same idea on communication: the cumulative transmitted bits must be no smaller than the task input size whenever completion is claimed\.

#### III\-C4Service\-window indicators

∑m∈Sηu,m,t≤1,\\displaystyle\\sum\_\{m\\in S\}\\eta\_\{u,m,t\}\\leq 1,∀u∈U,∀t∈𝒯,\\displaystyle\\forall u\\in U,\\ \\forall t\\in\\mathcal\{T\},\(19\)ηu,m,t≤yu,m,\\displaystyle\\eta\_\{u,m,t\}\\leq y\_\{u,m\},∀u∈U,∀m∈S,∀t∈𝒯,\\displaystyle\\forall u\\in U,\\ \\forall m\\in S,\\ \\forall t\\in\\mathcal\{T\},\(20\)ηu,m,t=1⇒\\displaystyle\\eta\_\{u,m,t\}=1\\RightarrowTu,marr≤t​Δ≤Tu,mdep,\\displaystyle T\_\{u,m\}^\{\\mathrm\{arr\}\}\\leq t\\Delta\\leq T\_\{u,m\}^\{\\mathrm\{dep\}\},\(21\)where,∀u∈U,∀m∈S,∀t∈𝒯\.\\displaystyle\\text\{where,\}\\quad\\forall u\\in U,\\ \\forall m\\in S,\\ \\forall t\\in\\mathcal\{T\}\.[Eqs\.19](https://arxiv.org/html/2605.13221#S3.E19),[20](https://arxiv.org/html/2605.13221#S3.E20)and[21](https://arxiv.org/html/2605.13221#S3.E21)define the slot\-level service\-window indicators by allowing service only at assigned stations and within the corresponding arrival–departure interval\.

#### III\-C5Service\-window feasibility for UAV\-involved processing

Letm​\(k\)m\(k\)denote the station to which ISDkkbelongs\. Then,

bk,τ,t,uul≤Buul​ηu,m​\(k\),t,fk,τ,t,uuav≤Fuuav​ηu,m​\(k\),t,\\displaystyle b\_\{k,\\tau,t,u\}^\{\\mathrm\{ul\}\}\\leq B\_\{u\}^\{\\mathrm\{ul\}\}\\,\\eta\_\{u,m\(k\),t\},\\qquad f\_\{k,\\tau,t,u\}^\{\\mathrm\{uav\}\}\\leq F\_\{u\}^\{\\mathrm\{uav\}\}\\,\\eta\_\{u,m\(k\),t\},bk,τ,t,ubh≤Bubh​ηu,m​\(k\),t,∀\(k,τ\)∈ℐ,∀u∈U,∀t≥τ\.\\displaystyle b\_\{k,\\tau,t,u\}^\{\\mathrm\{bh\}\}\\leq B\_\{u\}^\{\\mathrm\{bh\}\}\\,\\eta\_\{u,m\(k\),t\},\\qquad\\forall\(k,\\tau\)\\in\\mathcal\{I\},\\ \\forall u\\in U,\\ \\forall t\\geq\\tau\.\(22\)[SectionIII\-C5](https://arxiv.org/html/2605.13221#S3.Ex2)requires the UAV\-involved communication and computing allocations of task\(k,τ\)\(k,\\tau\)in slotttto be zero unless UAVuuis serving stationm​\(k\)m\(k\)in that slot\.

#### III\-C6Deadline constraint

zk,τ=1⇒Tk,τ≤τ​Δ\+Dk,∀\(k,τ\)∈ℐ\.z\_\{k,\\tau\}=1\\Rightarrow T\_\{k,\\tau\}\\leq\\tau\\Delta\+D\_\{k\},\\qquad\\forall\(k,\\tau\)\\in\\mathcal\{I\}\.\(23\)[Eq\.23](https://arxiv.org/html/2605.13221#S3.E23)states that a task is marked as on\-time only if its completion time is no later than its deadline\.

#### III\-C7UAV energy budget

The total energy of UAVuuis decomposed into flight/hover energy, onboard computing energy, and communication energy:

Eufly\\displaystyle E\_\{u\}^\{\\mathrm\{fly\}\}=αfly​∑i∈L∑j∈Ldi,j​xu,i,j\\displaystyle=\\alpha\_\{\\mathrm\{fly\}\}\\sum\_\{i\\in L\}\\sum\_\{j\\in L\}d\_\{i,j\}x\_\{u,i,j\}\+αhov​∑m∈S\(Tu,mdep−Tu,marr\)​yu,m,\\displaystyle\\quad\+\\alpha\_\{\\mathrm\{hov\}\}\\sum\_\{m\\in S\}\(T\_\{u,m\}^\{\\mathrm\{dep\}\}\-T\_\{u,m\}^\{\\mathrm\{arr\}\}\)y\_\{u,m\},\(24\)Eucmp\\displaystyle E\_\{u\}^\{\\mathrm\{cmp\}\}=αcmp​∑t∈𝒯∑\(k,τ\)∈ℐ:τ≤tfk,τ,t,uuav​Δ,\\displaystyle=\\alpha\_\{\\mathrm\{cmp\}\}\\sum\_\{t\\in\\mathcal\{T\}\}\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}:\\tau\\leq t\}f\_\{k,\\tau,t,u\}^\{\\mathrm\{uav\}\}\\Delta,\(25\)Eucom\\displaystyle E\_\{u\}^\{\\mathrm\{com\}\}=αul​∑t∈𝒯∑\(k,τ\)∈ℐ:τ≤tγk,t,uul​bk,τ,t,uul​Δ\\displaystyle=\\alpha\_\{\\mathrm\{ul\}\}\\sum\_\{t\\in\\mathcal\{T\}\}\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}:\\tau\\leq t\}\\gamma\_\{k,t,u\}^\{\\mathrm\{ul\}\}\\,b\_\{k,\\tau,t,u\}^\{\\mathrm\{ul\}\}\\,\\Delta\+αbh​∑t∈𝒯∑\(k,τ\)∈ℐ:τ≤tγt,ubh​bk,τ,t,ubh​Δ\.\\displaystyle\\quad\+\\alpha\_\{\\mathrm\{bh\}\}\\sum\_\{t\\in\\mathcal\{T\}\}\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}:\\tau\\leq t\}\\gamma\_\{t,u\}^\{\\mathrm\{bh\}\}\\,b\_\{k,\\tau,t,u\}^\{\\mathrm\{bh\}\}\\,\\Delta\.\(26\)The total energy must satisfy

Eufly\+Eucmp\+Eucom≤Eumax,\\displaystyle E\_\{u\}^\{\\mathrm\{fly\}\}\+E\_\{u\}^\{\\mathrm\{cmp\}\}\+E\_\{u\}^\{\\mathrm\{com\}\}\\leq E\_\{u\}^\{\\max\},∀u∈U\.\\displaystyle\\forall u\\in U\.\(27\)[SectionsIII\-C7](https://arxiv.org/html/2605.13221#S3.Ex3),[25](https://arxiv.org/html/2605.13221#S3.E25)and[III\-C7](https://arxiv.org/html/2605.13221#S3.Ex4)define the flight/hover, onboard computing, and communication energy components of UAVuu, and[Eq\.27](https://arxiv.org/html/2605.13221#S3.E27)imposes the corresponding per\-UAV energy budget over the whole mission\.

## IVProblem Formulation and Analysis

In this section, we provide a problem formulation and a problem analysis\.

### IV\-AProblem Formulation

We aim to maximize the overall system performance, subject to the UAV routing constraints in[Eq\.1](https://arxiv.org/html/2605.13221#S3.E1)–[Eq\.9](https://arxiv.org/html/2605.13221#S3.E9)and the MEC service \(i\.e\., task offloading\) constraints in[Eq\.10](https://arxiv.org/html/2605.13221#S3.E10)–[Eq\.27](https://arxiv.org/html/2605.13221#S3.E27)\. Specifically, the objective promotes product collection and deadline\-compliant task completion, while reducing deadline violations, task completion delay, and overall computing and communication resource occupation\. The complete optimization problem is formulated as follows:

𝐏:\\displaystyle\\mathbf\{P\}:\\quadmax⁡ωcol​∑m∈Svm​cm⏟collection value\+ωcmp​∑\(k,τ\)∈ℐzk,τ⏟on\-time completion return\\displaystyle\\max\\;\\underbrace\{\\omega\_\{\\mathrm\{col\}\}\\sum\_\{m\\in S\}v\_\{m\}c\_\{m\}\}\_\{\\text\{collection value\}\}\+\\underbrace\{\\omega\_\{\\mathrm\{cmp\}\}\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}\}z\_\{k,\\tau\}\}\_\{\\text\{on\-time completion return\}\}−ωmiss​∑\(k,τ\)∈ℐ\(1−zk,τ\)⏟deadline violation cost−ωflow​∑\(k,τ\)∈ℐ\(Tk,τ−τ​Δ\)⏟flow\-time cost\\displaystyle\-\\underbrace\{\\omega\_\{\\mathrm\{miss\}\}\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}\}\(1\-z\_\{k,\\tau\}\)\}\_\{\\text\{deadline violation cost\}\}\-\\underbrace\{\\omega\_\{\\mathrm\{flow\}\}\\sum\_\{\(k,\\tau\)\\in\\mathcal\{I\}\}\\bigl\(T\_\{k,\\tau\}\-\\tau\\Delta\\bigr\)\}\_\{\\text\{flow\-time cost\}\}−ωres​∑t∈𝒯\(Rtcmp\+Rtcom\)⏟resource\-occupation cost\\displaystyle\-\\underbrace\{\\omega\_\{\\mathrm\{res\}\}\\sum\_\{t\\in\\mathcal\{T\}\}\\bigl\(R\_\{t\}^\{\\mathrm\{cmp\}\}\+R\_\{t\}^\{\\mathrm\{com\}\}\\bigr\)\}\_\{\\text\{resource\-occupation cost\}\}\(28\)s\.t\.[Eq\.1](https://arxiv.org/html/2605.13221#S3.E1)−[Eq\.9](https://arxiv.org/html/2605.13221#S3.E9),\\displaystyle\\lx@cref\{creftype~refnum\}\{eq:C1\_aligned\}\-\\lx@cref\{creftype~refnum\}\{eq:C17\_aligned\},[Eq\.10](https://arxiv.org/html/2605.13221#S3.E10)−[Eq\.27](https://arxiv.org/html/2605.13221#S3.E27)\.\\displaystyle\\lx@cref\{creftype~refnum\}\{eq:mode\_select\_aligned\}\-\\lx@cref\{creftype~refnum\}\{eqn\_energy4\}\.whereTk,τT\_\{k,\\tau\}denotes the completion time of task\(k,τ\)\(k,\\tau\),zk,τ∈\{0,1\}z\_\{k,\\tau\}\\in\\\{0,1\\\}indicates whether the task is completed before its deadline, andRtcmpR\_\{t\}^\{\\mathrm\{cmp\}\}andRtcomR\_\{t\}^\{\\mathrm\{com\}\}denote the normalized computing and communication occupation in slottt, respectively\.

### IV\-BProblem Analysis

The mathematical formulation presented in[SectionIV\-A](https://arxiv.org/html/2605.13221#S4.SS1)is a large\-scale mixed\-integer optimization problem that jointly couples multi\-UAV routing and slot\-level task offloading\. In terms of computational complexity, the UAV routing part is NP\-hard since it jointly determines station assignment and visit sequencing under route\-consistency, payload, distance, and mission time constraints while maximizing collection value, which amounts to a capacitated value\-collecting multi\-UAV routing variant\. The task offloading part is also NP\-hard\. To see this, consider a restricted case where all routing\-related quantities are fixed, service windows are predetermined, there is only one time slot, communication and energy budgets are set sufficiently large so that they do not constrain the scheduling decisions, and only UAV execution is allowed\. This restricted case still subsumes the classical PARTITION problem\. PARTITION refers to the problem of deciding whether a set of positive integers can be divided into two subsets with equal sums\. Therefore, both subproblems, and thus the full joint formulation, are NP\-hard, which motivates our proposed hierarchical DRL approach in[SectionVI](https://arxiv.org/html/2605.13221#S6)as a scalable solution method\.

## VProposed Agentic AI Framework: LLM with Chain\-of\-thought

In this section, we introduce the proposed agentic AI framework shown in[Fig\.2](https://arxiv.org/html/2605.13221#S5.F2), where the user submits a natural\-language modeling request, the request is grounded by the RAG module through retrieval of relevant contextual knowledge, theAgentic AI Respondergenerates the formulation through CoT reasoning, and theAgentic AI Verifiervalidates each reasoning step before the process proceeds\. We discuss the retrieval\-augmented generation \(RAG\) processes, and the CoT\-based generation\-and\-verification workflow\.

![Refer to caption](https://arxiv.org/html/2605.13221v1/x2.png)Figure 2:An overview of the proposed agentic AI framework\.Part Apresents the IoT\-enabled system environment and the user’s natural\-language modeling request\.Part Billustrates the RAG module for retrieving modeling knowledge and building a retrieval\-augmented prompt\.Part Coutlines the CoT module for step\-by\-step formulation reasoning\.Part Dcorresponds to theAgentic AI Responderthat generates the formulation outputs\.Part Edepicts the verification module, where theAgentic AI Verifierchecks each CoT\-step output and either permits continuation or triggers rethinking\.### V\-ARetrieval\-Augmented Generation

With the rapid development of LLMs, RAG has become a promising approach for improving the factual grounding and domain consistency of LLM outputs\[Wanget al\.\[[46](https://arxiv.org/html/2605.13221#bib.bib67)\], Suet al\.\[[43](https://arxiv.org/html/2605.13221#bib.bib68)\]\]\. The RAG module\[Zhanget al\.\[[55](https://arxiv.org/html/2605.13221#bib.bib3)\]\]is used to ground the LLM with task\-relevant evidence before model generation\. Since no public corpus directly matches the hybrid UAV\-assisted collection and MEC scheduling problem considered in this paper, we organize an internal knowledge base from two complementary perspectives: UAV routing and computational task offloading\. The repository contains modeling\-oriented knowledge, including system descriptions, notation references, variable definitions, objective patterns, constraint templates, and representative formulation fragments in[SectionIX\-A](https://arxiv.org/html/2605.13221#S9.SS1)\. The retrieved knowledge is organized as text, segmented into chunks, and indexed for semantic retrieval\[Chenet al\.\[[7](https://arxiv.org/html/2605.13221#bib.bib42)\]\]\.

Letξ\\xidenote a user request, and letℋ=\{h1,h2,…,hN\}\\mathcal\{H\}=\\\{h\_\{1\},h\_\{2\},\\dots,h\_\{N\}\\\}be the chunked knowledge repository\. A shared text encoderΨenc​\(⋅\)\\Psi\_\{\\mathrm\{enc\}\}\(\\cdot\)maps both the queryξ\\xiand each chunkhnh\_\{n\}into the same embedding space\[Kataishi \[[22](https://arxiv.org/html/2605.13221#bib.bib48)\]\], which we write as:

𝐫ξ=Ψenc​\(ξ\),𝐫n=Ψenc​\(hn\),n=1,…,N,\\mathbf\{r\}\_\{\\xi\}=\\Psi\_\{\\mathrm\{enc\}\}\(\\xi\),\\qquad\\mathbf\{r\}\_\{n\}=\\Psi\_\{\\mathrm\{enc\}\}\(h\_\{n\}\),\\quad n=1,\\dots,N,\(29\)where𝐫ξ,𝐫n∈ℝdemb\\mathbf\{r\}\_\{\\xi\},\\mathbf\{r\}\_\{n\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{emb\}\}\}are dense semantic vectors anddembd\_\{\\mathrm\{emb\}\}is the embedding dimension\. To measure relevance between the query and each chunk, we use cosine similarity\[Prabhakaran \[[40](https://arxiv.org/html/2605.13221#bib.bib93)\]\]defined as follows:

sim⁡\(ξ,hn\)=𝐫ξ⊤​𝐫n‖𝐫ξ‖2​‖𝐫n‖2,n=1,…,N\.\\operatorname\{sim\}\(\\xi,h\_\{n\}\)=\\frac\{\\mathbf\{r\}\_\{\\xi\}^\{\\top\}\\mathbf\{r\}\_\{n\}\}\{\\\|\\mathbf\{r\}\_\{\\xi\}\\\|\_\{2\}\\,\\\|\\mathbf\{r\}\_\{n\}\\\|\_\{2\}\},\\quad n=1,\\dots,N\.\(30\)A largersim⁡\(ξ,hn\)\\operatorname\{sim\}\(\\xi,h\_\{n\}\)indicates that chunkhnh\_\{n\}is more semantically aligned with the query\. Given a retrieval budgetKr​e​tK\_\{ret\}, we define the index set of the top\-Kr​e​tK\_\{ret\}most relevant chunks as\[Lewiset al\.\[[27](https://arxiv.org/html/2605.13221#bib.bib54)\]\]

𝒥Kr​e​t\(ξ\)=TopKr​e​t\{sim\(ξ,hn\)\}n=1N,\\mathcal\{J\}\_\{K\_\{ret\}\}\(\\xi\)=\\operatorname\*\{Top\}\_\{K\_\{ret\}\}\\bigl\\\{\\operatorname\{sim\}\(\\xi,h\_\{n\}\)\\bigr\\\}\_\{n=1\}^\{N\},\(31\)where𝒥Kr​e​t​\(ξ\)\\mathcal\{J\}\_\{K\_\{ret\}\}\(\\xi\)denotes the set of indices of the retrieved chunks for queryξ\\xi, andTop\(⋅\)\\operatorname\*\{Top\}\(\\cdot\)returns the indices corresponding to theKr​e​tK\_\{ret\}largest values in the similarity\-score collection\{sim⁡\(ξ,hn\)\}n=1N\\\{\\operatorname\{sim\}\(\\xi,h\_\{n\}\)\\\}\_\{n=1\}^\{N\}\. The retrieved chunks form the context used in the subsequent generation stage\[Heredia Álvaro and Barreda \[[19](https://arxiv.org/html/2605.13221#bib.bib51)\]\], which we denote by

ℛrag​\(ξ\)=\{hn∣n∈𝒥Kr​e​t​\(ξ\)\},\\mathcal\{R\}\_\{\\mathrm\{rag\}\}\(\\xi\)=\\\{\\,h\_\{n\}\\mid n\\in\\mathcal\{J\}\_\{K\_\{ret\}\}\(\\xi\)\\,\\\},\(32\)whereℛrag​\(ξ\)\\mathcal\{R\}\_\{\\mathrm\{rag\}\}\(\\xi\)denotes the set of retrieved text chunks for queryξ\\xi, and each chunkhnh\_\{n\}is included if its indexnnbelongs to the top\-Kr​e​tK\_\{ret\}index set𝒥Kr​e​t​\(ξ\)\\mathcal\{J\}\_\{K\_\{ret\}\}\(\\xi\)\.

The retrieved chunks are then combined with the original query through an aggregation operator to form the final prompt\[Wanget al\.\[[51](https://arxiv.org/html/2605.13221#bib.bib52)\]\]:

𝔭​\(ξ\)=Agg⁡\(ξ,ℛrag​\(ξ\)\),\\mathfrak\{p\}\(\\xi\)=\\operatorname\{Agg\}\\\!\\bigl\(\\xi,\\mathcal\{R\}\_\{\\mathrm\{rag\}\}\(\\xi\)\\bigr\),\(33\)whereAgg⁡\(⋅,⋅\)\\operatorname\{Agg\}\(\\cdot,\\cdot\)denotes the prompt\-construction operator, and𝔭​\(ξ\)\\mathfrak\{p\}\(\\xi\)is the retrieval\-augmented prompt formed by combining the user requestξ\\xiwith the retrieved chunk setℛrag​\(ξ\)\\mathcal\{R\}\_\{\\mathrm\{rag\}\}\(\\xi\)\. Conditioned on this grounded prompt, the LLM generates an output token sequence𝝂=\(ν1,…,νNtok\)\\boldsymbol\{\\nu\}=\(\\nu\_\{1\},\\dots,\\nu\_\{N\_\{\\mathrm\{tok\}\}\}\)according to the autoregressive conditional distributionp​\(𝝂∣𝔭​\(ξ\)\)p\\\!\\left\(\\boldsymbol\{\\nu\}\\mid\\mathfrak\{p\}\(\\xi\)\\right\)given by\[Lewiset al\.\[[27](https://arxiv.org/html/2605.13221#bib.bib54)\]\]

p​\(𝝂∣𝔭​\(ξ\)\)=∏ϖ=1Ntokp​\(νϖ∣ν<ϖ,𝔭​\(ξ\)\),p\\\!\\left\(\\boldsymbol\{\\nu\}\\mid\\mathfrak\{p\}\(\\xi\)\\right\)=\\prod\_\{\\varpi=1\}^\{N\_\{\\mathrm\{tok\}\}\}p\\\!\\left\(\\nu\_\{\\varpi\}\\mid\\nu\_\{<\\varpi\},\\,\\mathfrak\{p\}\(\\xi\)\\right\),\(34\)whereϖ\\varpidenotes the token\-position index,ν<ϖ=\(ν1,…,νϖ−1\)\\nu\_\{<\\varpi\}=\(\\nu\_\{1\},\\dots,\\nu\_\{\\varpi\-1\}\)denotes the previously generated token prefix, andNtokN\_\{\\mathrm\{tok\}\}is the generated sequence length\.

In the proposed framework, the main roles of RAG are to minimize hallucinations\[Alabbasiet al\.\[[3](https://arxiv.org/html/2605.13221#bib.bib69)\]\]and to provide reliable domain grounding for formulation\. Specifically, RAG helps the LLM identify the key system components in the current CMfg scenario, such as UAVs, manufacturing stations, ISDs, computational tasks, and the central depot with cloud support, together with their interactions\. It also helps the LLM determine relevant execution modes, routing\-offloading couplings, resource and energy constraints, and objective components\. Consequently, RAG constrains the search space of the LLM, enhances alignment with domain knowledge, and mitigates unsupported or fabricated outputs\. The next chain\-of\-thought module subsequently arranges the acquired knowledge into a cohesive mathematical model\.

### V\-BChain\-of\-Thought and Verification

After retrieval grounds the LLM with domain\-relevant evidence, the CoT module reasons over the retrieved information and organizes it for downstream mathematical formulation\. Recent Long\-CoT studies characterize this reasoning paradigm by deep reasoning, extensive exploration, and feasible reflection, which support more intricate and coherent reasoning processes\[Chenet al\.\[[8](https://arxiv.org/html/2605.13221#bib.bib80)\]\]\. In[Fig\.2](https://arxiv.org/html/2605.13221#S5.F2), this module externalizes the intermediate reasoning that links the problem description to the objective terms and constraint families, thereby making the formulation process more transparent, in line with recent work on interactive reasoning and explicit reasoning\-chain manipulation\[Panget al\.\[[39](https://arxiv.org/html/2605.13221#bib.bib81)\]\]\. Recent studies further show that verifier\-guided CoT can assess or guide reasoning\-step correctness, while verifiability\-oriented evaluation can assess reasoning quality beyond final\-answer accuracy\[Chowdhury and Caragea \[[11](https://arxiv.org/html/2605.13221#bib.bib82)\], Aggarwalet al\.\[[2](https://arxiv.org/html/2605.13221#bib.bib83)\]\]\. The CoT\-based formulation step is particularly important in the CMfg scenario considered here, where UAV routing and computational task offloading are tightly coupled through time\-varying service availability, communication and computing limits, energy budgets, and task deadlines\.

Given a user requestξ\\xiand the retrieval\-augmented prompt𝔭​\(ξ\)\\mathfrak\{p\}\(\\xi\), the LLM performs a structured reasoning process over the retrieved evidence\. The framework in[Fig\.2](https://arxiv.org/html/2605.13221#S5.F2)includes two agents: theAgentic AI Responderand theAgentic AI Verifier\. TheAgentic AI Responderis responsible for reasoning over the user request and providing the final answer\. In contrast, theAgentic AI Verifierevaluates another agent’s responses, deciding whether to approve or reject it\. The CoT module of theAgentic AI Responderis shown inPart Cof[Fig\.2](https://arxiv.org/html/2605.13221#S5.F2)\. The CoT module first analyzes the user query and decomposes it into a sequence of CoT\-step queries, so that the original request can be solved in a structured, step\-by\-step manner\. Specifically, during the initialization phase, an LLM parses the user query and generates both a CoT\-step plan and the corresponding CoT\-step queries for theAgentic AI Responder\. These CoT\-step queries define the reasoning order and specify what the responder should address at each stage\[Xuet al\.\[[53](https://arxiv.org/html/2605.13221#bib.bib77)\]\]\.

As shown inPart Dof[Fig\.2](https://arxiv.org/html/2605.13221#S5.F2), the user request is decomposed into four steps, namely objective function formulation, UAV routing constraints derivation, task offloading constraints derivation, and mathematical notation summarization\. During execution, theAgentic AI Responderanswers the CoT\-step queries sequentially\[Nguyenet al\.\[[36](https://arxiv.org/html/2605.13221#bib.bib78)\]\]\. For example, in CoT\-step 1, the query on defining the objective function is first sent to the responder, which then produces an initial response\. This response is subsequently forwarded to theAgentic AI Verifierfor assessment through the verification module\. If the response does not pass verification, the verifier returns feedback requiring the responder to rethink and revise the current answer\. In contrast, if the response is verified as correct, the workflow proceeds to the next CoT\-step query\[Wanget al\.\[[49](https://arxiv.org/html/2605.13221#bib.bib79)\]\]\. The workflow then proceeds in the same manner for the remaining CoT steps\. CoT\-step 2 derives the constraints associated with UAV routing\. CoT\-step 3 focuses on the constraints for task offloading\. CoT\-step 4 summarizes the mathematical notation and symbol definitions used in the formulation\.

For each CoT step, theAgentic AI Responderinitially formulates a response to the specific CoT\-step query; subsequently, theAgentic AI Verifierassesses the generated response prior to the continuation of the reasoning process\[Wanget al\.\[[49](https://arxiv.org/html/2605.13221#bib.bib79)\]\]\. This sequential reasoning\-and\-verification methodology ensures the system’s stepwise completion of the entire CoT workflow, with the final output being generated only subsequent to the successful verification of all CoT\-step responses\. Upon the successful verification of all CoT\-step responses, the system consolidates the verified intermediate outputs into a comprehensive final response for the user\. Consequently, from the user’s perspective, the overall workflow can be viewed as a direct mapping from the input query to a complete mathematical model\. The intermediate reasoning, decomposition, and verification procedures are handled internally by the CoT module inPart Cand the verification module inPart E\. Together, these two modules realize an automated multi\-agent reasoning\-and\-verification process between theAgentic AI Responderand theAgentic AI Verifier\.

Part Eof[Fig\.2](https://arxiv.org/html/2605.13221#S5.F2)further presents the verification module of theAgentic AI Verifier\. For each CoT\-step query, the relevant reference chunks are obtained from the RAG database and combined with the corresponding response generated by theAgentic AI Responderto construct the verification context\[Liet al\.\[[31](https://arxiv.org/html/2605.13221#bib.bib66)\]\]\. Based on this context, together with parameterized knowledge, the verifier determines whether the responder’s answer at the current CoT step is semantically correct and logically consistent\. If the answer passes verification, the workflow continues to the next CoT step; otherwise, the verifier instructs the responder to rethink and try again until either a satisfactory response is obtained or the maximum number of allowed retries is reached, at which point the verification loop is terminated to avoid an unbounded iteration process\.

## VIProposed Hierarchical DRL Approach

Inspired by\[Maoet al\.\[[33](https://arxiv.org/html/2605.13221#bib.bib26)\]\], we decompose the joint UAV routing and MEC problem into a hierarchical two\-layer framework, which reduces the state and action spaces while preserving the key coupling between logistics and computation\. Both layers are solved using PPO\-based DRL\. Throughout this paper, terrestrial MEC servers are referred to as the“cloud”\.

### VI\-AHierarchical Framework Overview

The proposed hierarchical framework consists of two sequential PPO\-based DRL layers\. The upper layer solves the multi\-UAV routing problem as a centralized MDP to determine visiting sequences and station assignments under energy, capacity, and flight distance constraints\. Conditioned on the upper\-layer routes, the lower layer optimizes per\-slot task execution by jointly deciding offloading destinations and communication/computing resource allocation under service\-window, capacity, and deadline constraints\. To couple the two layers, the upper layer provides the lower layer with each UAV’s service window and remaining energy headroom\. Each service window gives the arrival and departure times of a UAV at a station\. In the co\-training procedure, the upper layer is first trained, after which its best policy is used to initialize the lower\-layer training\.

### VI\-BUpper\-layer DRL: MDP Design

The upper layer governs the coarse\-timescale routing decisions ofUUUAVs tasked with visitingMMstations within a mission horizon of durationTmissionT\_\{\\mathrm\{mission\}\}\. Each upper\-layer training cycle consists of multiple complete routing episodes, where each episode comprises a sequence of discrete*upper steps*\. During an upper step, routing decisions are taken, UAV states evolve according to physical constraints, and rewards are accumulated\.

#### VI\-B1Upper\-layer MDP

The routing problem at the upper layer is modeled as a finite\-horizon Markov decision process \(MDP\) with the following components

ℳup=\(𝒮up,𝒜up,ℛup,γup\),\\mathcal\{M\}^\{\\mathrm\{up\}\}=\\bigl\(\\mathcal\{S^\{\\mathrm\{up\}\}\},\\mathcal\{A^\{\\mathrm\{up\}\}\},\\mathcal\{R\}^\{\\mathrm\{up\}\},\\mathcal\{\\gamma\}^\{\\mathrm\{up\}\}\\bigr\),where𝒮up\\mathcal\{S^\{\\mathrm\{up\}\}\},𝒜up\\mathcal\{A^\{\\mathrm\{up\}\}\},ℛup\\mathcal\{R\}^\{\\mathrm\{up\}\}, andγup\\mathcal\{\\gamma\}^\{\\mathrm\{up\}\}are the state space, action space, reward function, and discount factor of upper\-layer DRL, respectively\.

#### VI\-B2State space

At each upper steptat\_\{a\}, the statestau​p∈𝒮ups\_\{t\_\{a\}\}^\{up\}\\in\\mathcal\{S\}^\{\\mathrm\{up\}\}is a continuous vector that summarizes four types of information: \(i\) per\-UAV status, including location, remaining energy, payload, and flight\-distance budget; \(ii\) route\-progress information, including normalized route length, completion status, and the set of stations already served by each UAV; \(iii\) per\-station attributes, including service status, normalized value and payload weights, and local service characteristics; and \(iv\) mission\-level context, including normalized mission time, the fraction of completed collections, and UAV–station distance information for feasibility and planning\.

#### VI\-B3Action space

At each upper step, the learned upper\-layer policy selects a joint actionatau​pa\_\{t\_\{a\}\}^\{up\}defined as:

atau​p=\(ataprio,ataroute\)∈𝒜up,a\_\{t\_\{a\}\}^\{up\}=\\bigl\(a^\{\\mathrm\{prio\}\}\_\{t\_\{a\}\},\\;a^\{\\mathrm\{route\}\}\_\{t\_\{a\}\}\\bigr\)\\in\\mathcal\{A^\{\\mathrm\{up\}\}\},\(35\)whereataprioa^\{\\mathrm\{prio\}\}\_\{t\_\{a\}\}refers to the priority score\[Zhouet al\.\[[57](https://arxiv.org/html/2605.13221#bib.bib27)\], Leeet al\.\[[25](https://arxiv.org/html/2605.13221#bib.bib28)\]\], and is a real\-valued vector whose dimension equals the number of UAVs and whose ranking determines the order in which the UAVs apply their routing decisions during the current upper step\.ataroutea^\{\\mathrm\{route\}\}\_\{t\_\{a\}\}is the routing actions\[Fanet al\.\[[15](https://arxiv.org/html/2605.13221#bib.bib29)\]\]\. For each UAVuu, the learned upper\-layer policy selects one discrete routing symbol from

atar​o​u​t​e​\(u\)∈\{stay,add​\(1\),…,add​\(M\),complete\},a\_\{t\_\{a\}\}^\{route\}\(u\)\\in\\\{\\text\{stay\},\\;\\text\{add\}\(1\),\\dots,\\text\{add\}\(M\),\\;\\text\{complete\}\\\},\(36\)wherestaykeeps the UAV at its current location\.add​\(m\)\\text\{add\}\(m\)instructs the UAV to visit stationmmuntilMMstations, whilecompletereturns the UAV to the depot and closes its current route\. The routing decisions and priority scores constitute the actionable degrees of freedom\. Feasibility masks restrict the learned upper\-layer policy to physically admissible routing symbols\. An action such asadd​\(m\)\\text\{add\}\(m\)becomes infeasible if its flight distance, return distance, payload, minimum service time, or remaining energy would violate safety margins\.

#### VI\-B4Reward function

At upper steptat\_\{a\}, conditioned on the observed statestau​ps\_\{t\_\{a\}\}^\{up\}, the learned upper\-layer policy outputs actionatau​pa\_\{t\_\{a\}\}^\{up\}, and the environment transitions tosta\+1u​ps\_\{t\_\{a\}\+1\}^\{up\}, whereta∈\{0,1,…,Tu​p−1\}t\_\{a\}\\in\\\{0,1,\\dots,T^\{up\}\-1\\\}, andTu​pT^\{up\}is the number of executed upper steps in the episode\. The per\-step reward,rtau​pr\_\{t\_\{a\}\}^\{up\}, is defined as:

rtau​p=\\displaystyle r\_\{t\_\{a\}\}^\{up\}=∑m∈𝒞tavm\+bcov\+bbal\+bdel\\displaystyle\\sum\_\{m\\in\\mathcal\{C\}\_\{t\_\{a\}\}\}v\_\{m\}\\;\+\\;b\_\{\\mathrm\{cov\}\}\\;\+\\;b\_\{\\mathrm\{bal\}\}\\;\+\\;b\_\{\\mathrm\{del\}\}\(37\)−κcon\+ψendu​p\.\\displaystyle\-\\kappa\_\{\\mathrm\{con\}\}\\;\+\\;\\psi\_\{\\mathrm\{end\}\}^\{up\}\.The term∑m∈𝒞tavm\\sum\_\{m\\in\\mathcal\{C\}\_\{t\_\{a\}\}\}v\_\{m\}is the collection value obtained during the transition, where𝒞ta\\mathcal\{C\}\_\{t\_\{a\}\}denotes the set of stations newly collected at upper steptat\_\{a\}, andvmv\_\{m\}is the product value of stationmm\. The termbcovb\_\{\\mathrm\{cov\}\}is a shaping reward that encourages increasing the coverage ratio of collected stations,bbalb\_\{\\mathrm\{bal\}\}encourages balanced route workloads across UAVs, andbdelb\_\{\\mathrm\{del\}\}rewards route termination according to two factors: the collected route value and the ratio between the collected route value and the closed\-tour distance\. The termκcon\\kappa\_\{\\mathrm\{con\}\}is an aggregated penalty that discourages infeasible decisions, including violations of energy, payload, remaining\-distance, duplicate\-visit, movement\-time, service\-time, and mission\-completion constraints\. The last termψendu​p\\psi\_\{\\mathrm\{end\}\}^\{up\}is the terminal\-condition term, activated only at the final upper step\. It adjusts the final reward according to the terminal outcome by granting a success bonus or imposing a penalty based on the number of unserved stations\.

### VI\-CInformation Flow from the Upper Layer to the Lower Layer

Each completed upper\-layer routing episode provides two types of information for subsequent lower\-layer training cycles: the feasible service windows extracted for each UAV\-station pair from the corresponding mission\-time intervals, and the residual energy headroom of all UAVs at the end of the episode, which is used to scale certain lower\-layer resource and capacity conditions\.

### VI\-DLower\-layer DRL: MDP Design

In each time slot, the lower\-layer DRL scheduler performs task scheduling and resource allocation for computation tasks generated by multiple ISDs\. Each task can be executed locally, offloaded to a UAV \(edge computing\), or offloaded to the cloud via a selected UAV\. Tasks may span multiple slots\. The lower\-layer DRL scheduler seeks to accelerate task progress and completion, reduce backlog, and avoid deadline misses, subject to per\-slot compute/communication capacity constraints and time\-varying UAV service\-window constraints determined by the upper layer\.

At the beginning of the time slot, the environment collects all unfinished tasks, including \(i\) waiting tasks and \(ii\) active tasks that have started but are not yet completed\. These tasks are consolidated into a global queue and prioritized according to urgency using the earliest\-deadline\-first \(EDF\) policy\. The lower\-layer DRL scheduler then scans the global queue from most urgent to least urgent and selects up to a preset maximum number of tasks to process in this slot\. For each selected task, it first chooses exactly one execution location \(local / a specific UAV / cloud via a specific UAV\) and then assigns discrete compute and communication resource levels to advance task processing\. The aggregated allocations across all selected tasks must remain feasible under the capacity and service\-window constraints\.

#### VI\-D1Lower\-layer MDP

The lower\-layer scheduling problem is modeled as an MDP defined by the tuple:

ℳlo=\(𝒮lo,𝒜lo,ℛlo,γlo\),\\mathcal\{M\}^\{\\mathrm\{lo\}\}=\\bigl\(\\mathcal\{S^\{\\mathrm\{lo\}\}\},\\mathcal\{A^\{\\mathrm\{lo\}\}\},\\mathcal\{R\}^\{\\mathrm\{lo\}\},\\mathcal\{\\gamma\}^\{\\mathrm\{lo\}\}\\bigr\),where𝒮lo\\mathcal\{S^\{\\mathrm\{lo\}\}\},𝒜lo\\mathcal\{A^\{\\mathrm\{lo\}\}\},ℛlo\\mathcal\{R\}^\{\\mathrm\{lo\}\}, andγlo\\mathcal\{\\gamma\}^\{\\mathrm\{lo\}\}are the state space, action space, reward function, and discount factor of lower\-layer DRL, respectively\.

#### VI\-D2State space

The lower\-layer statestbl​o∈𝒮l​os\_\{t\_\{b\}\}^\{lo\}\\in\\mathcal\{S\}^\{lo\}is observed at the beginning of slottbt\_\{b\}and summarizes local queues, UAV\-side resources, and current service feasibility through six groups of normalized features: \(i\) per\-ISD local statistics, including local compute capability and headroom, queue and active\-local workload summaries, and the minimum remaining time\-to\-deadline; \(ii\) per\-UAV statistics, including energy headroom, compute and communication capacities, and workload summaries of tasks currently executed on the UAV or relayed to the cloud via that UAV; \(iii\) global system scalars, such as the current and remaining time, aggregated counts of waiting and active tasks across processing modes, and a summary statistic of UAV energy; \(iv\) per UAV–ISD active offload status, represented by normalized aggregates of remaining workload and communication bits for active tasks from ISDkkassociated with UAVuu; \(v\) a UAV–ISD service\-window availability mask indicating whether UAVuucan serve ISDkkat timetbt\_\{b\}; and \(vi\) a fixed\-length, zero\-padded snapshot of the top\-NqN\_\{q\}unfinished tasks ordered by serviceability and urgency, with per\-task attributes including origin, status, execution mode and UAV index when applicable, remaining time\-to\-deadline, and remaining compute and communication requirements\.

#### VI\-D3Action space

At the beginning of each slottbt\_\{b\}, the learned lower\-layer policy outputs a joint action defined over a cached global\-queue snapshotQtbQ\_\{t\_\{b\}\}\. This joint action specifies the service decisions for the firstKgK\_\{g\}queue positions\.KgK\_\{g\}is a fixed truncation size and only the firstmin⁡\{Kg,\|Qtb\|\}\\min\\\{K\_\{g\},\|Q\_\{t\_\{b\}\}\|\\\}entries take effect\. The top\-KgK\_\{g\}is different from the top\-NqN\_\{q\}queue snapshot used in the state\.NqN\_\{q\}denotes the number of highest\-priority unfinished tasks encoded in the state\.KgK\_\{g\}denotes the number of highest\-priority queue positions on which the learned lower\-layer policy outputs per\-slot decisions\. The action is given by

Atbl​o=\{\(ℓtb\(j\),α~tb\(j\),β~tb\(j\)\)\}j=1Kg,A\_\{t\_\{b\}\}^\{lo\}=\\Bigl\\\{\\bigl\(\\ell\_\{t\_\{b\}\}^\{\(j\)\},\\,\\tilde\{\\alpha\}\_\{t\_\{b\}\}^\{\(j\)\},\\,\\tilde\{\\beta\}\_\{t\_\{b\}\}^\{\(j\)\}\\bigr\)\\Bigr\\\}\_\{j=1\}^\{K\_\{g\}\},\(38\)where thejj\-th tuple gives the processing decision and resource\-allocation scalars for the task at positionjjinQtbQ\_\{t\_\{b\}\}\. The discrete variableℓtb\(j\)\\ell\_\{t\_\{b\}\}^\{\(j\)\}specifies a processing choice for the queue item at positionjj\. It is selected from the semantic action set:

ℓtb\(j\)∈\{SKIP,local\}∪\{UAV​\(u\),cloud\-via\-UAV​\(u\)\}u=1U,\\ell\_\{t\_\{b\}\}^\{\(j\)\}\\in\\\{\\texttt\{SKIP\},\\ \\text\{local\}\\\}\\cup\\\{\\text\{UAV\}\(u\),\\ \\text\{cloud\-via\-UAV\}\(u\)\\\}\_\{u=1\}^\{U\},\(39\)whereSKIPdenotes no new dispatch, while the other options correspond to local execution, UAV execution, or cloud execution via UAVuu\. TheSKIPoption allows the learned lower\-layer policy to serve fewer tasks in a slot when needed to respect resource limits and prioritize more urgent or more serviceable tasks\.

The scalarsα~tb\(j\),β~tb\(j\)∈\[0,1\]\\tilde\{\\alpha\}\_\{t\_\{b\}\}^\{\(j\)\},\\tilde\{\\beta\}\_\{t\_\{b\}\}^\{\(j\)\}\\in\[0,1\]denote normalized compute and communication allocations, which are quantized by the environment into discrete levels\. The continuous sampling ofα~tb\(j\)\\tilde\{\\alpha\}\_\{t\_\{b\}\}^\{\(j\)\}andβ~tb\(j\)\\tilde\{\\beta\}\_\{t\_\{b\}\}^\{\(j\)\}is adopted to avoid a high\-dimensional multi\-discrete action space, while quantization ensures feasibility under level\-based resource constraints\. For local execution, communication allocation is ignored\. For active tasks already bound to a processing mode and UAV index,ℓtb\(j\)\\ell\_\{t\_\{b\}\}^\{\(j\)\}is restricted to eitherSKIPor the previously assigned processing option\. If selected, the allocation is updated for the current slot; otherwise, previous allocations persist\.

#### VI\-D4Reward function

The step rewardrtblor^\{\\mathrm\{lo\}\}\_\{t\_\{b\}\}is designed to align with the objectives: \(i\) finish tasks as early as possible, \(ii\) strictly avoid deadline failures, and \(iii\) discourage avoidable idling and excessive \(always\-max\) resource occupation\. A dense\-and\-aligned reward is written as:

rtblo=\\displaystyle r^\{\\mathrm\{lo\}\}\_\{t\_\{b\}\}=bcomplete\+bdispatch\+bprogress\\displaystyle\\;b\_\{\\mathrm\{complete\}\}\+b\_\{\\mathrm\{dispatch\}\}\+b\_\{\\mathrm\{progress\}\}\(40\)\+ΦB\+ΦU−κdeadline−κinvalid\\displaystyle\\;\+\\Phi\_\{\\mathrm\{B\}\}\+\\Phi\_\{\\mathrm\{U\}\}\-\\kappa\_\{\\mathrm\{deadline\}\}\-\\kappa\_\{\\mathrm\{invalid\}\}−κidle−κalloc−κliving\+ψendl​o\.\\displaystyle\\;\-\\kappa\_\{\\mathrm\{idle\}\}\-\\kappa\_\{\\mathrm\{alloc\}\}\-\\kappa\_\{\\mathrm\{living\}\}\+\\psi\_\{\\mathrm\{end\}\}^\{lo\}\.At slottbt\_\{b\}, the lower\-layer rewardrtblor\_\{t\_\{b\}\}^\{\\mathrm\{lo\}\}combines positive incentives, penalties, and an end\-of\-episode settlement\. Specifically,bcompleteb\_\{\\mathrm\{complete\}\}rewards task completions in slottbt\_\{b\},bdispatchb\_\{\\mathrm\{dispatch\}\}assigns a bonus or cost to each waiting task dispatched in that slot according to its selected execution destination, e\.g\., local, UAV, or cloud via a UAV\.bprogressb\_\{\\mathrm\{progress\}\}provides a dense reward for effective task advancement during the slot\. The shaping termsΦB\\Phi\_\{\\mathrm\{B\}\}andΦU\\Phi\_\{\\mathrm\{U\}\}further encourage backlog reduction and urgency reduction, respectively, by rewarding transitions that reduce the waiting\-task backlog and the urgency of unfinished tasks\. In contrast,κdeadline\\kappa\_\{\\mathrm\{deadline\}\}penalizes deadline violations,κinvalid\\kappa\_\{\\mathrm\{invalid\}\}penalizes infeasible decisions that violate service\-window or resource\-capacity constraints, andκidle\\kappa\_\{\\mathrm\{idle\}\}penalizes avoidable“no\-dispatch”behavior when feasible waiting tasks exist\. In addition,κalloc\\kappa\_\{\\mathrm\{alloc\}\}discourages excessive resource usage by penalizing the aggregate normalized allocation levels of computation and communication resources, andκliving\\kappa\_\{\\mathrm\{living\}\}imposes a persistent penalty on the remaining unfinished load after the current transition so as to maintain continuous pressure for prompt task clearance\. Finally,ψendl​o\\psi\_\{\\mathrm\{end\}\}^\{lo\}denotes the terminal contribution applied at the end of the episode, which penalizes unfinished tasks remaining at termination, imposes a stronger penalty on overdue tasks, and may also provide a sparse bonus when the system is cleared early after the final task arrival\.

### VI\-EComplexity Analysis of Upper– and Lower–Layer DRL

LetNupN\_\{\\mathrm\{up\}\}andNloN\_\{\\mathrm\{lo\}\}denote the numbers of training episodes for the upper and lower layers, respectively, and letT¯up\\bar\{T\}^\{\\mathrm\{up\}\}andT¯lo\\bar\{T\}^\{\\mathrm\{lo\}\}denote the average episode lengths\. LetPθ,ϕupP\_\{\\theta,\\phi\}^\{\\mathrm\{up\}\}andPθ,ϕloP\_\{\\theta,\\phi\}^\{\\mathrm\{lo\}\}denote the total numbers of trainable parameters in the upper\-layer and lower\-layer actor–critic networks, respectively\. In the upper layer, routing\-action evaluation and feasibility\-mask construction scale with the number of UAV–station pairs\. Thus, the upper\-layer training complexity is𝒪​\(Nup​T¯up​\(U​M\+Pθ,ϕup\)\)\\mathcal\{O\}\(N\_\{\\mathrm\{up\}\}\\bar\{T\}^\{\\mathrm\{up\}\}\(UM\+P\_\{\\theta,\\phi\}^\{\\mathrm\{up\}\}\)\), under fixed PPO update settings\. In the lower layer, the state contains per\-ISD features, per\-UAV features, UAV–ISD service\-window masks, and a top\-NqN\_\{q\}queue snapshot, while the policy only acts on the firstKgK\_\{g\}queue positions\. Hence, the lower\-layer training complexity is𝒪​\(Nlo​T¯lo​\(U​K\+Nq\+Kg​U\+Pθ,ϕlo\)\)\\mathcal\{O\}\(N\_\{\\mathrm\{lo\}\}\\bar\{T\}^\{\\mathrm\{lo\}\}\(UK\+N\_\{q\}\+K\_\{g\}U\+P\_\{\\theta,\\phi\}^\{\\mathrm\{lo\}\}\)\)\. A monolithic MDP would need to enumerate joint routing and task\-scheduling choices, whose size scales with\(M\+2\)U​\(2​U\+2\)Kg\(M\+2\)^\{U\}\(2U\+2\)^\{K\_\{g\}\}before considering resource\-allocation levels\. Therefore, the proposed hierarchical decomposition keeps the rollout and update costs polynomial in the main system sizes while preserving the service\-window coupling between UAV routing and MEC task scheduling\.

### VI\-FTraining Procedure

The two\-layer DRL framework is trained sequentially\. First, the upper layer is trained by PPO for UAV routing, where routing rollouts are collected under feasibility masks and the actor–critic networks are updated whenever the rollout buffer is full\. The best\-performing upper\-layer policy is then fixed to extract the UAV service\-window and residual\-energy information for the lower layer\. Next, the lower layer is trained by PPO for task scheduling and resource allocation under this fixed routing information\. In this stage, scheduling rollouts are collected over EDF\-based queue snapshots with feasibility masks and quantized allocation decisions, and the actor–critic networks are updated whenever the rollout buffer is full\. This sequential design preserves the cross\-layer coupling between routing and MEC scheduling while avoiding simultaneous optimization, thereby improving training stability\.

## VIIPerformance Evaluation

### VII\-ASimulation Parameters and Setup

In this subsection, we describe the scenario settings, agentic AI settings, and DRL settings as follows\.

#### VII\-A1Scenario Settings

We consider a cloud\-manufacturing scenario with two homogeneous UAVs serving six manufacturing stations, where each station is equipped with one ISD\. The mission horizon is set to 313 s\. The minimum required service durations are set to 100 s at the first station, 80 s at the second, third, and fourth stations, and 70 s at the fifth and sixth stations\. Here, payload\-units denote an abstract unit used to measure product weight and UAV payload budget\. For MEC scheduling, the lower layer is discretized into 1\-s time slots, and work\-units are used as an abstract unit to measure computation workload and processing capacity\. In each episode, 1920 tasks are randomly generated, with deadlines uniformly distributed from 20 s to 80 s\. The task workload is randomly generated around 10 work\-units per task, with a typical deviation of about 2 work\-units, while the task input size is randomly generated around10510^\{5\}bits per task, with a typical deviation of about2×1042\\times 10^\{4\}bits\. The other scenario settings are listed in[SectionVII\-A1](https://arxiv.org/html/2605.13221#S7.SS1.SSS1)\.

TABLE III:Scenario settingsSystem ParameterValueUAV flight speed1010m/sMaximum travel distance48004800mUAV payload capacity3030payload\-unitsUAV battery capacity150150WhLocal computing capacity300300work\-units/sUAV computing capacity400400work\-units/sCloud computing capacity400400work\-units/sISD\-to\-UAV / UAV\-to\-cloud rate500/120500/120Mb/s
#### VII\-A2Agentic AI Settings

The agentic AIs are implemented as follows\. First, the user description and expertise knowledge are encoded using OpenAIEmbeddings with*text\-embedding\-ada\-002*\[OpenAI \[[38](https://arxiv.org/html/2605.13221#bib.bib94)\]\]\. Next, a local RAG pipeline is constructed by indexing the text corpus into a Chroma vector store\[Chroma \[[12](https://arxiv.org/html/2605.13221#bib.bib96)\]\]and exposing a similarity\-based retriever as a retrieval tool\. The planning, response generation, and verification components all employ*gpt\-5\.4\-2026\-03\-05*\[OpenAI \[[37](https://arxiv.org/html/2605.13221#bib.bib95)\]\]\. Finally, the overall reasoning process is organized as a two\-agent CoT workflow, consisting of a Responder Agent and a Verifier Agent, while short\-term conversational memory across successive turns is maintained using LangGraph checkpointing\[LangChain \[[24](https://arxiv.org/html/2605.13221#bib.bib97)\]\]within the LangChain\-based agent framework\[LangChain \[[23](https://arxiv.org/html/2605.13221#bib.bib98)\]\]\.

#### VII\-A3DRL Settings

We adopt a hierarchical DRL framework for the considered hybrid logistics–MEC scheduling problem\. In both compared schemes, the upper layer uses PPO to optimize UAV routing, while the lower layer is used to optimize task execution and resource allocation\. To evaluate the effectiveness of our proposed approach, we compare it with a baseline method using advantage actor\-critic \(A2C\)\. The baseline scheme employs PPO in the upper layer and A2C in the lower layer, whereas the proposed scheme employs PPO in both the upper and lower layers\. For both layers, the actor and critic networks use two hidden layers with 256 neurons per layer\. ReLU is adopted as the activation function, and Adam is used for network optimization\. The detailed hyperparameter settings of the upper\-layer PPO \(Up\-PPO\), lower\-layer PPO \(Low\-PPO\), and lower\-layer A2C \(Low\-A2C\) are summarized in[SectionVII\-A3](https://arxiv.org/html/2605.13221#S7.SS1.SSS3)\.

TABLE IV:DRL hyperparameter settingsSystem ParameterUp\-PPOLow\-PPOLow\-A2CLearning rate \(actor\)2×10−42\\times 10^\{\-4\}7×10−57\\times 10^\{\-5\}1×10−51\\times 10^\{\-5\}Learning rate \(critic\)3×10−43\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}5×10−55\\times 10^\{\-5\}Discount factor0\.990\.990\.9960\.9960\.9960\.996GAE / trace parameter0\.920\.920\.950\.950\.950\.95Clipping parameter0\.150\.150\.050\.05–Entropy coefficient0\.0040\.0040\.0010\.0010\.0010\.001Rollout length96965125126464Batch size3232512512–Training episodes200020002000200020002000Number of hidden layers222222Hidden layer size256256256256256256Activation functionReLUReLUReLUDNN optimizerAdamAdamAdam
Abbreviations:Generalized advantage estimation \(GAE\); deep neural network \(DNN\)\.

### VII\-BEffectiveness of the Agentic AI

[Fig\.3](https://arxiv.org/html/2605.13221#S7.F3)presents the results of the proposed agentic AI framework, comprising theAgentic AI ResponderandAgentic AI Verifier, along with a single\-agent baseline that includes only theAgentic AI Responderwithout CoT and verification modules\. InPart Aof[Fig\.3](https://arxiv.org/html/2605.13221#S7.F3), the user first provides the prompts, including the system description and the required mathematical formulation\. A CoT planning prompt, shown inPart B, is then sent to the proposed framework inPart C\. Finally, theAgentic AI Responder, equipped with both CoT and verification modules, generates the complete response, of which only a partial example is presented inPart D\. In addition, a single\-agent baseline is shown inPart E\. It consists only of the Agentic AI Responder equipped with RAG and does not include either the CoT module or the verification module\.Part Fpresents several examples extracted from its complete response\.

In[Fig\.3](https://arxiv.org/html/2605.13221#S7.F3),Part DandPart Fpresent examples generated by the proposed agentic AI framework and the baseline, respectively\. The main difference between them lies in their semantic fidelity to the RAG database\. The proposed framework remains closer to the original modeling logic by preserving the intended meanings of the modeled terms, whereas the baseline tends to reformulate them into a different optimization\-oriented representation\. Therefore, the distinction is not merely syntactic, but conceptual\. For example, the semantic difference is reflected in what the resource\-related penalty is intended to measure\. In the RAG database, the penalty is defined through the abstract occupation variablesRtcmpR\_\{t\}^\{\\mathrm\{cmp\}\}andRtcomR\_\{t\}^\{\\mathrm\{com\}\}, which represent normalized computing and communication occupation\. The proposed framework preserves this same interpretation, so the penalty still refers to an aggregated and normalized notion of overall resource occupation\. By contrast, the baseline rewrites this penalty as a direct sum of computing and communication allocation variables\. As a result, it changes the penalty from normalized resource occupation to raw resource usage\. This is a semantic change, not merely a notational difference\.

![Refer to caption](https://arxiv.org/html/2605.13221v1/x3.png)Figure 3:Results from the proposed agentic AI framework\.Part Apresents an example of a user’s system description and mathematical formulation request\.Part Billustrates the CoT planning from the user\.Part Coutlines the conversation between the responder and verifier agents\.Part Dlists the examples from the final answer ofAgentic AI Responderwith CoT module and verification module\.Part Edepicts a single agent framework\.Part Flists the examples from the final answer ofAgentic AI Responderwithout CoT module and verification module\.### VII\-CEffectiveness of the Hierarchical DRL Approach

The effectiveness of the upper\-layer and lower\-layer DRL approach are discussed as follows\.

#### VII\-C1Upper\-layer DRL Training Results

In[Fig\.4](https://arxiv.org/html/2605.13221#S7.F4)\(a\), the upper\-layer PPO shows good learning ability and convergence\.Rawdenotes the reward of each episode andMA\(50\)is the moving average of the reward of the last 50 episodes, which can better reflect the overall training trend\. In the early stage, theRawreward fluctuates sharply, indicating active exploration\. With training,MA\(50\)increases rapidly and then stabilizes around 400, indicating that the policy is improved rapidly and converges to a stable high reward solution\. Although occasional drops still appear in theRawreward, the stableMA\(50\)suggests that the overall training performance is robust\.[Fig\.4](https://arxiv.org/html/2605.13221#S7.F4)\(b\) further shows that the learned upper\-layer policy achieves a high collection rate\. After several early exploratory fluctuations, the collection rate quickly approaches100%100\\%and remains at or near100%100\\%in most episodes, indicating that the PPO policy can reliably plan UAV routing and complete almost all collection tasks\. Although a few occasional drops still appear, the overall result remains highly stable\.

![Refer to caption](https://arxiv.org/html/2605.13221v1/x4.png)Figure 4:Upper\-layer DRL training results with PPO: Total rewards and collection rate\.
#### VII\-C2Lower\-layer DRL Training Results

[Fig\.5](https://arxiv.org/html/2605.13221#S7.F5)compares the lower\-layer DRL training performance of PPO and A2C in terms of total reward and deadline satisfaction rate, where the deadline satisfaction rate denotes the fraction of tasks completed within their deadlines in each episode\.[Fig\.5](https://arxiv.org/html/2605.13221#S7.F5)\(a\) shows that both PPO and A2C improve the lower\-layer policy from highly negative rewards to a converged positive\-reward region\. PPO exhibits larger fluctuations in the intermediate stage, but its reward quickly recovers and remains relatively stable afterward\. In contrast, the total reward under A2C increases more smoothly in the early stage, but several sharp reward drops still appear in the later stage, indicating weaker stability after convergence\.[Fig\.5](https://arxiv.org/html/2605.13221#S7.F5)\(b\) shows that both methods achieve a high deadline satisfaction rate after training\. However, PPO maintains a rate very close to100%100\\%more consistently in the later stage, whereas A2C still experiences several late\-stage drops and cannot always sustain the ideal100%100\\%deadline satisfaction rate\. Based on these results, PPO is preferred for the lower\-layer DRL training\. The main reason is that PPO provides more robust converged performance, with more stable rewards and more reliable deadline satisfaction in the later training stage\. In particular, unlike A2C, PPO is able to maintain nearly perfect deadline satisfaction in most converged episodes\.

![Refer to caption](https://arxiv.org/html/2605.13221v1/x5.png)Figure 5:Lower\-layer DRL training results with PPO and A2C: Total rewards and deadline satisfaction rate\.## VIIIConclusion

In this paper, we have studied a hybrid coordination problem in CMfg, where UAV\-assisted product collection is tightly coupled with MEC task processing for industrial sensor devices\. To address the difficulty of formulating such a problem, we have proposed an interactive agentic AI framework that integrates LLMs, RAG, and CoT reasoning to support interpretable mathematical modeling\. To solve the resulting optimization problem, we have further developed a hierarchical DRL approach based on PPO, in which the upper layer handles UAV routing and the lower layer performs computational task scheduling and resource allocation under coupled operational constraints\. Simulation results have demonstrated the effectiveness of the proposed framework\. The learned policies can achieve strong routing and scheduling performance, while the overall design preserved the essential coupling between logistics and computation through compact cross\-layer information exchange\. Overall, this work provides a unified framework for both formulating and solving hybrid logistics\-computation scheduling problems in CMfg\. Future work can extend the framework to larger\-scale systems, heterogeneous UAV fleets, and more dynamic manufacturing environments\.

## IXAppendix

### IX\-ADataset Construction

To validate our proposed framework, the RAG dataset is constructed using[SectionIV\-A](https://arxiv.org/html/2605.13221#S4.Ex5)as the optimization objective, where the UAV routing model is subject to the constraints from[Eq\.1](https://arxiv.org/html/2605.13221#S3.E1)to[Eq\.9](https://arxiv.org/html/2605.13221#S3.E9), and the computational task offloading model follows the constraints from[Eq\.19](https://arxiv.org/html/2605.13221#S3.E19)to[Eq\.27](https://arxiv.org/html/2605.13221#S3.E27)\. Our code and dataset for the proposed agentic AI framework are uploaded to GitHub222[https://github\.com/Puppet88/Agentic\-AI\-UAV](https://github.com/Puppet88/Agentic-AI-UAV)\.

## References

- \[1\]\(2024\)LLMs can schedule\.arXiv:2408\.06993\.External Links:[Link](https://arxiv.org/abs/2408.06993)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p1.1.4)\.
- \[2\]S\. Aggarwal, R\. V\. Mishra, and A\. Awekar\(2026\)Evaluating chain\-of\-thought reasoning through reusability and verifiability\.arXiv:2602\.17544\.External Links:[Link](https://arxiv.org/abs/2602.17544)Cited by:[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p1.1.3)\.
- \[3\]N\. Alabbasi, O\. Erak, O\. Alhussein, I\. Lotfi, S\. Muhaidat, and M\. Debbah\(2025\)TeleOracle: fine\-tuned retrieval\-augmented generation with long\-context support for networks\.IEEE Internet of Things Journal12\(10\),pp\. 13170–13182\.External Links:[Document](https://dx.doi.org/10.1109/JIOT.2025.3553161)Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p4.1.1)\.
- \[4\]D\. Askerbekov, J\. A\. Garza\-Reyes, R\. Roy Ghatak, R\. Joshi, J\. Kandasamy, and D\. Luiz de Mattos Nascimento\(2024\)Embracing drones and the internet of drones systems in manufacturing – an exploration of obstacles\.Technology in Society78,pp\. 102648\.External Links:ISSN 0160\-791X,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.techsoc.2024.102648),[Link](https://www.sciencedirect.com/science/article/pii/S0160791X24001969)Cited by:[§III\-A](https://arxiv.org/html/2605.13221#S3.SS1.p3.1.2)\.
- \[5\]L\. Bo\-hu, Z\. Lin, W\. Shi\-long, T\. Fei, C\. Jun\-wei, J\. Xiao\-dan, S\. Xiao, and C\. Xu\-dong\(2010\)Cloud manufacturing:a new service\-oriented networked manufacturing model\.Computer Integrated Manufacturing System16\(01\),pp\. 0–0\.External Links:[Link](http://www.cims-journal.cn/EN/Y2010/V16/I01/0)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p2.1.1)\.
- \[6\]G\. O\. Chagas, L\. C\. Coelho, D\. Laganà, and P\. Beraldi\(2025\)A dynamic drone routing problem with uncertain demand and energy consumption\.Transportation Research Part B: Methodological202,pp\. 103335\.External Links:ISSN 0191\-2615,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.trb.2025.103335),[Link](https://www.sciencedirect.com/science/article/pii/S0191261525001845)Cited by:[§II\-A](https://arxiv.org/html/2605.13221#S2.SS1.p1.1.2)\.
- \[7\]L\. Chen, M\. S\. Pardeshi, Y\. Liao, and K\. Pai\(2025\)Application of retrieval\-augmented generation for interactive industrial knowledge management via a large language model\.Computer Standards & Interfaces94,pp\. 103995\.External Links:ISSN 0920\-5489,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.csi.2025.103995),[Link](https://www.sciencedirect.com/science/article/pii/S0920548925000248)Cited by:[§II\-C](https://arxiv.org/html/2605.13221#S2.SS3.p1.1.4),[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p1.1.3)\.
- \[8\]Q\. Chen, L\. Qin, J\. Liu, D\. Peng, J\. Guan, P\. Wang, M\. Hu, Y\. Zhou, T\. Gao, and W\. Che\(2025\)Towards reasoning era: a survey of long chain\-of\-thought for reasoning large language models\.arXiv:2503\.09567\.External Links:[Link](https://arxiv.org/abs/2503.09567)Cited by:[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p1.1.1)\.
- \[9\]X\. Chen, J\. Cao, R\. Cao, Y\. Sahni, M\. Zhang, and Y\. Ji\(2026\)Decentralized task offloading in collaborative edge computing: a digital twin assisted multi\-agent reinforcement learning approach\.IEEE Transactions on Mobile Computing25\(4\),pp\. 4776–4790\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2025.3628502)Cited by:[§II\-B](https://arxiv.org/html/2605.13221#S2.SS2.p1.1.3)\.
- \[10\]X\. Chen, J\. Cao, Y\. Sahni, M\. Zhang, Z\. Liang, and L\. Yang\(2025\)Mobility\-aware dependent task offloading in edge computing: a digital twin\-assisted reinforcement learning approach\.IEEE Transactions on Mobile Computing24\(4\),pp\. 2979–2994\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2024.3506221)Cited by:[§II\-B](https://arxiv.org/html/2605.13221#S2.SS2.p1.1.1)\.
- \[11\]J\. R\. Chowdhury and C\. Caragea\(2025\)Zero\-shot verification\-guided chain of thoughts\.arXiv:2501\.13122\.External Links:[Link](https://arxiv.org/abs/2501.13122)Cited by:[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p1.1.3)\.
- \[12\]Chroma\(2026\)What chroma offers\.Note:[https://docs\.trychroma\.com/docs/overview/introduction](https://docs.trychroma.com/docs/overview/introduction)Accessed: 03 Apr 2026Cited by:[§VII\-A2](https://arxiv.org/html/2605.13221#S7.SS1.SSS2.p1.1.3)\.
- \[13\]W\. P\. Coutinho, J\. Fliege, M\. Battarra, and A\. Subramanian\(2025\)Routing a fleet of unmanned aerial vehicles: a trajectory optimisation\-based framework\.Transportation Research Part B: Methodological200,pp\. 103312\.External Links:ISSN 0191\-2615,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.trb.2025.103312),[Link](https://www.sciencedirect.com/science/article/pii/S0191261525001614)Cited by:[§II\-A](https://arxiv.org/html/2605.13221#S2.SS1.p1.1.1)\.
- \[14\]C\. Dong, W\. Li, Z\. Zhou, X\. Chen, Z\. Tian, and W\. Wen\(2025\)Delay\-sensitive task offloading with edge caching through martingale\-based deep reinforcement learning\.IEEE Transactions on Mobile Computing24\(7\),pp\. 6225–6242\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2025.3540413)Cited by:[§II\-B](https://arxiv.org/html/2605.13221#S2.SS2.p1.1.2)\.
- \[15\]M\. Fan, Y\. Wu, T\. Liao, Z\. Cao, H\. Guo, G\. Sartoretti, and G\. Wu\(2023\)Deep reinforcement learning for uav routing in the presence of multiple charging stations\.IEEE Transactions on Vehicular Technology72\(5\),pp\. 5732–5746\.External Links:[Document](https://dx.doi.org/10.1109/TVT.2022.3232607)Cited by:[§VI\-B3](https://arxiv.org/html/2605.13221#S6.SS2.SSS3.p1.4.2)\.
- \[16\]N\. M\. Farid, A\. Taghizadeh, and S\. Shafiee\(2026\)Agentic data analysis for intelligent manufacturing: benchmark\-driven evaluation of agentic vs\. direct llm approaches\.Procedia CIRP139,pp\. 280–285\.Note:13th CIRP Global Web ConferenceExternal Links:ISSN 2212\-8271,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.procir.2025.09.043),[Link](https://www.sciencedirect.com/science/article/pii/S2212827125010030)Cited by:[§II\-C](https://arxiv.org/html/2605.13221#S2.SS3.p1.1.6)\.
- \[17\]H\. Gu, L\. Zhao, Z\. Han, X\. Chu, G\. Zheng, J\. Liu, and G\. Zhou\(2026\)Joint task offloading and resource allocation in ultra\-dense multi\-access edge computing: a mean field learning approach\.IEEE Transactions on Mobile Computing25\(3\),pp\. 3598–3615\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2025.3619077)Cited by:[§II\-B](https://arxiv.org/html/2605.13221#S2.SS2.p1.1.5)\.
- \[18\]W\. Gu, Y\. Cao, Y\. Li, N\. Li, L\. Wang, N\. Tang, M\. Yuan, and F\. Pei\(2026\)Large language model\-empowered dynamic scheduling for intelligent hybrid flow shop using multi\-agent deep reinforcement learning\.Advanced Engineering Informatics71,pp\. 104294\.External Links:ISSN 1474\-0346,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.aei.2025.104294),[Link](https://www.sciencedirect.com/science/article/pii/S1474034625011875)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p1.1.2)\.
- \[19\]J\. A\. Heredia Álvaro and J\. G\. Barreda\(2025\)An advanced retrieval\-augmented generation system for manufacturing quality control\.Advanced Engineering Informatics64,pp\. 103007\.External Links:ISSN 1474\-0346,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.aei.2024.103007),[Link](https://www.sciencedirect.com/science/article/pii/S147403462400658X)Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p2.16.1)\.
- \[20\]Z\. Huang, L\. Guo, J\. Sheng, H\. Chen, W\. Li, B\. Jin, C\. Lu, and X\. Wang\(2025\)GraphThought: graph combinatorial optimization with thought generation\.arXiv:2502\.11607\.External Links:[Link](https://arxiv.org/abs/2502.11607)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p1.1.3)\.
- \[21\]L\. Jiao, L\. Gao, J\. Zheng, P\. Yang, and Z\. Zhang\(2026\)Optimizing 3d trajectory and task offloading in collaborative uav\-enabled mobile edge computing networks\.Computer Networks282,pp\. 112283\.External Links:ISSN 1389\-1286,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.comnet.2026.112283),[Link](https://www.sciencedirect.com/science/article/pii/S1389128626002951)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p3.1.2),[§III\-A](https://arxiv.org/html/2605.13221#S3.SS1.p2.1.3)\.
- \[22\]R\. Kataishi\(2026\)Enhancing retrieval\-augmented generation with topic\-enriched embeddings: a hybrid approach integrating traditional nlp techniques\.Natural Language Processing Journal14,pp\. 100200\.External Links:ISSN 2949\-7191,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.nlp.2026.100200),[Link](https://www.sciencedirect.com/science/article/pii/S294971912600004X)Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p2.5.1)\.
- \[23\]LangChain\(2026\)LangChain overview\.Note:[https://docs\.langchain\.com/oss/javascript/langchain/overview\#langchain\-overview](https://docs.langchain.com/oss/javascript/langchain/overview#langchain-overview)Accessed: 03 Apr 2026Cited by:[§VII\-A2](https://arxiv.org/html/2605.13221#S7.SS1.SSS2.p1.1.7)\.
- \[24\]LangChain\(2026\)LangGraph overview\.Note:[https://docs\.langchain\.com/oss/python/langgraph/overview](https://docs.langchain.com/oss/python/langgraph/overview)Accessed: 03 Apr 2026Cited by:[§VII\-A2](https://arxiv.org/html/2605.13221#S7.SS1.SSS2.p1.1.6)\.
- \[25\]H\. Lee, J\. Lee, I\. Yeom, and H\. Woo\(2020\)Panda: reinforcement learning\-based priority assignment for multi\-processor real\-time scheduling\.IEEE Access8\(\),pp\. 185570–185583\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2020.3029040)Cited by:[§VI\-B3](https://arxiv.org/html/2605.13221#S6.SS2.SSS3.p1.4.1)\.
- \[26\]J\. Lee and H\. Su\(2025\)Agentic ai for smart manufacturing\.Manufacturing Letters46,pp\. 92–96\.External Links:ISSN 2213\-8463,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.mfglet.2025.10.013),[Link](https://www.sciencedirect.com/science/article/pii/S2213846325002883)Cited by:[§II\-C](https://arxiv.org/html/2605.13221#S2.SS3.p1.1.3)\.
- \[27\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 9459–9474\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf)Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p2.11.1),[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p3.6.1)\.
- \[28\]M\. Li, Q\. Zhou, W\. Li, T\. Qu, M\. Yang, and P\. Jiang\(2026\)A4PS: agentic ai\-assisted advanced planning and scheduling with large language models for smart manufacturing\.Journal of Manufacturing Systems85,pp\. 207–226\.External Links:ISSN 0278\-6125,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jmsy.2026.01.003),[Link](https://www.sciencedirect.com/science/article/pii/S0278612526000154)Cited by:[§II\-C](https://arxiv.org/html/2605.13221#S2.SS3.p1.1.5)\.
- \[29\]S\. Li, T\. Liao, G\. Wu, Y\. Wang, and P\. N\. Suganthan\(2025\)Drone on\-demand delivery routing problem considering order splitting and battery swapping\.Computers & Industrial Engineering208,pp\. 111388\.External Links:ISSN 0360\-8352,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cie.2025.111388),[Link](https://www.sciencedirect.com/science/article/pii/S0360835225005340)Cited by:[§III\-A](https://arxiv.org/html/2605.13221#S3.SS1.p2.1.1)\.
- \[30\]Y\. Li, S\. Wang, H\. Sun, and S\. Zhou\(2025\)Collaborative vessel–unmanned aerial vehicle routing for time\-window\-constrained offshore parcel delivery\.Transportation Research Part C: Emerging Technologies178,pp\. 105189\.External Links:ISSN 0968\-090X,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.trc.2025.105189),[Link](https://www.sciencedirect.com/science/article/pii/S0968090X25001937)Cited by:[§II\-A](https://arxiv.org/html/2605.13221#S2.SS1.p1.1.4)\.
- \[31\]Y\. Li, W\. Ke, J\. Liu, P\. Wang, J\. Liu, and Y\. He\(2026\)Towards evidence\-aware retrieval\-augmented generation via self\-corrective chain\-of\-thought\.Information Processing & Management63\(2, Part A\),pp\. 104369\.External Links:ISSN 0306\-4573,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ipm.2025.104369),[Link](https://www.sciencedirect.com/science/article/pii/S0306457325003103)Cited by:[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p5.1.4)\.
- \[32\]Y\. Li, Q\. Wang, X\. Li, L\. Gao, L\. Fu, Y\. Yu, and W\. Zhou\(2025\)Real\-time scheduling for flexible job shop with agvs using multiagent reinforcement learning and efficient action decoding\.IEEE Transactions on Systems, Man, and Cybernetics: Systems55\(3\),pp\. 2120–2132\.External Links:[Document](https://dx.doi.org/10.1109/TSMC.2024.3520381)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p2.1.4)\.
- \[33\]X\. Mao, G\. Wu, M\. Fan, Z\. Cao, and W\. Pedrycz\(2025\)DL\-drl: a double\-level deep reinforcement learning approach for large\-scale task scheduling of multi\-uav\.IEEE Transactions on Automation Science and Engineering22\(\),pp\. 1028–1044\.External Links:[Document](https://dx.doi.org/10.1109/TASE.2024.3358894)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p2.1.3),[§I](https://arxiv.org/html/2605.13221#S1.p5.1.2),[§VI](https://arxiv.org/html/2605.13221#S6.p1.1.1)\.
- \[34\]S\. A\. H\. Mohsan, N\. Q\. H\. Othman, Y\. Li, M\. H\. Alsharif, and M\. A\. Khan\(2023\)Unmanned aerial vehicles \(uavs\): practical aspects, applications, open challenges, security issues, and future trends\.Intelligent Service Robotics16\(1\),pp\. 109–137\.External Links:ISSN 1861\-2784,[Document](https://dx.doi.org/10.1007/s11370-022-00452-4),[Link](https://doi.org/10.1007/s11370-022-00452-4)Cited by:[§III\-A](https://arxiv.org/html/2605.13221#S3.SS1.p3.1.3)\.
- \[35\]A\. Nabi and S\. Moh\(2025\)Joint offloading decision, user association, and resource allocation in hierarchical aerial computing: collaboration of uavs and hap\.IEEE Transactions on Mobile Computing24\(8\),pp\. 7267–7282\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2025.3548668)Cited by:[§II\-B](https://arxiv.org/html/2605.13221#S2.SS2.p1.1.4)\.
- \[36\]T\. Nguyen, P\. Chin, and Y\. Tai\(2025\)MA\-rag: multi\-agent retrieval\-augmented generation via collaborative chain\-of\-thought reasoning\.arXiv:2505\.20096\.External Links:[Link](https://arxiv.org/abs/2505.20096)Cited by:[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p3.1.3)\.
- \[37\]OpenAI\(2026\)GPT\-5\.4\.Note:[https://developers\.openai\.com/api/docs/models/gpt\-5\.4](https://developers.openai.com/api/docs/models/gpt-5.4)Accessed: 03 Apr 2026Cited by:[§VII\-A2](https://arxiv.org/html/2605.13221#S7.SS1.SSS2.p1.1.5)\.
- \[38\]OpenAI\(2026\)Text\-embedding\-ada\-002\.Note:[https://developers\.openai\.com/api/docs/models/text\-embedding\-ada\-002](https://developers.openai.com/api/docs/models/text-embedding-ada-002)Accessed: 03 Apr 2026Cited by:[§VII\-A2](https://arxiv.org/html/2605.13221#S7.SS1.SSS2.p1.1.2)\.
- \[39\]R\. Y\. Pang, K\. J\. K\. Feng, S\. Feng, C\. Li, W\. Shi, Y\. Tsvetkov, J\. Heer, and K\. Reinecke\(2025\)Interactive reasoning: visualizing and controlling chain\-of\-thought reasoning in large language models\.arXiv:2506\.23678\.External Links:[Link](https://arxiv.org/abs/2506.23678)Cited by:[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p1.1.2)\.
- \[40\]S\. Prabhakaran\(2026\)Cosine similarity – understanding the math and how it works \(with python codes\)\.Note:[https://machinelearningplus\.com/nlp/cosine\-similarity/](https://machinelearningplus.com/nlp/cosine-similarity/)Accessed: 19 Mar 2026Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p2.7.1)\.
- \[41\]Y\. Ren, Y\. Liu, T\. Ji, and X\. Xu\(2025\)AI agents and agentic ai–navigating a plethora of concepts for future manufacturing\.Journal of Manufacturing Systems83,pp\. 126–133\.External Links:ISSN 0278\-6125,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jmsy.2025.08.017),[Link](https://www.sciencedirect.com/science/article/pii/S027861252500216X)Cited by:[§II\-C](https://arxiv.org/html/2605.13221#S2.SS3.p1.1.2)\.
- \[42\]S\. I\. Satoglu and I\. E\. Sahin\(2013\)Design of a just\-in\-time periodic material supply system for the assembly lines and an application in electronics industry\.The International Journal of Advanced Manufacturing Technology65\(1\),pp\. 319–332\.External Links:ISSN 1433\-3015,[Document](https://dx.doi.org/10.1007/s00170-012-4171-7),[Link](https://doi.org/10.1007/s00170-012-4171-7)Cited by:[§III\-A](https://arxiv.org/html/2605.13221#S3.SS1.p3.1.1)\.
- \[43\]C\. Su, J\. Wen, J\. Kang, Y\. Wang, Y\. Su, H\. Pan, Z\. Zhong, and M\. Shamim Hossain\(2025\)Hybrid rag\-empowered multimodal llm for secure data management in internet of medical things: a diffusion\-based contract approach\.IEEE Internet of Things Journal12\(10\),pp\. 13428–13440\.External Links:[Document](https://dx.doi.org/10.1109/JIOT.2024.3521425)Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p1.1.1)\.
- \[44\]G\. Sun, J\. Wu, Z\. Sun, L\. He, J\. Wang, D\. Niyato, A\. Jamalipour, and S\. Mao\(2025\)JC5\\text\{C\}^\{5\}a: service delay minimization for aerial mec\-assisted industrial cyber\-physical systems\.IEEE Transactions on Services Computing18\(5\),pp\. 2976–2993\.External Links:[Document](https://dx.doi.org/10.1109/TSC.2025.3592419)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p2.1.5),[§III\-A](https://arxiv.org/html/2605.13221#S3.SS1.p2.1.2)\.
- \[45\]J\. Walker\(14 Jul 2022\)AMR vs agv: a clear choice for flexible material handling\.Note:[https://locusrobotics\.com/blog/amr\-vs\-agv](https://locusrobotics.com/blog/amr-vs-agv)Accessed: 03 Jul 2025Cited by:[§III\-A](https://arxiv.org/html/2605.13221#S3.SS1.p3.1.3)\.
- \[46\]C\. Wang, S\. Chai, T\. Xu, M\. Adil, and T\. Qiu\(2026\)CP\-rag: mitigating distracting content in retrieval\-augmented generation for industrial knowledge question answering\.IEEE Internet of Things Journal13\(7\),pp\. 15056–15066\.External Links:[Document](https://dx.doi.org/10.1109/JIOT.2026.3652422)Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p1.1.1)\.
- \[47\]F\. WANG, H\. ZHANG, S\. DU, M\. HUA, and G\. ZHONG\(2025\)C\-sppo: a deep reinforcement learning framework for large\-scale dynamic logistics uav routing problem\.Chinese Journal of Aeronautics38\(5\),pp\. 103229\.External Links:ISSN 1000\-9361,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cja.2024.09.005),[Link](https://www.sciencedirect.com/science/article/pii/S1000936124003662)Cited by:[§II\-A](https://arxiv.org/html/2605.13221#S2.SS1.p1.1.3)\.
- \[48\]M\. Wang, S\. Chen, and Q\. Meng\(2026\)Drone routing problem for shore\-to\-ship delivery services considering non\-linear energy consumption\.Transportation Research Part B: Methodological206,pp\. 103410\.External Links:ISSN 0191\-2615,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.trb.2026.103410),[Link](https://www.sciencedirect.com/science/article/pii/S0191261526000226)Cited by:[§II\-A](https://arxiv.org/html/2605.13221#S2.SS1.p1.1.5)\.
- \[49\]X\. Wang, J\. Wang, Y\. Wang, P\. Dang, S\. Cao, and C\. Zhang\(2026\)MARS: toward more efficient multi\-agent collaboration for llm reasoning\.arXiv:2509\.20502\.External Links:[Link](https://arxiv.org/abs/2509.20502)Cited by:[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p3.1.5),[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p4.1.3)\.
- \[50\]X\. Wang, Y\. Laili, L\. Zhang, and Y\. Liu\(2025\)Hybrid task scheduling in cloud manufacturing with sparse\-reward deep reinforcement learning\.IEEE Transactions on Automation Science and Engineering22\(\),pp\. 1878–1892\.External Links:[Document](https://dx.doi.org/10.1109/TASE.2024.3371250)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p2.1.2)\.
- \[51\]Y\. Wang, Y\. Wan, X\. Lei, Q\. Chen, and H\. Hu\(2025\)A retrieval augmented generation based optimization approach for medical knowledge understanding and reasoning in large language models\.Array28,pp\. 100504\.External Links:ISSN 2590\-0056,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.array.2025.100504),[Link](https://www.sciencedirect.com/science/article/pii/S2590005625001316)Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p3.10.1)\.
- \[52\]Z\. Wang, C\. Wan, J\. Liu, X\. Zhang, H\. Wang, Y\. Hu, and Z\. Hu\(2025\)MASC: large language model\-based multi\-agent scheduling chain for flexible job shop scheduling problem\.Advanced Engineering Informatics67,pp\. 103527\.External Links:ISSN 1474\-0346,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.aei.2025.103527),[Link](https://www.sciencedirect.com/science/article/pii/S1474034625004203)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p1.1.1)\.
- \[53\]R\. Xu, W\. Shi, Y\. Zhuang, Y\. Yu, J\. C\. Ho, H\. Wang, and C\. Yang\(2025\)Collab\-rag: boosting retrieval\-augmented generation for complex question answering via white\-box and black\-box llm collaboration\.arXiv:2504\.04915\.External Links:[Link](https://arxiv.org/abs/2504.04915)Cited by:[§V\-B](https://arxiv.org/html/2605.13221#S5.SS2.p2.2.8)\.
- \[54\]H\. Zhang, R\. Zhang, W\. Zhang, D\. Niyato, Y\. Wen, and C\. Miao\(2026\)Advancing generative artificial intelligence and large language models for demand side management with internet of electric vehicles\.IEEE Internet of Things Journal\(\),pp\. 1–1\.External Links:[Document](https://dx.doi.org/10.1109/JIOT.2026.3685302)Cited by:[§II\-C](https://arxiv.org/html/2605.13221#S2.SS3.p1.1.1)\.
- \[55\]R\. Zhang, H\. Du, Y\. Liu, D\. Niyato, J\. Kang, S\. Sun, X\. Shen, and H\. V\. Poor\(2024\)Interactive ai with retrieval\-augmented generation for next generation networking\.IEEE Network38\(6\),pp\. 414–424\.External Links:[Document](https://dx.doi.org/10.1109/MNET.2024.3401159)Cited by:[§V\-A](https://arxiv.org/html/2605.13221#S5.SS1.p1.1.2)\.
- \[56\]Z\. Zhao, D\. Tang, H\. Zhu, Z\. Zhang, K\. Chen, C\. Liu, and Y\. Ji\(2024\)A large language model\-based multi\-agent manufacturing system for intelligent shopfloor\.arXiv:2405\.16887\.External Links:[Link](https://arxiv.org/abs/2405.16887)Cited by:[§I](https://arxiv.org/html/2605.13221#S1.p1.1.5)\.
- \[57\]Y\. Zhou, W\. Yang, and Y\. Gong\(2026\)Reinforcement learning with priority decentralized ppo for multi\-vessel cooperative rescue scheduling in flood disaster\.Alexandria Engineering Journal138,pp\. 96–113\.External Links:ISSN 1110\-0168,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.aej.2026.01.047),[Link](https://www.sciencedirect.com/science/article/pii/S1110016826000761)Cited by:[§VI\-B3](https://arxiv.org/html/2605.13221#S6.SS2.SSS3.p1.4.1)\.

Similar Articles

Large Language Models over Networks: Collaborative Intelligence under Resource Constraints

Hugging Face Daily Papers

This paper explores collaborative intelligence paradigms where distributed Large Language Models work together across devices and clouds to handle resource constraints. It covers vertical device-cloud collaboration, horizontal multi-agent collaboration, routing policies, and open research challenges in scalable and trustworthy cooperative AI.

Building advanced AI workflows—what am I missing?

Reddit r/artificial

A developer seeking recommendations on advanced AI workflow orchestration tools and patterns, including LangChain, LangGraph, and AWS Step Functions, to build more robust and future-proof systems.

Position: Agentic AI System Is a Foreseeable Pathway to AGI

arXiv cs.AI

This paper argues that monolithic scaling of a single model is insufficient for achieving AGI and proposes Agentic AI with multi-agent collaboration as a necessary paradigm, demonstrating theoretically that agentic systems achieve exponentially superior generalization and sample efficiency.

@Kangwook_Lee: https://x.com/Kangwook_Lee/status/2052925157606568217

X AI KOLs Timeline

The author argues that human-designed structural frameworks for AI agents should be replaced by AI-engineered ones, introducing a Three Regimes Framework to show how this shift unlocks mid-sized model capabilities. Citing projects like Meta Harness, they predict an imminent transition where AI will autonomously optimize its own system architecture.