Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems

arXiv cs.AI 06/24/26, 04:00 AM Papers
agentic-ai wireless-systems beamforming bilevel-optimization physical-layer mimo optimization
Summary
This paper presents Agentic-LTPO, a nested bilevel optimization framework that uses agentic AI to adapt physical layer configurations under dynamic operator policies, achieving 57.2% long-term performance improvement in cell-free MIMO beamforming.
arXiv:2606.24416v1 Announce Type: new Abstract: Network operators' changing policies, service requirements, and stringent real-time constraints render existing methods designed with fixed objectives and constraints ineffective. This paper presents Agentic long-term performance optimization (Agentic-LTPO), a nested bilevel optimization framework that can be applied to adaptive physical layer problem configuration. The key idea is to employ agentic AI to generate upper-level configurations in a bilevel optimization structure, where evolving operator policies, environment summaries, and historical experiences are translated into structured lower-level optimization problem configurations. The lower level solves the problems with updated configurations for real-time physical-layer decisions. Considering cell-free MIMO beamforming as a use case, we embody Agentic-LTPO by designing a new multi-agent decision process with retrieval-augmented experience-based verification in the upper level, together with a closed-form beamformer in the lower level. Experiments demonstrate that Agentic-LTPO exhibits strong adaptability to dynamic operator policies and effectively enhances the system's long-term performance by 57.2% compared to traditional methods.
Original Article
View Cached Full Text
Cached at: 06/24/26, 07:47 AM
# Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems
Source: [https://arxiv.org/html/2606.24416](https://arxiv.org/html/2606.24416)
Bingnan Xiao, Chenhao Yang, Wei Ni, , Xin Wang, , and Tony Q\. S\. QuekB\. Xiao and X\. Wang are with the Key Laboratory of EMW Information \(MoE\), College of Future Information Technology, Fudan University, Shanghai 200433, China \(e\-mail: 22110720061@m\.fudan\.edu\.cn, xwang11@fudan\.edu\.cn\)\. C\. Yang is with the James Watt School of Engineering, University of Glasgow, Glasgow G12 8QQ, U\.K\. \(email:3165453Y@student\.gla\.ac\.uk\)\. W\. Ni is with the School of Engineering, Edith Cowan University, Perth, WA 6027, Australia \(e\-mail: wei\.ni@ieee\.org\)\. T\. Q\. S\. Quek is with the Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore 487372 \(e\-mail: tonyquek@sutd\.edu\.sg\)\.

###### Abstract

Network operators’ changing policies, service requirements, and stringent real\-time constraints render existing methods designed with fixed objectives and constraints ineffective\. This paper presents Agentic long\-term performance optimization \(Agentic\-LTPO\), a nested bilevel optimization framework that can be applied to adaptive physical layer problem configuration\. The key idea is to employ agentic AI to generate upper\-level configurations in a bilevel optimization structure, where evolving operator policies, environment summaries, and historical experiences are translated into structured lower\-level optimization problem configurations\. The lower level solves the problems with updated configurations for real\-time physical\-layer decisions\. Considering cell\-free MIMO beamforming as a use case, we embody Agentic\-LTPO by designing a new multi\-agent decision process with retrieval\-augmented experience\-based verification in the upper level, together with a closed\-form beamformer in the lower level\. Experiments demonstrate that Agentic\-LTPO exhibits strong adaptability to dynamic operator policies and effectively enhances the system’s long\-term performance by 57\.2% compared to traditional methods\.

Keywords:Agentic AI, wireless systems, beamforming, foundation models

## IIntroduction

Future wireless networks will be featured by dense connectivity, diverse service requirements, and software\-defined operations\[[30](https://arxiv.org/html/2606.24416#bib.bib1)\]\. The widespread deployment of coordinated physical\-layer architectures, e\.g\., cell\-free \(CF\) massive MIMO, gives rise to new beamforming and radio resource allocation designs necessitating distributed access points \(APs\) to jointly optimize transmission decisions under coupled quality\-of\-service \(QoS\) and power constraints\[[30](https://arxiv.org/html/2606.24416#bib.bib1),[36](https://arxiv.org/html/2606.24416#bib.bib4),[8](https://arxiv.org/html/2606.24416#bib.bib3)\]\. In a classical physical\-layer control framework, the network operator first specifies an objective and constraints, after which a model\-based solver produces decisions\[[36](https://arxiv.org/html/2606.24416#bib.bib4)\]\. In practice, operator policies, intents, and key performance indicators \(KPIs\) can change over time\. Preferably, the physical layer would adapt to a policy\-driven, non\-stationary environment with time\-varying objectives and constraints\.

Deep learning \(DL\) and deep reinforcement learning \(DRL\) have helped relax the need for persistent optimization configurations of wireless control\[[21](https://arxiv.org/html/2606.24416#bib.bib5)\]\. DRL has been employed to learnresource allocation and beamformingpolicies from interactions with the environment, reducing dependence on explicit analytical models at runtime\[[2](https://arxiv.org/html/2606.24416#bib.bib7),[40](https://arxiv.org/html/2606.24416#bib.bib9)\]\. Learning\-based methods can improve online adaptivity and generate low\-latency control decisions for dynamic wireless environments\[[21](https://arxiv.org/html/2606.24416#bib.bib5)\]\. Unfortunately, these methods are designed with a pre\-specified utility or reward function concerning a given series of KPIs\. Once the operator’s intent changes, e\.g\., from throughput to energy efficiency, the deployed learning rule or reward design needs to be modified, and the controller needs to be retrained\.

Recent advances in large language models \(LLMs\) and agentic AI offer a new possibility for and adaptive wireless control\[[39](https://arxiv.org/html/2606.24416#bib.bib11),[26](https://arxiv.org/html/2606.24416#bib.bib12),[10](https://arxiv.org/html/2606.24416#bib.bib14),[20](https://arxiv.org/html/2606.24416#bib.bib15),[28](https://arxiv.org/html/2606.24416#bib.bib16)\]\. Existing studies show that LLMs can interpret natural\-language intents, retrieve relevant evidence, and coordinate multiple tools for networking tasks\[[39](https://arxiv.org/html/2606.24416#bib.bib11),[26](https://arxiv.org/html/2606.24416#bib.bib12),[10](https://arxiv.org/html/2606.24416#bib.bib14)\]\. In fact, most existing works have focused on intent extraction, configuration assistance, or control\-plane orchestration\[[20](https://arxiv.org/html/2606.24416#bib.bib15),[28](https://arxiv.org/html/2606.24416#bib.bib16)\]\. Directly applying an LLM to generate beamformers is inappropriate since these actions must satisfy strict real\-time and numerical feasibility requirements under instantaneous channel state information \(CSI\)\.

A more appropriate use case of agentic AI is to configure the fast timescale physical layer controllers, e\.g\., beamforming optimizers, by interpreting operator policies, summarizing network behavior, and reusing historical experience at a slower timescale\. Following a bilevel optimization structure, this can leverage agentic AI’s capability of interpreting natural\-language policies, reasoning about long\-term network behavior, and retrieving relevant historical configurations at the upper level to produce structured lower\-level problem configuration parameters that reflect the operator’s evolving policies and intents\. It is non\-trivial to establish a reliable interface between heterogeneous upper\-level policy and intent inputs, and executable lower\-level configuration parameters\. The reason is that an AI agent must produce a structured and feasible configuration at the upper level, while the quality of the configuration can only be assessed indirectly through the lower\-level responses accumulated over time\.

### I\-ARelated Works

#### I\-A1Physical\-Layer Configuration for Coordinated Wireless Systems

Significant effort has been devoted to physical\-layer optimization for coordinated wireless systems\. Multi\-cell cooperative transmission and coordinated beamforming were investigated in\[[9](https://arxiv.org/html/2606.24416#bib.bib17),[36](https://arxiv.org/html/2606.24416#bib.bib4)\]to mitigate inter\-cell interference and improve network utility\. Massive MIMO was studied in\[[25](https://arxiv.org/html/2606.24416#bib.bib19),[16](https://arxiv.org/html/2606.24416#bib.bib20)\]to provide a scalable architecture for high spectral efficiency\. Built on these advances, CF massive MIMO was introduced in\[[30](https://arxiv.org/html/2606.24416#bib.bib1)\], and studied from the perspectives of precoding and power control\[[29](https://arxiv.org/html/2606.24416#bib.bib22)\], user\-centric design and implementation\[[14](https://arxiv.org/html/2606.24416#bib.bib23),[1](https://arxiv.org/html/2606.24416#bib.bib24)\], and local/distributed processing\[[15](https://arxiv.org/html/2606.24416#bib.bib25)\]\. These works have typically assumed that the lower\-level objective and constraints are specified beforehand, and cannot support time\-varying beamforming problem settings with time\-varying operator policies, operating rules, and KPIs\.

#### I\-A2Learning\-Based Wireless Control and Optimization Acceleration

Recent advances in learning\-based wireless control have reduced online complexity and improved adaptivity in dynamic environments\. The growing roles of DL and DRL in wireless communications and networking have been articulated in\[[21](https://arxiv.org/html/2606.24416#bib.bib5),[24](https://arxiv.org/html/2606.24416#bib.bib27),[22](https://arxiv.org/html/2606.24416#bib.bib29)\]\. Representative DRL\-based designs have been developed for dynamic multichannel access\[[41](https://arxiv.org/html/2606.24416#bib.bib30)\],online resource allocation\[[2](https://arxiv.org/html/2606.24416#bib.bib7)\], and dynamic beamforming design\[[40](https://arxiv.org/html/2606.24416#bib.bib9)\], typically relying on general\-purpose DRL frameworks\.

A large body of work has aimed to learn or accelerate structured optimization\. Deep neural networks were trained for wireless resource management in\[[38](https://arxiv.org/html/2606.24416#bib.bib36)\], sample\-efficient optimization with limited supervision was investigated in\[[35](https://arxiv.org/html/2606.24416#bib.bib37)\], and model\-driven DL for physical\-layer communications was reviewed in\[[12](https://arxiv.org/html/2606.24416#bib.bib38)\]\. Recently, deep unfolding and graph\-based unrolling have been applied to weighted minimum mean\-square error \(WMMSE\)\-type wireless operation optimization, including matrix\-inverse\-free unfolding\[[32](https://arxiv.org/html/2606.24416#bib.bib39)\], GNN\-assisted WMMSE unrolling for power allocation\[[4](https://arxiv.org/html/2606.24416#bib.bib40)\], deep graph unfolding for beamforming\[[5](https://arxiv.org/html/2606.24416#bib.bib41)\], and knowledge\-driven WMMSE\-unrolled resource allocation\[[43](https://arxiv.org/html/2606.24416#bib.bib42)\]\. Compared with pure model\-based optimization, these methods improve decision latency, approximation capability, and online deployability, especially when the optimization structure is known and stable\. Yet, they have been designed largely around pre\-specified utilities, rewards, or optimization templates\. They do not address how to interpret evolving policy intents and convert them into structured physical\-layer problem configurations\.

#### I\-A3Intent\-Aware Networking and Agentic Wireless Control

A small but rapidly growing body of works has connected intent understanding, LLMs, and agentic AI with communication networks\. Early intent\-based networking studies established the concepts, abstractions, and service\-assurance mechanisms of intent\-driven control\[[31](https://arxiv.org/html/2606.24416#bib.bib43),[17](https://arxiv.org/html/2606.24416#bib.bib44)\], while LLM\-assisted intent extraction for 5G network management was explored in\[[23](https://arxiv.org/html/2606.24416#bib.bib48)\]\. These studies suggest that high\-level operator intent can be translated into structured network\-side semantics\. Recent works have increasingly shifted attention from intent parsing to broader LLM\-empowered network intelligence\[[11](https://arxiv.org/html/2606.24416#bib.bib49),[19](https://arxiv.org/html/2606.24416#bib.bib56)\]\. LLM\-based in\-context learning and optimization were explored for power control and resource allocation\[[45](https://arxiv.org/html/2606.24416#bib.bib54),[33](https://arxiv.org/html/2606.24416#bib.bib55)\]\. LLM\-assisted algorithm generation was explored for networking\[[13](https://arxiv.org/html/2606.24416#bib.bib50)\], and generic tool\-using and multi\-agent LLM paradigms were developed in\[[44](https://arxiv.org/html/2606.24416#bib.bib51),[34](https://arxiv.org/html/2606.24416#bib.bib52),[42](https://arxiv.org/html/2606.24416#bib.bib53)\], offering new building blocks for reasoning, tool invocation, and hierarchical autonomous control\.

Recent studies have also instigated agentic AI for open and intelligent RANs, including tool\-oriented agentic communications, autonomous control architectures for open 6G networks\[[7](https://arxiv.org/html/2606.24416#bib.bib57)\], multi\-scale agentic control and management for O\-RAN\[[27](https://arxiv.org/html/2606.24416#bib.bib58)\], conflict\-aware multi\-agentic rApp policy orchestration\[[18](https://arxiv.org/html/2606.24416#bib.bib59)\], and intent\-driven optimization for CF O\-RAN\[[37](https://arxiv.org/html/2606.24416#bib.bib60)\]\. These works have focused on architectural autonomy, orchestration, intent translation, or direct LLM\-assisted optimization\. It is non\-trivial to enable reasoning\-capable agents to reliably translate evolving high\-level intents into configurations of structured physical\-layer controllers when the underlying objectives and constraints vary over time\.

### I\-BContributions

This paper proposes a new framework, named Agentic long\-term performance optimization \(Agentic\-LTPO\), to adaptively update optimization configuration specified for effective control of wireless physical layer, where network operator’s intents, policies, operating rules, and KPIs can change over time\. The key contributions are summarized as follows:

- •We formulate adaptive physical\-layer control under changing operator’s policies and intents as a nested bilevel problem, where the upper and lower levels manage policy\-driven optimization configuration and instant physical layer optimization, respectively\. Agentic\-LTPO decouples these levels into two timescales, allowing agentic AI to interpret the operator’s intents at a large timescale while preserving the optimality of lower\-level optimization at the lower level\.
- •Considering CF\-MIMO beamforming to showcase Agentic\-LTPO, we design a multi\-agent collaboration architecture for the upper level to convert policy inputs, environment summaries, and historical experience into physically feasible configuration parameters\. This is achieved through four agentic roles, including interpretation, observation, planning, and criticism, as well as a planner–critic refinement loop\.
- •To anchor the upper\-level decisions on accumulated operating evidence \(as opposed to isolated generation\), we design a retrieval augmented generation \(RAG\) module that maintains a policy memory and a case memory to supply relevant empirical evidence during policy interpretation and configuration evaluation\.
- •We reveal that the robust energy\-minimization problem of low\-level beamforming admits a closed\-form worst\-case signal\-to\-interference\-plus\-noise ratio \(SINR\) bound under a zero\-forcing criterion, leading to an efficient per\-slot solver with global optimality and linear complexity for evaluating configurations generated by the upper level\.

Extensive experiments are conducted on CF\-MIMO beamforming systems under both random and piecewise\-stationary operator policy settings\. Agentic\-LTPO improves the cumulative communication utility by57\.2%57\.2\\%over the static baseline, confirming the benefit of adaptively updating lower\-level configurations under evolving operator policies\. We also examine the KPI responses and configuration trajectories across different policy regimes, and compare raw natural\-language policies with oracle structured\-policy inputs to evaluate the sensitivity of upper\-level decisions to language grounding\. It is demonstrated that Agentic\-LTPO translates policy regimes into interpretable configuration updates and target\-KPI responses while reducing the impact of language ambiguity\.

The rest of this paper is organized as follows\. Section II introduces the system model\. Section III formulates the two\-timescale Agentic\-LTPO problem and proposes the new upper\-level multi\-agent collaboration mechanism, the lower\-level optimization method, and the overall algorithm implementation\. Section IV provides the experimental results, followed by the conclusions in Section V\.

TABLE I:Notation and definitions\.Notation:Lower\-case letters indicate scalars \(e\.g\.,xx\), boldface lower\-case letters indicate vectors \(e\.g\.,𝐱\\mathbf\{x\}\), and boldface upper\-case letters indicate matrices \(e\.g\.,𝐗\\mathbf\{X\}\)\. Calligraphic letters represent sets \(e\.g\.,𝒦\\mathcal\{K\},ℒ\\mathcal\{L\}\)\.\|𝒮\|\|\\mathcal\{S\}\|denotes the cardinality of a finite set𝒮\\mathcal\{S\}\.\[𝐱\]i\[\\mathbf\{x\}\]\_\{i\}denotes theii\-th entry of a vector𝐱\\mathbf\{x\}, and\[𝐗\]i,j\[\\mathbf\{X\}\]\_\{i,j\}denotes the\(i,j\)\(i,j\)\-th entry of a matrix𝐗\\mathbf\{X\}\.\(⋅\)H\(\\cdot\)^\{H\}denotes its Hermitian transpose\.‖𝐱‖2\\\|\\mathbf\{x\}\\\|\_\{2\}denotes the Euclidean norm of a vector𝐱\\mathbf\{x\}, and‖𝐗‖F\\\|\\mathbf\{X\}\\\|\_\{F\}denotes the Frobenius norm of a matrix𝐗\\mathbf\{X\}\.diag\(𝐱\)\\mathrm\{diag\}\(\\mathbf\{x\}\)denotes a diagonal matrix with diagonal elements given by vector𝐱\\mathbf\{x\}\.

## IISystem Model

This section introduces the Agentic\-LTPO framework and its CF\-MIMO downlink beamforming example\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/figs/sys1.png)Figure 1:An illustration of the proposed Agentic\-LTPO framework\.### II\-AAgentic\-LTPO Framework

Agentic\-LTPO admits a nested optimization framework, as depicted in Fig\. 1\. To rapidly adapt to operator policies and channels and make effective use of the autonomous decision\-making and reasoning abilities of agentic AI, the nested learning problem is formulated as

min𝝅\(𝒙\)F\(𝝅\(𝒙\),\\displaystyle\\min\_\{\\bm\{\\pi\}\(\\bm\{x\}\)\}F\(\\bm\{\\pi\}\(\\bm\{x\}\),𝒚1:T\)=1T∑t=1TwtGt\(𝝅\(𝒙\),𝒚t\)\\displaystyle\\bm\{y\}^\{1:T\}\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}w^\{t\}G^\{t\}\(\\bm\{\\pi\}\(\\bm\{x\}\),\\bm\{y\}^\{t\}\)s\.t\.𝒚t\\displaystyle\\text\{ s\.t\. \}\\bm\{y\}^\{t\}\\\!∈argmin𝒚∈𝒴⁡ft\(𝝅\(𝒙\),𝒚\),∀t=1,⋯,T,\\displaystyle\\in\\\!\{\\operatorname\{argmin\}\}\_\{\\bm\{y\}\\in\\mathcal\{Y\}\}f^\{t\}\(\\bm\{\\pi\}\(\\bm\{x\}\),\\bm\{y\}\),\\forall t\\\!=\\\!1,\\cdots,T,git\(𝝅\(𝒙\)\)≤0,hit\(𝝅\(𝒙\),𝒚\)≤0,∀i=1,⋯,I\\displaystyle g\_\{i\}^\{t\}\(\\\!\\bm\{\\pi\}\(\\bm\{x\}\)\\\!\)\\\!\\leq\\\!0,h\_\{i\}^\{t\}\(\\\!\\bm\{\\pi\}\(\\bm\{x\}\),\\bm\{y\}\\\!\)\\\!\\leq\\\!0,\\forall i\\\!=\\\!1,\\cdots,I\(1\)where𝒙∈ℝd1\\bm\{x\}\\in\\mathbb\{R\}^\{d\_\{1\}\}denotes the input data evaluated under the decision policy𝝅\(⋅\)\\bm\{\\pi\}\(\\cdot\)generated by agentic AI;𝝅\(𝒙\)∈ℝd2\\bm\{\\pi\}\(\\bm\{x\}\)\\in\\mathbb\{R\}^\{d\_\{2\}\}and𝒚∈ℝd3\\bm\{y\}\\in\\mathbb\{R\}^\{d\_\{3\}\}are the upper\- and lower\-level decision variables, respectively;𝒚1:T\\bm\{y\}^\{1:T\}denotes the sequence of instantaneous lower\-level responses across a time horizon ofTTtime slots, with the weightwtw^\{t\}reflecting the long\-term priority of slottt; each time slottthas its specific upper\- and lower\-level objective functions,GtG^\{t\}andftf^\{t\}, driven by rapidly varying communication demands;gitg\_\{i\}^\{t\}defines the feasible boundary of the high\-level policy space, andhith\_\{i\}^\{t\}encapsulates the physical constraints accounting for operator policies and coupling Agentic AI’s long\-term intents with the instantaneous actions\.

Problem \([II\-A](https://arxiv.org/html/2606.24416#S2.Ex1)\) provides an interface between a policy\-driven upper level and a latency\-sensitive lower level\. The upper\-level decision strategy𝝅\(⋅\)\\bm\{\\pi\}\(\\cdot\)abstracts the long\-term intent and configuration of a communication task \(see Section III\-A\), and𝝅\(𝒙\)\\bm\{\\pi\}\(\\bm\{x\}\)is updated based on aggregated observations at a large timescale interval\. The lower level computes the instantaneous response𝒚t\\bm\{y\}^\{t\}by performing a per\-slot optimization subject to instantaneous constraintsgit\(𝝅\(𝒙\)\)g\_\{i\}^\{t\}\(\\bm\{\\pi\}\(\\bm\{x\}\)\)andhit\(𝝅\(𝒙\),𝒚\)h\_\{i\}^\{t\}\(\\bm\{\\pi\}\(\\bm\{x\}\),\\bm\{y\}\)\.111The nested problem construction of \([II\-A](https://arxiv.org/html/2606.24416#S2.Ex1)\) is motivated by the stringent real\-time requirements of physical\-layer tasks: The per\-slot decision𝒚t\\bm\{y\}^\{t\}must be produced slot by slot, whereas the decision policy𝝅\(𝒙\)\\bm\{\\pi\}\(\\bm\{x\}\)is executed at the large timescale spanning many slots\.

To solve \([II\-A](https://arxiv.org/html/2606.24416#S2.Ex1)\), in slottt, the following steps are executed:

- •Upper\-level configuration based on Agentic AI: The upper level ingests the current policy profilePPfrom the operator and long\-term performance summaries, input data𝒙\\bm\{x\}, and generates the decision strategy𝝅\(𝒙\)\\bm\{\\pi\}\(\\bm\{x\}\), which is broadcast to the lower level\.
- •Lower\-level optimization based on fast solvers: Conditioned on𝝅\(𝒙\)\\bm\{\\pi\}\(\\bm\{x\}\), the lower level computes𝒚t∈arg⁡min𝒚∈𝒴⁡ft\(𝝅\(𝒙\),𝒚\)\\bm\{y\}^\{t\}\\in\\arg\\min\_\{\\bm\{y\}\\in\\mathcal\{Y\}\}f^\{t\}\(\\bm\{\\pi\}\(\\bm\{x\}\),\\bm\{y\}\)subject to the system constraints with a designed fast solver to meet the per\-slot complexity and latency requirements\.
- •Upper\-level update based on lower\-level feedback: After lower\-level optimization, structured feedback based on𝒚t\\bm\{y\}^\{t\}is returned to the upper level to update its decision strategy𝒙\\bm\{x\}to computeGt\(𝝅\(𝒙\),𝒚t\)G^\{t\}\\left\(\\bm\{\\pi\}\(\\bm\{x\}\),\\bm\{y\}^\{t\}\\right\)\.

Then, the next \(t\+1t\+1\)\-th time slot starts\. This repeats until the expiration of the time horizonTT\.

### II\-BCF\-MIMO Downlink Beamforming

An embodiment of the Agentic\-LTPO framework is instantiated in a CF\-MIMO scenario, which features a two\-timescale nested structure: Lower\-level beamforming must be optimized at each slot based on instantaneous \(often imperfect\) CSI, while the upper\-level configuration, including the target user QoS and AP power budgets, is adapted to operator policies and KPIs that evolve at a large timescale\. The beamforming vectors across all AP\-user pairs are coupled through inter\-user interference, per\-AP power constraints, and policy\-dependent QoS requirements\.

In the downlink CF\-MIMO system, a set ofLLdistributed APs,ℒ≜\{1,…,L\}\\mathcal\{L\}\\triangleq\\\{1,\\ldots,L\\\}, jointly serve a set ofKKsingle\-antenna users,𝒦≜\{1,…,K\}\\mathcal\{K\}\\triangleq\\\{1,\\ldots,K\\\}\. Each APℓ∈ℒ\\ell\\in\\mathcal\{L\}is equipped withMMantennas\. All APs are connected to a CPU through fronthaul with abundant bandwidth, centralized coordination\[[30](https://arxiv.org/html/2606.24416#bib.bib1),[1](https://arxiv.org/html/2606.24416#bib.bib24)\]\. Agentic AI is deployed at the CPU side, e\.g\., on an edge/cloud platform co\-located with the CPU\. It ingests long\-term KPIs and policy inputs from the operator at the upper level, and updates the configuration passed to the lower\-level beamforming solvers at the APs\.

As described in Section II\-A, the system operates forTTtime slots, denoted as𝒯=\{1,…,T\}\\mathcal\{T\}=\\\{1,\\ldots,T\\\}, which constituteNNupper\-level decision intervals\. Consider a block fading channel\. The channel remains unchanged in a time slot and can vary independently across slots\. Let𝐡ℓkt∈ℂM\\mathbf\{h\}\_\{\\ell k\}^\{t\}\\in\\mathbb\{C\}^\{M\}be the channel from APℓ\\ellto userkkin slottt\. The communication symbolskts\_\{k\}^\{t\}destined for userkkwith𝔼\[\|skt\|2\]=1\\mathbb\{E\}\[\|s\_\{k\}^\{t\}\|^\{2\}\]=1is precoded by the beamforming vector𝐰ℓkt∈ℂM\\mathbf\{w\}\_\{\\ell k\}^\{t\}\\in\\mathbb\{C\}^\{M\}at APℓ\\ell, as given by

𝐱ℓt=∑k=1K𝐰ℓktskt,∀t∈𝒯\.\\displaystyle\\mathbf\{x\}\_\{\\ell\}^\{t\}=\\sum\_\{k=1\}^\{K\}\\mathbf\{w\}\_\{\\ell k\}^\{t\}s\_\{k\}^\{t\},\\ \\forall t\\in\\mathcal\{T\}\.\(2\)
At each time slottt, we design the downlink beamforming matrix𝐖t:=\[𝐰1t,…,𝐰Kt\]∈ℂLM×K\\mathbf\{W\}^\{t\}:=\\big\[\\mathbf\{w\}\_\{1\}^\{t\},\\ldots,\\mathbf\{w\}\_\{K\}^\{t\}\\big\]\\in\\mathbb\{C\}^\{LM\\times K\}with𝐰kt≜\[\(𝐰1kt\)⊺,…,\(𝐰Lkt\)⊺\]⊺\\mathbf\{w\}\_\{k\}^\{t\}\\triangleq\\big\[\(\\mathbf\{w\}\_\{1k\}^\{t\}\)^\{\\intercal\},\\ldots,\(\\mathbf\{w\}\_\{Lk\}^\{t\}\)^\{\\intercal\}\\big\]^\{\\intercal\}to meet the long\-term objective dictated by the upper\-layer policy specified by Agentic AI\. The beamforming decisions are optimized based on instantaneous CSI, under the guidance of the slowly varying KPIs and operating policies\.

### II\-CSystem Performance Metrics

We introduce the performance metrics of tasks to support the subsequent nested optimization problem formulation\.

#### II\-C1Quality of service

In the downlink CF\-MIMO system, the received signal of userkkat time slotttis given by

ykt=\(∑ℓ=1L\(𝐡ℓkt\)H𝐰ℓkt\)skt⏟desired\+∑j≠k\(∑ℓ=1L\(𝐡ℓkt\)H𝐰ℓjt\)sjt⏟multiuser interference\+nkt,\\displaystyle y\_\{k\}^\{t\}=\\underbrace\{\\left\(\\sum\_\{\\ell=1\}^\{L\}\(\\mathbf\{h\}\_\{\\ell k\}^\{t\}\)^\{H\}\\mathbf\{w\}\_\{\\ell k\}^\{t\}\\right\)s\_\{k\}^\{t\}\}\_\{\\text\{desired\}\}\+\\underbrace\{\\sum\_\{j\\neq k\}\\left\(\\sum\_\{\\ell=1\}^\{L\}\(\\mathbf\{h\}\_\{\\ell k\}^\{t\}\)^\{H\}\\mathbf\{w\}\_\{\\ell j\}^\{t\}\\right\)s\_\{j\}^\{t\}\}\_\{\\text\{multiuser interference\}\}\+n\_\{k\}^\{t\},\(3\)wherenkt∼𝒞𝒩\(0,\(σkt\)2\)n\_\{k\}^\{t\}\\sim\\mathcal\{CN\}\(0,\(\\sigma\_\{k\}^\{t\}\)^\{2\}\)is the additive circularly symmetric complex Gaussian \(CSCG\) noise at userkkin slottt, with variance\(σkt\)2\(\\sigma\_\{k\}^\{t\}\)^\{2\}\. Then, the SINR of userkkat slotttis

SINRkt=\|∑ℓ=1L\(𝐡ℓkt\)H𝐰ℓkt\|2∑j≠k\|∑ℓ=1L\(𝐡ℓkt\)H𝐰ℓjt\|2\+\(σkt\)2\.\\displaystyle\\mathrm\{SINR\}\_\{k\}^\{t\}=\\frac\{\\left\|\\sum\_\{\\ell=1\}^\{L\}\\left\(\\mathbf\{h\}\_\{\\ell k\}^\{t\}\\right\)^\{H\}\\mathbf\{w\}\_\{\\ell k\}^\{t\}\\right\|^\{2\}\}\{\\sum\_\{j\\neq k\}\\left\|\\sum\_\{\\ell=1\}^\{L\}\\left\(\\mathbf\{h\}\_\{\\ell k\}^\{t\}\\right\)^\{H\}\\mathbf\{w\}\_\{\\ell j\}^\{t\}\\right\|^\{2\}\+\(\\sigma\_\{k\}^\{t\}\)^\{2\}\}\.\(4\)We can use SINR as a performance metric to measure the QoS of a user in the system\.

#### II\-C2Sum\-rate

The system can be subject to a network\-level performance requirement\. We characterize such a requirement using the sum\-rate, defined as: At slottt,

Rt=∑k=1Klog2⁡\(1\+SINRkt\)\.\\displaystyle R^\{t\}=\\sum\_\{k=1\}^\{K\}\\log\_\{2\}\\\!\\left\(1\+\\mathrm\{SINR\}\_\{k\}^\{t\}\\right\)\.\(5\)

#### II\-C3Total transmit power

At time slottt, the total transmit power of the downlink CF\-MIMO system is defined as the total transmit power of all APs to all users, i\.e\.,

Ptott=∑ℓ=1L∑k=1K‖𝐰ℓkt‖22\.P\_\{\\mathrm\{tot\}\}^\{t\}=\\sum\_\{\\ell=1\}^\{L\}\\sum\_\{k=1\}^\{K\}\\big\\\|\{\\bf w\}\_\{\\ell k\}^\{t\}\\big\\\|\_\{2\}^\{2\}\.\(6\)

#### II\-C4Energy efficiency

The energy efficiency of the system in time slotttis defined as the ratio between the achieved sum\-rate and the total transmit power, i\.e\.,

EEt=RtPtott,\\mathrm\{EE\}^\{t\}=\\frac\{R^\{t\}\}\{P\_\{\\mathrm\{tot\}\}^\{t\}\},\(7\)which quantifies how efficiently the transmit power is converted to the throughput of the system\.

#### II\-C5Agentic AI energy consumption

We quantify the inference energy proxy of the upper\-level agentic AI, denoted byJag\(n\)J\_\{\\mathrm\{ag\}\}^\{\(n\)\}, by accounting only for the external LLM calls involved in generating the upper\-level decision within thenn\-th large timescale interval\. The energy proxy is defined as

Jag\(n\)=∑q∈𝒬\(n\)e\(mq,τq,in,τq,out\),n=0,1,⋯,N−1\\displaystyle J\_\{\\mathrm\{ag\}\}^\{\(n\)\}=\\sum\_\{q\\in\\mathcal\{Q\}^\{\(n\)\}\}e\\\!\\left\(m\_\{q\},\\tau\_\{q,\\mathrm\{in\}\},\\tau\_\{q,\\mathrm\{out\}\}\\right\),\\ n=0,1,\\cdots,N\-1\(8\)where𝒬\(n\)\\mathcal\{Q\}^\{\(n\)\}denotes the set of such external LLM calls in intervalnn,mqm\_\{q\}denotes the model/service type used by callqq, andτq,in\\tau\_\{q,\\mathrm\{in\}\}andτq,out\\tau\_\{q,\\mathrm\{out\}\}denote the corresponding input and output token counts, respectively\.e\(m,τin,τout\)e\(m,\\tau\_\{\\mathrm\{in\}\},\\tau\_\{\\mathrm\{out\}\}\)gives the energy proxy of one external call under model/service typemm\.

Following workload\-aware LLM energy models\[[3](https://arxiv.org/html/2606.24416#bib.bib61)\], a practical proxy form is given by

e\(m,τin,τout\)=αm,0τin\+αm,1τout\+αm,2τinτout,e\\\!\\left\(m,\\tau\_\{\\mathrm\{in\}\},\\tau\_\{\\mathrm\{out\}\}\\right\)=\\alpha\_\{m,0\}\\tau\_\{\\mathrm\{in\}\}\+\\alpha\_\{m,1\}\\tau\_\{\\mathrm\{out\}\}\+\\alpha\_\{m,2\}\\tau\_\{\\mathrm\{in\}\}\\tau\_\{\\mathrm\{out\}\},\(9\)whereαm,0,αm,1,αm,2≥0\\alpha\_\{m,0\},\\alpha\_\{m,1\},\\alpha\_\{m,2\}\\geq 0are model\-dependent proxy coefficients\. This construction accounts for the fact that the energy cost of an external LLM call depends not only on the input and output token lengths, but also on their interaction\.

### II\-DConstraints of Lower\-level beamforming design

The lower\-level beamforming is executed per time slott∈𝒯t\\in\\mathcal\{T\}based on the instantaneous CSI and the configuration provided by the upper level\. The CSI available at the APs can be imperfect due to estimation errors and calibration mismatches\. To model CSI uncertainty, we adopt

𝐡ℓkt=𝐡^ℓkt\+Δ𝐡ℓkt,∀ℓ∈ℒ,k∈𝒦,t∈𝒯,\\displaystyle\\mathbf\{h\}\_\{\\ell k\}^\{t\}=\\widehat\{\\mathbf\{h\}\}\_\{\\ell k\}^\{t\}\+\\Delta\\mathbf\{h\}\_\{\\ell k\}^\{t\},\\quad\\forall\\ell\\in\\mathcal\{L\},k\\in\\mathcal\{K\},t\\in\\mathcal\{T\},\(10\)where𝐡^ℓkt\\widehat\{\\mathbf\{h\}\}\_\{\\ell k\}^\{t\}is the estimated CSI, andΔ𝐡ℓkt\\Delta\\mathbf\{h\}\_\{\\ell k\}^\{t\}is the unknown CSI error\. Assume the CSI error is norm\-bounded:

‖Δ𝐡ℓkt‖2≤δℓkt,∀ℓ∈ℒ,k∈𝒦,t∈𝒯,\\displaystyle\\left\\\|\\Delta\\mathbf\{h\}\_\{\\ell k\}^\{t\}\\right\\\|\_\{2\}\\leq\\delta\_\{\\ell k\}^\{t\},\\quad\\forall\\ell\\in\\mathcal\{L\},k\\in\\mathcal\{K\},t\\in\\mathcal\{T\},\(11\)whereδℓkt\\delta\_\{\\ell k\}^\{t\}denotes the corresponding error radius\. Perfect CSI is a special case of this model by setting the error radius toδℓkt=0,∀ℓ∈ℒ,k∈𝒦\\delta\_\{\\ell k\}^\{t\}=0,\\forall\\ell\\in\\mathcal\{L\},k\\in\\mathcal\{K\}\.

Under imperfect CSI, we define the worst\-case \(robust\) SINR of userkkin time slotttas

SINRkt,wc=inf‖Δ𝐡ℓkt‖≤δℓkt,∀ℓ∈ℒ\\displaystyle\\mathrm\{SINR\}\_\{k\}^\{t,\\mathrm\{wc\}\}\\\!=\\\!\\\!\\inf\_\{\\\|\\Delta\{\\bf h\}\_\{\\ell k\}^\{t\}\\\|\\leq\\delta\_\{\\ell k\}^\{t\},\\forall\\ell\\in\\mathcal\{L\}\}SINRkt\(\{𝐡^ℓkt\+Δ𝐡ℓkt\}ℓ=1L,\{𝐰ℓjt\}\),\\displaystyle\\mathrm\{SINR\}\_\{k\}^\{t\}\\\!\\left\(\\\!\\\!\\\{\\hat\{\\bf h\}\_\{\\ell k\}^\{t\}\\\!\+\\\!\\\!\\Delta\{\\bf h\}\_\{\\ell k\}^\{t\}\\\}\_\{\\ell=1\}^\{L\},\\\{\{\\bf w\}\_\{\\ell j\}^\{t\}\\\}\\\!\\\!\\right\),∀k∈𝒦,t∈𝒯,\\displaystyle\\forall k\\in\\mathcal\{K\},t\\in\\mathcal\{T\},\(12\)where𝐡^ℓkt\\hat\{\\bf h\}\_\{\\ell k\}^\{t\}is the estimated CSI from APℓ\\ellto userkkin slottt\. Accordingly, the worst\-case \(robust\) QoS requirement is compactly written as

SINRkt,wc≥Γkt,∀k∈𝒦,t∈𝒯\.\\displaystyle\\mathrm\{SINR\}\_\{k\}^\{t,\\mathrm\{wc\}\}\\geq\\Gamma\_\{k\}^\{t\},\\quad\\forall k\\in\\mathcal\{K\},t\\in\\mathcal\{T\}\.\(13\)
We also consider the per\-AP transmit power constraints in each slot\. Each APℓ\\ellhas a transmit power budgetPℓmaxP\_\{\\ell\}^\{\\max\}, i\.e\.,

Pℓt≜𝔼\{‖𝐱ℓt‖22\}=∑k=1K‖𝐰ℓkt‖22≤ϕℓtPℓmax,∀ℓ∈ℒ,t∈𝒯,\\displaystyle P\_\{\\ell\}^\{t\}\\triangleq\\mathbb\{E\}\\\!\\left\\\{\\left\\\|\\mathbf\{x\}\_\{\\ell\}^\{t\}\\right\\\|\_\{2\}^\{2\}\\right\\\}\\\!=\\\!\\sum\_\{k=1\}^\{K\}\\left\\\|\\mathbf\{w\}\_\{\\ell k\}^\{t\}\\right\\\|\_\{2\}^\{2\}\\leq\\phi\_\{\\ell\}^\{t\}P\_\{\\ell\}^\{\\max\},\\ \\forall\\ell\\in\\mathcal\{L\},~t\\in\\mathcal\{T\},\(14\)whereϕℓt∈\[0,1\]\\phi\_\{\\ell\}^\{t\}\\in\[0,1\]denotes the power\-budget exposure factor of APℓ\\ellat slottt, which specifies the fraction of the transmit power budgetPℓmaxP\_\{\\ell\}^\{\\max\}made available to beamforming\. Ifϕℓt=0\\phi\_\{\\ell\}^\{t\}=0, APℓ\\ellis deactivated in slottt;ϕℓt=1\\phi\_\{\\ell\}^\{t\}=1, its full transmit power budgetPℓmaxP\_\{\\ell\}^\{\\max\}is available for beamforming\.

## IIIProposed Agentic\-LTPO Framework

In this section, we apply Agentic\-LTPO to solve the CF\-MIMO downlink beamforming task\. Based on the nested framework in \(1\), we elaborate on the upper\-level optimization, including the multi\-agent structure and the RAG module, followed by the lower\-level beamforming with a fast solver\.

### III\-AProblem Formulation

We formulate the nested optimization problem of the CF\-MIMO downlink beamforming at two timescales\. The pre\-specified time horizonTTis first partitioned intoNNlarge timescale intervals, each containingTsT\_\{s\}slots, i\.e\.,T=NTsT=NT\_\{s\}\. The index set of thenn\-th large interval𝒯\(n\)\\mathcal\{T\}^\{\(n\)\}is defined as

𝒯\(n\)≜\{nTs\+1,nTs\+2,…,\(n\+1\)Ts\},\\mathcal\{T\}^\{\(n\)\}\\triangleq\\\{nT\_\{s\}\+1,nT\_\{s\}\+2,\\ldots,\(n\+1\)T\_\{s\}\\\},\(15\)wheren=0,1,⋯,N−1n=0,1,\\cdots,N\-1\.

At the beginning of thenn\-th large interval, the upper\-level Agentic AI deployed at the CPU outputs a configuration vector𝒄\(n\)\\bm\{c\}^\{\(n\)\}based on its overall decision strategy𝝅\\bm\{\\pi\}, as given by

𝒄\(n\)=𝝅\(Senv\(n\),P\(n\),ℰ\(n\)\)=\(\{Γk\(n\)\}k∈𝒦,\{ϕℓ\(n\)\}ℓ∈ℒ\),\\displaystyle\\bm\{c\}^\{\(n\)\}\\\!\\\!=\\\!\\bm\{\\pi\}\\left\(S\_\{\\mathrm\{env\}\}^\{\(n\)\},P^\{\(n\)\},\\mathcal\{E\}^\{\(n\)\}\\right\)\\\!\\\!=\\\!\\\!\\Big\(\\\!\\\{\\Gamma\_\{k\}^\{\(n\)\}\\\}\_\{k\\in\\mathcal\{K\}\},\\\{\\phi\_\{\\ell\}^\{\(n\)\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}\\Big\),\(16\)whereSenv\(n\)S\_\{\\mathrm\{env\}\}^\{\(n\)\}denotes the previous interval environment summary;P\(n\)P^\{\(n\)\}represents natural language composed of the current policy profile from the operator;ℰ\(n\)\\mathcal\{E\}^\{\(n\)\}denotes the cross\-timescale experience maintained by the upper level;Γk\(n\)\\Gamma\_\{k\}^\{\(n\)\}is the corresponding target level of the worst\-case \(robust\) QoS requirement of userkk, c\.f\. \([13](https://arxiv.org/html/2606.24416#S2.E13)\); andϕℓ\(n\)\\phi\_\{\\ell\}^\{\(n\)\}is the power\-budget exposure factor of APℓ\\ell, c\.f\. \([14](https://arxiv.org/html/2606.24416#S2.E14)\)\. As clarified in Section III\-B,ℰ\(n\)\\mathcal\{E\}^\{\(n\)\}provides the base historical tuples, from which the case memory used by the RAG module is constructed\.

As specified in \([16](https://arxiv.org/html/2606.24416#S3.E16)\), the decision strategy𝒄\(n\)\\bm\{c\}^\{\(n\)\}remains unchanged over slotst∈𝒯\(n\)t\\in\\mathcal\{T\}^\{\(n\)\}of thenn\-th large interval\. The nested optimization problem be formulated as

max𝝅\\displaystyle\\max\_\{\\bm\{\\pi\}\}\\quad∑n=0N−1G\(KPI\(n\)\)\\displaystyle\\sum\_\{n=0\}^\{N\-1\}G\\\!\\left\(\\mathrm\{KPI\}^\{\(n\)\}\\right\)\(17a\)s\.t\.𝒄\(n\)=𝝅\(Senv\(n\),P\(n\),ℰ\(n\)\),∀n,\\displaystyle\\bm\{c\}^\{\(n\)\}=\\bm\{\\pi\}\\\!\\left\(S\_\{\\mathrm\{env\}\}^\{\(n\)\},P^\{\(n\)\},\\mathcal\{E\}^\{\(n\)\}\\right\),\\ \\forall n,\(17b\)𝐖t,∗\(𝒄\(n\)\)∈arg⁡min\{𝐰ℓkt\}∑ℓ∈ℒ∑k∈𝒦‖𝐰ℓkt‖22,\\displaystyle\{\\bf W\}^\{t,\*\}\\big\(\\\!\\bm\{c\}^\{\(n\)\}\\big\)\\in\\arg\\min\_\{\\\{\{\\bf w\}\_\{\\ell k\}^\{t\}\\\}\}\\sum\_\{\\ell\\in\\mathcal\{L\}\}\\sum\_\{k\\in\\mathcal\{K\}\}\\big\\\|\{\\bf w\}\_\{\\ell k\}^\{t\}\\big\\\|\_\{2\}^\{2\},\(17c\)s\.t\.∑k∈𝒦‖𝐰ℓkt‖22≤ϕℓ\(n\)Pℓmax,∀t∈𝒯\(n\),ℓ∈ℒ,\\displaystyle\\ \\ \\text\{s\.t\.\}\\sum\_\{k\\in\\mathcal\{K\}\}\\\!\\\!\\big\\\|\{\\bf w\}\_\{\\ell k\}^\{t\}\\big\\\|\_\{2\}^\{2\}\\\!\\leq\\\!\\phi\_\{\\ell\}^\{\(n\)\}\\\!P\_\{\\ell\}^\{\\max\},\\forall t\\\!\\in\\\!\\mathcal\{T\}^\{\(n\)\},\\ell\\\!\\in\\\!\\mathcal\{L\},\(17d\)SINRkt,wc≥Γk\(n\),∀t∈𝒯\(n\),k∈𝒦,\\displaystyle\\qquad\\ \\ \\mathrm\{SINR\}\_\{k\}^\{t,\\mathrm\{wc\}\}\\geq\\Gamma\_\{k\}^\{\(n\)\},\\ \\forall t\\in\\mathcal\{T\}^\{\(n\)\},k\\in\\mathcal\{K\},\(17e\)In \([17](https://arxiv.org/html/2606.24416#S3.E17)a\),G\(⋅\)G\(\\cdot\)is a long\-term performance function specified below\. \([17](https://arxiv.org/html/2606.24416#S3.E17)b\) is the configuration vector for the lower\-level optimization; see \([16](https://arxiv.org/html/2606.24416#S3.E16)\)\. \([17](https://arxiv.org/html/2606.24416#S3.E17)c\)–\([17](https://arxiv.org/html/2606.24416#S3.E17)e\) represent the lower\-level problem executed at slottt, namely, energy\-minimizing beamforming in \([17c](https://arxiv.org/html/2606.24416#S3.E17.3)\) under the per\-AP power constraint \([17](https://arxiv.org/html/2606.24416#S3.E17)d\) and the worst\-case \(robust\) QoS constraint \([17](https://arxiv.org/html/2606.24416#S3.E17)e\)\.

The upper\-level objective \([17](https://arxiv.org/html/2606.24416#S3.E17)a\) is based on the performance metrics described in Section II\-C, and the computation cost from the upper\-level Agentic AI\. With the energy proxyJag\(n\)J\_\{\\mathrm\{ag\}\}^\{\(n\)\}of the upper\-level Agentic AI,KPI\(n\)\\mathrm\{KPI\}^\{\(n\)\}can be defined as

KPI\(n\)=\(\{Rt\}t∈𝒯\(n\),\{EEt\}t∈𝒯\(n\),Jag\(n\)\),∀n\.\\displaystyle\\mathrm\{KPI\}^\{\(n\)\}=\\left\(\\\{R^\{t\}\\\}\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\},\\\{\\mathrm\{EE\}^\{t\}\\\}\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\},J\_\{\\mathrm\{ag\}\}^\{\(n\)\}\\right\),\\ \\forall n\.\(18\)where\{Rt,EEt\}t∈𝒯\(n\)\\\{R^\{t\},\\mathrm\{EE\}^\{t\}\\\}\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\}collects the sum\-rate and energy\-efficiency values achieved in thenn\-th large interval𝒯\(n\)\\mathcal\{T\}^\{\(n\)\}\. GivenKPI\(n\)\\mathrm\{KPI\}^\{\(n\)\}, we design the upper\-level objective as

G\(KPI\(n\)\)=λRTs∑t∈𝒯\(n\)RtRref\+λEETs∑t∈𝒯\(n\)EEtEEref−λJJag\(n\)Jref,\\displaystyle G\(\\mathrm\{KPI\}^\{\(n\)\}\)\\\!=\\\!\\frac\{\\lambda\_\{R\}\}\{T\_\{s\}\}\\\!\\\!\\\!\\sum\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\}\\\!\\\!\\\!\\frac\{R^\{t\}\}\{R\_\{\\mathrm\{ref\}\}\}\\\!\+\\\!\\frac\{\\lambda\_\{\\mathrm\{EE\}\}\}\{T\_\{s\}\}\\\!\\\!\\\!\\sum\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\}\\\!\\\!\\\!\\frac\{\\mathrm\{EE\}^\{t\}\}\{\\mathrm\{EE\}\_\{\\mathrm\{ref\}\}\}\\\!\-\\\!\\lambda\_\{J\}\\\!\\frac\{J\_\{\\mathrm\{ag\}\}^\{\(n\)\}\}\{J\_\{\\mathrm\{ref\}\}\},\(19\)whereRrefR\_\{\\mathrm\{ref\}\},EEref\\mathrm\{EE\}\_\{\\mathrm\{ref\}\}, andJrefJ\_\{\\mathrm\{ref\}\}are pre\-specified positive normalization constants to rescale the different KPI components into comparable quantities\.λR\\lambda\_\{R\},λEE\\lambda\_\{\\mathrm\{EE\}\}, andλJ\\lambda\_\{J\}are positive weighting coefficientss to balance the contributions of each KPI\. \([19](https://arxiv.org/html/2606.24416#S3.E19)\) defines a long\-term utility that balance communication performance and decision complexity, encouraging policies that achieve high spectral and energy efficiency while avoiding unnecessary upper\-level reasoning overhead\.

Problem \([17](https://arxiv.org/html/2606.24416#S3.E17)\) captures the timescale coupling in Agentic\-LTPO\. At the large timescale, the upper level updates the configuration vector per large interval to maximize the long\-term utilityG\(KPI\(n\)\)G\\\!\\big\(\\mathrm\{KPI\}^\{\(n\)\}\\big\), which balances the sum\-rate, energy efficiency, and agentic AI inference cost\. At the small timescale, the lower level allows the APs to make beamforming decisions slot\-by\-slot under imperfect instantaneous CSI in a distributed fashion, subject to the policy constraints specified by the upper\-level configuration\.

### III\-BUpper\-Level Optimization with Multi\-Agent Collaboration

As shown in \([17](https://arxiv.org/html/2606.24416#S3.E17)\), the upper level determines the configuration vector𝒄\(n\)\\bm\{c\}^\{\(n\)\}for thenn\-th large interval from the environment summarySenv\(n\)S\_\{\\mathrm\{env\}\}^\{\(n\)\}, the operator’s policy profileP\(n\)P^\{\(n\)\}, and the accumulated experience bufferℰ\(n\)\\mathcal\{E\}^\{\(n\)\}\.222This mapping is non\-trivial since the upper level must convert these heterogeneous, cross\-timescale, and partially semantic inputs into a structured configuration vector, whereas the quality of the configuration can only be assessed indirectly through the lower\-level responses overt∈𝒯\(n\)t\\in\\mathcal\{T\}^\{\(n\)\}and the resulting KPI aggregation in \([18](https://arxiv.org/html/2606.24416#S3.E18)\)\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/figs/sys2_new.png)Figure 2:The upper\-level multi\-agent architecture of Agentic\-LTPO for CF\-MIMO long\-term performance optimization\.We propose a coordinated multi\-agent architecture, which leverages LLM capabilities, including long\-term planning, structured policy interpretation, and retrieval\-enhanced reasoning, to process the heterogeneous inputs efficiently, as illustrated in Fig\.[2](https://arxiv.org/html/2606.24416#S3.F2)\. To realize the mapping in \([16](https://arxiv.org/html/2606.24416#S3.E16)\) in a structured and interpretable manner, we adopt four coordinated LLM agents, namely, Policy InterpreterAintA\_\{\\mathrm\{int\}\}, Network ObserverAobsA\_\{\\mathrm\{obs\}\}, Performance CriticAcritA\_\{\\mathrm\{crit\}\}, and Configuration PlannerAplanA\_\{\\mathrm\{plan\}\}, together with a RAG module\.

At the beginning of thenn\-th large interval, the decision𝒄\(n\)\\bm\{c\}^\{\(n\)\}is generated from the heterogeneous input\(Senv\(n\),P\(n\),ℰ\(n\)\)\(S\_\{\\mathrm\{env\}\}^\{\(n\)\},P^\{\(n\)\},\\mathcal\{E\}^\{\(n\)\}\)\. To construct the information flow across intervals, we collect the per\-slot records of the previous interval, as given by

𝒟\(n−1\)=\{\(𝐇^t,𝐖t,∗\(𝒄\(n−1\)\),𝐮t\)\}t∈𝒯\(n−1\),\\displaystyle\\mathcal\{D\}^\{\(n\-1\)\}=\\left\\\{\\left\(\\hat\{\\mathbf\{H\}\}^\{t\},\\mathbf\{W\}^\{t,\*\}\\big\(\\bm\{c\}^\{\(n\-1\)\}\\big\),\\mathbf\{u\}^\{t\}\\right\)\\right\\\}\_\{t\\in\\mathcal\{T\}^\{\(n\-1\)\}\},\(20\)where𝐇^t≜\[𝐡^1t,…,𝐡^Kt\]∈ℂLM×K\\hat\{\\mathbf\{H\}\}^\{t\}\\triangleq\[\\hat\{\\mathbf\{h\}\}\_\{1\}^\{t\},\\ldots,\\hat\{\\mathbf\{h\}\}\_\{K\}^\{t\}\]\\in\\mathbb\{C\}^\{LM\\times K\}is the estimated channel matrix in slotttwith𝐡^kt≜\[\(𝐡^1kt\)⊺,…,\(𝐡^Lkt\)⊺\]⊺\\hat\{\\mathbf\{h\}\}\_\{k\}^\{t\}\\triangleq\[\(\\hat\{\\mathbf\{h\}\}\_\{1k\}^\{t\}\)^\{\\intercal\},\\ldots,\(\\hat\{\\mathbf\{h\}\}\_\{Lk\}^\{t\}\)^\{\\intercal\}\]^\{\\intercal\}, and𝐮t≜\(\{SINRkt,wc\}k∈𝒦,\{PℓtPℓmax\}ℓ∈ℒ,Rt,EEt\)\\mathbf\{u\}^\{t\}\\triangleq\\big\(\\\{\\mathrm\{SINR\}\_\{k\}^\{t,\\mathrm\{wc\}\}\\\}\_\{k\\in\\mathcal\{K\}\},\\\{\\frac\{P\_\{\\ell\}^\{t\}\}\{P\_\{\\ell\}^\{\\max\}\}\\\}\_\{\\ell\\in\\mathcal\{L\}\},R^\{t\},\\mathrm\{EE\}^\{t\}\\big\)collects the per\-slot KPI measurements after deploying𝒄\(n−1\)\\bm\{c\}^\{\(n\-1\)\}, withPℓt=∑k∈𝒦‖𝐰ℓkt‖22P\_\{\\ell\}^\{t\}=\\sum\_\{k\\in\\mathcal\{K\}\}\\\|\{\\bf w\}\_\{\\ell k\}^\{t\}\\\|\_\{2\}^\{2\}\.

Meanwhile, the historical experienceℰ\(n\)\\mathcal\{E\}^\{\(n\)\}available to the upper level at the beginning of intervalnnis expressed as

ℰ\(n\)≜\{\(Senv\(i\),𝒄\(i\),KPI\(i\)\)\}i=0n−1,\\mathcal\{E\}^\{\(n\)\}\\triangleq\\left\\\{\\left\(S\_\{\\mathrm\{env\}\}^\{\(i\)\},\\bm\{c\}^\{\(i\)\},\\mathrm\{KPI\}^\{\(i\)\}\\right\)\\right\\\}\_\{i=0\}^\{n\-1\},\(21\)whereℰ\(n\)\\mathcal\{E\}^\{\(n\)\}serves as the cross\-timescale experience used in the upper level\. For retrieval, its stored tuples are augmented in Section III\-B\.4\) to construct the corresponding case memory\.

#### III\-B1Policy InterpreterAintA\_\{\\mathrm\{int\}\}

The policy inputP\(n\)P^\{\(n\)\}from the operator is typically specified in natural language, e\.g\., operation guidelines, service requirements, or deployment rules, which are not directly executable by the agents since they do not provide an explicit representation of decision\-related fields\[[23](https://arxiv.org/html/2606.24416#bib.bib48)\]\. We invoke the Policy InterpreterAintA\_\{\\mathrm\{int\}\}to convertP\(n\)P^\{\(n\)\}into a structured policy context in a JSON format\.AintA\_\{\\mathrm\{int\}\}first produces a preliminary structured policyP~\(n\)\\tilde\{P\}^\{\(n\)\}fromP\(n\)P^\{\(n\)\}, and then refines it with the retrieved policy evidence set𝒵pol\(n\)\\mathcal\{Z\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}from the RAG module; see Section III\-B\.4\)\. The final structured policy context is given by

P¯\(n\)=Aint\(P\(n\),𝒵pol\(n\)\),\\bar\{P\}^\{\(n\)\}=A\_\{\\mathrm\{int\}\}\(P^\{\(n\)\},\\mathcal\{Z\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}\),\(22\)whereP¯\(n\)\\bar\{P\}^\{\(n\)\}is a machine\-readable policy object organized as typed key–value entries\. It encodes the policy semantics relevant to the configurations in \([16](https://arxiv.org/html/2606.24416#S3.E16)\), including the QoS requirements\{Γk\(n\)\}k∈𝒦\\\{\\Gamma\_\{k\}^\{\(n\)\}\\\}\_\{k\\in\\mathcal\{K\}\}, the power\-budget factors\{ϕℓ\(n\)\}ℓ∈ℒ\\\{\\phi\_\{\\ell\}^\{\(n\)\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}, and the feasible decision domain\. The retrieved evidence in𝒵pol\(n\)\\mathcal\{Z\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}enhances the current policy input by stabilizing field extraction, resolving semantically similar policies across intervals, and improving the consistency of the admissible bounds for the robust QoS targets\{Γk\}k∈𝒦\\\{\\Gamma\_\{k\}\\\}\_\{k\\in\\mathcal\{K\}\}and the AP power\-budget factors\{ϕℓ\}ℓ∈ℒ\\\{\\phi\_\{\\ell\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}\. Notably,P~\(n\)\\tilde\{P\}^\{\(n\)\}is the preliminary parse of the current operator policy;P¯\(n\)\\bar\{P\}^\{\(n\)\}is the evidence\-calibrated policy object used by the downstream agents\.P~\(n\)\\tilde\{P\}^\{\(n\)\}andP¯\(n\)\\bar\{P\}^\{\(n\)\}follow the same structured\-policy schema\.

To make the policy constraints explicit,AintA\_\{\\mathrm\{int\}\}generates the feasible configuration set from the structured policy contextP¯\(n\)\\bar\{P\}^\{\(n\)\}, as given by

𝒞\(P¯\(n\)\)=\{\(\{Γk\}k∈𝒦,\{ϕℓ\}ℓ∈ℒ\):Γk∈\[Γkmin\(P¯\(n\)\),Γkmax\(P¯\(n\)\)\],\\displaystyle\\mathcal\{C\}\(\\\!\\bar\{P\}^\{\(n\)\}\\\!\)\\\!=\\\!\\Big\\\{\\\!\\\!\\big\(\\\!\\\{\\Gamma\_\{k\}\\\!\\\}\_\{k\\in\\mathcal\{K\}\},\\\{\\phi\_\{\\ell\}\\\!\\\}\_\{\\ell\\in\\mathcal\{L\}\}\\big\)\\\!:\\\!\\Gamma\_\{k\}\\\!\\in\\\!\[\\Gamma\_\{k\}^\{\\min\}\(\\\!\\bar\{P\}^\{\(n\)\}\\\!\),\\Gamma\_\{k\}^\{\\max\}\(\\\!\\bar\{P\}^\{\(n\)\}\\\!\)\],∀k∈𝒦,ϕℓ∈\[ϕℓmin\(P¯\(n\)\),ϕℓmax\(P¯\(n\)\)\],∀ℓ∈ℒ\},\\displaystyle\\forall k\\in\\mathcal\{K\},\\phi\_\{\\ell\}\\\!\\in\\\!\[\\phi\_\{\\ell\}^\{\\min\}\(\\bar\{P\}^\{\(n\)\}\),\\phi\_\{\\ell\}^\{\\max\}\(\\bar\{P\}^\{\(n\)\}\)\],\\ \\forall\\ell\\in\\mathcal\{L\}\\Big\\\},\(23\)where the bounds are determined from the JSON\-structured policy contextP¯\(n\)\\bar\{P\}^\{\(n\)\}\. In particular, the bounds on\{Γk\}k∈𝒦\\\{\\Gamma\_\{k\}\\\}\_\{k\\in\\mathcal\{K\}\}specify the admissible robust QoS targets under the operator’s current intent; the bounds on\{ϕℓ\}ℓ∈ℒ\\\{\\phi\_\{\\ell\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}specify the admissible AP\-side power\-budget activation range\.

Notably,AintA\_\{\\mathrm\{int\}\}removes the ambiguity ofP\(n\)P^\{\(n\)\}and exposes the policy information in a standardized schema that can be accessed by the agents, i\.e\.,AobsA\_\{\\mathrm\{obs\}\},AplanA\_\{\\mathrm\{plan\}\}, orAcritA\_\{\\mathrm\{crit\}\}\. Meanwhile,AintA\_\{\\mathrm\{int\}\}links the operator’s intent to the subsequent upper\-level optimization, enablingAobsA\_\{\\mathrm\{obs\}\},AplanA\_\{\\mathrm\{plan\}\},AcritA\_\{\\mathrm\{crit\}\}and the RAG module to operate on the same structured policyP¯\(n\)\\bar\{P\}^\{\(n\)\}\. If no policy evidence is retained from the RAG module,AintA\_\{\\mathrm\{int\}\}uses the retrieved policy evidence to refineP~\(n\)\\tilde\{P\}^\{\(n\)\}and outputsP¯\(n\)\\bar\{P\}^\{\(n\)\}\.

#### III\-B2Network ObserverAobsA\_\{\\mathrm\{obs\}\}

The role ofAobsA\_\{\\mathrm\{obs\}\}is to translate the performance of the previous large interval to a compact state representation usable by the current upper\-level optimization\. Based on𝒟\(n−1\)\\mathcal\{D\}^\{\(n\-1\)\},𝒄\(n−1\)\\bm\{c\}^\{\(n\-1\)\}andP¯\(n\)\\bar\{P\}^\{\(n\)\}fromAintA\_\{\\mathrm\{int\}\}, the network observerAobsA\_\{\\mathrm\{obs\}\}produces the environment summary on the basis of large intervals, as given by

Senv\(n\)=Aobs\(𝒟\(n−1\),𝒄\(n−1\),P¯\(n\)\)=\(𝐬op\(n\),𝐬diag\(n\)\),S\_\{\\mathrm\{env\}\}^\{\(n\)\}=A\_\{\\mathrm\{obs\}\}\\\!\\left\(\\mathcal\{D\}^\{\(n\-1\)\},\\bm\{c\}^\{\(n\-1\)\},\\bar\{P\}^\{\(n\)\}\\right\)=\\Big\(\\mathbf\{s\}\_\{\\mathrm\{op\}\}^\{\(n\)\},\\mathbf\{s\}\_\{\\mathrm\{diag\}\}^\{\(n\)\}\\Big\),\(24\)where𝐬op\(n\)≜\(\{Γk,obs\(n\)\}k∈𝒦,\{ϕℓ,obs\(n\)\}ℓ∈ℒ\)\\mathbf\{s\}\_\{\\mathrm\{op\}\}^\{\(n\)\}\\triangleq\\left\(\\\{\\Gamma\_\{k,\\mathrm\{obs\}\}^\{\(n\)\}\\\}\_\{k\\in\\mathcal\{K\}\},\\\{\\phi\_\{\\ell,\\mathrm\{obs\}\}^\{\(n\)\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}\\right\)withΓk,obs\(n\)=mint∈𝒯\(n−1\)⁡SINRkt,wc\\Gamma\_\{k,\\mathrm\{obs\}\}^\{\(n\)\}=\\min\_\{t\\in\\mathcal\{T\}^\{\(n\-1\)\}\}\\mathrm\{SINR\}\_\{k\}^\{t,\\mathrm\{wc\}\},∀k∈𝒦\\forall k\\in\\mathcal\{K\}andϕℓ,obs\(n\)=maxt∈𝒯\(n−1\)⁡PℓtPℓmax\\phi\_\{\\ell,\\mathrm\{obs\}\}^\{\(n\)\}=\\max\_\{t\\in\\mathcal\{T\}^\{\(n\-1\)\}\}\\frac\{P\_\{\\ell\}^\{t\}\}\{P\_\{\\ell\}^\{\\max\}\},∀ℓ∈ℒ\\forall\\ell\\in\\mathcal\{L\}\. Particularly,Γk,obs\(n\)\\Gamma\_\{k,\\mathrm\{obs\}\}^\{\(n\)\}andϕℓ,obs\(n\)\\phi\_\{\\ell,\\mathrm\{obs\}\}^\{\(n\)\}are userkk’s worst\-case \(robust\) QoS level and APℓ\\ell’s largest normalized power usage, respectively\.𝐬diag\(n\)\\mathbf\{s\}\_\{\\mathrm\{diag\}\}^\{\(n\)\}is a concise semantic diagnosis generated from the same inputs as𝐬op\(n\)\\mathbf\{s\}\_\{\\mathrm\{op\}\}^\{\(n\)\}, which summarizes the causes behind the observed operating bottlenecks and the resulting KPI profile, e\.g\., which users repeatedly approach their QoS limits and which APs frequently operate close to their power budgets\.

Notably, the environment summarySenv\(n\)S\_\{\\mathrm\{env\}\}^\{\(n\)\}in \([24](https://arxiv.org/html/2606.24416#S3.E24)\) is not intended to restate the operator’s previous objective\. Instead, it summarizes the current operating state reached by the system after executing𝒄\(n−1\)\\bm\{c\}^\{\(n\-1\)\}, so thatAplanA\_\{\\mathrm\{plan\}\}andAcritA\_\{\\mathrm\{crit\}\}can determine how to move from the current state toward the new policy intent encoded inP¯\(n\)\\bar\{P\}^\{\(n\)\}\.AobsA\_\{\\mathrm\{obs\}\}links lower\-level solutions to upper\-level adaptation: It compresses the per\-slot performance\{𝐮t\}t∈𝒯\(n−1\)\\\{\\mathbf\{u\}^\{t\}\\\}\_\{t\\in\\mathcal\{T\}^\{\(n\-1\)\}\}contained in𝒟\(n−1\)\\mathcal\{D\}^\{\(n\-1\)\}into a structured stateSenv\(n\)=\(𝐬op\(n\),𝐬diag\(n\)\)S\_\{\\mathrm\{env\}\}^\{\(n\)\}=\(\\mathbf\{s\}\_\{\\mathrm\{op\}\}^\{\(n\)\},\\mathbf\{s\}\_\{\\mathrm\{diag\}\}^\{\(n\)\}\), exposes the bottlenecks that limit the current setting, and provides the state inputSenv\(n\)S\_\{\\mathrm\{env\}\}^\{\(n\)\}for the other agents and the RAG module\.

#### III\-B3Configuration PlannerAplanA\_\{\\mathrm\{plan\}\}and Performance CriticAcritA\_\{\\mathrm\{crit\}\}

After obtainingP¯\(n\)\\bar\{P\}^\{\(n\)\}andSenv\(n\)S\_\{\\mathrm\{env\}\}^\{\(n\)\}fromAintA\_\{\\mathrm\{int\}\}andAobsA\_\{\\mathrm\{obs\}\}, respectively, the upper level determines the configuration for intervalnnthrough an effective planner–critic refinement procedure\.AplanA\_\{\\mathrm\{plan\}\}andAcritA\_\{\\mathrm\{crit\}\}perform up toRRrefinement rounds within every large interval to obtain𝒄\(n\)\\bm\{c\}^\{\(n\)\}for the lower level\. The key is to separate the candidate configuration𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}from evidence\-based verification, wherer=0,1,⋯,R−1r=0,1,\\cdots,R\-1denotes the adjustment at therr\-th refinement round within intervalnn\. Specifically,AplanA\_\{\\mathrm\{plan\}\}proposes a structured adjustment𝒅\(n\),r\\bm\{d\}^\{\(n\),r\}using the current policy context and operating state;AcritA\_\{\\mathrm\{crit\}\}evaluates the resulting candidate𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}using the case evidence𝒵case\(n\),r\\mathcal\{Z\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}retrieved by the RAG module, to determine whether𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}is sufficiently supported by historically similar operating conditions before deployment in the lower level\. This design provides a correction mechanism before the configuration is sent to the lower\-level fast solver\.

At the beginning of intervalnn, with the previous configuration vector𝒄\(n−1\)\\bm\{c\}^\{\(n\-1\)\}, the environment summarySenv\(n\)S\_\{\\mathrm\{env\}\}^\{\(n\)\}, and the structured policy contextP¯\(n\)\\bar\{P\}^\{\(n\)\}, the configuration plannerAplanA\_\{\\mathrm\{plan\}\}first proposes the initial adjustment vector, as given by

𝒅\(n\),0=Aplan\(Senv\(n\),P¯\(n\),𝒄\(n−1\)\),\\bm\{d\}^\{\(n\),0\}=A\_\{\\mathrm\{plan\}\}\\\!\\left\(S\_\{\\mathrm\{env\}\}^\{\(n\)\},\\bar\{P\}^\{\(n\)\},\\bm\{c\}^\{\(n\-1\)\}\\right\),\(25\)where𝒅\(n\),0\\bm\{d\}^\{\(n\),0\}is the planner\-generated adjustment used in the first refinement round\. For notational simplicity, every planner\-generated adjustment𝒅\(n\),r\\bm\{d\}^\{\(n\),r\}, including ther=0r=0, is written componentwise as𝒅\(n\),r=\(\{ΔΓk\(n\),r\}k∈𝒦,\{Δϕℓ\(n\),r\}ℓ∈ℒ\)\\bm\{d\}^\{\(n\),r\}=\\big\(\\\{\\Delta\\Gamma\_\{k\}^\{\(n\),r\}\\\}\_\{k\\in\\mathcal\{K\}\},\\\{\\Delta\\phi\_\{\\ell\}^\{\(n\),r\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}\\big\),r=0,1,…,R−1r=0,1,\\ldots,R\-1, whereΔΓk\(n\),r\\Delta\\Gamma\_\{k\}^\{\(n\),r\}is the generated adjustment of the robust QoS target of userkkandΔϕℓ\(n\),r\\Delta\\phi\_\{\\ell\}^\{\(n\),r\}is the generated adjustment of the power\-budget exposure factor of APℓ\\ell\.

At refinement roundrr,AplanA\_\{\\mathrm\{plan\}\}forms the candidate configuration:

𝒄\(n\),r=Π𝒞\(P¯\(n\)\)\(𝒄\(n−1\)\+𝒅\(n\),r\),r=0,1,⋯,R−1,\\bm\{c\}^\{\(n\),r\}=\\Pi\_\{\\mathcal\{C\}\(\\bar\{P\}^\{\(n\)\}\)\}\\\!\\left\(\\bm\{c\}^\{\(n\-1\)\}\+\\bm\{d\}^\{\(n\),r\}\\right\),\\ r=0,1,\\cdots,R\-1,\(26\)whereΠ𝒞\(P¯\)\(⋅\)\\Pi\_\{\\mathcal\{C\}\(\\bar\{P\}\)\}\(\\cdot\)is the Euclidean projection onto the set in \([23](https://arxiv.org/html/2606.24416#S3.E23)\), and implemented by the clipping operation:∀k∈𝒦,ℓ∈ℒ\\forall k\\in\\mathcal\{K\},\\ell\\in\\mathcal\{L\},

Γk\(n\),r=min⁡\{Γkmax\(P¯\(n\)\),max⁡\{Γkmin\(P¯\(n\)\),Γk\(n−1\)\+ΔΓk\(n\),r\}\};\\displaystyle\\Gamma\_\{k\}^\{\(n\),r\}\\\!\\\!=\\\!\\\!\\min\\\!\\Big\\\{\\\!\\Gamma\_\{k\}^\{\\max\}\(\\\!\\bar\{P\}^\{\(n\)\}\\\!\),\\max\\\!\\big\\\{\\\!\\Gamma\_\{k\}^\{\\min\}\(\\\!\\bar\{P\}^\{\(n\)\}\\\!\),\\Gamma\_\{k\}^\{\(n\\\!\-\\\!1\)\}\\\!\\\!\+\\\!\\Delta\\Gamma\_\{k\}^\{\(n\),r\}\\\!\\big\\\}\\\!\\\!\\Big\\\};ϕℓ\(n\),r=min⁡\{ϕℓmax\(P¯\(n\)\),max⁡\{ϕℓmin\(P¯\(n\)\),ϕℓ\(n−1\)\+Δϕℓ\(n\),r\}\}\.\\displaystyle\\phi\_\{\\ell\}^\{\(n\),r\}\\\!\\\!=\\\!\\\!\\min\\\!\\Big\\\{\\\!\\phi\_\{\\ell\}^\{\\max\}\(\\\!\\bar\{P\}^\{\(n\)\}\\\!\),\\max\\\!\\big\\\{\\\!\\phi\_\{\\ell\}^\{\\min\}\(\\\!\\bar\{P\}^\{\(n\)\}\\\!\),\\phi\_\{\\ell\}^\{\(n\\\!\-\\\!1\)\}\\\!\\\!\+\\\!\\Delta\\phi\_\{\\ell\}^\{\(n\),r\}\\\!\\big\\\}\\\!\\\!\\Big\\\}\.\(27\)The clipping is required because𝒅\(n\),r\\bm\{d\}^\{\(n\),r\}cannot guarantee policy feasibility\. By projecting the raw candidate𝒄\(n−1\)\+𝒅\(n\),r\\bm\{c\}^\{\(n\-1\)\}\+\\bm\{d\}^\{\(n\),r\}onto𝒞\(P¯\(n\)\)\\mathcal\{C\}\(\\bar\{P\}^\{\(n\)\}\), we ensure that the candidate configuration𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}transmitted toAcritA\_\{\\mathrm\{crit\}\}satisfies the policy bounds in \([23](https://arxiv.org/html/2606.24416#S3.E23)\)\.

At refinement roundrrof intervalnn, the criticAcritA\_\{\\mathrm\{crit\}\}evaluates𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}using the case evidence set𝒵case\(n\),r\\mathcal\{Z\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}retrieved by the RAG module, as will be described in Section III\-B\.4\)\. Based on the normalized weights associated with the retrieved cases,AcritA\_\{\\mathrm\{crit\}\}forms the empirical reference levels\{Γ^k\(n\),r\}k∈𝒦\\\{\\widehat\{\\Gamma\}\_\{k\}^\{\(n\),r\}\\\}\_\{k\\in\\mathcal\{K\}\}and\{ϕ^ℓ\(n\),r\}ℓ∈ℒ\\\{\\widehat\{\\phi\}\_\{\\ell\}^\{\(n\),r\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}from the historical support statistics\{Γk,obs\(j\)\}\\\{\\Gamma\_\{k,\\mathrm\{obs\}\}^\{\(j\)\}\\\}and\{ϕℓ,obs\(j\)\}\\\{\\phi\_\{\\ell,\\mathrm\{obs\}\}^\{\(j\)\}\\\}contained in the retained environment summaries\{Senv\(j\)\}j∈ℐcase\(n\),r\\\{S\_\{\\mathrm\{env\}\}^\{\(j\)\}\\\}\_\{j\\in\\mathcal\{I\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}\}; see Section III\-B\.4\)\. Then,AcritA\_\{\\text\{crit\}\}determines

\(a\(n\),r,Δ𝒅\(n\),r\)=Acrit\(Senv\(n\),P¯\(n\),𝒄\(n\),r,𝒅\(n\),r,𝒵case\(n\),r\),\\left\(\\\!a^\{\(n\),r\},\\Delta\\bm\{d\}^\{\(n\),r\}\\\!\\right\)\\\!\\\!=\\\!\\\!A\_\{\\mathrm\{crit\}\}\\\!\\\!\\left\(\\\!S\_\{\\mathrm\{env\}\}^\{\(n\)\},\\bar\{P\}^\{\(n\)\},\\bm\{c\}^\{\(n\),r\},\\bm\{d\}^\{\(n\),r\},\\mathcal\{Z\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}\\\!\\right\),\(28\)wherea\(n\),r∈\{0,1\}a^\{\(n\),r\}\\in\\\{0,1\\\}denotes an acceptance flag; andΔ𝒅\(n\),r\\Delta\\bm\{d\}^\{\(n\),r\}denotes the correction generated byAcritA\_\{\\mathrm\{crit\}\}, which is used byAplanA\_\{\\mathrm\{plan\}\}to revise𝒅\(n\),r\\bm\{d\}^\{\(n\),r\}in the next refinement roundr\+1r\+1of thenn\-th large interval\. Since the candidate𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}has already been projected onto𝒞\(P¯\(n\)\)\\mathcal\{C\}\(\\bar\{P\}^\{\(n\)\}\),AcritA\_\{\\mathrm\{crit\}\}checks whether it remains within the empirically supported neighborhood of the retrieved reference levels, i\.e\.,∀k∈𝒦,ℓ∈ℒ\\forall k\\in\\mathcal\{K\},\\ell\\in\\mathcal\{L\},

a\(n\),r=1,if\\displaystyle a^\{\(n\),r\}=1,\\ \\text\{if\}\\\|Γk\(n\),r−Γ^k\(n\),r\|≤εΓ,k\(n\),r,and\\displaystyle\\left\|\\Gamma\_\{k\}^\{\(n\),r\}\-\\widehat\{\\Gamma\}\_\{k\}^\{\(n\),r\}\\right\|\\leq\\varepsilon\_\{\\Gamma,k\}^\{\(n\),r\},\\ \\text\{and\}\|ϕℓ\(n\),r−ϕ^ℓ\(n\),r\|≤εϕ,ℓ\(n\),r\\displaystyle\\left\|\\phi\_\{\\ell\}^\{\(n\),r\}\-\\widehat\{\\phi\}\_\{\\ell\}^\{\(n\),r\}\\right\|\\leq\\varepsilon\_\{\\phi,\\ell\}^\{\(n\),r\}\(29\)whereεΓ,k\(n\),r≥0\\varepsilon\_\{\\Gamma,k\}^\{\(n\),r\}\\geq 0andεϕ,ℓ\(n\),r≥0\\varepsilon\_\{\\phi,\\ell\}^\{\(n\),r\}\\geq 0are component\-wise empirically supported tolerances around the retrieved reference levels\. Specifically,εΓ,k\(n\),r\\varepsilon\_\{\\Gamma,k\}^\{\(n\),r\}bounds the admissible deviation between the candidate robust QoS targetΓk\(n\),r\\Gamma\_\{k\}^\{\(n\),r\}and its retrieved referenceΓ^k\(n\),r\\widehat\{\\Gamma\}\_\{k\}^\{\(n\),r\}, whileεϕ,ℓ\(n\),r\\varepsilon\_\{\\phi,\\ell\}^\{\(n\),r\}bounds the admissible deviation between the candidate AP power\-budget exposure factorϕℓ\(n\),r\\phi\_\{\\ell\}^\{\(n\),r\}and its retrieved referenceϕ^ℓ\(n\),r\\widehat\{\\phi\}\_\{\\ell\}^\{\(n\),r\}\. Otherwise,AcritA\_\{\\mathrm\{crit\}\}returns a correctionΔ𝒅\(n\),r\\Delta\\bm\{d\}^\{\(n\),r\}that pushes the candidate𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}back toward the empirically supported region:

Δ𝒅\(n\),r=\\displaystyle\\Delta\\bm\{d\}^\{\(n\),r\}=\(30\)\(\{ρΓ\(Γ^k\(n\),r−Γk\(n\),r\)\}k∈𝒦,\{ρϕ\(ϕ^ℓ\(n\),r−ϕℓ\(n\),r\)\}ℓ∈ℒ\),\\displaystyle\\Big\(\\\{\\rho\_\{\\Gamma\}\(\\widehat\{\\Gamma\}\_\{k\}^\{\(n\),r\}\-\\Gamma\_\{k\}^\{\(n\),r\}\)\\\}\_\{k\\in\\mathcal\{K\}\},\\\{\\rho\_\{\\phi\}\(\\widehat\{\\phi\}\_\{\\ell\}^\{\(n\),r\}\-\\phi\_\{\\ell\}^\{\(n\),r\}\)\\\}\_\{\\ell\\in\\mathcal\{L\}\}\\Big\),whereρΓ,ρϕ\>0\\rho\_\{\\Gamma\},\\rho\_\{\\phi\}\>0are correction coefficients\.

Ifa\(n\),r=1a^\{\(n\),r\}=1, the candidate𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}is accepted and deployed as the configuration vector of intervalnn\. Otherwise,AplanA\_\{\\mathrm\{plan\}\}revises its adjustment𝒅\(n\),r\\bm\{d\}^\{\(n\),r\}according toAcritA\_\{\\mathrm\{crit\}\}’s feedback:

𝒅\(n\),r\+1=Aplan\(Senv\(n\),P¯\(n\),𝒄\(n−1\),𝒅\(n\),r,Δ𝒅\(n\),r\),\\bm\{d\}^\{\(n\),r\+1\}=A\_\{\\mathrm\{plan\}\}\\\!\\left\(S\_\{\\mathrm\{env\}\}^\{\(n\)\},\\bar\{P\}^\{\(n\)\},\\bm\{c\}^\{\(n\-1\)\},\\bm\{d\}^\{\(n\),r\},\\Delta\\bm\{d\}^\{\(n\),r\}\\right\),\(31\)Then, the same verification process as in \([26](https://arxiv.org/html/2606.24416#S3.E26)\)–\([III\-B3](https://arxiv.org/html/2606.24416#S3.Ex3)\) is repeated\. Letr⋆r^\{\\star\}be the first refinement round achievinga\(n\),r⋆=1a^\{\(n\),r^\{\\star\}\}=1\. The deployed configuration is given by

𝒄\(n\)=𝒄\(n\),r⋆\.\\bm\{c\}^\{\(n\)\}=\\bm\{c\}^\{\(n\),r^\{\\star\}\}\.\(32\)If no adjustment is accepted withinRRrefinement rounds,AplanA\_\{\\mathrm\{plan\}\}deploys𝒄\(n\),R−1\\bm\{c\}^\{\(n\),R\-1\}as the output\. Thus, the upper level does not directly map\(Senv\(n\),P¯\(n\)\)\(S\_\{\\mathrm\{env\}\}^\{\(n\)\},\\bar\{P\}^\{\(n\)\}\)to a configuration in one step; instead, it determines𝒄\(n\)\\bm\{c\}^\{\(n\)\}through a planner–critic refinement process, which is robust to decision errors and aligned with the policy\-aware deployment requirement of Agentic\-LTPO\.

#### III\-B4RAG module

To support the planner–critic refinement in Section III\-B\.3\), we equip the upper\-level critic agentAcritA\_\{\\mathrm\{crit\}\}with the RAG module to provide policy\- and experience\-based evidence throughout the refinement rounds\. The RAG module enablesAintA\_\{\\mathrm\{int\}\}andAplanA\_\{\\mathrm\{plan\}\}to reuse previous policy contexts and historical operating cases under similar operating regimes, making the update from𝒄\(n−1\)\\bm\{c\}^\{\(n\-1\)\}to𝒄\(n\)\\bm\{c\}^\{\(n\)\}stable and interpretable\.

The RAG module maintains two memories, namely, a policy memory and a case memory\. The policy memory stores the structured policy contexts generated byAintA\_\{\\mathrm\{int\}\}, as given by

ℳpol\(n\)≜\{P¯\(i\)\}i=0n−1\.\\mathcal\{M\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}\\triangleq\\left\\\{\\bar\{P\}^\{\(i\)\}\\right\\\}\_\{i=0\}^\{n\-1\}\.\(33\)The case memory is constructed by augmenting the historical tuples stored in the cross\-timescale experience bufferℰ\(n\)\\mathcal\{E\}^\{\(n\)\}with the deployed configuration changes, as given by

ℳcase\(n\)≜\{\(Senv\(i\),𝒄\(i\),KPI\(i\),Δ𝒄\(i\)\)\}i=0n−1,\\mathcal\{M\}\_\{\\mathrm\{case\}\}^\{\(n\)\}\\triangleq\\left\\\{\\Big\(S\_\{\\mathrm\{env\}\}^\{\(i\)\},\\bm\{c\}^\{\(i\)\},\\mathrm\{KPI\}^\{\(i\)\},\\Delta\\bm\{c\}^\{\(i\)\}\\Big\)\\right\\\}\_\{i=0\}^\{n\-1\},\(34\)whereΔ𝒄\(i\)=𝒄\(i\)−𝒄\(i−1\),i≥1\\Delta\\bm\{c\}^\{\(i\)\}=\\bm\{c\}^\{\(i\)\}\-\\bm\{c\}^\{\(i\-1\)\},i\\geq 1withΔ𝒄\(0\)\\Delta\\bm\{c\}^\{\(0\)\}initialized an all\-zero vector, andΔ𝒄\(i\)\\Delta\\bm\{c\}^\{\(i\)\}records the deployed configuration change of intervaliiwithi=0,1,⋯,n−1i=0,1,\\cdots,n\-1\.

As designed in Sections III\-B\.1\) to III\-B\.3\), the policy memoryℳpol\(n\)\\mathcal\{M\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}stores the historical structured policy contexts produced byAintA\_\{\\mathrm\{int\}\}\. The case memoryℳcase\(n\)\\mathcal\{M\}\_\{\\mathrm\{case\}\}^\{\(n\)\}is updated after each interval by recording the observer summarySenv\(i\)S\_\{\\mathrm\{env\}\}^\{\(i\)\}, together with the deployed configuration𝒄\(i\)\\bm\{c\}^\{\(i\)\}, realized KPI tupleKPI\(i\)\\mathrm\{KPI\}^\{\(i\)\}, and resulting configuration changeΔ𝒄\(i\)\\Delta\\bm\{c\}^\{\(i\)\}\.

The RAG module is used online byAcritA\_\{\\mathrm\{crit\}\}in \([28](https://arxiv.org/html/2606.24416#S3.E28)\)\. As described in Section III\-B\.1\),AintA\_\{\\mathrm\{int\}\}first forms a preliminary structured policyP~\(n\)\\tilde\{P\}^\{\(n\)\}and then retrieves a raw policy evidence set𝒵~pol\(n\)\\widetilde\{\\mathcal\{Z\}\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}fromℳpol\(n\)\\mathcal\{M\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}:

𝒵~pol\(n\)=TopKKpol⁡\{P¯\(i\)∈ℳpol\(n\):sim\(P~\(n\),P¯\(i\)\)\},\\widetilde\{\\mathcal\{Z\}\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}=\\operatorname\{TopK\}\_\{K\_\{\\mathrm\{pol\}\}\}\\left\\\{\\bar\{P\}^\{\(i\)\}\\in\\mathcal\{M\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}:\\mathrm\{sim\}\\\!\\left\(\\tilde\{P\}^\{\(n\)\},\\bar\{P\}^\{\(i\)\}\\right\)\\right\\\},\(35\)wheresim\(⋅,⋅\)\\mathrm\{sim\}\(\\cdot,\\cdot\)gives the cosine similarity between the embeddings of the normalized records converted fromP~\(n\)\\tilde\{P\}^\{\(n\)\}andP¯\(i\)\\bar\{P\}^\{\(i\)\}, produced by an embedding API333In practice, before embedding, each compared object is converted into a typed normalized record under a unified schema\.;KpolK\_\{\\mathrm\{pol\}\}is the number of recalled policy items before filtering;TopKK⁡\{xi:si\}\\operatorname\{TopK\}\_\{K\}\\\!\\left\\\{x\_\{i\}:s\_\{i\}\\right\\\}returns theKKitemsxix\_\{i\}with the highest scoressis\_\{i\}\.

The retrieved policy evidence set is then determined as

𝒵pol\(n\)=\{P¯\(i\)∈𝒵~pol\(n\):sim\(P~\(n\),P¯\(i\)\)≥τpol\},\\mathcal\{Z\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}=\\left\\\{\\bar\{P\}^\{\(i\)\}\\in\\widetilde\{\\mathcal\{Z\}\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}:\\mathrm\{sim\}\\\!\\left\(\\tilde\{P\}^\{\(n\)\},\\bar\{P\}^\{\(i\)\}\\right\)\\geq\\tau\_\{\\mathrm\{pol\}\}\\right\\\},\(36\)whereτpol\\tau\_\{\\mathrm\{pol\}\}is the predesigned threshold\. This policy retrieval is implemented with a nearest\-neighbor search over the stored structured policy contexts inℳpol\(n\)\\mathcal\{M\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}, followed by threshold\-based filtering atτpol\\tau\_\{\\mathrm\{pol\}\}\. The evidence in𝒵pol\(n\)\\mathcal\{Z\}\_\{\\mathrm\{pol\}\}^\{\(n\)\}is supplied toAintA\_\{\\mathrm\{int\}\}to revise the preliminary fields ofP~\(n\)\\tilde\{P\}^\{\(n\)\}and produce the structured policy contextP¯\(n\)\\bar\{P\}^\{\(n\)\}in \([22](https://arxiv.org/html/2606.24416#S3.E22)\)\.

At refinement roundrrof thenn\-th large timescale interval,AcritA\_\{\\mathrm\{crit\}\}retrieves a case evidence set fromℳcase\(n\)\\mathcal\{M\}\_\{\\mathrm\{case\}\}^\{\(n\)\}to perform corrections\. To jointly evaluate the elements in \([34](https://arxiv.org/html/2606.24416#S3.E34)\), we define the following hybrid retrieval score:

ηcase\(i\),r\\displaystyle\\eta\_\{\\mathrm\{case\}\}^\{\(i\),r\}=λopsim\(𝐬op\(n\),𝐬op\(i\)\)\+λdiagsim\(𝐬diag\(n\),𝐬diag\(i\)\)\\displaystyle=\\lambda\_\{\\mathrm\{op\}\}\\mathrm\{sim\}\\left\(\\mathbf\{s\}\_\{\\mathrm\{op\}\}^\{\(n\)\},\\mathbf\{s\}\_\{\\mathrm\{op\}\}^\{\(i\)\}\\right\)\+\\lambda\_\{\\mathrm\{diag\}\}\\mathrm\{sim\}\\left\(\\mathbf\{s\}\_\{\\mathrm\{diag\}\}^\{\(n\)\},\\mathbf\{s\}\_\{\\mathrm\{diag\}\}^\{\(i\)\}\\right\)\+λdirsim\(𝒅\(n\),r,Δ𝒄\(i\)\),\\displaystyle\+\\lambda\_\{\\mathrm\{dir\}\}\\mathrm\{sim\}\\left\(\\bm\{d\}^\{\(n\),r\},\\Delta\\bm\{c\}^\{\(i\)\}\\right\),\(37\)whereλop,λdiag,λdir≥0\\lambda\_\{\\mathrm\{op\}\},\\lambda\_\{\\mathrm\{diag\}\},\\lambda\_\{\\mathrm\{dir\}\}\\geq 0andλop\+λdiag\+λdir=1\\lambda\_\{\\mathrm\{op\}\}\+\\lambda\_\{\\mathrm\{diag\}\}\+\\lambda\_\{\\mathrm\{dir\}\}=1\. If‖Δ𝒄\(i\)‖2=0\\\|\\Delta\\bm\{c\}^\{\(i\)\}\\\|\_\{2\}=0, thensim\(𝒅\(n\),r,Δ𝒄\(i\)\)=0\\mathrm\{sim\}\(\\bm\{d\}^\{\(n\),r\},\\Delta\\bm\{c\}^\{\(i\)\}\)=0\. Based onηcase\(i\),r\\eta\_\{\\mathrm\{case\}\}^\{\(i\),r\}, the raw case retrieval𝒵~case\(n\),r\\widetilde\{\\mathcal\{Z\}\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}is defined as

𝒵~case\(n\),r=TopKKcase⁡\{\(Senv\(i\),𝒄\(i\),KPI\(i\),Δ𝒄\(i\)\)∈ℳcase\(n\):ηcase\(i\),r\},\\widetilde\{\\mathcal\{Z\}\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}\\\!\\\!=\\\!\\\!\\operatorname\{TopK\}\_\{K\_\{\\mathrm\{case\}\}\}\\\!\\\!\\left\\\{\\\!\\\!\\Big\(\\\!\\\!S\_\{\\mathrm\{env\}\}^\{\(i\)\},\\bm\{c\}^\{\(i\)\},\\mathrm\{KPI\}^\{\(i\)\},\\Delta\\bm\{c\}^\{\(i\)\}\\\!\\\!\\Big\)\\\!\\\!\\in\\\!\\\!\\mathcal\{M\}\_\{\\mathrm\{case\}\}^\{\(n\)\}\\\!:\\\!\\eta\_\{\\mathrm\{case\}\}^\{\(i\),r\}\\\!\\right\\\},\(38\)whereKcaseK\_\{\\mathrm\{case\}\}denotes the number of recalled case items before filtering\. With threshold\-based filtering, the retrieved case evidence set is defined as

𝒵case\(n\),r\\displaystyle\\mathcal\{Z\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}=\{\(Senv\(i\),𝒄\(i\),KPI\(i\),Δ𝒄\(i\)\)∈𝒵~case\(n\),r:\\displaystyle=\\Big\\\{\\,\\Big\(S\_\{\\mathrm\{env\}\}^\{\(i\)\},\\bm\{c\}^\{\(i\)\},\\mathrm\{KPI\}^\{\(i\)\},\\Delta\\bm\{c\}^\{\(i\)\}\\Big\)\\in\\widetilde\{\\mathcal\{Z\}\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}:\\;𝒄\(i\)∈𝒞\(P¯\(n\)\),sim\(𝒅\(n\),r,Δ𝒄\(i\)\)≥τdir\},\\displaystyle\\bm\{c\}^\{\(i\)\}\\in\\mathcal\{C\}\\\!\\Big\(\\bar\{P\}^\{\(n\)\}\\Big\),\\mathrm\{sim\}\\left\(\\bm\{d\}^\{\(n\),r\},\\Delta\\bm\{c\}^\{\(i\)\}\\right\)\\geq\\tau\_\{\\mathrm\{dir\}\}\\Big\\\},\(39\)whereτdir\\tau\_\{\\mathrm\{dir\}\}is the threshold for directional consistency\. The case retrieval is implemented as a hybrid search inℳcase\(n\)\\mathcal\{M\}\_\{\\mathrm\{case\}\}^\{\(n\)\}: a top\-KKrecall is performed according to \([37](https://arxiv.org/html/2606.24416#S3.E37)\) and refined by feasibility and directional consistency filtering under the policy\-constrained domain𝒞\(P¯\(n\)\)\\mathcal\{C\}\\\!\\left\(\\bar\{P\}^\{\(n\)\}\\right\)\.

The retrieved evidence in \([39](https://arxiv.org/html/2606.24416#S3.E39)\) is used byAcritA\_\{\\mathrm\{crit\}\}to form the empirical reference levels in \([III\-B3](https://arxiv.org/html/2606.24416#S3.Ex3)\)\. Letℐcase\(n\),r=\{j:\(Senv\(j\),𝒄\(j\),KPI\(j\),Δ𝒄\(j\)\)∈𝒵case\(n\),r\}\\mathcal\{I\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}=\\\{j:\\;\(S\_\{\\mathrm\{env\}\}^\{\(j\)\},\\bm\{c\}^\{\(j\)\},\\mathrm\{KPI\}^\{\(j\)\},\\Delta\\bm\{c\}^\{\(j\)\}\)\\in\\mathcal\{Z\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}\\\}\. We define the normalized weight as

ωj\(n\),r=exp⁡\(ηcase\(j\),r\)∑i∈ℐcase\(n\),rexp⁡\(ηcase\(i\),r\),\\omega\_\{j\}^\{\(n\),r\}=\\frac\{\\exp\\\!\\big\(\\eta\_\{\\mathrm\{case\}\}^\{\(j\),r\}\\big\)\}\{\\sum\_\{i\\in\\mathcal\{I\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}\}\\exp\\\!\\big\(\\eta\_\{\\mathrm\{case\}\}^\{\(i\),r\}\\big\)\},\(40\)Withωj\(n\),r,∀j∈ℐcase\(n\),r\\omega\_\{j\}^\{\(n\),r\},\\forall j\\in\\mathcal\{I\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}, the reference levels in \([III\-B3](https://arxiv.org/html/2606.24416#S3.Ex3)\) are constructed as

Γ^k\(n\),r=∑j∈ℐcase\(n\),rωj\(n\),rΓk,obs\(j\),∀k∈𝒦;\\displaystyle\\widehat\{\\Gamma\}\_\{k\}^\{\(n\),r\}=\\sum\_\{j\\in\\mathcal\{I\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}\}\\omega\_\{j\}^\{\(n\),r\}\\,\\Gamma\_\{k,\\mathrm\{obs\}\}^\{\(j\)\},\\quad\\forall k\\in\\mathcal\{K\};\(41\)ϕ^ℓ\(n\),r=∑j∈ℐcase\(n\),rωj\(n\),rϕℓ,obs\(j\),∀ℓ∈ℒ\.\\displaystyle\\widehat\{\\phi\}\_\{\\ell\}^\{\(n\),r\}=\\sum\_\{j\\in\\mathcal\{I\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}\}\\omega\_\{j\}^\{\(n\),r\}\\,\\phi\_\{\\ell,\\mathrm\{obs\}\}^\{\(j\)\},\\quad\\forall\\ell\\in\\mathcal\{L\}\.\(42\)whereΓk,obs\(j\)\\Gamma\_\{k,\\mathrm\{obs\}\}^\{\(j\)\}andϕℓ,obs\(j\)\\phi\_\{\\ell,\\mathrm\{obs\}\}^\{\(j\)\}are userkk’s worst\-case \(robust\) QoS level and APℓ\\ell’s largest normalized power usage inSenv\(j\)S\_\{\\mathrm\{env\}\}^\{\(j\)\}, respectively\.AcritA\_\{\\mathrm\{crit\}\}does not need to search the entire experience setℰ\(n\)\\mathcal\{E\}^\{\(n\)\}; it grounds its acceptance decision with evidence from historical similar cases and deployed adjustments\. If the retrieved case evidence set is empty \(𝒵case\(n\),r=∅\\mathcal\{Z\}\_\{\\mathrm\{case\}\}^\{\(n\),r\}=\\emptyset\),AcritA\_\{\\mathrm\{crit\}\}falls back to𝐬op\(n\)\\mathbf\{s\}\_\{\\mathrm\{op\}\}^\{\(n\)\}and sets

Γ^k\(n\),r=Γk,obs\(n\),∀k∈𝒦;ϕ^ℓ\(n\),r=ϕℓ,obs\(n\),∀ℓ∈ℒ\.\\widehat\{\\Gamma\}\_\{k\}^\{\(n\),r\}=\\Gamma\_\{k,\\mathrm\{obs\}\}^\{\(n\)\},\\quad\\forall k\\in\\mathcal\{K\};\\qquad\\widehat\{\\phi\}\_\{\\ell\}^\{\(n\),r\}=\\phi\_\{\\ell,\\mathrm\{obs\}\}^\{\(n\)\},\\quad\\forall\\ell\\in\\mathcal\{L\}\.In this case, the weighting step in \([40](https://arxiv.org/html/2606.24416#S3.E40)\) is skipped\.

### III\-CLower\-Level Optimization

At each time slottt, for the lower level, we design the downlink beamformers\{𝐰ℓkt\}\\\{\{\\bf w\}\_\{\\ell k\}^\{t\}\\\}to minimize the total transmit energy\. Based on the configuration𝒄\(n\)\\bm\{c\}^\{\(n\)\}, the lower level adjusts the target robust QoS levels\{Γk\(n\)\}k∈𝒦\\\{\\Gamma\_\{k\}^\{\(n\)\}\\\}\_\{k\\in\\mathcal\{K\}\}and the AP power\-budget exposure factors\{ϕℓ\(n\)\}ℓ∈ℒ\\\{\\phi\_\{\\ell\}^\{\(n\)\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}in the slot\-level constraints in slott∈𝒯\(n\)t\\in\\mathcal\{T\}^\{\(n\)\}\. The energy minimization problem for the lower level at slotttis formulated as

min\{𝐰ℓkt\}\\displaystyle\\min\_\{\\\{\{\\bf w\}\_\{\\ell k\}^\{t\}\\\}\}~~∑ℓ∈ℒ∑k∈𝒦‖𝐰ℓkt‖22\\displaystyle\\sum\_\{\\ell\\in\\mathcal\{L\}\}\\sum\_\{k\\in\\mathcal\{K\}\}\\big\\\|\{\\bf w\}\_\{\\ell k\}^\{t\}\\big\\\|\_\{2\}^\{2\}\(43a\)s\.t\.\\displaystyle\\mathrm\{s\.t\.\}~~∑k∈𝒦‖𝐰ℓkt‖22≤ϕℓ\(n\)Pℓmax,∀ℓ∈ℒ,\\displaystyle\\sum\_\{k\\in\\mathcal\{K\}\}\\big\\\|\{\\bf w\}\_\{\\ell k\}^\{t\}\\big\\\|\_\{2\}^\{2\}\\leq\\phi\_\{\\ell\}^\{\(n\)\}\\\!P\_\{\\ell\}^\{\\max\},\\quad\\forall\\ell\\in\\mathcal\{L\},\(43b\)SINRkt,wc≥Γk\(n\),∀k∈𝒦,\\displaystyle\\mathrm\{SINR\}\_\{k\}^\{t,\\mathrm\{wc\}\}\\geq\\Gamma\_\{k\}^\{\(n\)\},\\quad\\forall k\\in\\mathcal\{K\},\(43c\)To solve \([43](https://arxiv.org/html/2606.24416#S3.E43)\), we adopt a zero\-forcing \(ZF\) beamforming structure \(even under imperfect CSI\) to obtain a computationally efficient design for the lower level\.

Let𝐡kt≜\[\(𝐡1kt\)⊺,…,\(𝐡Lkt\)⊺\]⊺∈ℂLM\{\\bf h\}\_\{k\}^\{t\}\\triangleq\\big\[\(\{\\bf h\}\_\{1k\}^\{t\}\)^\{\\intercal\},\\ldots,\(\{\\bf h\}\_\{Lk\}^\{t\}\)^\{\\intercal\}\\big\]^\{\\intercal\}\\in\\mathbb\{C\}^\{LM\}and𝐡^kt≜\[\(𝐡^1kt\)⊺,…,\(𝐡^Lkt\)⊺\]⊺\\hat\{\\bf h\}\_\{k\}^\{t\}\\triangleq\\big\[\(\\hat\{\\bf h\}\_\{1k\}^\{t\}\)^\{\\intercal\},\\ldots,\(\\hat\{\\bf h\}\_\{Lk\}^\{t\}\)^\{\\intercal\}\\big\]^\{\\intercal\}\. The unnormalized ZF directions are constructed via the right pseudo\-inverse of𝐇^t\\hat\{\\bf H\}^\{t\}, as given by

𝐕~t≜𝐇^t\(\(𝐇^t\)H𝐇^t\)−1=\[𝐯~1t,…,𝐯~Kt\],\\tilde\{\\bf V\}^\{t\}\\triangleq\\hat\{\\bf H\}^\{t\}\\big\(\(\\hat\{\\bf H\}^\{t\}\)^\{H\}\\hat\{\\bf H\}^\{t\}\\big\)^\{\-1\}=\\big\[\\tilde\{\\bf v\}\_\{1\}^\{t\},\\ldots,\\tilde\{\\bf v\}\_\{K\}^\{t\}\\big\],\(44\)which yields the ZF property over the estimated channels, i\.e\.,

\(𝐡^kt\)H𝐯~jt=0,∀k∈𝒦,∀j∈𝒦\\\{k\}\.\(\\hat\{\\bf h\}\_\{k\}^\{t\}\)^\{H\}\\tilde\{\\bf v\}\_\{j\}^\{t\}=0,\\quad\\forall k\\in\\mathcal\{K\},~\\forall j\\in\\mathcal\{K\}\\backslash\\\{k\\\}\.\(45\)We normalize each ZF direction as𝐯kt≜𝐯~kt/‖𝐯~kt‖2\{\\bf v\}\_\{k\}^\{t\}\\triangleq\\tilde\{\\bf v\}\_\{k\}^\{t\}/\\\|\\tilde\{\\bf v\}\_\{k\}^\{t\}\\\|\_\{2\}and parameterize the beamformer as

𝐰kt=pkt𝐯kt,pkt≥0,∀k∈𝒦\.\{\\bf w\}\_\{k\}^\{t\}=\\sqrt\{p\_\{k\}^\{t\}\}\\,\{\\bf v\}\_\{k\}^\{t\},\\quad p\_\{k\}^\{t\}\\geq 0,\\ \\forall k\\in\\mathcal\{K\}\.\(46\)Let𝐯kt=\[\(𝐯1kt\)⊺,…,\(𝐯Lkt\)⊺\]⊺\{\\bf v\}\_\{k\}^\{t\}=\[\(\{\\bf v\}\_\{1k\}^\{t\}\)^\{\\intercal\},\\ldots,\(\{\\bf v\}\_\{Lk\}^\{t\}\)^\{\\intercal\}\]^\{\\intercal\}denote the AP\-wise partition corresponding toLLblocks of sizeMM\. Then, the per\-AP power in time slotttbecomes

∑k∈𝒦‖𝐰ℓkt‖22=∑k∈𝒦pkt‖𝐯ℓkt‖22,∀ℓ∈ℒ\.\\sum\_\{k\\in\\mathcal\{K\}\}\\\|\{\\bf w\}\_\{\\ell k\}^\{t\}\\\|\_\{2\}^\{2\}=\\sum\_\{k\\in\\mathcal\{K\}\}p\_\{k\}^\{t\}\\\|\{\\bf v\}\_\{\\ell k\}^\{t\}\\\|\_\{2\}^\{2\},\\quad\\forall\\ell\\in\\mathcal\{L\}\.\(47\)The beamforming design in \([43](https://arxiv.org/html/2606.24416#S3.E43)\) reduces to optimizing the power allocation vector𝐩t≜\[p1t,…,pKt\]⊺\{\\bf p\}^\{t\}\\triangleq\[p\_\{1\}^\{t\},\\ldots,p\_\{K\}^\{t\}\]^\{\\intercal\}\.

To handle the imperfect CSI, we take the worst\-case \(robust\) SINR into consideration\. For userkk, we define

akt≜\|∑ℓ∈ℒ\(𝐡^ℓkt\)H𝐯ℓkt\|,ζk,jt≜∑ℓ∈ℒδℓkt‖𝐯ℓjt‖2\.a\_\{k\}^\{t\}\\triangleq\\left\|\\sum\_\{\\ell\\in\\mathcal\{L\}\}\(\\hat\{\\bf h\}\_\{\\ell k\}^\{t\}\)^\{H\}\{\\bf v\}\_\{\\ell k\}^\{t\}\\right\|,\\quad\\zeta\_\{k,j\}^\{t\}\\triangleq\\sum\_\{\\ell\\in\\mathcal\{L\}\}\\delta\_\{\\ell k\}^\{t\}\\\|\{\\bf v\}\_\{\\ell j\}^\{t\}\\\|\_\{2\}\.\(48\)Based on \([48](https://arxiv.org/html/2606.24416#S3.E48)\), applying the triangle inequality and the Cauchy\-Schwarz inequality, we obtain the following worst\-case bounds of\|∑ℓ∈ℒ\(𝐡ℓkt\)H𝐯ℓkt\|\|\\sum\_\{\\ell\\in\\mathcal\{L\}\}\(\{\\bf h\}\_\{\\ell k\}^\{t\}\)^\{H\}\{\\bf v\}\_\{\\ell k\}^\{t\}\|and\|∑ℓ∈ℒ\(𝐡ℓkt\)H𝐯ℓjt\|\|\\sum\_\{\\ell\\in\\mathcal\{L\}\}\(\{\\bf h\}\_\{\\ell k\}^\{t\}\)^\{H\}\{\\bf v\}\_\{\\ell j\}^\{t\}\|:

min\{Δ𝐡ℓkt:‖Δ𝐡ℓkt‖2≤δℓkt\}⁡\|∑ℓ∈ℒ\(𝐡ℓkt\)H𝐯ℓkt\|≥\(akt−ζk,kt\)\+;\\displaystyle\\min\_\{\\\{\\Delta\{\\bf h\}\_\{\\ell k\}^\{t\}:\\\|\\Delta\{\\bf h\}\_\{\\ell k\}^\{t\}\\\|\_\{2\}\\leq\\delta\_\{\\ell k\}^\{t\}\\\}\}\\left\|\\sum\_\{\\ell\\in\\mathcal\{L\}\}\(\{\\bf h\}\_\{\\ell k\}^\{t\}\)^\{H\}\{\\bf v\}\_\{\\ell k\}^\{t\}\\right\|\\geq\\big\(a\_\{k\}^\{t\}\-\\zeta\_\{k,k\}^\{t\}\\big\)\_\{\+\};\(49a\)max\{Δ𝐡ℓkt:‖Δ𝐡ℓkt‖2≤δℓkt\}⁡\|∑ℓ∈ℒ\(𝐡ℓkt\)H𝐯ℓjt\|≤ζk,jt,∀j∈𝒦\\\{k\},\\displaystyle\\max\_\{\\\{\\Delta\{\\bf h\}\_\{\\ell k\}^\{t\}:\\\|\\Delta\{\\bf h\}\_\{\\ell k\}^\{t\}\\\|\_\{2\}\\leq\\delta\_\{\\ell k\}^\{t\}\\\}\}\\\!\\\!\\left\|\\sum\_\{\\ell\\in\\mathcal\{L\}\}\(\{\\bf h\}\_\{\\ell k\}^\{t\}\)^\{H\}\{\\bf v\}\_\{\\ell j\}^\{t\}\\\!\\right\|\\\!\\leq\\\!\\zeta\_\{k,j\}^\{t\},\\forall j\\\!\\in\\\!\\mathcal\{K\}\\backslash\\\{\\\!k\\\!\\\},\(49b\)where\(x\)\+≜max⁡\{x,0\}\(x\)\_\{\+\}\\triangleq\\max\\\{x,0\\\}\. Accordingly, under the strict global ZF structure in \([46](https://arxiv.org/html/2606.24416#S3.E46)\) \(i\.e\.,𝐰kt=pkt𝐯kt\{\\bf w\}\_\{k\}^\{t\}=\\sqrt\{p\_\{k\}^\{t\}\}\{\\bf v\}\_\{k\}^\{t\}\), a closed\-form lower bound for the worst\-case \(robust\) SINR in \([II\-D](https://arxiv.org/html/2606.24416#S2.Ex3)\) is given by

γkt≜pkt\(akt−ζk,kt\)\+2\(σkt\)2\+∑j∈𝒦\\\{k\}pjt\(ζk,jt\)2,𝐩t≜\[p1t,…,pKt\]T,\\gamma\_\{k\}^\{t\}\\\!\\triangleq\\\!\\frac\{p\_\{k\}^\{t\}\\big\(a\_\{k\}^\{t\}\-\\zeta\_\{k,k\}^\{t\}\\big\)\_\{\+\}^\{2\}\}\{\(\\sigma\_\{k\}^\{t\}\)^\{2\}\\\!\+\\\!\\sum\_\{j\\in\\mathcal\{K\}\\backslash\\\{k\\\}\}p\_\{j\}^\{t\}\\big\(\\zeta\_\{k,j\}^\{t\}\\big\)^\{2\}\},\\ \{\\bf p\}^\{t\}\\\!\\triangleq\\\!\[p\_\{1\}^\{t\},\\ldots,p\_\{K\}^\{t\}\]^\{T\},\(50\)which guarantees thatSINRkt,wc≥γkt\\mathrm\{SINR\}\_\{k\}^\{t,\\mathrm\{wc\}\}\\geq\\gamma\_\{k\}^\{t\}\.

With \([50](https://arxiv.org/html/2606.24416#S3.E50)\), Problem \([43](https://arxiv.org/html/2606.24416#S3.E43)\) can be equivalently cast as

min𝐩t⪰𝟎\\displaystyle\\min\_\{\{\\bf p\}^\{t\}\\succeq\{\\bf 0\}\}\\quad∑k∈𝒦pkt\\displaystyle\\sum\_\{k\\in\\mathcal\{K\}\}p\_\{k\}^\{t\}\(51a\)s\.t\.\\displaystyle\\mathrm\{s\.t\.\}\\quad∑k∈𝒦pkt‖𝐯ℓkt‖22≤ϕℓ\(n\)Pℓmax,∀ℓ∈ℒ,\\displaystyle\\sum\_\{k\\in\\mathcal\{K\}\}p\_\{k\}^\{t\}\\big\\\|\{\\bf v\}\_\{\\ell k\}^\{t\}\\big\\\|\_\{2\}^\{2\}\\leq\\phi\_\{\\ell\}^\{\(n\)\}\\\!P\_\{\\ell\}^\{\\max\},\\quad\\forall\\ell\\in\\mathcal\{L\},\(51b\)γkt≥Γk\(n\),∀k∈𝒦,\\displaystyle\\gamma\_\{k\}^\{t\}\\geq\\Gamma\_\{k\}^\{\(n\)\},\\quad\\forall k\\in\\mathcal\{K\},\(51c\)which is a linear programming problem since the fractional constraint \([51](https://arxiv.org/html/2606.24416#S3.E51)c\) can be transformed into an affine inequality with respect to𝐩t\{\\bf p\}^\{t\}\. Hence, \([51](https://arxiv.org/html/2606.24416#S3.E51)\) can be efficiently solved using convex optimization solvers, e\.g\., CVX\[[6](https://arxiv.org/html/2606.24416#bib.bib62)\]\. With the optimal solution of \([51](https://arxiv.org/html/2606.24416#S3.E51)\) and the ZF beamforming structure, this approach is efficient and well\-suited for the latency\-sensitive lower\-level optimization at each slot\.

Algorithm 1Implementation of Agentic\-LTPO1:Input:

TT,

NN,

TsT\_\{s\},

RR,

\{P\(n\)\}n=0N−1\\\{P^\{\(n\)\}\\\}\_\{n=0\}^\{N\-1\},

\{γ\(n\)\}n=0N−1\\\{\\gamma^\{\(n\)\}\\\}\_\{n=0\}^\{N\-1\},

λR\\lambda\_\{R\},

λEE\\lambda\_\{\\mathrm\{EE\}\},

λJ\\lambda\_\{J\},

RrefR\_\{\\mathrm\{ref\}\},

EEref\\mathrm\{EE\}\_\{\\mathrm\{ref\}\},

JrefJ\_\{\\mathrm\{ref\}\},

KpolK\_\{\\text\{pol\}\},

τpol\\tau\_\{\\text\{pol\}\},

KcaseK\_\{\\mathrm\{case\}\},

τdir\\tau\_\{\\mathrm\{dir\}\},

λop\\lambda\_\{\\mathrm\{op\}\},

λdiag\\lambda\_\{\\mathrm\{diag\}\},

λdir\\lambda\_\{\\mathrm\{dir\}\},

𝒟\(−1\)\\mathcal\{D\}^\{\(\-1\)\},

ℳpol\(0\)\\mathcal\{M\}\_\{\\mathrm\{pol\}\}^\{\(0\)\},

ℳcase\(0\)\\mathcal\{M\}\_\{\\mathrm\{case\}\}^\{\(0\)\},

ℰ\(0\)\\mathcal\{E\}^\{\(0\)\},

𝒄\(−1\)\\bm\{c\}^\{\(\-1\)\}
2:for

n=0,1,…,N−1n=0,1,\\ldots,N\-1do

3:

AintA\_\{\\mathrm\{int\}\}forms

P~\(n\)\\tilde\{P\}^\{\(n\)\}from

P\(n\)P^\{\(n\)\}, followed by the RAG module generates policy evidence with \([35](https://arxiv.org/html/2606.24416#S3.E35)\)–\([36](https://arxiv.org/html/2606.24416#S3.E36)\)\.

AintA\_\{\\mathrm\{int\}\}generates

P¯\(n\)\\bar\{P\}^\{\(n\)\}and

𝒞\(P¯\(n\)\)\\mathcal\{C\}\(\\bar\{P\}^\{\(n\)\}\)with \([22](https://arxiv.org/html/2606.24416#S3.E22)\) and \([23](https://arxiv.org/html/2606.24416#S3.E23)\)\. The RAG module then updates the policy memory with

P¯\(n\)\\bar\{P\}^\{\(n\)\}
4:

AobsA\_\{\\mathrm\{obs\}\}summarizes the environment summary

Senv\(n\)S\_\{\\mathrm\{env\}\}^\{\(n\)\}from

𝒟\(n−1\)\\mathcal\{D\}^\{\(n\-1\)\},

𝒄\(n−1\)\\bm\{c\}^\{\(n\-1\)\}, and

P¯\(n\)\\bar\{P\}^\{\(n\)\}with \([24](https://arxiv.org/html/2606.24416#S3.E24)\)

5:

AplanA\_\{\\mathrm\{plan\}\}initializes the adjustment

𝒅\(n\),0\\bm\{d\}^\{\(n\),0\}by \([25](https://arxiv.org/html/2606.24416#S3.E25)\)

6:for

r=0,1,…,R−1r=0,1,\\ldots,R\-1do

7:

AplanA\_\{\\mathrm\{plan\}\}forms the candidate

𝒄\(n\),r\\bm\{c\}^\{\(n\),r\}by \([26](https://arxiv.org/html/2606.24416#S3.E26)\)

8:The RAG generates case evidence with \([37](https://arxiv.org/html/2606.24416#S3.E37)\)–\([39](https://arxiv.org/html/2606.24416#S3.E39)\)

9:

AcritA\_\{\\mathrm\{crit\}\}forms the reference levels using \([40](https://arxiv.org/html/2606.24416#S3.E40)\)–\([42](https://arxiv.org/html/2606.24416#S3.E42)\), and outputs

\(a\(n\),r,δ𝒅\(n\),r\)\\left\(a^\{\(n\),r\},\\delta\\bm\{d\}^\{\(n\),r\}\\right\)by \([28](https://arxiv.org/html/2606.24416#S3.E28)\)

10:if

a\(n\),r=1a^\{\(n\),r\}=1then

11:Set

𝒄\(n\)←𝒄\(n\),r\\bm\{c\}^\{\(\\\!n\\\!\)\}\\\!\\leftarrow\\\!\\bm\{c\}^\{\(\\\!n\\\!\),r\}, send

𝒄\(n\)\\bm\{c\}^\{\(\\\!n\\\!\)\}to the lower level,break

12:else

13:

AplanA\_\{\\mathrm\{plan\}\}revises the adjustment by \([31](https://arxiv.org/html/2606.24416#S3.E31)\)

14:endif

15:endfor

16:if

𝒄\(n\)\\bm\{c\}^\{\(n\)\}is not assignedthen

17:Set

𝒄\(n\)←𝒄\(n\),R−1\\bm\{c\}^\{\(\\\!n\\\!\)\}\\leftarrow\\bm\{c\}^\{\(\\\!n\\\!\),R\\\!\-\\\!1\}, and send

𝒄\(n\)\\bm\{c\}^\{\(\\\!n\\\!\)\}to the lower level

18:endif

19:foreach slot

t∈𝒯\(n\)t\\in\\mathcal\{T\}^\{\(n\)\}do

20:With the estimated CSI

𝐇^t\\widehat\{\\mathbf\{H\}\}^\{t\}, the lower level solves \([51](https://arxiv.org/html/2606.24416#S3.E51)\) via CVX and obtains

𝐖t,∗\(𝒄\(n\)\)\{\\bf W\}^\{t,\*\}\\big\(\\bm\{c\}^\{\(n\)\}\\big\), and the slot\-level feedback

𝐮t\\mathbf\{u\}^\{t\}is collected by the upper level

21:endfor

22:The upper level forms

𝒟\(n\)\\mathcal\{D\}^\{\(n\)\}and updates

ℰ\(n\+1\)\\mathcal\{E\}^\{\(n\+1\)\}with \([20](https://arxiv.org/html/2606.24416#S3.E20)\) and \([21](https://arxiv.org/html/2606.24416#S3.E21)\), computes

KPI\(n\)\\mathrm\{KPI\}^\{\(n\)\}and

G\(KPI\(n\)\)G\(\\mathrm\{KPI\}^\{\(n\)\}\)with \([18](https://arxiv.org/html/2606.24416#S3.E18)\) and \([19](https://arxiv.org/html/2606.24416#S3.E19)\), and udpates the objective in \([17a](https://arxiv.org/html/2606.24416#S3.E17.1)\)

23:Based on \([34](https://arxiv.org/html/2606.24416#S3.E34)\), the upper level computes

Δ𝒄\(n\)\\Delta\\bm\{c\}^\{\(n\)\}, and the RAG module updates

ℳcase\(n\+1\)\\mathcal\{M\}\_\{\\mathrm\{case\}\}^\{\(n\+1\)\}
24:endfor

25:Return:

\{𝒄\(n\),KPI\(n\)\}n=0N−1\\\{\\\!\\bm\{c\}^\{\(\\\!n\\\!\)\},\\mathrm\{KPI\}^\{\(\\\!n\\\!\)\}\\\!\\\}\_\{n=0\}^\{N\\\!\-\\\!1\},

\{𝐖t,∗\}t=1T\\\{\\\!\{\\bf W\}^\{t,\*\}\\\!\\\}\_\{t\\\!=\\\!1\}^\{T\},

∑n=0N−1γ\(n\)G\(KPI\(n\)\)\\sum\_\{n=0\}^\{N\\\!\-\\\!1\}\\\!\\gamma^\{\(\\\!n\\\!\)\}\\\!G\\\!\(\\\!\\mathrm\{KPI\}^\{\(\\\!n\\\!\)\}\\\!\)

### III\-DOverall Algorithm Implementation

Algorithm[1](https://arxiv.org/html/2606.24416#alg1)summarizes the implementation of the proposed Agentic\-LTPO framework\. At each large timescale intervalnn, agentAintA\_\{\\mathrm\{int\}\}first grounds the operator’s policy input into the structured context and feasible decision domain, after which agentAobsA\_\{\\mathrm\{obs\}\}summarizes the current environment state\. Conditioned on this environment state, the planner–critic coordination between agentsAplanA\_\{\\mathrm\{plan\}\}andAcritA\_\{\\mathrm\{crit\}\}performs at mostRRrefinement rounds with RAG\-assisted retrieval until a feasible configuration is accepted and passed to the lower level\. The accepted configuration is fixed throughout intervalnn, while the fast solver computes the beamforming solutions based on the instantaneous CSI per slot\. At the end of intervalnn, the realized lower\-level beamforming solutions are aggregated to update the experience buffer, the designed KPIs, and the case memory for the next upper\-level update\. The algorithm can be initialized offline by a tuple that is consistent with the adopted configuration and memory settings\.

## IVExperiments

In this section, we evaluate the proposed Agentic\-LTPO framework for a CF\-MIMO system withL=16L=16distributed APs jointly servingK=8K=8single\-antenna users in a square service area of size1km×1km1~\\mathrm\{km\}\\times 1~\\mathrm\{km\}\. Each AP is equipped with a uniform linear array \(ULA\) withM=4M=4transmit antennas and half\-wavelength antenna spacing\. Unless otherwise specified, the default experiments useL=16L=16APs fixed at the grid points of a4×44\\times 4layout spanning the1km×1km1~\\mathrm\{km\}\\times 1~\\mathrm\{km\}service area\. We set the pre\-specified timeT=600T=600time slots withN=30N=30large timescale intervals, each containingTs=20T\_\{s\}=20slots\. In each interval, theKKusers are independently and uniformly dropped in the square service area\. The user locations in each interval follow a homogeneous binomial point process in the area\. The user locations are fixed within each large interval and regenerated independently across intervals\. To model the downlink channels, we adopt the standard large\-scale/small\-scale decomposition that is widely used in the CF\-MIMO literature, e\.g\.,\[[30](https://arxiv.org/html/2606.24416#bib.bib1)\]\. For each slott∈𝒯\(n\)t\\in\\mathcal\{T\}^\{\(n\)\}, the channel from APℓ\\ellto userkkis generated as

𝐡ℓkt=βℓk\(n\)𝐠ℓkt,𝐠ℓkt∼𝒞𝒩\(𝟎,𝐈M\),\\displaystyle\\mathbf\{h\}\_\{\\ell k\}^\{t\}=\\sqrt\{\\beta\_\{\\ell k\}^\{\(n\)\}\}\\,\\mathbf\{g\}\_\{\\ell k\}^\{t\},\\qquad\\mathbf\{g\}\_\{\\ell k\}^\{t\}\\sim\\mathcal\{CN\}\\\!\\left\(\\mathbf\{0\},\\mathbf\{I\}\_\{M\}\\right\),\(52\)where𝐠ℓkt\\mathbf\{g\}\_\{\\ell k\}^\{t\}denotes the small\-scale Rayleigh fading vector, andβℓk\(n\)\\beta\_\{\\ell k\}^\{\(n\)\}is the large\-scale fading coefficient that remains unchanged within thenn\-th large timescale interval\. The large\-scale fading coefficient is modeled as

βℓk\(n\)=10PLℓk\(n\)\+Fℓk\(n\)10,\\displaystyle\\beta\_\{\\ell k\}^\{\(n\)\}=10^\{\\frac\{\\mathrm\{PL\}\_\{\\ell k\}^\{\(n\)\}\+F\_\{\\ell k\}^\{\(n\)\}\}\{10\}\},\(53\)wherePLℓk\(n\)\\mathrm\{PL\}\_\{\\ell k\}^\{\(n\)\}is the path loss \(in dB\) andFℓk\(n\)F\_\{\\ell k\}^\{\(n\)\}denotes the shadow fading \(in dB\)\. Considering an urban microcell setting,PLℓk\(n\)\[dB\]=−30\.5−36\.7log10⁡\(dℓk\(n\)\)\\mathrm\{PL\}\_\{\\ell k\}^\{\(n\)\}\[\\mathrm\{dB\}\]=\-30\.5\-36\.7\\log\_\{10\}\\\!\(\{d\_\{\\ell k\}^\{\(n\)\}\}\), wheredℓk\(n\)d\_\{\\ell k\}^\{\(n\)\}is the distance between APℓ\\elland userkkin intervalnn; the shadow fading yieldsFℓk\(n\)∼𝒩\(0,42\)dBF\_\{\\ell k\}^\{\(n\)\}\\sim\\mathcal\{N\}\(0,4^\{2\}\)\\ \\mathrm\{dB\}\. Unless otherwise specified,βℓk\(n\)\\beta\_\{\\ell k\}^\{\(n\)\}is unchanged within any intervalnnand updated independently across intervals, and𝐠ℓkt\\mathbf\{g\}\_\{\\ell k\}^\{t\}is independently generated across slots\. The CSI uncertainty is set asδℓkt=ϵ‖𝐡^ℓkt‖2,∀ℓ,k,t\\delta\_\{\\ell k\}^\{t\}\\\!=\\\!\\epsilon\\\|\\hat\{\\mathbf\{h\}\}\_\{\\ell k\}^\{t\}\\\|\_\{2\},\\forall\\ell,k,twith the default uncertainty levelϵ=0\.05\\epsilon\\\!=\\\!0\.05\.

For the lower\-level beamforming design, the noise variance is\(σkt\)2=−167dBm/Hz\(\\sigma\_\{k\}^\{t\}\)^\{2\}=\-167~\\mathrm\{dBm/Hz\},∀k∈𝒦\\forall k\\in\\mathcal\{K\}\. All APs are assigned the same nominal per\-AP power budgetPℓmax=−50dBm/HzP\_\{\\ell\}^\{\\max\}=\-50~\\mathrm\{dBm/Hz\},∀ℓ∈ℒ\\forall\\ell\\in\\mathcal\{L\}\. The effective per\-AP power limit in intervalnnis adjusted by the upper\-level coefficientϕℓ\(n\)\\phi\_\{\\ell\}^\{\(n\)\}\.

For the upper\-level multi\-agent setting, we manually generateP\(n\)P^\{\(n\)\}, which expresses only the operator’s operational objectives and preferences in intervalnnand does not directly specify the configuration values\. The values ofRrefR\_\{\\mathrm\{ref\}\},EEref\\mathrm\{EE\}\_\{\\mathrm\{ref\}\}, andJrefJ\_\{\\mathrm\{ref\}\}are obtained offline with66intervals, and the initial configuration𝒄\(−1\)\\bm\{c\}^\{\(\-1\)\}inAlgorithm 1isΓk\(−1\)=0\(dB\),∀k∈𝒦\\Gamma\_\{k\}^\{\(\-1\)\}=0~\(\\mathrm\{dB\}\),\\forall k\\in\\mathcal\{K\}, andϕℓ\(−1\)=1,∀ℓ∈ℒ\\phi\_\{\\ell\}^\{\(\-1\)\}=1,\\forall\\ell\\in\\mathcal\{L\}\. The empirically supported tolerance thresholds in \([III\-B3](https://arxiv.org/html/2606.24416#S3.Ex3)\), i\.e\., the admissible deviations of the candidate configuration components from their retrieved reference levels, areεΓ,k\(n\),r=0\.75\+0\.25r\(dB\)\\varepsilon\_\{\\Gamma,k\}^\{\(n\),r\}=0\.75\+0\.25r\\ \(\\mathrm\{dB\}\)andεϕ,ℓ\(n\),r=0\.05\+0\.03r\\varepsilon\_\{\\phi,\\ell\}^\{\(n\),r\}=0\.05\+0\.03r, respectively\. The other parameters associated with the system are listed in Table I\.

TABLE II:simulation parameters not explicitly specified in the text\.Since Agentic\-LTPO jointly considers slow\-timescale policy interpretation, cross\-timescale configuration adaptation, and slot\-level CF\-MIMO beamforming, no existing methods are directly comparable\. Particularly, classical beamforming methods, e\.g\.,\[[36](https://arxiv.org/html/2606.24416#bib.bib4)\], typically assume a fixed problem formulation, while learning\-based controllers, e\.g\.,\[[2](https://arxiv.org/html/2606.24416#bib.bib7),[40](https://arxiv.org/html/2606.24416#bib.bib9)\], have been designed under pre\-specified rewards or optimization templates\. For the language\-grounding study, we further consider an*oracle structured\-policy*setting, where the natural language policy is replaced by its manually specified structured counterpart\. This setting is not a standalone benchmark; rather, it separates the effect of policy interpretation from that of the decision logic\. We consider the following benchmarks for Agentic\-LTPO:

- •Static Configuration: A pre\-tuned upper\-level configuration𝒄\\bm\{c\}remains fixed throughout all periods\.
- •Single\-Agent Controller: A single LLM\-based controller directly maps the current policy input and environment summary to the next configuration to generate𝒄\(n\)\\bm\{c\}^\{\(n\)\}\.

We also test the following ablated versions of Agentic\-LTPO:

- •Agentic\-LTPO without RAG: The proposed framework with the RAG module removed\.
- •Agentic\-LTPO withoutAcritA\_\{\\mathrm\{crit\}\}: The proposed framework with the performance critic and the corresponding refinement loop deactivated\.

We consider two policy\-generation settings in the experiments\. In the first setting, the operator’s policy is independently sampled for each large timescale interval, so that the policy tendency may change arbitrarily across adjacent intervals\. This setting evaluates the adaptability of the upper\-level controller under rapidly varying policy inputs\. In the second setting, the policy sequence is generated in a piecewise\-stationary manner, where every ten intervals share the same dominant policy tendency, e\.g\., throughput\-oriented, QoS\-oriented, or energy\-saving operation\. This setting evaluates whether the upper level can reliably track policy regimes and adjust the lower\-level problem configuration accordingly\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x1.png)\(a\)Random policy setting\.
![Refer to caption](https://arxiv.org/html/2606.24416v1/x2.png)\(b\)Piecewise\-stationary policy setting\.

Figure 3:Cumulative communication utility of different upper\-level controllers under two policy\-generation settings\.We first evaluate the communication utility metricGcomm\(n\)≜λRTs∑t∈𝒯\(n\)RtRref\+λEETs∑t∈𝒯\(n\)EEtEErefG\_\{\\mathrm\{comm\}\}^\{\(n\)\}\\triangleq\\frac\{\\lambda\_\{R\}\}\{T\_\{s\}\}\\sum\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\}\\frac\{R^\{t\}\}\{R\_\{\\mathrm\{ref\}\}\}\+\\frac\{\\lambda\_\{\\mathrm\{EE\}\}\}\{T\_\{s\}\}\\sum\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\}\\frac\{\\mathrm\{EE\}^\{t\}\}\{\\mathrm\{EE\}\_\{\\mathrm\{ref\}\}\}, which removes the upper\-level inference energy fromG\(KPI\(n\)\)G\(\\mathrm\{KPI\}^\{\(n\)\}\)in \([19](https://arxiv.org/html/2606.24416#S3.E19)\)\. Fig\.[3](https://arxiv.org/html/2606.24416#S4.F3)plots∑i=1nGcomm\(i\)\\sum\_\{i=1\}^\{n\}G\_\{\\mathrm\{comm\}\}^\{\(i\)\}of different benchmarks under both random and piecewise\-stationary policy settings\. Agentic\-LTPO yields the largest cumulative communication\-utility gain in both settings, showing that it can more effectively translate operator intents into suitable configurations across intervals\. By contrast, the static configuration strategy remains fixed and therefore accumulates a significant mismatch as the policy evolves across intervals\. Removing either the RAG module or the critic weakens the long\-term gain, indicating that both retrieved evidence and planner–critic refinement contribute to sustained policy adaptation\. The single\-agent controller outperforms the static baseline, but is worse than the proposed design; simply introducing an LLM is insufficient for reliable long\-term configuration\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x3.png)\(a\)Normalized sum\-rate component\.
![Refer to caption](https://arxiv.org/html/2606.24416v1/x4.png)\(b\)Normalized energy\-efficiency component\.

Figure 4:Policy\-following KPI responses of different upper\-level controllers under sustained operator policy regimes\.Fig\.[4](https://arxiv.org/html/2606.24416#S4.F4)examines whether the upper\-level controller follows the active operator policy at the level of individual KPI components\. Unlike Fig\.[3](https://arxiv.org/html/2606.24416#S4.F3), which reports the cumulative communication utility, Fig\.[4](https://arxiv.org/html/2606.24416#S4.F4)plots the two normalized terms that constituteGcomm\(n\)G\_\{\\mathrm\{comm\}\}^\{\(n\)\}, namely1Ts∑t∈𝒯\(n\)RtRref\\frac\{1\}\{T\_\{s\}\}\\sum\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\}\\frac\{R^\{t\}\}\{R\_\{\\mathrm\{ref\}\}\}and1Ts∑t∈𝒯\(n\)EEtEEref\\frac\{1\}\{T\_\{s\}\}\\sum\_\{t\\in\\mathcal\{T\}^\{\(n\)\}\}\\frac\{\\mathrm\{EE\}^\{t\}\}\{\\mathrm\{EE\}\_\{\\mathrm\{ref\}\}\}\. The piecewise\-stationary policy sequence is divided into three regimes, where Regimes I and III emphasize communication\-oriented operation, while Regime II corresponds to an energy\-saving operating tendency\.

As shown in Fig\.[4](https://arxiv.org/html/2606.24416#S4.F4)\(a\), Agentic\-LTPO increases the normalized sum\-rate more effectively in the communication\-oriented regimes\. In Fig\.[4](https://arxiv.org/html/2606.24416#S4.F4)\(b\), the energy\-efficiency is maintained at a high level during the energy\-saving regime among the adaptive controllers\. The static strategy also yields high energy efficiency because it is conservative, but it does not provide the corresponding sum\-rate response in Regimes I and III\. Hence, the result should be interpreted as a policy\-following behavior rather than a single\-metric dominance claim: Agentic\-LTPO changes the dominant KPI response according to the active regime, whereas the single\-agent controller and the ablated variants show weaker or less coordinated responses across the two KPI dimensions\. Agentic\-LTPO translates sustained policy regimes into both interpretable configuration changes and corresponding KPI\-level responses\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x5.png)Figure 5:Policy\-compliance operating points of different adaptive upper\-level controllers\.To quantify the above policy\-complying behavior, Fig\.[5](https://arxiv.org/html/2606.24416#S4.F5)summarizes the intent satisfaction ratio and the corresponding target\-KPI margin over the sustained policy sequence\. For each large interval, the target KPI is selected according to the active policy regime: The normalized sum\-rate metric is used in the throughput\-oriented regimes, whereas the normalized energy\-efficiency metric is used in the energy\-saving regime\. An interval is counted as intent\-satisfied if the selected target KPI exceeds its regime\-specific threshold\. The target\-KPI margin is the difference between the selected normalized KPI and the corresponding threshold, averaged over all intervals\. The thresholds are used only to operationalize the natural\-language policy into measurable compliance events, rather than to define a new optimization objective\.

As shown in Fig\.[5](https://arxiv.org/html/2606.24416#S4.F5), Agentic\-LTPO lies in the upper\-right region, indicating that it achieves both a higher overall intent satisfaction ratio and a larger positive target\-KPI margin\. The single\-agent controller and the ablated variants can satisfy part of the policy sequence, but their margins are smaller, which implies weaker headroom relative to the active target KPI\. Complementing the interval\-wise KPI response in Fig\.[4](https://arxiv.org/html/2606.24416#S4.F4), Fig\.[5](https://arxiv.org/html/2606.24416#S4.F5)provides a compact compliance\-level summary of whether the response is aligned with the active operator intent\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x6.png)\(a\)Final utility–overhead tradeoff\.
![Refer to caption](https://arxiv.org/html/2606.24416v1/x7.png)\(b\)Cumulative utility–overhead trajectory\.

Figure 6:Token\-based upper\-level reasoning overhead under the random policy setting\.Fig\.[6](https://arxiv.org/html/2606.24416#S4.F6)reports the token\-side reasoning overhead associated with the random\-policy experiment in Fig\.[3](https://arxiv.org/html/2606.24416#S4.F3)\(a\)\. Thexx\-axis uses the token\-based proxyJ¯tok\\bar\{J\}\_\{\\mathrm\{tok\}\}, computed from the recorded input and output token counts of the upper\-level controller; theyy\-axis keeps the communication\-side metricGcomm\(n\)G\_\{\\mathrm\{comm\}\}^\{\(n\)\}unchanged\. As shown in Fig\.[6](https://arxiv.org/html/2606.24416#S4.F6)\(a\), Agentic\-LTPO achieves the largest cumulativeGcommG\_\{\\mathrm\{comm\}\}but also incurs a larger token cost than the single\-agent controller and the static strategy\. This is expected though, since the proposed architecture invokes policy interpretation, retrieval\-assisted planning, and critic\-based verification\. Agentic\-LTPO outperforms Agentic\-LTPO without RAG even though their token\-cost proxies are comparable, indicating that the communication\-utility gain is not merely caused by more tokens\. Fig\.[6](https://arxiv.org/html/2606.24416#S4.F6)\(b\) shows the same tradeoff in the entire simulation horizon: The proposed method accumulates higher reasoning overhead as the intervals proceed; the additional cost contributes to a higher final communication utility\. Agentic\-LTPO exhibits a utility–overhead tradeoff: The multi\-agent reasoning process is more expensive than a monolithic or static controller, but yields more effective long\-term configuration under varying operator policies\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x8.png)Figure 7:Mean–variability tradeoff of interval\-level communication utility under the random policy setting\.To further examine whether the long\-term gain in Fig\.[3](https://arxiv.org/html/2606.24416#S4.F3)\(a\) is obtained through unstable interval\-level behavior, Fig\.[7](https://arxiv.org/html/2606.24416#S4.F7)plots the mean ofGcomm\(n\)G\_\{\\mathrm\{comm\}\}^\{\(n\)\}against its standard deviation across intervals under the random policy setting\. Agentic\-LTPO attains the largest mean interval\-level communication utility, while its variability remains lower than that of the single\-agent controller and Agentic\-LTPO without the critic\. Agentic\-LTPO without RAG has a slightly smaller standard deviation and produces a much lower mean utility, suggesting less responsive update behavior rather than a better operating point\. The proposed design improves the mean–variability tradeoff: Retrieval\-assisted grounding and critic\-guided verification allow the upper level to make more effective adaptations, while avoiding larger fluctuations observed in the monolithic single\-agent controller and the no\-critic variant\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x9.png)\(a\)Average target robust QoS level\.
![Refer to caption](https://arxiv.org/html/2606.24416v1/x10.png)\(b\)Average AP power\-budget exposure\.

Figure 8:Configuration trajectories generated by different upper\-level controllers under sustained operator policy regimes\.To understand the cause of the above\-mentioned cumulative communication\-utility gain, Fig\.[8](https://arxiv.org/html/2606.24416#S4.F8)plots the interval\-level configuration variables generated by the upper\-level controllers, including the average target robust QoS level1K∑k∈𝒦Γk\(n\)\\frac\{1\}\{K\}\\sum\_\{k\\in\\mathcal\{K\}\}\\Gamma\_\{k\}^\{\(n\)\}and the average AP power\-budget exposure factor1L∑ℓ∈ℒϕℓ\(n\)\\frac\{1\}\{L\}\\sum\_\{\\ell\\in\\mathcal\{L\}\}\\phi\_\{\\ell\}^\{\(n\)\}\. These curves provide a direct view of how policy semantics are translated into lower\-level optimization parameters\. In Regime I, Agentic\-LTPO increases the average target robust QoS more aggressively, which is consistent with the throughput\-oriented policy intent\. In Regime II, it sharply relaxes the AP power budget exposure factor, thereby enforcing a more conservative power\-allocation regime under the energy\-saving intent\. When the throughput\-oriented policy returns in Regime III, both configuration variables are restored toward a more communication\-favorable operating point\. The single\-agent controller and the ablated variants also exhibit partial adaptation, but their configuration changes are less responsive and less coordinated across the QoS and AP\-exposure dimensions\. This observation supports the role of retrieval\-assisted grounding and critic\-guided refinement in producing policy\-consistent upper\-level configurations\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x11.png)\(a\)Cumulative communication utility\.
![Refer to caption](https://arxiv.org/html/2606.24416v1/x12.png)\(b\)Regime\-wise average communication utility\.

Figure 9:Ablation study on the contributions of retrieval\-assisted grounding and critic\-guided refinement\.Fig\.[9](https://arxiv.org/html/2606.24416#S4.F9)evaluates the contribution of the main upper\-level components\. Fig\.[9](https://arxiv.org/html/2606.24416#S4.F9)\(a\) reports the cumulative communication utility∑n=1NGcomm\(n\)\\sum\_\{n=1\}^\{N\}G\_\{\\mathrm\{comm\}\}^\{\(n\)\}; Fig\.[9](https://arxiv.org/html/2606.24416#S4.F9)\(b\) reports the averageGcomm\(n\)G\_\{\\mathrm\{comm\}\}^\{\(n\)\}within each sustained policy regime\. Compared with the single\-agent controller, Agentic\-LTPO obtains a larger cumulative gain, showing that simply mapping the policy and environment summary to a configuration is insufficient for effective long\-term adaptation\. Removing the RAG module or the critic also degrades the cumulative utility, indicating that both historical evidence reuse and planner–critic verification contribute to the communication utility\. The regime\-wise comparison further shows that the advantage is the most visible in Regimes I and III, where the policy emphasizes communication\-oriented operation and the upper level must actively restore a more aggressive configuration\. In Regime II, the utility gap is smaller because all adaptive controllers move toward a more conservative configuration under the energy\-saving intent\. These results are consistent with the configuration trajectories in Fig\.[8](https://arxiv.org/html/2606.24416#S4.F8)and support the role of the proposed multi\-agent structure in coordinating policy interpretation, retrieval grounding, and critic\-guided refinement\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x13.png)\(a\)Cumulative communication utility\.
![Refer to caption](https://arxiv.org/html/2606.24416v1/x14.png)\(b\)Mean target\-KPI margin\.

Figure 10:Impact of raw natural\-language and oracle structured\-policy inputs on upper\-level policy adaptation\.Fig\.[10](https://arxiv.org/html/2606.24416#S4.F10)compares the cumulative communication utility and target\-KPI margin under raw natural\-language policy input and oracle structured\-policy input\. Since the oracle structured\-policy input removes the ambiguity of natural language while keeping the downstream controller unchanged, the gap between the two input formats reflects the sensitivity of the controller to language grounding\. As shown in Fig\.[10](https://arxiv.org/html/2606.24416#S4.F10)\(a\), Agentic\-LTPO maintains a higher cumulative communication utility than the single\-agent controller under both input formats\. Replacing the raw input with the oracle structured\-policy input slightly increases the cumulative utility of Agentic\-LTPO, with only a1\.9%1\.9\\%gain over the raw\-input case, while the corresponding gain for the single\-agent controller is6\.8%6\.8\\%\. This indicates that the monolithic controller is more sensitive to the input representation\. Fig\.[10](https://arxiv.org/html/2606.24416#S4.F10)\(b\) shows the same trend from the policy\-compliance perspective: The oracle structured\-policy input increases the mean target\-KPI margin for both controllers, while Agentic\-LTPO retains a larger positive margin\. In this sense, the gain of Agentic\-LTPO is not solely due to stronger text processing; it also comes from the structured grounding and multi\-agent decision flow\.

![Refer to caption](https://arxiv.org/html/2606.24416v1/x15.png)\(a\)Cumulative communication utility\.
![Refer to caption](https://arxiv.org/html/2606.24416v1/x16.png)\(b\)QoS violation ratio\.

Figure 11:Service\-load robustness under different AP densities\.Fig\.[11](https://arxiv.org/html/2606.24416#S4.F11)evaluates the behavior of different upper\-level controllers under a service\-load stress test by varying the number of APs,LL, while keeping the user load fixed\. Fig\.[11](https://arxiv.org/html/2606.24416#S4.F11)\(a\) reports the accumulated communication utility∑n=1NGcomm\(n\)\\sum\_\{n=1\}^\{N\}G\_\{\\mathrm\{comm\}\}^\{\(n\)\}; Fig\.[11](https://arxiv.org/html/2606.24416#S4.F11)\(b\) reports the QoS violation ratio, defined as the proportion of user\-slot instances in which the achieved robust QoS lower bound is below the configured target\. WhenLLis small, the lower\-level feasible region is more restrictive and the quality of the upper\-level configuration is more important\. In this regime, Agentic\-LTPO achieves a larger cumulativeGcommG\_\{\\mathrm\{comm\}\}and a lower QoS violation ratio than the ablated and single\-agent controllers, indicating that the proposed retrieval\-assisted and critic\-guided decision flow produces more reliable lower\-level configurations under sparser deployments of fewer APs\. AsLLincreases, the additional AP resources enlarge the feasible solution region and it becomes easier for all adaptive controllers to satisfy the QoS targets, hence reducing the violation ratio and narrowing the performance gap\. The static strategy remains weak since it cannot adjust the lower\-level configuration to the stressed operating condition\.

## VConclusion

In this paper, we presented a new agentic AI framework, named Agentic\-LTPO, to tackle long\-term performance optimization for wireless physical layer tasks\. By decoupling agentic AI decisions from instantaneous optimization, we developed a nested bilevel framework to handle dynamic operator policies, together with stringent real\-time constraints\. To illustrate the advantages of Agentic\-LTPO, we considered a CF\-MIMO beamforming scenario, where an upper\-level multi\-agent system parameterizing the operator’s changing policies and KPIs was designed to adapt the lower\-level beamforming problem configuration based on policy inputs, environment summaries, and historical experience\. Numerical results demonstrated that Agentic\-LTPO improves cumulative communication utility by57\.2%57\.2\\%over the static baseline, while limiting the cumulative\-utility gap between raw natural\-language and oracle structured\-policy inputs to1\.9%1\.9\\%\.

## References

- \[1\]\(2020\)Making cell\-free massive MIMO competitive with MMSE processing and centralized implementation\.IEEE Trans\. Wireless Commun\.19\(1\),pp\. 77–90\.Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1),[§II\-B](https://arxiv.org/html/2606.24416#S2.SS2.p2.6)\.
- \[2\]Y\. Cai, P\. Cheng, Z\. Chen, M\. Ding, B\. Vucetic, and Y\. Li\(2024\)Deep reinforcement learning for online resource allocation in network slicing\.IEEE Trans\. Mobile Comput\.23\(6\),pp\. 7099–7116\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2023.3328950)Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p1.1.1),[§I](https://arxiv.org/html/2606.24416#S1.p2.1),[§IV](https://arxiv.org/html/2606.24416#S4.p4.1)\.
- \[3\]H\. P\. Cavagna, A\. Proia, G\. Madella, G\. B\. Esposito, F\. Antici, D\. Cesarini, Z\. Kiziltan, and A\. Bartolini\(2026\)SweetSpot: an analytical model for predicting energy efficiency of LLM inference\.InProc\. ACM/SPEC Int\. Conf\. Perform\. Eng\.,External Links:[Document](https://dx.doi.org/10.1145/3777884.3797011),2602\.05695Cited by:[§II\-C5](https://arxiv.org/html/2606.24416#S2.SS3.SSS5.p2.2)\.
- \[4\]A\. Chowdhury, G\. Verma, C\. Rao, A\. Swami, and S\. Segarra\(2021\)Unfolding WMMSE using graph neural networks for efficient power allocation\.IEEE Trans\. Wireless Commun\.20\(9\),pp\. 6004–6017\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p2.1)\.
- \[5\]A\. Chowdhury, G\. Verma, A\. Swami, and S\. Segarra\(2024\)Deep graph unfolding for beamforming in MU\-MIMO interference networks\.IEEE Trans\. Wireless Commun\.23\(5\),pp\. 4889–4903\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p2.1)\.
- \[6\]CVX Research, Inc\.\(2012\-08\)CVX: matlab software for disciplined convex programming, version 2\.0\.External Links:[Link](https://cvxr.com/cvx.)Cited by:[§III\-C](https://arxiv.org/html/2606.24416#S3.SS3.p4.1)\.
- \[7\]M\. Elkael, S\. D’Oro, L\. Bonati, M\. Polese, Y\. Lee, K\. Furueda, and T\. Melodia\(2025\)AgentRAN: an agentic AI architecture for autonomous control of open 6G networks\.arXiv preprint arXiv:2508\.17778\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p2.1)\.
- \[8\]G\. Femenias and F\. Riera\-Palou\(2025\)From cells to freedom: 6G’s evolutionary shift with cell\-free massive MIMO\.IEEE Trans\. Mobile Comput\.24\(2\),pp\. 812–829\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2024.3468003)Cited by:[§I](https://arxiv.org/html/2606.24416#S1.p1.1)\.
- \[9\]D\. Gesbert, S\. Hanly, H\. Huang, S\. S\. Shitz, O\. Simeone, and W\. Yu\(2010\)Multi\-cell MIMO cooperative networks: a new look at interference\.IEEE J\. Sel\. Areas Commun\.28\(9\),pp\. 1380–1408\.Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1)\.
- \[10\]C\. Hang, P\. Yu, R\. Morabito, and C\. Tan\(2024\)Large language models meet next\-generation networking technologies: a review\.Future Internet16\(10\),pp\. 365\.External Links:[Document](https://dx.doi.org/10.3390/fi16100365)Cited by:[§I](https://arxiv.org/html/2606.24416#S1.p3.1)\.
- \[11\]C\. Hang, P\. Yu, R\. Morabito, and C\. Tan\(2024\)Large language models meet next\-generation networking technologies: a review\.Future Internet16\(10\),pp\. 365\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[12\]H\. He, S\. Jin, C\. Wen, F\. Gao, G\. Y\. Li, and Z\. Xu\(2019\)Model\-driven deep learning for physical layer communications\.IEEE Wireless Commun\.26\(5\),pp\. 77–83\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p2.1)\.
- \[13\]Z\. He, A\. Gottipati, L\. Qiu, X\. Luo, K\. Xu, Y\. Yang, and F\. Y\. Yan\(2024\)Designing network algorithms via large language models\.InProc\. ACM Workshop Hot Topics Networks,pp\. 205–212\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[14\]G\. Interdonato, E\. Björnson, H\. Q\. Ngo, P\. Frenger, and E\. G\. Larsson\(2019\)Ubiquitous cell\-free massive MIMO communications\.EURASIP J\. Wireless Commun\. Netw\.2019,pp\. 197\.Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1)\.
- \[15\]G\. Interdonato, M\. Karlsson, E\. Björnson, and E\. G\. Larsson\(2020\)Local partial zero\-forcing precoding for cell\-free massive MIMO\.IEEE Trans\. Wireless Commun\.19\(7\),pp\. 4758–4774\.Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1)\.
- \[16\]E\. G\. Larsson, O\. Edfors, F\. Tufvesson, and T\. L\. Marzetta\(2014\)Massive MIMO for next generation wireless systems\.IEEE Commun\. Mag\.52\(2\),pp\. 186–195\.Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1)\.
- \[17\]A\. Leivadeas and M\. Falkner\(2023\)A survey on intent\-based networking\.IEEE Commun\. Surveys Tuts\.25\(1\),pp\. 625–655\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[18\]H\. Li, Y\. Wu, and D\. Simeonidou\(2026\)Multi\-agentic AI for conflict\-aware rapp policy orchestration in open RAN\.arXiv preprint arXiv:2603\.07375\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p2.1)\.
- \[19\]L\. Liang, H\. Ye, Y\. Sheng, O\. Wang, J\. Wang, S\. Jin, and G\. Y\. Li\(2025\)Large language models for wireless communications: from adaptation to autonomy\.arXiv preprint arXiv:2507\.21524\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[20\]L\. Liang, H\. Ye, Y\. Sheng, O\. Wang, J\. Wang, S\. Jin, and G\. Y\. Li\(2026\)Large language models for wireless communications: from adaptation to autonomy\.IEEE Commun\. Mag\.\.Note:Early AccessExternal Links:[Document](https://dx.doi.org/10.1109/MCOM.001.2500476)Cited by:[§I](https://arxiv.org/html/2606.24416#S1.p3.1)\.
- \[21\]L\. Liang, H\. Ye, G\. Yu, and G\. Y\. Li\(2020\)Deep\-learning\-based wireless resource allocation with application to vehicular networks\.Proc\. IEEE108\(2\),pp\. 341–356\.External Links:[Document](https://dx.doi.org/10.1109/JPROC.2019.2957798)Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p1.1),[§I](https://arxiv.org/html/2606.24416#S1.p2.1)\.
- \[22\]L\. Liang, H\. Ye, G\. Yu, and G\. Y\. Li\(2020\)Deep\-learning\-based wireless resource allocation with application to vehicular networks\.Proc\. IEEE108\(2\),pp\. 341–356\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p1.1)\.
- \[23\]D\. M\. Manias, A\. Chouman, and A\. Shami\(2024\)Towards intent\-based network management: large language models for intent extraction in 5G core networks\.InProc\. Int\. Conf\. Des\. Reliable Commun\. Netw\.,Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1),[§III\-B1](https://arxiv.org/html/2606.24416#S3.SS2.SSS1.p1.7)\.
- \[24\]Q\. Mao, F\. Hu, and Q\. Hao\(2018\)Deep learning for intelligent wireless networks: a comprehensive survey\.IEEE Commun\. Surveys Tuts\.20\(4\),pp\. 2595–2621\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p1.1)\.
- \[25\]T\. L\. Marzetta\(2010\)Noncooperative cellular wireless with unlimited numbers of base station antennas\.IEEE Trans\. Wireless Commun\.9\(11\),pp\. 3590–3600\.Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1)\.
- \[26\]C\. V\. Nahum, S\. D’Oro, P\. Batista, C\. B\. Both, K\. V\. Cardoso, A\. Klautau, and T\. Melodia\(2026\)Intent\-based radio scheduler for RAN slicing: learning to deal with different network scenarios\.IEEE Trans\. Mobile Comput\.25\(3\),pp\. 3229–3246\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2025.3614453)Cited by:[§I](https://arxiv.org/html/2606.24416#S1.p3.1)\.
- \[27\]H\. Navidan, M\. Cheraghinia, J\. Fontaine, M\. Seif, E\. De Poorter, H\. V\. Poor, I\. Moerman, and A\. Shahid\(2026\)Toward autonomous O\-RAN: a multi\-scale agentic AI framework for real\-time network control and management\.arXiv preprint arXiv:2602\.14117\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p2.1)\.
- \[28\]H\. Navidan, M\. Cheraghinia, J\. Fontaine, M\. Seif, E\. De Poorter, H\. V\. Poor, I\. Moerman, and A\. Shahid\(2026\)Toward autonomous O\-RAN: a multi\-scale agentic AI framework for real\-time network control and management\.arXiv preprint arXiv:2602\.14117\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.14117)Cited by:[§I](https://arxiv.org/html/2606.24416#S1.p3.1)\.
- \[29\]E\. Nayebi, A\. Ashikhmin, T\. L\. Marzetta, H\. Yang, and B\. D\. Rao\(2017\)Precoding and power optimization in cell\-free massive MIMO systems\.IEEE Trans\. Wireless Commun\.16\(7\),pp\. 4445–4459\.Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1)\.
- \[30\]H\. Q\. Ngo, A\. Ashikhmin, H\. Yang, E\. G\. Larsson, and T\. L\. Marzetta\(2017\)Cell\-free massive MIMO versus small cells\.IEEE Trans\. Wireless Commun\.16\(3\),pp\. 1834–1850\.External Links:[Document](https://dx.doi.org/10.1109/TWC.2017.2655515)Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1),[§I](https://arxiv.org/html/2606.24416#S1.p1.1),[§II\-B](https://arxiv.org/html/2606.24416#S2.SS2.p2.6),[§IV](https://arxiv.org/html/2606.24416#S4.p1.14)\.
- \[31\]L\. Pang, C\. Yang, D\. Chen, Y\. Song, and M\. Guizani\(2020\)A survey on intent\-driven networks\.IEEE Access8,pp\. 22862–22873\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[32\]L\. Pellaco, M\. Bengtsson, and J\. Jaldén\(2022\)Matrix\-inverse\-free deep unfolding of the weighted MMSE beamforming algorithm\.IEEE Open J\. Commun\. Soc\.3,pp\. 65–81\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p2.1)\.
- \[33\]X\. Peng, Y\. Liu, Y\. Cang, C\. Cao, and M\. Chen\(2025\)LLM\-optira: LLM\-driven optimization of resource allocation for non\-convex problems in wireless communications\.arXiv preprint arXiv:2505\.02091\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[34\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.InAdv\. Neural Inf\. Process\. Syst\.,Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[35\]Y\. Shen, Y\. Shi, J\. Zhang, and K\. B\. Letaief\(2020\)LORM: learning to optimize for resource management in wireless networks with few training samples\.IEEE Trans\. Wireless Commun\.19\(1\),pp\. 665–679\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p2.1)\.
- \[36\]Q\. Shi, M\. Razaviyayn, Z\. Luo, and C\. He\(2011\)An iteratively weighted MMSE approach to distributed sum\-utility maximization for a MIMO interfering broadcast channel\.IEEE Trans\. Signal Process\.59\(9\),pp\. 4331–4340\.External Links:[Document](https://dx.doi.org/10.1109/TSP.2011.2147784)Cited by:[§I\-A1](https://arxiv.org/html/2606.24416#S1.SS1.SSS1.p1.1),[§I](https://arxiv.org/html/2606.24416#S1.p1.1),[§IV](https://arxiv.org/html/2606.24416#S4.p4.1)\.
- \[37\]M\. H\. Shokouhi and V\. W\. S\. Wong\(2026\)Agentic AI for intent\-driven optimization in cell\-free O\-RAN\.arXiv preprint arXiv:2602\.22539\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p2.1)\.
- \[38\]H\. Sun, X\. Chen, Q\. Shi, M\. Hong, X\. Fu, and N\. D\. Sidiropoulos\(2018\)Learning to optimize: training deep neural networks for interference management\.IEEE Trans\. Signal Process\.66\(20\),pp\. 5438–5453\.External Links:[Document](https://dx.doi.org/10.1109/TSP.2018.2866382)Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p2.1)\.
- \[39\]J\. Wang, L\. Guo, J\. Wu, C\. Yan, H\. Sun, L\. Zhang, Z\. Zhuang, Q\. Qi, and J\. Liao\(2025\)Hierarchical index retrieval\-driven wireless network intent translation with LLM\.IEEE Trans\. Mobile Comput\.24\(10\),pp\. 9837–9851\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2025.3564937)Cited by:[§I](https://arxiv.org/html/2606.24416#S1.p3.1)\.
- \[40\]L\. Wang, K\. Wang, C\. Pan, and N\. Aslam\(2023\)Joint trajectory and passive beamforming design for intelligent reflecting surface\-aided UAV communications: a deep reinforcement learning approach\.IEEE Trans\. Mobile Comput\.22\(11\),pp\. 6543–6553\.External Links:[Document](https://dx.doi.org/10.1109/TMC.2022.3200998)Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p1.1.1),[§I](https://arxiv.org/html/2606.24416#S1.p2.1),[§IV](https://arxiv.org/html/2606.24416#S4.p4.1)\.
- \[41\]S\. Wang, H\. Liu, P\. H\. Gomes, and B\. Krishnamachari\(2018\)Deep reinforcement learning for dynamic multichannel access in wireless networks\.IEEE Trans\. Cogn\. Commun\. Netw\.4\(2\),pp\. 257–265\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p1.1)\.
- \[42\]Q\. Wu, G\. Bansal, J\. Zhang,et al\.\(2024\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation\.InConf\. Lang\. Model\.,Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[43\]H\. Yang, N\. Cheng, R\. Sun, W\. Quan, R\. Chai, K\. Aldubaikhy, A\. Alqasir, and X\. Shen\(2024\)Knowledge\-driven resource allocation for wireless networks: a WMMSE unrolled graph neural network approach\.IEEE Internet Things J\.11\(10\)\.Cited by:[§I\-A2](https://arxiv.org/html/2606.24416#S1.SS1.SSS2.p2.1)\.
- \[44\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InInt\. Conf\. Learn\. Represent\.,Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
- \[45\]H\. Zhou, C\. Hu, D\. Yuan, Y\. Yuan, D\. Wu, X\. Liu, and J\. C\. Zhang\(2024\)Large language model \(LLM\)\-enabled in\-context learning for wireless network optimization: a case study of power control\.arXiv preprint arXiv:2408\.00214\.Cited by:[§I\-A3](https://arxiv.org/html/2606.24416#S1.SS1.SSS3.p1.1)\.
Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems

Similar Articles

APPO: Agentic Procedural Policy Optimization

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Submit Feedback

Similar Articles

APPO: Agentic Procedural Policy Optimization
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning