HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster
Summary
This paper proposes HADT, a transformer-based architecture for autonomous resource management in heterogeneous satellite clusters for Earth observation, using differential attention and relational tokenization. Experiments show significant improvements over baselines and strong adaptability to varying cluster sizes.
View Cached Full Text
Cached at: 06/01/26, 09:26 AM
# HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster
Source: [https://arxiv.org/html/2605.31023](https://arxiv.org/html/2605.31023)
11institutetext:School of Computer Science and Information Technology, Adelaide University, Adelaide, 5095, SA, Australia\.11email:\{mohamad\.hady; muhammadanwar\.masum; dhika\.pratama; jimmy\.cao; ryszard\.kowalczyk\}@adelaide\.edu\.au22institutetext:School of Electrical Engineering, Computing and Mathematical Sciences \(EECMS\), Curtin University, Kent St, Bentley, 6102, WA, Australia\.22email:siyi\.hu@curtin\.edu\.au33institutetext:Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland\.Muhammad Anwar MasumSiyi HuMahardhika PratamaZehong CaoRyszard Kowalczyk
###### Abstract
This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation \(EO\) missions including optical and Synthetic Aperture Radar \(SAR\) satellites\. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real\-time decision\-making based on the latest conditions, while requiring minimal interaction with ground operators\. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management\. Then, this problem is solved by using optimization algorithms\. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment\. A promising alternative is to reformulate the problem as a sequential decision\-making process and apply model\-free reinforcement learning techniques to enable adaptive and real\-time resource management\. To this end, we propose a novel transformer\-based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations\-actions tokenization and differential attention mechanism\. Our experimental results demonstrate significant performance improvements compared to the available baselines\. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters\.111This work has been accepted in ECML\-PKDD’26
## 1Introduction
Coordinating and managing multiple Low Earth Orbit \(LEO\) satellites autonomously remains a challenge due to the dynamic, uncertain, and resource\-constrained nature of satellite mission\[wang2020agile,chen2019mixedILP,stephenson2023optimal,pan2023dense\]\. Unlike pre\-planned mission operations, autonomous EO missions require each satellite to make real\-time decisions under dynamic conditions, and resource limitations, while maintaining coordinated behaviour across the entire constellation\[li2024mission,yang2024objective\]\. These challenges stem from several interacting factors: uncertainty in observation conditions \(e\.g\., variations in target priority and cloud coverage affecting data acquisition quality\), limited onboard resources such as power, data storage, attitude control, and the intrinsic non\-stationarity of multi\-agent settings, where each satellite’s actions continuously alter the shared environment and other agents’ states\[araguz2018applying,yao2019task\]\. Beyond these general coordination difficulties, EO missions increasingly employ heterogeneous satellite clusters by combining different payload types such as Synthetic Aperture Radar \(SAR\) and optical sensors\[cohen2017novasar,dong2024optisar,alzubairi2024spacecraft\]\. The SAR sensor can operate under cloud coverage and dark condition, while the optical sensor offers high\-resolution imaging under clear condition\. This complementary capability substantially improves temporal coverage and observation reliability yet introduces new coordination and control complexity\. The general scenario of heterogeneous cluster used in this work is illustrated in Fig\.[1](https://arxiv.org/html/2605.31023#S1.F1)\.
Figure 1:Illustration of A Heterogeneous Satellite Cluster for EO Mission under different cloud condition\. The satellites are deployed in three different narrow orbits forming a cooperative formation\. The SAR satellite is used to cover regions with higher cloud coverage that is the burden of the Optical \(OPT\) satellite\. Managing such a mixed constellation thus requires not only efficient resource scheduling but also adaptive policies that account for diverse capabilities\.Traditional optimization techniques, such as mixed\-integer linear programming \(MILP\), have been explored for constellation scheduling and resource allocation\[kim2024optimal\]\. It is effective in static conditions, rely on predefined models and are limited in their ability to adapt to time\-varying environmental and operational uncertainties\. A model\-free Reinforcement Learning \(RL\) provides a promising alternative by enabling agents to learn adaptive strategies through interaction with the environment, allowing satellites to handle imaging, energy usage, and data management under uncertainty without building its mathematical model\[herrmann2024single,stephenson2024bsk\]\. However, as EO missions scale from single satellites to cooperative cluster or constellations, the decision\-making problem expands from isolated optimization to distributed coordination among multiple agents — a setting more naturally modelled by Multi\-Agent Reinforcement Learning \(MARL\)\[tang2024dynamic\]\. The existing MARL applications in satellite operations\[herrmann2023reinforcement,stephenson2024reinforcement\]have shown encouraging results but often rely on fully centralized training or continuous communication assumptions, which simplifies real\-world settings with limited access to the other agent information during execution\. The Centralized Training with Decentralized Execution \(CTDE\) paradigms\[ning2024survey,hady2025multi\]achieves a practical balance by enabling satellites to learn coordinated policies during centralized training while operating independently at execution time\.
Despite these advances, current MARL frameworks largely assume homogeneous agents, where all satellites share identical dynamics and observation–action structures\. Recent on\-policy methods such as Multi\-Agent Proximal Policy Optimization \(MAPPO\) have demonstrated stable performance across diverse MARL environments\[yu2022surprising\]is suitable for homogenous agent assumption\. This assumption limits applicability to realistic EO missions, where sensor diversity and operational asymmetry fundamentally change the learning problem\. Recent progress in heterogeneous\-agent MARL, such as Heterogeneous\-Agent PPO \(HAPPO\)\[zhong2024heterogeneous\], introduces separate value estimation and policy update mechanisms to address agent heterogeneity\. Yet, their performance and adaptability in physically grounded, the improvement of the policy architecture remains unexplored\. Especially under difficult task, the performance of this method still can be improved\.
Therefore, we propose a transformer\-based algorithm to handle autonomous and heterogeneous satellite clusters EO mission \(HADT\)\. Our contributions are listed as:
- •A problem formulation of heterogeneous satellite cluster resources management based\-on Decentralized Partially Observable Markov Decision Process \(DecPOMDP\)\. We proposed scenarios incorporating three different complexity levels, including randomness and uncertainty aspect\.
- •A new transformer\-based architecture algorithm applied for heterogeneous satellite cluster\. We developed a new differential multi\-head attention to handle noisy inputs\. In addition, the HADT is a general satellite policy model, that is adaptable to the multiple heterogeneous cluster\. The code is made publicly available222[https://anonymous\.4open\.science/r/ECMLPKDD\-2A50](https://anonymous.4open.science/r/ECMLPKDD-2A50), with a demonstration video of our experimental scenario\.
- •A token\-based agent observations which maps the observation entities to the action entities of each satellite\.
The rest of this paper is structured as follows: Section II presents the preliminary of this study from the problem formulation, motivation and review of the current state of the art\. Section III describes the proposed method employed to solve the problem for heterogeneous satellite cluster scenarios\. Section IV discusses our experimental evaluation and results\. Finally, Section V concludes the paper and discusses future directions\.
## 2Preliminary
### 2\.1Autonomous Heterogeneous Multi Satellite Cluster Problem Formulation
In this subsection, a formal model of the Autonomous Heterogeneous Multi Satellite Cluster for Earth Observation \(EO\) mission problem is formulated\. The objective is to capture as many unique high priority target as possible during in orbits\. Previous works on single\-satellite EO missions have formulated the problem as a sequential decision\-making task in the well\-known reinforcement learning framework, specifically as a Partially Observable Markov Decision Process \(POMDP\)\[stephenson2024reinforcement,stephenson2024using\]\. Building upon this, we formally define the multi\-satellite EO mission as a Decentralized POMDP \(Dec\-POMDP\) model as a tuple:𝒢=⟨ℐ,𝒮,𝒜,𝒪,𝒯,r,𝒵,γ⟩\\mathcal\{G\}=\\langle\\mathcal\{I,S,A,O,T\},r,\\mathcal\{Z,\\gamma\}\\rangle\.
A cluster consists of three satellites \(as illustrated in Fig\.[1](https://arxiv.org/html/2605.31023#S1.F1)\) and each satellite functions as an agent forming a set of agents \(ℐ\\mathcal\{I\}\), making a decision at discrete time step \(tt\) based on the current states \(𝒮\\mathcal\{S\}\) and agent’s local observations \(𝒪\\mathcal\{O\}\)\. The observation is a subset of state which the agent can continuously observe, including battery level, onboard memory storage, reaction wheel speed \(angular velocity\), target priority, target opportunity window, cloud coverage forecast, ground station visibility windows, eclipse intervals, and simulation time\. The simulation defines a finite set of possible actions \(𝒜\\mathcal\{A\}\):1\) Charging, which involves reorienting the satellite toward the sun to maximize solar energy absorption and recharge its battery;2\) Downlinking, where the satellite transmits the collected EO image data whenever it has access to a ground station;3\) Desaturating, which ensures that the Reaction Wheels \(RWs\), the primary actuators for attitude control, operate within safe speed limits\. If the RW speed approaches saturation, the satellite must execute a desaturation manoeuver to maintain stable attitude control and prevent uncontrollable drift;4\) Capturing theii\-th image target, where the satellite must orient its optical imaging sensor toward a selected target\-iiamong the available targets on Earth and store it in the on\-board memory\.
The instantaneous rewardrrintegrates three mission objectives: data acquisition, resource utilization, and safe operation and defined as:
r=\{qi−ρt\+ci,if a target is successfully captured,−ρt\+δt,if any data is downlinked,−100,if a failure occurs,−ρt,if only power is consumed,0,otherwise,r=\\begin\{cases\}q\_\{i\}\-\\rho\_\{t\}\+c\_\{i\},&\\text\{if a target is successfully captured\},\\\\ \-\\rho\_\{t\}\+\\delta\_\{t\},&\\text\{if any data is downlinked\},\\\\ \-100,&\\text\{if a failure occurs\},\\\\ \-\\rho\_\{t\},&\\text\{if only power is consumed\},\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(1\)whereqi∈\(0,1\)q\_\{i\}\\in\(0,1\)represents the priority of the target AoI at timettand promotes selecting targets with greater mission importance\.
Resource usage efficiency is encouraged through three supporting terms:1\) Battery power usageis defined as:ρt=αΔQt\(1−Qt\)\\rho\_\{t\}=\\alpha\\,\\Delta Q\_\{t\}\\,\(1\-Q\_\{t\}\), andΔQt=Qt−1−Qt\\Delta Q\_\{t\}=Q\_\{t\-1\}\-Q\_\{t\}, whereρt\\rho\_\{t\}penalizes excessive energy consumption \(ΔQt\\Delta Q\_\{t\}\) times a constantα\\alpha\.2\) Maximizing data downlinkingcan be achieved by giving a feedback as:δt=βΔDt\\delta\_\{t\}=\\beta\\,\\Delta D\_\{t\}andΔDt=Dt−Dt−1\\Delta D\_\{t\}=D\_\{t\}\-D\_\{t\-1\}, which rewards successful transmission of collected data\. The amount of the transferred data is calculated asΔDt\\Delta D\_\{t\}multiply by a scalar constant \(β\\beta\)\.3\) Ensuring payload correctnesseither SAR or Optical \(OPT\) under different cloud condition:
ci=\{−1\+σ,ifσ<0\.5and captured by SAR,σ,ifσ≥0\.5and captured by SAR,1−σ,ifσ<0\.5and captured by OPT,−σ,ifσ≥0\.5and captured by OPT,c\_\{i\}=\\begin\{cases\}\-1\+\\sigma,&if\\ \\sigma<0\.5\\ \\text\{and captured by SAR\},\\\\ \\sigma,&if\\ \\sigma\\geq 0\.5\\ \\text\{and captured by SAR\},\\\\ 1\-\\sigma,&if\\ \\sigma<0\.5\\ \\text\{and captured by OPT\},\\\\ \-\\sigma,&if\\ \\sigma\\geq 0\.5\\ \\text\{and captured by OPT\},\\end\{cases\}\(2\)where,σ∈\(0,1\)\\sigma\\in\(0,1\)is the cloud coverage ratio, to guide SAR satellite to use only in cloudy conditions and optical payloads are used in clear conditions\.
If the satellite encounters a failure, a fault condition is triggered, represented as:
Failure=\(bt<mb∨any\(Ω^≥Ωmax\)\),Failure=\(b\_\{t\}<m\_\{b\}\\vee\\text\{any\}\(\\hat\{\\Omega\}\\geq\\Omega\_\{max\}\)\),\(3\)where,mbm\_\{b\}is the minimum battery level to trigger a failure\. This ensures safe and continuous satellite operation\.
Satellite Resources Constraints:A single satellite has two limited resources that are considered asconstraintsin our study: battery level \(bt∈\[Bmin,Bmax\]b\_\{t\}\\in\[B\_\{min\},B\_\{max\}\]\) and data storage capacity \(dt∈\[Dmin,Dmax\]d\_\{t\}\\in\[D\_\{min\},D\_\{max\}\]\) at any time step \(tt\)\. At each time step, the satellite consumes electrical power, denoted ascb,ic\_\{b,i\}, and stores data, represented ascd,ic\_\{d,i\}\. The battery is rechargeable via a solar panel\. To maximize battery charging, the satellite must adjust its attitude toward the sun, which may conflict with its target imaging orientation\. Another constraint arises from attitude control, specifically the speed of the Reaction Wheels \(RWs\), denoted asΩ^∈\[−Ωmax,Ωmax\]\\hat\{\\Omega\}\\in\[\-\\Omega\_\{max\},\\Omega\_\{max\}\]\. These wheels serve as the primary actuators for satellite attitude adjustments along the three axes \(x,y,zx,y,z\)\. To prevent exceeding the maximum speed threshold, the satellite must periodically desaturate the wheels\. The limited resource constraints are expressed as:
∑t=0∞cb,t≤bt,Bmin≤bt≤Bmaxand∑t=0∞cd,t≤dt,Dmin≤dt≤Dmax\.\\sum^\{\\infty\}\_\{t=0\}\{c\_\{b,t\}\}\\leq b\_\{t\},\\\>B\_\{min\}\\leq b\_\{t\}\\leq B\_\{max\}\\quad\\text\{and\}\\quad\\sum^\{\\infty\}\_\{t=0\}\{c\_\{d,t\}\}\\leq d\_\{t\},\\\>D\_\{min\}\\leq d\_\{t\}\\leq D\_\{max\}\.\(4\)
These constraints are incorporated into the model asfailure\(Eq\.[3](https://arxiv.org/html/2605.31023#S2.E3)\) triggers in the reward function, resulting in penalties or negative rewards as feedback from the environment\. Some other constraints, such as the communication baud rate, have a relevant impact on the system performance\. However, in this work, it is assumed as fixed in time as the transmitter specification\.
### 2\.2Advantage of Model\-free Reinforcement Learning Solution
In traditional EO satellite mission planning, optimization problems are commonly formulated as Mixed\-Integer Linear Programming \(MILP\) models when the system dynamics, operational constraints, and objective functions are fully known and accurately characterized\[chen2019mixedILP\]\. Under such deterministic and well\-defined conditions, MILP provides a rigorous mathematical framework capable of generating globally optimal solutions for fixed planning horizons\. However, in practical on\-orbit autonomous operations, model inaccuracies, environmental uncertainties, and partially observable system states may degrade the effectiveness of model\-based optimization\. In such cases, model\-free Reinforcement Learning \(RL\) offers a promising alternative, as it does not require explicit knowledge of the system transition model and instead learns decision policies directly through interaction with the environment\. However, as RL policy employs a model to approximate the optimal policy, the scheduling solution is not the global optimal point or it is called quasi\-optimal solution\.
Figure 2:RL and MILP Performance Comparison\. RL can achieve comparable result to MILP under simple case study\. Thus, model\-free RL with PPO can be a potential solution to autonomous heterogeneous satellite cluster\.To confirm this hypotheses, we designed a simple model of satellite cluster scheduling as initial study and comparison \(detailed mathematical model provided in the Supplemental Document, Section 1\)\. The cluster is homogeneous with three optical satellites and simulated under short scheduling period \(400 seconds\) with only five ground targets to be captured\. The constraints used in this scenario are battery, memory storage, target opportunity windows, ground station opportunity windows, eclipse condition and unique target capturing \(no duplication\)\. The resources availability is calculated for each time step \(20 seconds period\) and the time windows opportunity information are collected by running a Basilisk simulation, then it is fed as Proximal Policy Optimization \(PPO\)\[schulman2017PPO\]RL inputs or observation\. The policy outputs or action spaces are: charging, downlinking, and i\-th target capturing\.
From the results as shown in Fig\.[2](https://arxiv.org/html/2605.31023#S2.F2), MILP can solve the problem and achieve the optimal scheduling solution with 3\.65 rewards and 100% completion rate \(5/5 targets are captured\)\. It also confirms that RL can perform comparable to MILP in a simple EO Mission case after 5000 training episodes\. By adapting this preliminary study to more complex mission with stochastic dynamics, RL can be more robust to modelling errors and unforeseen operational variations, making it suitable for autonomous and dynamic EO satellite resource management\.
### 2\.3State of the Art of MARL Algorithm
Figure 3:HADT comparison with the baseline algorithms\. Our main focus is to improve the performance under uncertainty, randomness and disturbance conditions with heterogeneous agent learning feature\.To solve POMDP and Dec\-POMDP of Autonomous EO Mission, a model\-free approach is selected due to its flexibility to directly learn the policy, especially when the model of the system is complex and the explicit mathematical model is unavailable\. In this work, three recent state\-of\-the\-art on\-policy MARL methods have been selected to be compared, which motivates us to proposed our methods:
1\) Multi\-agent PPO \(MAPPO\):MAPPO is an extension of PPO designed specifically for multi\-agent systems\[yu2022surprising\]\. It incorporates centralized critics and decentralized policies to improve performance in MARL tasks\. MAPPO uses a single centralized critic shared by all agents, allowing the evaluation of the global state to stabilize learning and mitigate non\-stationarity:Vicentralized\(s\)≈𝔼\[∑t=0∞γtri,t∣s0=s\]V\_\{i\}^\{\\text\{centralized\}\}\(s\)\\approx\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\_\{i,t\}\\mid s\_\{0\}=s\\right\], wheressrepresents the global state,γ\\gammais the discount factor, andrtr\_\{t\}is the reward at time steptt\. The loss function for the policy optimization in MAPPO is given by:
LMAPPO\(θi\)=𝔼t\[min\(rt\(θi\)A^t,clip\(rt\(θi\),1−ϵ,1\+ϵ\)A^t\)\]L^\{\\text\{MAPPO\}\}\(\\theta\_\{i\}\)=\\mathbb\{E\}\_\{t\}\\left\[\\min\\left\(r\_\{t\}\(\\theta\_\{i\}\)\\hat\{A\}\_\{t\},\\text\{clip\}\(r\_\{t\}\(\\theta\_\{i\}\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{t\}\\right\)\\right\], where:rt\(θi\)=πθi\(at∣ot\)πθiold\(at∣ot\)r\_\{t\}\(\\theta\_\{i\}\)=\\frac\{\\pi\_\{\\theta\_\{i\}\}\(a\_\{t\}\\mid o\_\{t\}\)\}\{\\pi\_\{\\theta\_\{i\}\}^\{\\text\{old\}\}\(a\_\{t\}\\mid o\_\{t\}\)\}is the probability ratio,A^t\\hat\{A\}\_\{t\}is the global advantage function, andϵ\\epsilonis the clipping parameter\.
2\) Heterogeneous Agent PPO \(HAPPO\):This algorithm extends MAPPO by accounting for heterogeneous agents with distinct state\-action spaces or roles and the sequential update scheme\[zhong2024heterogeneous\]\. It uses individual advantage functions and decentralized policies while maintaining centralized critics\. In HAPPO, the centralized value function is agent\-specific to handle heterogeneous agents:Vicentralized\(s\)≈𝔼\[∑t=0∞γtri,t∣s0=s\]V\_\{i\}^\{\\text\{centralized\}\}\(s\)\\approx\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\_\{i,t\}\\mid s\_\{0\}=s\\right\], whereiidenotes the agent index andri,tr\_\{i,t\}is the reward specific to agentii\. And, the loss function for HAPPO is denoted by:
LiHAPPO\(θi\)=𝔼t\[min\(ri,t\(θi\)A^i,t,clip\(ri,t\(θi\),1−ϵ,1\+ϵ\)A^i,t\)\],L\_\{i\}^\{\\text\{HAPPO\}\}\(\\theta\_\{i\}\)=\\mathbb\{E\}\_\{t\}\\left\[\\min\\left\(r\_\{i,t\}\(\\theta\_\{i\}\)\\hat\{A\}\_\{i,t\},\\text\{clip\}\(r\_\{i,t\}\(\\theta\_\{i\}\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{i,t\}\\right\)\\right\],\(5\)whereri,t\(θi\)=πθi\(ai,t∣oi,t\)πθiold\(ai,t∣oi,t\)r\_\{i,t\}\(\\theta\_\{i\}\)=\\frac\{\\pi\_\{\\theta\_\{i\}\}\(a\_\{i,t\}\\mid o\_\{i,t\}\)\}\{\\pi\_\{\\theta\_\{i\}\}^\{\\text\{old\}\}\(a\_\{i,t\}\\mid o\_\{i,t\}\)\}andA^i,t\\hat\{A\}\_\{i,t\}are the advantage functions of agentii\. And it follows the multi agent decomposition lemma and it has sequential update mechanism denoted as:A^i1:m,t\(ot,ai1:m,t\)=∑j=1mA^ij\(ot,ai1:j−1,t,aij\)\\hat\{A\}\_\{i\_\{1:m\},t\}\(o\_\{t\},a\_\{i\_\{1:m\},t\}\)=\\sum\_\{j=1\}^\{m\}\\hat\{A\}\_\{i\_\{j\}\}\\bigl\(o\_\{t\},a\_\{i\_\{1:j\-1\},t\},a\_\{i\_\{j\}\}\\bigr\)\.
3\) Multi\-Agent Transformer \(MAT\):This algorithm is one of the available novel architecture that reformulates cooperative multi\-agent reinforcement learning \(MARL\) as a sequential sequence\-modelling problem by mapping agents’ joint observation sequences to optimal action sequences using a Transformer encoder–decoder framework\[wen2022multi\]\. Central to MAT is the application of the multi\-agent advantage decomposition theorem, which decomposes the joint team advantageAπ\(o,a\)A^\{\\pi\}\(o,a\)into a sum of individual agent advantages conditioned on preceding agents’ actions, enabling a sequential decision process with only linear complexity in the number of agents:Aπ\(o,a\)=∑m=1nAimπ\(o,ai1:m−1,aim\)A^\{\\pi\}\(o,a\)=\\sum\_\{m=1\}^\{n\}A^\{\\pi\}\_\{i\_\{m\}\}\\big\(o,\\;a\_\{i\_\{1:m\-1\}\},\\,a\_\{i\_\{m\}\}\\big\), whereAimπ\(⋅\)A^\{\\pi\}\_\{i\_\{m\}\}\(\\cdot\)is the advantage of agentimi\_\{m\}’s action given earlier agents’ choices\. MAT leverages the transformer’s self\-attention and autoregressive decoding to model inter\-agent dependencies, enjoys monotonic performance improvement guarantees, and is trained online on\-policy, outperforming strong baselines such as MAPPO and HAPPO across major benchmarks\.
Although MAPPO, HAPPO, and MAT can operate effectively in the CTDE paradigm, they do not have strategy to handle randomness and uncertainty of the environment\. As compared in Fig[3](https://arxiv.org/html/2605.31023#S2.F3), MAPPO was developed for a stable learning process, then improved by HAPPO to incorporates agent\-specific advantage value update to better understanding other agent’s policy in heterogeneous systems\. MAT maps all agent’s observation as a sequence to generate agent’s action, thus in our study it is categorized as a centralized paradigm\. Although, it has the decentralized variant, the complete encoder\-decoder architecture is more complex and preferred to be compared with our approach, which relies on only encoder architecture\. Our proposed method, HADT, is designed to solve uncertainty and randomness problem occurred under heterogeneous satellite cluster EO mission\. It has a unique mapping of the satellite observations to actions entities then equipped with a differential multi\-head attention\.
## 3Proposed Method: Heterogeneous Multi\-Agent Differential Transformer \(HADT\)
### 3\.1Observation to Token Construction
Inspired by the policy decoupling mechanism in UPDeT\[hu2021updet\], a universal policy can be designed by constructing the transformer input token based on the number of agent’s observation entities\. Thus, in this work, by integrating prior knowledge of the relation between thei−thi\-thsatellite local observation entities \(Ot=\[ot1,ot2,…,oti\]O\_\{t\}=\[o\_\{t\}^\{1\},o\_\{t\}^\{2\},\.\.\.,o\_\{t\}^\{i\}\]\) and decision \(action\) entities \(ati=\[at1,at2,…,ati\]a\_\{t\}^\{i\}=\[a\_\{t\}^\{1\},a\_\{t\}^\{2\},\.\.\.,a\_\{t\}^\{i\}\]\) at time steptt, it can construct agent’s token information by mapping the observations\-actions relations as shown in Fig\.[4](https://arxiv.org/html/2605.31023#S3.F4)\. In other words, construct the token by decoupling the observation entities into each action entities as described in Section[2\.1](https://arxiv.org/html/2605.31023#S2.SS1)\. Since the dimensionality and semantic structure of these observations may differ across agents, a direct concatenation would be trivial for this case\.
Figure 4:HADT Policy Model\. It has two main components: 1\) Observation to Token Construction to map observation entities to the action entities with its relation connections; 2\) Differential Multi\-head Attention to reduce the sensitivity of self\-attention mechanism to randomness or noisy inputs\.To enable structured learning, each local observation tokenot,Nio\_\{t,N\}^\{i\}is first projected into a shared latent space via an embedding layereti=Emb\(ot,Ni\)e\_\{t\}^\{i\}=\\text\{Emb\}\(o\_\{t,N\}^\{i\}\), whereEmb\(⋅\)\\text\{Emb\}\(\\cdot\)denotes a learnable linear or non\-linear projection\. The embedding layer normalizes heterogeneous feature scales and produces fixed\-dimensional token representations\. Consequently, the global observation is transformed into a token set:Et=\{et1,et2,…,etN\}E\_\{t\}=\\\{e\_\{t\}^\{1\},e\_\{t\}^\{2\},\\dots,e\_\{t\}^\{N\}\\\}, where each token semantically represents observation\-action relations\. In the output side, the architecture incorporates action tokens corresponding to discrete operational modes such as charging, downlinking, target capturing, and desaturating\. These action tokens allow the model to associate relational context with candidate actions and facilitate structured policy learning\. The combined token representations are then processed by a Transformer encoder block composed of Differential Multi\-head Attention with residual connections, layer normalization, and Multi Layer Perceptron \(MLP\) networks, enabling implicit inter\-observations\-actions and relational reasoning for each of the heterogeneous cluster satellite\.
### 3\.2Differential Multi\-Head Attention
To explicitly model cooperative interactions within the heterogeneous satellite cluster under uncertainty, we propose a Differential Multi\-Head Attention \(Diff\-MHA\) mechanism, inspired by\[ye2024differential\]\. The previous differential transformer architecture implements differential attention \(Diff\-Attn\), where it has dualQ1,Q2Q\_\{1\},Q\_\{2\}andK1,K2K\_\{1\},K\_\{2\}parameters with multipleλ\\lambdaparameters\. Meanwhile, our new Diff\-MHA leverage a single learnableλ\\lambdaparameter and introduce a new mixing parameters to achieve simpler strategy yet maintain its capability\. Our proposed mechanism introduces a differential MHA operations to capture contrastive dependencies among two distinct MHA that may contain noisy informations, which can degrade the overall agent’s performance\.
Given the input token matrixX∈ℝN×dX\\in\\mathbb\{R\}^\{N\\times d\}, whereddis the embedding dimension, the tokens are first partitioned into two sub\-representations withX1,2∈ℝN×d/2X\_\{1,2\}\\in\\mathbb\{R\}^\{N\\times d/2\}and each branch is then processed by an independent Multi\-head Attention \(MHA\) module:
X1,X2=split\(X\),Y1=MHA1\(X1\),Y2=MHA2\(X2\)\.X\_\{1\},X\_\{2\}=\\text\{split\}\(X\),\\quad Y\_\{1\}=\\text\{MHA\}\_\{1\}\(X\_\{1\}\),\\quad Y\_\{2\}=\\text\{MHA\}\_\{2\}\(X\_\{2\}\)\.\(6\)
This dual\-attention modules capture the relational between observation’s token, thus helps better representation to understand the current situation and decides the best action\. For each MHA, we adopt standard MHA\[vaswani2017attention\]with an input token matrixX1,2X\_\{1,2\}, queries, keys, and values are computed as:Q=X1,2WQ,K=X1,2WK,V=X1,2WVQ=X\_\{1,2\}W\_\{Q\},\\quad K=X\_\{1,2\}W\_\{K\},\\quad V=X\_\{1,2\}W\_\{V\}, whereWQ,WK,WV∈ℝd/2×dkW\_\{Q\},W\_\{K\},W\_\{V\}\\in\\mathbb\{R\}^\{d/2\\times d\_\{k\}\}are learnable projection matrices anddk=dHd\_\{k\}=\\frac\{d\}\{H\}\. The scaled dot\-product attention is then defined as:
Attn1,2\(Q,K,V\)=softmax\(QK⊤dk\)V\.\\text\{Attn\}\_\{1,2\}\(Q,K,V\)=\\text\{softmax\}\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)V\.\(7\)In multi\-head attention,HHparallel attention heads are computed and concatenated to form the outputs:
Y1,2=MHA1,2\(X\)=Concat1,2\(Attn1,2\(Q,K,V\)h\)h=1:HWO,Y\_\{1,2\}=\\text\{MHA\}\_\{1,2\}\(X\)=\\text\{Concat\}\_\{1,2\}\(\\text\{Attn\}\_\{1,2\}\(Q,K,V\)\_\{h\}\)\_\{h=1:H\}W\_\{O\},\(8\)whereWOW\_\{O\}is a learnable output projection matrix\. Then, a differential operation is subsequently applied into both MHA’s outpus:
Ydiff=Y1−λ\.Y2,Yout=Ydiff\.Wmix,Y\_\{\\text\{diff\}\}=Y\_\{1\}\-\\lambda\\ \.\\ Y\_\{2\},\\quad Y\_\{\\text\{out\}\}=Y\_\{\\text\{diff\}\}\\ \.\\ W\_\{\\text\{mix\}\},\(9\)whereλ\\lambdais a learnable scaling parameter controlling the influence of differential MHA\. This subtraction mechanism allows the network to enhance beneficial relational signals while attenuating noisy information, thereby explicitly reduces noisy information effect\. Then, the differential representation is then projected using a Mixing \(Mix\.\) withWmixW\_\{mix\}parameters, whereWmix∈ℝd/2×dW\_\{\\text\{mix\}\}\\in\\mathbb\{R\}^\{d/2\\times d\}is additional learnable projection matrix to match with the output space dimension\. The output is integrated within a Transformer encoder block using residual connections and layer normalization to ensure stable training\. Finally, the encoded representations are passed through an Act\. \(action\) MLP head to produce the policy distributionπ\(at∣Ot\)\\pi\(a\_\{t\}\\mid O\_\{t\}\), from which the actionata\_\{t\}is sampled or selected\.
The proposed Differential Multi\-Head Attention mechanism enables the HADT architecture to effectively capture structured inter observations\-actions dependencies, particularly it produces a suitable policy architecture for heterogeneous satellite cluster\-based decision making under dynamic and constrained EO missions\. This architecture eventually used as an actor encoder to approximate the policy functionπ\\piwith parametersϕ\\phi\(πϕ\\pi\_\{\\phi\}\), where it has to be optimized following the HAPPO policy update in Eq\.[5](https://arxiv.org/html/2605.31023#S2.E5)\.



Figure 5:Average evaluation rewards comparison across different scenarios
## 4Experimental Results
### 4\.1Scenario Description
Our algorithm is tested under BSK\-RL\[stephenson2024bsk\]and Basilisk simulator which contain numerous simulation parameters\. Due to the page limit, some of the parameters which are adjusted to match with our scenarios \(e\.g\. orbital parameters and satellite specifications\), are listed in the Supplemental Doc\., Section 2\. Total targets to be captured are 160 targets and the location randomized within 11 different regions\. We defined three different scenarios to evaluate our proposed algorithm performance incrementally from the ideal to near\-realistic simulations:
easy: This scenario is designed to study the ideal satellite parameters assumption and all of resources initial conditions are set to be fully available for executing an EO mission\. The battery level is fully charged \(100%\), memory storage is empty \(0%\), there is no randomness at the attitude disturbance and reaction wheel speed initialization\)\. It simulates the mission with less challenge of the cluster resources availability\.
medium: Themediumscenario is designed to simulate the cluster under conditions of restricted resources condition\. The downlink transmission speed \(baud\-rate\) is assumed to degrade up to 30% from its default value as ineasycase\. Additionally, the battery level is directly initialized as 85% and the memory storage is initialized as 90%\. This scenario introduces more difficulties especially to balance the memory storage utilization, if too many data collected the memory will be full and can not be properly downlinked due to the issue of the transmission speed\.
hard: This scenario is our main focus in this work to have a near\-realistic simulation\. It integrates themediumscenario with additional randomness factors, including randomization of the initialization of battery, memory storage, disturbance and reaction wheels speed\. This adds more challenge to the cluster to perform the EO mission and evaluate the generalization performance of the policy across different resources initial condition\. It reflects more realistic case of the real satellite cluster EO mission, where the resources initial condition may vary depends on the states of its previous mission\. The battery level is initialized randomly between 80\-85%, the memory storage is 90\-100%, disturbance is randomized with normal distribution with the scale is10−410^\{\-4\}, and the reaction wheels are uniformly randomized between \-3000 to 3000 RPM\. Therefore, this scenario demonstrates more realistic challenge of the resource availability as well as its uncertainty and randomness nature at the same time\.
### 4\.2Performance Evaluation
The policies of different MARL algorithms are trained under the same machine with Intel\(R\) Core\(TM\) i9\-14900K CPU, 32 GB RAM, and 24 GB GPU NVIDIA GeForce RTX 4090 in this work\. The algorithm hyper\-parameters are listed in Supplemental Document Section 3\. The training is terminated at one million training time steps with 20 parallel\-processing environments \(rollouts\)\. Each rollouts has its own seed with 3 different model seeds, generating 60 unique seed combinations in our training scheme\. This settings achieve at least 5000 randomized target distributions and then evaluate the performance under 500 unseen target distributions with 5 episodes and 10 rollouts each to ensure the policy’s generalization performance across different target locations, priorities and cloud conditions\.
Figure 6:Average evaluation Completion Rate \(CR\) in different scenariosThe averaged total rewards metric is shown in Fig\.[5](https://arxiv.org/html/2605.31023#S3.F5)and the Completion Rate \(CR\) is presented Fig\.[6](https://arxiv.org/html/2605.31023#S4.F6)\. Based on those two metrics, our proposed algorithm is not suitable foreasyscenario\. MAPPO and HAPPO with MLP\+RNN based architecture achieve better performance in this scenario\. However, this scenario is ideal case, where the resources constraints are neglected\. Under more realistic scenarios, HADT outperforms the baselines undermediumandhardscenarios\. This can be achieved by its differential multi\-head attention mechanism enabling more robust performance under uncertain observation with randomness\. MAT is better than HADT ineasyscenarios, however, it does not have uncertainty handling mechanism in the transformer architecture, thus it can not perform well underhardscenarios\. Detailed numerical values of this experiments are presented in of Supplemental Document Section 4 \(Table 4\)\.
### 4\.3The Impact of HADT
To evaluate the significance impact of HADT, an ablation study has been performed underhardscenario and the results are shown in Table[1](https://arxiv.org/html/2605.31023#S4.T1)\. The results clearly demonstrate the effectiveness of HADT compared with its fundamental components and architectures\. HADT achieves the highest average reward \(74\.58±4\.8174\.58\\pm 4\.81\) and Completion Rate \(CR\) \(39\.48±2\.2939\.48\\pm 2\.29\)\. If the differential transformer \(DT\) feature is deactivated it becomes the vanilla transformer variant, which attains71\.64±6\.4871\.64\\pm 6\.48in reward and37\.94±4\.1137\.94\\pm 4\.11in CR\. This performance gap highlights the contribution of the differential transformer mechanism in enhancing representation learning and stabilizing coordination among heterogeneous agents\. Furthermore, the margin becomes substantially larger when compared to conventional neural architectures: 1\-Layer and 2\-Layer MLP\+RNN models with 256 nodes, whose rewards drop to50\.3950\.39and47\.9247\.92, respectively, with significantly lower completion rates while increasing the number of layers\. This result confirms the impact of transformer\-based architecture compared with MLP\+RNN\. The reduced variance observed in HADT also indicates more stable learning dynamics\. Overall, these results validate that the proposed algorithm enables more effective decision\-making under uncertainties and dynamics, leading to superior task completion performance in complex multi\-agent environments\.
Table 1:Evaluation of Different Model Architecture \(CR: Completion Rate \(%\); DT: Differential Transformer\)
### 4\.4Transferability and Scalability
The last experiment is the extension from single to two clusters\. In this study, we evaluate the HADT transferability and adaptability to the new cluster configurations underhardscenario\. We add a new cluster separated in 45 degrees different orbital offset\. Then, the policy is duplicated for the new cluster and then trained with 1 million time steps to adapt the policy performance to the new task\. In other words, the pre\-trained HADT policy for single cluster is transferred and adapted to the new two cluster task\. The results are shown in Fig\.[7](https://arxiv.org/html/2605.31023#S4.F7)\. Compared with training of2 Cluster from Scratch, the1 to 2 Cluster Transferwhere the HADT policy is duplicated for the new cluster, it can achieve better performance around 20 points rewards gap\. This confirms the transferability and adaptability performance of our HADT algorithm\. Also, by using two different cluster it can improve the completion rate \(CR\) around 10%\.


Figure 7:HADT Policy Transfer from 1 to 2 Cluster
## 5Conclusion
Transformer\-based architecture has a promising future to solve problems in realistic multi\-agent reinforcement learning problem\. From our study, we have evaluated a new differential multi\-head attention can improve the vanilla transformer to help agent learn under dynamic and uncertain condition with randomness\. Moreover, the observation tokenization is a key strategy to guide transformer model by constructing the relation between observations and actions spaces\. Therefore, the HADT algorithm outperforms MAPPO, HAPPO, and MAT undermediumandhardscenarios where it is designed to mimic near\-realistic conditions in EO mission\. The HADT also is a potential candidate to be used for transfer learning or fine\-tuning to different cluster numbers\. However, if it is considered to be implemented as onboard autonomy algorithm in the future, the model contains numerous parameters that may insufficient due to the limited computational power\. One of our future research can be the integration of parameter efficient fine\-tuning, where it is currently one of developing research topic especially for large model in large language modelling or image processing field and it can be adopted to online multi\-agent reinforcement learning research area\.
\{credits\}
#### 5\.0\.1Acknowledgements
This work has been supported by the SmartSat CRC, whose activities are funded by the Australian Government’s CRC Program\. This work use an open\-source realistic satellite simulator \(Basilisk and BSK\-RL\) that is actively developed by Dr\. Hanspeter Schaub and team at AVS Laboratory, University of Colorado Boulder\. Also, the authors would like to express their sincere gratitude to BAE Systems for their invaluable support and collaboration throughout this research\.
## ReferencesSimilar Articles
Autonomous heterogeneous catalyst discovery with a self-evolving multi-agent digital twin
This paper presents CatDT, a self-evolving multi-agent digital twin that autonomously predicts heterogeneous catalyst properties from bulk crystal and reaction description, achieving experimental accuracy across seven benchmarks and discovering non-precious catalyst candidates for propane dehydrogenation.
From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)
This paper proposes a hierarchical multi-agent reference architecture called HANA for achieving Level 4/5 autonomous networks. It integrates agent self-awareness to harmonize strategic governance with reflexive fault recovery, validated in a 5G Core environment achieving 86% reduction in Mean Time to Repair.
The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality
The paper introduces the EΔ-MHC-Geo Transformer, a novel architecture using adaptive geodesic operations with guaranteed orthogonality via Cayley rotations and Householder reflections. It demonstrates improved long-horizon stability and norm preservation compared to existing baselines like Deep Delta Learning.
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing
AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to decouple world prediction from action execution, achieving efficient long-horizon planning and real-time control. It achieves state-of-the-art performance on robotic manipulation tasks with up to 92.8% success on RoboTwin and 78.3% on real-world tasks, while reaching 24.17 Hz closed-loop control.
DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions
DAStatFormer is a hybrid multibranch Transformer that integrates statistical features with gated attention for efficient and accurate event classification in Distributed Acoustic Sensing (DAS), achieving up to 99.4% accuracy with significantly lower computational cost.