DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning

arXiv cs.AI Papers

Summary

DiffAero is a GPU-accelerated, fully differentiable simulation framework for quadrotor control policy learning that supports environment- and agent-level parallelism, multiple dynamics models, and customizable sensors. It enables robust flight policy learning in hours on consumer-grade hardware and is released as open-source.

arXiv:2509.10247v1 Announce Type: cross Abstract: This letter introduces DiffAero, a lightweight, GPU-accelerated, and fully differentiable simulation framework designed for efficient quadrotor control policy learning. DiffAero supports both environment-level and agent-level parallelism and integrates multiple dynamics models, customizable sensor stacks (IMU, depth camera, and LiDAR), and diverse flight tasks within a unified, GPU-native training interface. By fully parallelizing both physics and rendering on the GPU, DiffAero eliminates CPU-GPU data transfer bottlenecks and delivers orders-of-magnitude improvements in simulation throughput. In contrast to existing simulators, DiffAero not only provides high-performance simulation but also serves as a research platform for exploring differentiable and hybrid learning algorithms. Extensive benchmarks and real-world flight experiments demonstrate that DiffAero and hybrid learning algorithms combined can learn robust flight policies in hours on consumer-grade hardware. The code is available at https://github.com/flyingbitac/diffaero.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:10 AM

# A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning
Source: [https://arxiv.org/html/2509.10247](https://arxiv.org/html/2509.10247)
Xinhong Zhang, Runqing Wang, Yunfan Ren, Jian Sun, Hao Fang, Jie Chen, , and Gang Wang∗This work was supported in part by the National Natural Science Foundation of China under Grants U23B2059, 62173034, 62088101, and also by the Zhongguancun Academy under Grant 20240307\.Xinhong Zhang, Runqing Wang, Jian Sun, Hao Fang, and Gang Wang are with the State Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology, Beijing 100081, China\{xhzhang, bitwrq, sunjian, fangh, gangwang\}@bit\.edu\.cn\. Xinhong Zhang is also with the Zhongguancun Academy, Beijing 100094, China\. Yunfan Ren is with the Department of Mechanical Engineering, University of Hong Kongrenyf@connect\.hku\.hk\. Jie Chen is with the Harbin Institute of Technology, and also with the State Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology, Beijing 100081, Chinachenjie@bit\.edu\.cn\.∗Corresponding author\.

###### Abstract

This letter introduces DiffAero, a lightweight, GPU\-accelerated, and fully differentiable simulation framework designed for efficient quadrotor control policy learning\. DiffAero supports both environment\-level and agent\-level parallelism and integrates multiple dynamics models, customizable sensor stacks \(IMU, depth camera, and LiDAR\), and diverse flight tasks within a unified, GPU\-native training interface\. By fully parallelizing both physics and rendering on the GPU, DiffAero eliminates CPU\-GPU data transfer bottlenecks and delivers orders\-of\-magnitude improvements in simulation throughput\. In contrast to existing simulators, DiffAero not only provides high\-performance simulation but also serves as a research platform for exploring differentiable and hybrid learning algorithms\. Extensive benchmarks and real\-world flight experiments demonstrate that DiffAero and hybrid learning algorithms combined can learn robust flight policies in hours on consumer\-grade hardware\. The code is available athttps://github\.com/flyingbitac/diffaero\.

## IIntroduction

Quadrotors—and swarms of quadrotors thereof—are increasingly deployed in complex environments for aerial inspection, environmental monitoring, and high\-speed racing, owing to their agile maneuverability and onboard sensing capabilities\. Traditional autonomy architectures decompose flight functionality into perception, localization, mapping, planning, and control modules\[[1](https://arxiv.org/html/2509.10247v1#bib.bib1),[2](https://arxiv.org/html/2509.10247v1#bib.bib2)\]\. Although modular, this hierarchical pipeline incurs latency and accumulative errors, limiting the full exploitation of quadrotor dynamics\[[3](https://arxiv.org/html/2509.10247v1#bib.bib3)\]\. In addition, the tight coupling between modules renders design and tuning labor\-intensive and brittle\.

End\-to\-end learning addresses these limitations by training neural flight policies that map raw sensor observations directly to control commands, thereby streamlining the autonomy stack and enabling tighter feedback loops\[[4](https://arxiv.org/html/2509.10247v1#bib.bib4)\]\. Reinforcement learning \(RL\) methods train flight policies through interactions by maximizing expected cumulative reward\[[5](https://arxiv.org/html/2509.10247v1#bib.bib5)\]\. However, RL methods, model\-free RL in particular, often demand millions of simulated interactions to achieve proficient behavior, especially when processing high\-dimensional observations \(e\.g\., depth images\) or optimizing sparse rewards\[[6](https://arxiv.org/html/2509.10247v1#bib.bib6),[7](https://arxiv.org/html/2509.10247v1#bib.bib7)\]\. Imitation learning \(IL\), on the other hand, reduces sample complexity by leveraging expert demonstrations but suffers from limited generalization and the practical burden of data collection because it depends heavily on demonstration diversity and quality\[[8](https://arxiv.org/html/2509.10247v1#bib.bib8)\]\.

![Refer to caption](https://arxiv.org/html/2509.10247v1/x1.png)Figure 1:Visualization of task environments in DiffAero\.\(a\) Position control\. \(b\) Racing\. \(c\) Obstacle avoidance with outdoor obstacle settings\. \(d\) Obstacle avoidance task with indoor obstacle settings\. Ceilings are omitted for clarity\.Accelerating flight policy learning can be approached from two directions: that is improving the simulation performance or enhancing the data efficiency of the learning algorithm\. Existing GPU\-based drone simulators are typically built upon existing physics and rendering engines, thus limiting their simulation and rendering performance\. Furthermore, the lack of differentiability and a unified learning interface hinders the development and benchmarking of novel data\-efficient hybrid algorithms across paradigms\[[9](https://arxiv.org/html/2509.10247v1#bib.bib9),[10](https://arxiv.org/html/2509.10247v1#bib.bib10),[11](https://arxiv.org/html/2509.10247v1#bib.bib11)\]\.

In this letter, focusing on the autonomous flight of quadrotors, we intend to accelerate quadrotor policy learning by enhancing both simulation performance and data efficiency\. We present a GPU\-accelerated,Differentiable, and extendable simulation framework forAerial robotics,DiffAero, enabling high\-performance interactive quadrotor simulation\. It offers1\.8×1\.8\\timesand9\.6×9\.6\\timesspeedups in physics simulation and depth rendering compared to Aerial Gym\[[12](https://arxiv.org/html/2509.10247v1#bib.bib12)\], respectively\. To our best knowledge, this is the first differentiable simulation framework that performs both physical simulation and visual rendering fully on the GPU, as well as supports multiple dynamics models\. By combining high\-performance simulation, a unified learning interface, learning algorithms, and deployment utilities, robust generalizable flight policies can be trained, evaluated, and deployed within hours on consumer\-grade hardware\.

In a nutshell, the main contributions of this work are summarized as follows\.

1. 1\.We present DiffAero, a GPU\-accelerated differentiable quadrotor simulator that parallelizes both physics and rendering\. It achieves orders\-of\-magnitude performance improvements over existing platforms with little VRAM consumption\.
2. 2\.DiffAero provides a modular and extensible framework supporting four differentiable dynamics models, three sensor modalities, and three flight tasks\. Its PyTorch\-based interface unifies four learning formulations and three learning paradigms\. This flexibility enables DiffAero to serve as a benchmark for learning algorithms and allows researchers to investigate a wide range of problems, from differentiable policy learning to multi\-agent coordination\.
3. 3\.We demonstrate how DiffAero serves as an enabling research platform for exploring and evaluating quadrotor learning algorithms\. Through simulation and real\-world experiments, we show that the framework facilitates the development of data\-efficient differentiable and hybrid learning algorithms and seamless sim\-to\-real transfer, thereby accelerating research on learning\-based quadrotor flight control\. Moreover, the source code is released for the reference of the community\.

## IIRelated Work

TABLE I:Feature Comparison of Popular Drone Simulation Frameworks for Interactive Policy Learning- 1D stands for depth camera, I for IMU, and L for LiDAR\.2FPS results of physics simulation are measured under8,1928,192parallel environments\.3FPS results of depth camera rendering are measured under2,0482,048parallel environments with a depth image resolution of64×6464\\times 64, except OmniDrones, which does not support depth cameras\.4Data from\[[10](https://arxiv.org/html/2509.10247v1#bib.bib10)\]\. Note that the GPU parallelization support of VisFly is indicated with a \(✓\), as rendered images in VisFly are returned as NumPy arrays stored in RAM\. Consequently, its simulation performance is significantly limited by the I/O speed\.

### II\-ALearning\-based Autonomous Flight

Learning\-based end\-to\-end policies enable direct control from raw observations, bypassing the limitations of traditional modular pipelines\. Imitation learning approaches supervise policy networks to mimic expert demonstrations across a variety of tasks, including acrobatics\[[13](https://arxiv.org/html/2509.10247v1#bib.bib13)\], navigation\[[3](https://arxiv.org/html/2509.10247v1#bib.bib3)\]and racing\[[8](https://arxiv.org/html/2509.10247v1#bib.bib8)\], but their performance relies heavily on diverse demonstrations and lacks generalization\[[14](https://arxiv.org/html/2509.10247v1#bib.bib14)\]\. RL allows training flight policies interactively using simulation generated data and rewards, demonstrating champion\-level results in drone racing\[[7](https://arxiv.org/html/2509.10247v1#bib.bib7)\]\.

While most RL methods treat quadrotor dynamics as a black box, recent advances integrate differentiable simulation to incorporate dynamics as analytic gradients into the learning process, enabling more data\-efficient training and direct optimization over visual features\. Differentiable simulation approaches\[[14](https://arxiv.org/html/2509.10247v1#bib.bib14)\]enabled end\-to\-end agile navigation of a swarm of quadrotors from depth images\. While these approaches are data\-efficient and capable of handling high\-dimensional inputs, designing effective differentiable reward signals remains challenging\. Furthermore, similar to artificial potential field methods, learning agents can easily become trapped in local optima\. Hybrid algorithms like SHAC\[[15](https://arxiv.org/html/2509.10247v1#bib.bib15)\]enable the integration of both RL and differentiable learning techniques, allowing for the separate and flexible design of reward signals for direct gradient backpropagation and terminal value optimization\.

### II\-BDrone Simulators

Early drone simulators like Gazebo\[[16](https://arxiv.org/html/2509.10247v1#bib.bib16)\]became popular due to ROS integration and flexible support for physics and sensor simulation, but they face limitations in performance and scalability from a data generation perspective\. With a primary emphasis on visual realism, Flightmare\[[17](https://arxiv.org/html/2509.10247v1#bib.bib17)\]and AirSim\[[18](https://arxiv.org/html/2509.10247v1#bib.bib18)\]leveraged game engines to provide fast simulation with realistic environment rendering, while Menagerie\[[19](https://arxiv.org/html/2509.10247v1#bib.bib19)\]and PyBulletDrones\[[20](https://arxiv.org/html/2509.10247v1#bib.bib20)\]targeted high\-fidelity physics based on Mujoco\[[21](https://arxiv.org/html/2509.10247v1#bib.bib21)\]and PyBullet\[[22](https://arxiv.org/html/2509.10247v1#bib.bib22)\], though their visual realism remains of low quality\.

Recent simulators, such as OmniDrones\[[9](https://arxiv.org/html/2509.10247v1#bib.bib9)\], Aerial Gym Simulator\[[12](https://arxiv.org/html/2509.10247v1#bib.bib12)\]and AirGym\[[11](https://arxiv.org/html/2509.10247v1#bib.bib11)\]utilize NVIDIA Omniverse Isaac Sim and Isaac Gym\[[23](https://arxiv.org/html/2509.10247v1#bib.bib23)\]to explore GPU\-acceleration for parallel simulation and rendering, minimizing CPU\-GPU data transfer by directly accessing simulator states as PyTorch\[[24](https://arxiv.org/html/2509.10247v1#bib.bib24)\]tensors\. However, these frameworks are resource\-intensive, demanding strict system requirements, and the rasterization pipeline still limits their rendering performance\. VisFly\[[10](https://arxiv.org/html/2509.10247v1#bib.bib10)\]incorporates differentiable simulation, but its performance is bottlenecked by the I/O speed due to the image transfer between RAM and VRAM, and does not provide a unified interface that is compatible with both RL and differentiable learning algorithms\. To overcome these issues, we develop a novel simulation framework from scratch entirely in PyTorch, providing fully GPU\-parallelized differentiable simulation capabilities while remaining lightweight, flexible, easy to install, and user\-friendly\.

## IIIThe DiffAero Simulation Framework

![Refer to caption](https://arxiv.org/html/2509.10247v1/x2.png)Figure 2:An overview of the developed simulation framework\.The simulation framework comprises three modular components: \(1\) simulation environments, \(2\) learning agents, and, \(3\) export and deployment utilities\. The decoupled modular design of simulation environment and learning agent enables nearly arbitrary module combinations, thereby providing highly customizable training configurations\. Additionally, DiffAero features a Gym\-like learning interface that runs on the GPU, allowing data to be transferred directly between environments and agents, thereby reducing data transfer I/O overhead\. Furthermore, DiffAero incorporates export and deployment utilities to deploy and evaluate the trained policy\.In this section, we describe the key features and building blocks of DiffAero\. DiffAero is a differentiable, high\-performance quadrotor simulation toolkit capable of simulating multiple environments in parallel, each containing multiple quadrotor instances, by leveraging the parallel computing capabilities of modern GPUs\.

DiffAero consists of the following main components: \(1\) a high\-performance and flexible simulation framework, including three quadrotor dynamics, three sensor modalities, and three flight tasks, combined with a Gym\-like interactive learning interface; \(2\) an interactive learning library implemented in PyTorch\[[24](https://arxiv.org/html/2509.10247v1#bib.bib24)\], including four configurable network architectures and several learning algorithms, optimized for training on data generated by GPU\-parallelized simulation; and, \(3\) utilities to export and deploy trained control policies in other simulators or on real quadrotor platform, enable researchers to evaluate the performance efficiently\. An overview of DiffAero is depicted in Fig\.[2](https://arxiv.org/html/2509.10247v1#S3.F2)\.

### III\-ADifferentiable Dynamics

Existing differentiable physics engines\[[25](https://arxiv.org/html/2509.10247v1#bib.bib25),[26](https://arxiv.org/html/2509.10247v1#bib.bib26)\]primarily support general rigid\-body dynamics and use joint torques as action inputs, which mismatches the requirements of quadrotor simulation\. Instead of pursuing generality, our framework is tailored specifically for quadrotors by implementing four dynamics models with varying levels of complexity and fidelity to facilitate parallel, GPU\-accelerated, and multi\-fidelity quadrotor simulation\.

The first dynamics model is the full quadrotor dynamics\[[27](https://arxiv.org/html/2509.10247v1#bib.bib27)\], defined as follows

\{𝐩˙=𝐯𝐯˙=𝐠−c​𝐳B−𝐑𝐃𝐑⊤​𝐯𝐪˙=12​𝐪⊙\[0ω\]ω˙=𝐉−1​\(τ−ω×𝐉​ω\)\\begin\{cases\}\\begin\{aligned\} \\dot\{\\mathbf\{p\}\}=&\\mathbf\{v\}\\\\ \\dot\{\\mathbf\{v\}\}=&\\mathbf\{g\}\-c\\mathbf\{z\}\_\{B\}\-\\mathbf\{RDR^\{\\top\}v\}\\\\ \\dot\{\\mathbf\{q\}\}=&\\frac\{1\}\{2\}\\mathbf\{q\}\\odot\\begin\{bmatrix\}0\\\\ \\mathbf\{\\omega\}\\end\{bmatrix\}\\\\ \\dot\{\\mathbf\{\\omega\}\}=&\\mathbf\{J\}^\{\-1\}\(\\mathbf\{\\tau\-\\omega\\times J\\omega\}\)\\end\{aligned\}\\end\{cases\}\(1\)where𝐩\\mathbf\{p\}and𝐯\\mathbf\{v\}are the position and velocity in world frame,𝐪\\mathbf\{q\}andω\\mathbf\{\\omega\}denote the orientation quaternion and angular velocity in body frame,𝐠\\mathbf\{g\}indicates the gravitational acceleration vector,𝐳B\\mathbf\{z\}\_\{B\}refers to the unit vector along the z\-axis of the body frame,𝐑\\mathbf\{R\},𝐃\\mathbf\{D\}and𝐉\\mathbf\{J\}are respectively the rotation matrix, drag matrix, and inertia matrix, defined in body frame,ccandτ\\taurepresent collective rotor thrust and torque, and⊙\\odotsignifies the quaternion product\. We use the geometric controller\[[28](https://arxiv.org/html/2509.10247v1#bib.bib28)\]that takes thrust and body rate commands as inputs to generate desired collective thrust and torque\. The dynamics model \([1](https://arxiv.org/html/2509.10247v1#S3.E1)\) provides the highest fidelity but is also the most computationally demanding; its complex computation graph makes it suitable primarily for reinforcement learning tasks where precise attitude dynamics are critical\.

The simplified quadrotor dynamics\[[29](https://arxiv.org/html/2509.10247v1#bib.bib29)\]is defined as follows

\{𝐩˙=𝐯𝐯˙=𝐑​c\+𝐠𝐑˙=𝐑​\[ω\]×\\begin\{cases\}\\begin\{aligned\} \\dot\{\\mathbf\{p\}\}=&\\mathbf\{v\}\\\\ \\dot\{\\mathbf\{v\}\}=&\\mathbf\{R\}c\+\\mathbf\{g\}\\\\ \\dot\{\\mathbf\{R\}\}=&\\mathbf\{R\}\[\\mathbf\{\\omega\}\]\_\{\\times\}\\end\{aligned\}\\end\{cases\}\(2\)where\[⋅\]×\[\\cdot\]\_\{\\times\}denotes the skew\-symmetric matrix operator\. The dynamics model[2](https://arxiv.org/html/2509.10247v1#S3.E2)bypasses body\-rate dynamics by directly taking body rates and collective thrust as inputs while preserving attitude dynamics\.

The remaining two dynamics models are based on the point\-mass dynamics for differentiable learning algorithms\. The continuous\-time point\-mass dynamics is defined as follows

\{𝐩˙=𝐯𝐯˙=𝐚\+𝐠−d​𝐯𝐚˙=λ​\(𝐮−𝐚\)\\begin\{cases\}\\dot\{\\mathbf\{p\}\}=\\mathbf\{v\}\\\\ \\dot\{\\mathbf\{v\}\}=\\mathbf\{a\}\+\\mathbf\{g\}\-d\\mathbf\{v\}\\\\ \\dot\{\\mathbf\{a\}\}=\\lambda\(\\mathbf\{u\}\-\\mathbf\{a\}\)\\end\{cases\}\(3\)where𝐚\\mathbf\{a\}represents the acceleration of the quadrotor in the world frame,ddis the drag coefficient, andλ\\lambdais the control latency factor\. The dynamics model \([3](https://arxiv.org/html/2509.10247v1#S3.E3)\) additionally omits the rotational degrees of freedom compares to[2](https://arxiv.org/html/2509.10247v1#S3.E2)and primarily focuses on the movement of the center of mass, thereby simplifying the computation and enabling smooth gradient backpropagation through the dynamics\. Following\[[14](https://arxiv.org/html/2509.10247v1#bib.bib14)\], the discrete\-time point\-mass dynamics is defined as follows

\{𝐩t\+1=𝐩t\+𝐯t​Δ​t\+12​𝐮t​Δ​t2𝐯t\+1=𝐯t\+12​\(𝐮t\+𝐮t\+1\)​Δ​t\\begin\{cases\}\\mathbf\{p\}\_\{t\+1\}=\\mathbf\{p\}\_\{t\}\+\\mathbf\{v\}\_\{t\}\\Delta t\+\\dfrac\{1\}\{2\}\\mathbf\{u\}\_\{t\}\\Delta t^\{2\}\\\\ \\mathbf\{v\}\_\{t\+1\}=\\mathbf\{v\}\_\{t\}\+\\dfrac\{1\}\{2\}\(\\mathbf\{u\}\_\{t\}\+\\mathbf\{u\}\_\{t\+1\}\)\\Delta t\\end\{cases\}\(4\)The dynamics model \([4](https://arxiv.org/html/2509.10247v1#S3.E4)\) is the simplest among all implemented models and requires minimal computation, making it suitable for tasks prioritizing extreme simulation speed\.

The dynamics models described above are implemented in PyTorch, thereby supporting direct gradient backpropagation, single\- and multi\-agent setups, and massively parallel simulations on GPUs\. Given a state𝐬t2\\mathbf\{s\}\_\{t\_\{2\}\}and an action𝐚t1\\mathbf\{a\}\_\{t\_\{1\}\}witht2\>t1t\_\{2\}\>t\_\{1\}, the gradient of𝐬t2\\mathbf\{s\}\_\{t\_\{2\}\}with respect to𝐚t1\\mathbf\{a\}\_\{t\_\{1\}\}can be computed as follows through back\-propagation

∂𝐬t2∂𝐚t1=∂𝐬t1\+1∂𝐚t1​∏t=t1\+1t2−1∂𝐬t\+1∂𝐬t\\frac\{\\partial\\mathbf\{s\}\_\{t\_\{2\}\}\}\{\\partial\\mathbf\{a\}\_\{t\_\{1\}\}\}=\\frac\{\\partial\\mathbf\{s\}\_\{t\_\{1\}\+1\}\}\{\\partial\\mathbf\{a\}\_\{t\_\{1\}\}\}\\prod^\{t\_\{2\}\-1\}\_\{t=t\_\{1\}\+1\}\\frac\{\\partial\\mathbf\{s\}\_\{t\+1\}\}\{\\partial\\mathbf\{s\}\_\{t\}\}\(5\)Although primarily designed for quadrotors, these models can be readily adapted to other multi\-rotor configurations \(e\.g\., hexarotors and octarotors\) with minor modifications\. DiffAero enables further research on dynamics modeling by making all dynamics modular, supporting them in all flight tasks, and offering a unified base class for custom dynamics\.

### III\-BSensor Stack

To enable end\-to\-end flight policy learning, DiffAero provides interceptive and exteroceptive sensor data regarding the agent and its surrounding obstacles\. DiffAero supports three sensor modalities: IMU, depth camera, and LiDAR, all implemented in PyTorch\. Both depth camera and LiDAR are implemented with a self\-defined ray\-casting algorithm\. At a resolution below64×6464\\times 64, which is crucial for agents to learn generalizable policies from simple obstacle configurations and visual effects, GPU\-parallelized ray\-casting offers significant performance and usability advantages compared to rasterization pipelines\. By excluding obstacles outside the sensor’s FOV from the ray casting pipeline, the rendering efficiency is greatly improved, particularly in cluttered environments, as shown in Fig\.[3](https://arxiv.org/html/2509.10247v1#S3.F3)\. Additionally, rather than performing computationally expensive ray\-casting against individual triangle meshes\[[12](https://arxiv.org/html/2509.10247v1#bib.bib12),[14](https://arxiv.org/html/2509.10247v1#bib.bib14)\], we implement specialized ray\-casting functions for every primitive shape \(e\.g\., spheres, cubes, cylinders\) to further accelerate the rendering pipeline\.

DiffAero also provides access to the quadrotor’s IMU data with configurable noise and drift parameters for odometry\-based policy learning\. Following\[[30](https://arxiv.org/html/2509.10247v1#bib.bib30)\], the attitude of the point\-mass dynamics model \([3](https://arxiv.org/html/2509.10247v1#S3.E3)\) and \([4](https://arxiv.org/html/2509.10247v1#S3.E4)\) are governed by the collective thrust direction and velocity vector\. Specifically, since propeller forces are aligned with the body frame’s z\-axis, two rotational degrees of freedom are determined by the collective thrust vector\. The remaining degree of freedom is constrained by aligning the body frame’s x\-axis with the projection of the exponential moving average of the velocity vector onto the horizontal plane\. This configuration ensures that the quadrotor consistently orients itself in the direction of motion, reducing the task difficulty and avoiding sideslip, which can be problematic sometimes because of inadequate observation about the environment\. However, it is important to note that this approach might be less effective in highly cluttered or indoor environments, where rotational maneuvers in place are crucial for gathering detailed environmental information\.

![Refer to caption](https://arxiv.org/html/2509.10247v1/x3.png)Figure 3:Rendering strategy of ray\-casting sensors\.Obstacles outside the sensor’s field of view are excluded from the ray\-casting process, substantially accelerating rendering\.![Refer to caption](https://arxiv.org/html/2509.10247v1/x4.png)Figure 4:Different learning paradigms under a unified learning interface\.\(a\) Differentiable learning algorithms, \(b\) Hybrid learning algorithms, and \(c\) Reinforcement learning algorithms are all supported by the interface\.
### III\-CFlight Tasks

For high\-speed autonomous flight, we implement three flight tasks\.

- •Position Control: The quadrotor\(s\) must reach and hover at specified target positions from random initial states\. This task primarily evaluates the fidelity of the dynamics models and the effectiveness of learning algorithms\. Under the multi\-agent configuration, quadrotors additionally maintain a prescribed formation while avoiding inter\-agent collisions\.
- •Obstacle Avoidance: The quadrotor\(s\), equipped with a depth camera or a LiDAR, navigate to and hover on target positions while avoiding collision with environmental obstacles\. The obstacles are randomly placed around the path between the initial position and the target position, ensuring a non\-trivial obstacle avoidance policy\. Formation and inter\-agent collision\-avoidance constraints remain active in the multi\-agent setting\.
- •Autonomous Racing: The quadrotor traverses through a set of gates in a predefined order as quickly as possible\. Gate positions are provided either through onboard depth sensing or as ground\-truth relative poses\.

The task environments are visualized in Fig\.[1](https://arxiv.org/html/2509.10247v1#S1.F1)\. Both position control and obstacle avoidance tasks support single\- and multi\-agent configurations\. Observations consist of proprioceptive states, goal\-related vectors, and exteroceptive measurements \(depth or LiDAR\)\. A copy of the flight environment of each task is provided in Gazebo\. Once a policy is trained, both the policy and the definition of the observation space are exported to ONNX format, loaded by ROS nodes, and used for inference during SITL or real\-world experiments, as illustrated in Fig\.[5](https://arxiv.org/html/2509.10247v1#S3.F5)\. Packaging the inference function and observation specification in ONNX enables seamless evaluation and deployment of trained policies\.

![Refer to caption](https://arxiv.org/html/2509.10247v1/x5.png)Figure 5:Training and deployment pipeline\.The observation function and the learned policy are exported from DiffAero in ONNX format and loaded by the ROS flight control and inference nodes\.
### III\-DLearning Interface

DiffAero provides a unified Gym\-like learning interface for RL and differentiable simulation based learning algorithms\. By providing both differentiable and non\-differentiable reward signals for each time step, DiffAero enables easy development, evaluation, and comparison of RL and differentiable learning algorithms\. Instead of using existing RL algorithm libraries like SB3\[[31](https://arxiv.org/html/2509.10247v1#bib.bib31)\]and rsl\_rl\[[32](https://arxiv.org/html/2509.10247v1#bib.bib32)\], we implement common backbone networks for interactive learning from scratch in PyTorch, including MLP, CNN, RNN, and RCNN, since existing libraries are not capable of leveraging system differentiability\. Additionally, we implement several learning agents with unified interfaces, decoupled from specific neural network design, allowing easy comparison of different network architectures\.

DiffAero supports four learning formulations: single\-agent reinforcement learning \(SARL\), multi\-agent reinforcement learning \(MARL\), single\-agent differentiable learning, and multi\-agent differentiable learning\. Beyond RL and differentiable learning, hybrid learning algorithms that combine training techniques of both are also supported, as depicted in Fig\.[4](https://arxiv.org/html/2509.10247v1#S3.F4)\(b\)\. With the support of the versatile interface, we implement several learning algorithms, including PPO\[[33](https://arxiv.org/html/2509.10247v1#bib.bib33)\], MAPPO\[[34](https://arxiv.org/html/2509.10247v1#bib.bib34)\], BPTT, SHAC\[[15](https://arxiv.org/html/2509.10247v1#bib.bib15)\], multi\-agent SHAC, and DreamerV3\[[35](https://arxiv.org/html/2509.10247v1#bib.bib35)\], etc\. Since network architecture, learning agent, and algorithmic logic are decoupled from each other, users can readily develop custom algorithms without concerning themselves with low\-level implementation details\.

### III\-EDesigning Novel Algorithms with DiffAero

To illustrate how DiffAero enables the development and evaluation of new algorithms and learning paradigms, we present a simple case study on training a vision\-based navigation policy with a novel hybrid learning algorithm\. We designed a variant of SHAC\[[15](https://arxiv.org/html/2509.10247v1#bib.bib15)\]by using an asymmetric actor\-critic architecture and different reward signals for direct gradient back\-propagation and terminal value optimization, respectively, formulating short horizon asymmetric actor\-critic \(SHA2C\)\.

Specifically, the policy networkπθ\\pi\_\{\\theta\}adopts a recurrent and convolution architecture to encode visual inputs and preserve temporal information, while the value networkVϕV\_\{\\phi\}digests the full simulator state with an MLP\. We design two reward signals for the obstacle avoidance task: a local control reward functionRctrlR\_\{\\text\{ctrl\}\}and a goal\-oriented reward functionRgoalR\_\{\\text\{goal\}\}\. Both reward signals are designed to guide the quadrotor to navigate to the target and avoid collision, where the control reward functionRctrlR\_\{\\text\{ctrl\}\}is dense and differentiable to the physical state of the agent, controlling the short\-term behavior of the agent, and the goal\-oriented reward functionRgoalR\_\{\\text\{goal\}\}is sparse and non\-differentiable, supervising the long\-term and global behavior with binary success and failure flags, and representing the overall goal of the task\.

At each time step, the policy networkπθ\\pi\_\{\\theta\}selects actionsat∼𝒩​\(μθ​\(ot\),σθ​\(ot\)\)\\mathrm\{a\}\_\{t\}\\sim\\mathcal\{N\}\(\\mu\_\{\\theta\}\(\\mathrm\{o\}\_\{t\}\),\\sigma\_\{\\theta\}\(\\mathrm\{o\}\_\{t\}\)\), while the value networkVϕV\_\{\\phi\}evaluates the state value ofRgoalR\_\{\\text\{goal\}\}from the full physical statest\\mathrm\{s\}\_\{t\}\. After collectingNNrollouts of lengthTT, SHA2C updates both networks in an on\-policy manner, using data collected by the agent\. Specifically, the policy loss is computed by accumulating discounted local rewards and a terminal value as follows

L​\(θ\)=−1N​T​∑i=1N\[\(∑t=0T−1γt​Rctrl​\(sti,ati\)\)\+γT​Vϕ​\(sTi\)\]L\(\\theta\)=\-\\frac\{1\}\{NT\}\\sum\_\{i=1\}^\{N\}\\left\[\\Big\{\(\}\\sum^\{T\-1\}\_\{t=0\}\\gamma^\{t\}R\_\{\\text\{ctrl\}\}\(\\mathrm\{s\}^\{i\}\_\{t\},\\mathrm\{a\}^\{i\}\_\{t\}\)\\Big\{\)\}\+\\gamma^\{T\}V\_\{\\phi\}\(\\mathrm\{s\}^\{i\}\_\{T\}\)\\right\]\(6\)wheresti\\mathrm\{s\}^\{i\}\_\{t\}andati\\mathrm\{a\}\_\{t\}^\{i\}are the state and action at stepttof theii\-th trajectory\. The value network is updated using the MSE loss, matching its outputs to the value targets bootstrapped using thekk\-step return of the goal rewardRgoalR\_\{\\text\{goal\}\}from timettthrough TD\-λ\\lambda\.

Since SHA2C optimizesRctrlR\_\{\\text\{ctrl\}\}andRgoalR\_\{\\text\{goal\}\}with first\- and zeroth\-order policy gradient, respectively, it relaxes the constraint that all components of the reward signal in differentiable simulation based learning algorithms must be dense, continuous, and differentiable\. With proper design, the policy learning process benefits fromRgoalR\_\{\\text\{goal\}\}even when it is simple and sparse\. This example shows that the versatility of DiffAero enables researchers to develop and evaluate novel algorithms and learning paradigms\.

## IVExperiments

To evaluate the performance of the simulation framework and demonstrate the effectiveness of the proposed algorithm SHA2C, in this section we present experimental results on: \(1\) simulation speed and VRAM utilization; \(2\) performance of baseline algorithms in flight tasks; \(3\) comparison of different quadrotor dynamics; and, \(4\) deploy examples of flight policies trained by SHA2C\.

![Refer to caption](https://arxiv.org/html/2509.10247v1/x6.png)Figure 6:Comparison of the simulation performance among GPU\-parallelized drone simulators\.The upper two figures depict FPS of simulators when performing physics simulation \(total number of interactions per second\) and rendering64×6464\\times 64depth images \(total number of images per second\) under varying numbers of environments\. The lower two figures present the VRAM consumption of simulators during physics simulation and depth rendering, evaluated using a fixed number of2,0482,048parallel environments\. Note that the depth rendering performance of OmniDrones is not provided since it does not support depth camera functionality\.### IV\-ASimulation Performance

We compared the simulation speed and VRAM consumption of several simulators designed exclusively for drones, as illustrated in Fig\.[6](https://arxiv.org/html/2509.10247v1#S4.F6)\. The results indicate that our simulator achieves superior performance in both physics simulation and depth image rendering\. Additionally, it maintains VRAM consumption within a practical range, enabling high\-speed simulations on mainstream consumer\-grade hardware\. Notably, the physics simulation speed of our simulator continues to scale effectively without saturation, even with a massive number of environments \(e\.g\.,8,1928,192\), whereas other simulators exhibit saturation\. This demonstrates the scalability potential of our simulation framework\. All reported results incorporate the time and VRAM overhead associated with observation calculation, reward calculation, and environment reset operations\. All results were obtained on a desktop workstation equipped with an Intel Core i9\-14900K processor and an NVIDIA GeForce RTX 4090 graphics card\.

![Refer to caption](https://arxiv.org/html/2509.10247v1/x7.png)Figure 7:The learning curves of algorithm baselines on flight tasks provided by the simulation framework\.All results are obtained using continuous point\-mass dynamics, follow the same number of parallel environments and number of policy updates, except DreamerV3, which needs far more updates and fewer data than other algorithms\.
### IV\-BBenchmark Flight Tasks

We tested the performance of algorithm baselines in single\-agent versions of flight tasks\. The results are shown in Fig\.[7](https://arxiv.org/html/2509.10247v1#S4.F7)\. All evaluated algorithms were adapted from open\-source implementations and modified to interface seamlessly with our learning agents and network architectures\. For each algorithm, the same set of hyperparameters was applied across all tasks, except network architecture, which was tailored for each algorithm to achieve the best performance\. The weights assigned to individual reward components for RL algorithms and differentiable algorithms were tuned separately, since RL algorithms are sensitive to the magnitude of the reward signal, whereas differentiable algorithms are affected primarily by the derivative of the reward, regardless of its value\. The reward results reported in Fig\.[7](https://arxiv.org/html/2509.10247v1#S4.F7)are weighted in the same manner as the differentiable reward, except for the Racing task, in which we report the reward signal for RL algorithms\.

As shown in Fig\.[7](https://arxiv.org/html/2509.10247v1#S4.F7), in the position control task, differentiable algorithms consistently outperformed RL algorithms, achieving even greater data efficiency than the model\-based DreamerV3\. We attribute this to the simplicity of the point\-mass dynamics model and the goal of the task, which makes the differentiable reward signals both intuitive and effective\. All algorithms successfully converged to a 100% success rate in this task\. In the obstacle avoidance task, SHA2C achieved the best success rate and stability among the algorithms we tested\. Compared to BPTT and SHAC, though using the same differentiable reward signalRctrlR\_\{\\text\{ctrl\}\}, SHA2C was able to further improve the policy under the guidance ofRgoalR\_\{\\text\{goal\}\}, avoiding convergence to local optima\. For the racing task, we adopted the design of the progress reward following\[[7](https://arxiv.org/html/2509.10247v1#bib.bib7)\]\. Since we’ve failed to design an effective differentiable reward signal to guide the agent smoothly through the gates, all differentiable learning algorithms exhibited poor performance in this task\.

![Refer to caption](https://arxiv.org/html/2509.10247v1/x8.png)Figure 8:The learning curves of different dynamics models on flight tasks\.
### IV\-CComparison of Dynamics Models

We compared the learning curves of different dynamics models across the three flight tasks provided in our framework, as shown in Fig\.[8](https://arxiv.org/html/2509.10247v1#S4.F8)\. For each task, the policy was trained using the algorithm that achieved the best performance on the corresponding task \(i\.e\., SHA2C for position control and obstacle avoidance, PPO for racing\)\. The results highlight a clear trade\-off: full quadrotor dynamics provides higher fidelity by retaining detailed attitude information but incurs substantial training overhead and higher difficulty, leading to slower convergence\. In contrast, point\-mass dynamics models sacrifice attitude fidelity yet dramatically simplify the optimization landscape, resulting in faster and more stable policy learning\.

![Refer to caption](https://arxiv.org/html/2509.10247v1/x9.png)Figure 9:Velocity curve comparison between DiffAero and the real world under the position control scenario\.![Refer to caption](https://arxiv.org/html/2509.10247v1/x10.png)Figure 10:Trajectory of vision\-based obstacle avoidance in Gazebo\.
### IV\-DDeployment Examples

In this section, we present the deployment results of our trained policies both in Gazebo and in real\-world experiments\. The policies were trained using the proposed SHA2C algorithm in1,0241,024environments in parallel, using continuous point\-mass dynamics\. Desired accelerations generated by the agent were converted into target attitude and thrust commands and fed into the PX4 flight control stack\. Air drag coefficientdd, control latencyλ\\lambda, and action range were randomized at the start of each episode for generalization, and the randomization range ofλ\\lambdawas empirically calibrated to match the dynamics of the real quadrotor\[[36](https://arxiv.org/html/2509.10247v1#bib.bib36)\]\. To learn a yaw\-invariant flight policy, observations and actions are defined in local frame that retains only yaw attitude while discarding pitch and roll angles\.

For real\-world experiments, we employed an OptiTrack motion capture system to estimate the quadrotor’s position, velocity, and attitude\. Onboard computation is handled by a Radxa X4 computer equipped with an Intel N100 processor\. Depth images are captured by a RealSense D435i camera and post\-processed with filters provided by the RealSense SDK, and downsampled to16×916\\times 9pixels, suppressing noise and minimizing the sim\-to\-real gap of visual features\.

Fig\.[9](https://arxiv.org/html/2509.10247v1#S4.F9)presents the velocity curves of the quadrotor in the position control task in both DiffAero and the real world\. With moderate domain randomization, the actual quadrotor’s response closely matches the simulated dynamics\. Fig\.[10](https://arxiv.org/html/2509.10247v1#S4.F10)illustrates the flight trajectory of a quadrotor in Gazebo, controlled by a vision\-based obstacle avoidance policy\. The quadrotor successfully navigates through cluttered corridors, achieving a peak velocity of44m/s\. Fig\.[11](https://arxiv.org/html/2509.10247v1#S4.F11)shows the trajectory of a quadrotor in a cluttered real\-world environment\. The deployment results demonstrate that the trained policies can be seamlessly transferred to other simulators and real\-world settings, even in the presence of noisy sensory inputs\.

![Refer to caption](https://arxiv.org/html/2509.10247v1/imgs/realworld-trajectory-small.jpg)Figure 11:Vision\-based obstacle avoidance trajectory in the real world\.

## VConclusion and Discussion

In this paper, we have introduced DiffAero, a fully differentiable quadrotor simulator that harnesses GPU\-parallelized physics and custom ray casting to deliver orders\-of\-magnitude improvements in simulation and rendering throughput compared to existing platforms\. Beyond performance, DiffAero provides a modular and extensible research platform that unifies multiple dynamics models, sensor modalities, flight tasks, and learning algorithms within a GPU\-native learning interface, thereby enabling systematic investigations of learning\-based aerial robotics\. To illustrate its capability, we presented a case study using a variant of SHAC \(SHA2C\) for training vision\-based navigation policies with DiffAero\. Our extensive benchmarks and real\-world experiments demonstrate that DiffAero enables rapid, reliable training and seamless sim\-to\-real transfer of quadrotor flight policies on standard hardware\.

Looking forward, we expect DiffAero to serve as an enabling tool for addressing fundamental research questions, such as how the choice of dynamics models affects the learning process, the role of non\-differentiable reward signals in hybrid learning algorithms, sim\-to\-real transfer, and multi\-agent coordination\. We hope that DiffAero will accelerate research in learning\-based aerial robotics and foster new advances in agile, autonomous flight control\.

## References

- \[1\]D\. Mellinger and V\. Kumar, “Minimum snap trajectory generation and control for quadrotors,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\.*, Shanghai, China, Aug\. 2011, pp\. 2520–2525\.
- \[2\]Y\. Ren, F\. Zhu, G\. Lu, Y\. Cai, L\. Yin, F\. Kong, J\. Lin, N\. Chen, and F\. Zhang, “Safety\-assured high\-speed navigation for MAVs,”*Sci\. Robot\.*, vol\. 10, no\. 98, p\. eado6187, Jan\. 2025\.
- \[3\]A\. Loquercio, E\. Kaufmann, R\. Ranftl, M\. Müller, V\. Koltun, and D\. Scaramuzza, “Learning high\-speed flight in the wild,”*Sci\. Robot\.*, vol\. 6, no\. 59, p\. eabg5810, Oct\. 2021\.
- \[4\]K\. Kondo, A\. Tagliabue, X\. Cai, C\. Tewari, O\. Garcia, M\. Espitia\-Alvarez, and J\. How, “CGD: Constraint\-guided diffusion policies for UAV trajectory planning,”*arXiv:2405\.01758*, 2024\.
- \[5\]W\. Zhang, G\. Wang, J\. Sun, Y\. Yuan, and G\. Huang, “STORM: Efficient stochastic transformer based world models for reinforcement learning,” in*Proc\. Adv\. Neural Inf\. Process\. Syst\.*, vol\. 36, New Orleans, LA, USA, Dec\. 2023, pp\. 27 147–27 166\.
- \[6\]R\. Ferede, C\. De Wagter, D\. Izzo, and G\. C\. De Croon, “End\-to\-end reinforcement learning for time\-optimal quadcopter flight,” in*IEEE Int\. Conf\. Robot\. Autom\.*IEEE, 2024, pp\. 6172–6177\.
- \[7\]E\. Kaufmann, L\. Bauersfeld, A\. Loquercio, M\. Müller, V\. Koltun, and D\. Scaramuzza, “Champion\-level drone racing using deep reinforcement learning,”*Nature*, vol\. 620, no\. 7976, pp\. 982–987, Aug\. 2023\.
- \[8\]T\. Wang and D\. E\. Chang, “Robust navigation for racing drones based on imitation learning and modularization,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\.*, Xi’an, China, May 2021, pp\. 13 724–13 730\.
- \[9\]B\. Xu, F\. Gao, C\. Yu, R\. Zhang, Y\. Wu, and Y\. Wang, “OmniDrones: An efficient and flexible platform for reinforcement learning in drone control,”*IEEE Robot\. Autom\. Lett\.*, vol\. 9, no\. 3, pp\. 2838–2844, Mar\. 2024\.
- \[10\]F\. Li, F\. Sun, T\. Zhang, and D\. Zou, “VisFly: An efficient and versatile simulator for training vision\-based flight,”*arXiv:2407\.14783*, 2024\.
- \[11\]K\. Huang, H\. Wang, Y\. Luo, J\. Chen, J\. Chen, X\. Zhang, X\. Ji, and H\. Liu, “A general infrastructure and workflow for quadrotor deep reinforcement learning and reality deployment,”*arXiv:2504\.15129*, 2025\.
- \[12\]M\. Kulkarni, W\. Rehberg, and K\. Alexis, “Aerial Gym simulator: A framework for highly parallelized simulation of aerial robots,”*IEEE Robot\. Autom\. Lett\.*, vol\. 10, no\. 4, pp\. 4093–4100, Apr\. 2025\.
- \[13\]E\. Kaufmann, A\. Loquercio, R\. Ranftl, M\. Müller, V\. Koltun, and D\. Scaramuzza, “Deep drone acrobatics,” in*Proc\. Robot\. Sci\. Syst\.*, Corvalis, Oregon, USA, July 2020\.
- \[14\]Y\. Zhang, Y\. Hu, Y\. Song, D\. Zou, and W\. Lin, “Learning vision\-based agile flight via differentiable physics,”*Nat\. Mach\. Intell\.*, vol\. 7, no\. 6, pp\. 954–966, Jun\. 2025\.
- \[15\]J\. Xu, V\. Makoviychuk, Y\. Narang, F\. Ramos, W\. Matusik, A\. Garg, and M\. Macklin, “Accelerated policy learning with parallel differentiable simulation,” in*Proc\. Int\. Conf\. Learn\. Represent\.*, Apr\. 2021, pp\. 1–26\.
- \[16\]N\. Koenig and A\. Howard, “Design and use paradigms for Gazebo, an open\-source multi\-robot simulator,” in*Proc\. IEEE/RSJ Int\. Conf\. Intell\. Robot\. Syst\.*, vol\. 3, Sendai, Japan, Sep\. 2004, pp\. 2149–2154\.
- \[17\]Y\. Song, S\. Naji, E\. Kaufmann, A\. Loquercio, and D\. Scaramuzza, “Flightmare: A flexible quadrotor simulator,” in*Proc\. Conf\. Robot Learn\.*, vol\. 155, Nov\., 2021, pp\. 1147–1157\.
- \[18\]S\. Shah, D\. Dey, C\. Lovett, and A\. Kapoor, “AirSim: High\-fidelity visual and physical simulation for autonomous vehicles,” in*Proc\. Field Serv\. Robot\.*, Cham, Sep\. 2018, pp\. 621–635\.
- \[19\]K\. Zakka and Y\. Tassa, “MuJoCo Menagerie: Acollection of high\-quality simulation models for MuJoCo,” 2022\. \[Online\]\. Available:[http://github\.com/google\-deepmind/mujoco\_menagerie](http://github.com/google-deepmind/mujoco_menagerie)
- \[20\]J\. Panerati, H\. Zheng, S\. Zhou, J\. Xu, A\. Prorok, and A\. P\. Schoellig, “Learning to fly\-A Gym environment with pybullet physics for reinforcement learning of multi\-agent quadcopter control,” in*Proc\. IEEE/RSJ Int\. Conf\. Intell\. Robot\. Syst\.*, Prague, Czech Republic, Sep\. 2021, pp\. 7512–7519\.
- \[21\]E\. Todorov, T\. Erez, and Y\. Tassa, “MuJoCo: A physics engine for model\-based control,” in*Proc\. IEEE/RSJ Int\. Conf\. Intell\. Robot\. Syst\.*, Vilamoura\-Algarve, Portugal, Oct\. 2012, pp\. 5026–5033\.
- \[22\]E\. Coumans and Y\. Bai, “PyBullet, a python module for physics simulation for games, robotics and machine learning,” 2016\.
- \[23\]V\. Makoviychuk, L\. Wawrzyniak, Y\. Guo, M\. Lu, K\. Storey, M\. Macklin, D\. Hoeller, N\. Rudin, A\. Allshire, A\. Handa*et al\.*, “Isaac Gym: High performance GPU\-based physics simulation for robot learning,”*arXiv:2108\.10470*, 2021\.
- \[24\]A\. Paszke, “Pytorch: An imperative style, high\-performance deep learning library,”*arXiv:1912\.01703*, 2019\.
- \[25\]C\. D\. Freeman, E\. Frey, A\. Raichuk, S\. Girgin, I\. Mordatch, and O\. Bachem, “Brax \- A differentiable physics engine for large scale rigid body simulation,” 2021\.
- \[26\]M\. Macklin, “Warp: A high\-performance python framework for GPU simulation and graphics,”[https://github\.com/nvidia/warp](https://github.com/nvidia/warp), March 2022, nVIDIA GPU Technology Conference \(GTC\)\.
- \[27\]Z\. Zhou, G\. Wang, J\. Sun, J\. Wang, and J\. Chen, “Efficient and robust time\-optimal trajectory planning and control for agile quadrotor flight,”*IEEE Robot\. Autom\. Lett\.*, vol\. 8, no\. 12, pp\. 7913–7920, Dec\. 2023\.
- \[28\]T\. Lee, M\. Leok, and N\. H\. McClamroch, “Geometric tracking control of a quadrotor UAV on SE\(3\),” in*Proc\. IEEE Conf\. Decis\. Control*, Atlanta, GA, USA, Dec\. 2010, pp\. 5420–5425\.
- \[29\]J\. Heeg, Y\. Song, and D\. Scaramuzza, “Learning quadrotor control from visual features using differentiable simulation,” in*Proc\. IEEE Int\. Conf\. Robot\. Autom\.*, Atlanta, GA, USA, May 2025, pp\. 4033–4039\.
- \[30\]Y\. Hu, Y\. Zhang, Y\. Song, Y\. Deng, F\. Yu, L\. Zhang, W\. Lin, D\. Zou, and W\. Yu, “Seeing through pixel motion: Learning obstacle avoidance from optical flow with one camera,”*IEEE Robot\. Autom\. Lett\.*, vol\. 10, no\. 6, pp\. 5871–5878, Jun\. 2025\.
- \[31\]A\. Raffin, A\. Hill, A\. Gleave, A\. Kanervisto, M\. Ernestus, and N\. Dormann, “Stable\-Baselines3: Reliable reinforcement learning implementations,”*J\. Mach\. Learn\. Res\.*, vol\. 22, no\. 268, pp\. 1–8, Jan\. 2021\.
- \[32\]N\. Rudin, D\. Hoeller, P\. Reist, and M\. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in*Proc\. Conf\. Robot Learn\.*, vol\. 164, London, Nov\. 2022, pp\. 91–100\.
- \[33\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov, “Proximal policy optimization algorithms,”*arXiv:1707\.06347*, 2017\.
- \[34\]C\. Yu, A\. Velu, E\. Vinitsky, J\. Gao, Y\. Wang, A\. Bayen, and Y\. Wu, “The surprising effectiveness of PPO in cooperative multi\-agent games,” in*Proc\. Adv\. Neural Inf\. Process\. Syst\.*, vol\. 35, New Orleans, LA, USA, Nov\. 2022, pp\. 24 611–24 624\.
- \[35\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap, “Mastering diverse control tasks through world models,”*Nature*, pp\. 1–7, Apr\. 2025\.
- \[36\]J\. Chen, C\. Yu, Y\. Xie, F\. Gao, Y\. Chen, S\. Yu, W\. Tang, S\. Ji, M\. Mu, Y\. Wu, H\. Yang, and Y\. Wang, “What matters in learning a zero\-shot sim\-to\-real RL policy for quadrotor control? A comprehensive study,”*arXiv:2412\.11764*, 2024\.

Similar Articles

From Noise to Control: Parameterized Diffusion Policies

arXiv cs.AI

This paper introduces Parameterized Diffusion Policy (PDP), a framework that makes diffusion policies controllable by conditioning on low-dimensional latent parameters, enabling smooth behavior interpolation and adaptation without retraining. It demonstrates improved performance on complex multimodal robot tasks in simulation and real-world experiments.

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

arXiv cs.CL

Fast-dDrive is a block-diffusion VLA model for end-to-end autonomous driving that achieves state-of-the-art trajectory accuracy while delivering over 12x throughput speedup over autoregressive baselines, addressing the trade-off between high-fidelity planning and efficient inference for edge deployment.

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hugging Face Daily Papers

RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.