RAPTOR:四旋翼控制的基礎策略
RAPTOR: A Foundation Policy for Quadrotor Control
September 15, 2025
作者: Jonas Eschmann, Dario Albani, Giuseppe Loianno
cs.AI
摘要
人類在適應新環境條件時展現出驚人的數據效率,例如駕駛一輛新車。相比之下,現代機器人控制系統,如通過強化學習(RL)訓練的神經網絡策略,往往高度專注於單一環境。由於這種過度擬合,它們在面對微小差異(如模擬到現實的Sim2Real差距)時便容易失效,甚至系統的微小變動也需要進行系統識別和重新訓練。在本研究中,我們提出了RAPTOR方法,旨在訓練一種高度適應性的基礎策略,用於四旋翼飛行器的控制。該方法能夠訓練一個單一的端到端神經網絡策略,以控制多種多樣的四旋翼飛行器。我們測試了從32克到2.4公斤不等的10種真實四旋翼飛行器,這些飛行器在電機類型(有刷與無刷)、框架類型(軟性與剛性)、螺旋槳類型(2/3/4葉片)以及飛行控制器(PX4/Betaflight/Crazyflie/M5StampFly)等方面均有所不同。我們發現,僅需一個三層結構、僅含2084個參數的微小策略,便足以實現對多種平台的零樣本適應。這種通過上下文學習實現的適應性,得益於隱藏層中的遞歸設計。該策略通過一種新穎的元模仿學習算法進行訓練,我們採樣了1000種四旋翼飛行器,並使用強化學習為每種飛行器訓練一個教師策略。隨後,這1000個教師策略被蒸餾成一個單一的、具有適應性的學生策略。我們發現,生成的基礎策略能在毫秒級時間內,零樣本適應於未見過的四旋翼飛行器。我們在各種條件下(軌跡跟踪、室內/室外、風擾、戳刺、不同螺旋槳)對基礎策略的能力進行了廣泛測試。
English
Humans are remarkably data-efficient when adapting to new unseen conditions,
like driving a new car. In contrast, modern robotic control systems, like
neural network policies trained using Reinforcement Learning (RL), are highly
specialized for single environments. Because of this overfitting, they are
known to break down even under small differences like the Simulation-to-Reality
(Sim2Real) gap and require system identification and retraining for even
minimal changes to the system. In this work, we present RAPTOR, a method for
training a highly adaptive foundation policy for quadrotor control. Our method
enables training a single, end-to-end neural-network policy to control a wide
variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg
that also differ in motor type (brushed vs. brushless), frame type (soft vs.
rigid), propeller type (2/3/4-blade), and flight controller
(PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy
with only 2084 parameters is sufficient for zero-shot adaptation to a wide
variety of platforms. The adaptation through In-Context Learning is made
possible by using a recurrence in the hidden layer. The policy is trained
through a novel Meta-Imitation Learning algorithm, where we sample 1000
quadrotors and train a teacher policy for each of them using Reinforcement
Learning. Subsequently, the 1000 teachers are distilled into a single, adaptive
student policy. We find that within milliseconds, the resulting foundation
policy adapts zero-shot to unseen quadrotors. We extensively test the
capabilities of the foundation policy under numerous conditions (trajectory
tracking, indoor/outdoor, wind disturbance, poking, different propellers).