ChatPaper.aiChatPaper

RAPTOR:四旋翼飞行器控制的基础策略

RAPTOR: A Foundation Policy for Quadrotor Control

September 15, 2025
作者: Jonas Eschmann, Dario Albani, Giuseppe Loianno
cs.AI

摘要

人类在适应新环境时表现出极高的数据效率,比如驾驶一辆新车。相比之下,现代机器人控制系统,如通过强化学习(RL)训练的神经网络策略,往往高度专一于单一环境。这种过拟合特性使得它们即使在微小差异下,如仿真到现实(Sim2Real)的差距,也会失效,并且需要对系统进行识别和重新训练,哪怕系统只有极小的变动。在本研究中,我们提出了RAPTOR方法,用于训练一种高度自适应的基础策略,以控制四旋翼飞行器。我们的方法能够训练一个单一的、端到端的神经网络策略,来控制多种多样的四旋翼飞行器。我们测试了从32克到2.4公斤不等的10种真实四旋翼飞行器,这些飞行器在电机类型(有刷与无刷)、框架类型(软性与刚性)、螺旋桨类型(2/3/4叶)以及飞行控制器(PX4/Betaflight/Crazyflie/M5StampFly)上均有所不同。我们发现,仅含2084个参数的三层微型策略,足以实现零样本适应于多种平台。这种通过上下文学习实现的适应性,得益于隐藏层中的循环结构。该策略通过一种新颖的元模仿学习算法进行训练,我们采样了1000种四旋翼飞行器,并为每种飞行器使用强化学习训练了一个教师策略。随后,这1000个教师策略被蒸馏成一个单一的、自适应的学生策略。我们发现,在毫秒级时间内,生成的基础策略能够零样本适应于未见过的四旋翼飞行器。我们广泛测试了基础策略在多种条件下的能力(轨迹跟踪、室内/室外、风扰、触碰、不同螺旋桨)。
English
Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through In-Context Learning is made possible by using a recurrence in the hidden layer. The policy is trained through a novel Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using Reinforcement Learning. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).
PDF12September 17, 2025