RAPTOR: 쿼드로터 제어를 위한 기초 정책

초록

인간은 새로운 차량을 운전하는 것과 같이 이전에 경험하지 못한 상황에 적응할 때 놀라울 정도로 데이터 효율적입니다. 반면, 강화학습(Reinforcement Learning, RL)으로 훈련된 신경망 정책과 같은 현대의 로봇 제어 시스템은 단일 환경에 대해 매우 특화되어 있습니다. 이러한 과적합으로 인해, 시뮬레이션-현실(Sim2Real) 간극과 같은 작은 차이에도 쉽게 무너지며, 시스템에 최소한의 변화가 있어도 시스템 식별과 재훈련이 필요합니다. 본 연구에서는 쿼드로터 제어를 위한 고도로 적응 가능한 기반 정책을 훈련하는 방법인 RAPTOR를 제시합니다. 우리의 방법은 다양한 쿼드로터를 제어하기 위해 단일의 종단간(end-to-end) 신경망 정책을 훈련할 수 있게 합니다. 우리는 32g에서 2.4kg까지의 10가지 실제 쿼드로터를 테스트했으며, 이들은 모터 유형(브러시 vs. 브러시리스), 프레임 유형(연성 vs. 경성), 프로펠러 유형(2/3/4-날개), 비행 컨트롤러(PX4/Betaflight/Crazyflie/M5StampFly) 등에서도 차이가 있습니다. 우리는 단 2084개의 매개변수를 가진 작은 3층 정책이 다양한 플랫폼에 대해 제로샷(zero-shot) 적응에 충분하다는 것을 발견했습니다. 은닉층에서의 반복을 통해 컨텍스트 내 학습(In-Context Learning)을 통한 적응이 가능해졌습니다. 이 정책은 새로운 메타-모방 학습(Meta-Imitation Learning) 알고리즘을 통해 훈련되었으며, 여기서 1000개의 쿼드로터를 샘플링하고 각각에 대해 강화학습을 사용하여 교사 정책을 훈련했습니다. 이후, 1000개의 교사 정책을 단일의 적응형 학생 정책으로 증류했습니다. 우리는 결과적으로 얻은 기반 정책이 밀리초 단위로 보지 못한 쿼드로터에 대해 제로샷 적응을 한다는 것을 발견했습니다. 우리는 이 기반 정책의 능력을 다양한 조건(궤적 추적, 실내/실외, 바람 방해, 푸시, 다른 프로펠러)에서 광범위하게 테스트했습니다.

English

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through In-Context Learning is made possible by using a recurrence in the hidden layer. The policy is trained through a novel Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using Reinforcement Learning. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).

RAPTOR: 쿼드로터 제어를 위한 기초 정책

RAPTOR: A Foundation Policy for Quadrotor Control

초록

Support