AutoRT: 대규모 로봇 에이전트 조율을 위한 구체화된 기초 모델

초록

언어, 시각, 그리고 최근에는 동작을 통합한 파운데이션 모델은 인터넷 규모의 데이터를 활용하여 유용한 작업에 대해 추론하는 능력을 혁신적으로 변화시켰다. 그러나 구체화된 파운데이션 모델을 훈련하는 데 있어 주요한 과제 중 하나는 물리적 세계에 기반한 데이터의 부족이다. 본 논문에서는 기존 파운데이션 모델을 활용하여 완전히 새로운 시나리오에서 운영 로봇의 배치를 최소한의 인간 감독 하에 확장할 수 있는 AutoRT 시스템을 제안한다. AutoRT는 장면 이해와 기반 작업을 위해 시각-언어 모델(VLM)을 활용하고, 더 나아가 대규모 언어 모델(LLM)을 사용하여 로봇 군단이 수행할 다양한 새로운 지시를 제안한다. 파운데이션 모델의 지식을 활용하여 데이터 수집을 안내함으로써, AutoRT는 자율성과 안전성 간의 균형을 효과적으로 추론하면서 로봇 학습을 위한 데이터 수집을 크게 확장할 수 있다. 우리는 AutoRT가 여러 건물에 걸쳐 20대 이상의 로봇에게 지시를 제안하고, 원격 조작 및 자율 로봇 정책을 통해 77,000개의 실제 로봇 에피소드를 수집하는 것을 시연한다. 실험적으로, AutoRT에 의해 수집된 이러한 "야생" 데이터가 훨씬 더 다양하며, AutoRT의 LLM 사용이 인간의 선호도에 부합하는 지시 수행 데이터 수집 로봇을 가능하게 함을 보여준다.

English

Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.

AutoRT: 대규모 로봇 에이전트 조율을 위한 구체화된 기초 모델

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

초록

Support