유희적 에이전트 로봇 학습

초록

현재 에이전트 로봇 시스템은 실행 가능한 Code-as-Policy 프로그램을 작성하고, 피드백을 관찰하며, 여러 시도에 걸쳐 행동을 수정할 수 있지만, 여전히 대부분 작업 중심적(task-driven)입니다. 재사용 가능한 기술(skill)은 명시적인 지시가 있을 때에만 획득됩니다. 본 연구에서는 하위 작업(downstream tasks)이 도착하기 전에 내장된 코딩 에이전트(embodied coding agent)가 자기 주도적 놀이(self-directed play)를 지속적인 기술 학습 단계로 사용하는 놀이형 에이전트 로봇 학습(Playful Agentic Robot Learning)을 연구합니다. 우리는 놀이 시간 동안 기술을 습득하도록 설계된 로봇 에이전트 팀인 RATs(Robotics Agent Teams)를 소개합니다. 놀이 중 RATs는 새롭고도 학습 가능한 탐색적 작업을 제안하고, 로봇-코드 정책을 계획 및 실행하며, 중간 진행 상황을 검증하고, 실패를 진단하며, 밀집된 단계별 피드백으로 재시도하고, 성공적인 실행을 지속적인 코드 기술 라이브러리로 추출합니다. 테스트 시에는 에이전트가 이 고정된 라이브러리에서 관련 기술을 재사용하여 새로운 작업을 해결하는 데 도움을 줍니다. LIBERO-PRO와 MolmoSpaces에서의 실험 결과, 놀이를 통해 학습된 기술은 놀이 없음 및 무작위 놀이 기준선에 비해 하위 작업 성능을 향상시켰으며, CaP-Agent0 대비 LIBERO-PRO에서 20.6%p, MolmoSpaces에서 17.0%p의 성능 향상을 보였습니다. 또한 학습된 기술은 추론 시 Code-as-Policy 에이전트의 컨텍스트에 단순히 검색하여 삽입함으로써 다른 에이전트에 적용 가능하며, 기본 모델을 미세 조정하지 않고도 RoboSuite 및 실제 환경 전이에서 각각 8.9%p 및 8.8%p의 성능 향상을 달성했습니다.

English

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.