로봇 기술 합성을 위한 언어에서 보상으로의 변환

초록

대규모 언어 모델(LLM)은 문맥 학습을 통해 논리적 추론부터 코드 작성에 이르기까지 다양한 새로운 능력을 획득하는 데 있어 흥미로운 진전을 보여주었습니다. 로봇공학 연구자들 또한 로봇 제어 능력을 발전시키기 위해 LLM을 활용하는 방법을 탐구해 왔습니다. 그러나 저수준 로봇 동작은 하드웨어에 의존적이며 LLM 학습 코퍼스에서 충분히 다루어지지 않았기 때문에, LLM을 로봇공학에 적용하려는 기존의 노력은 대부분 LLM을 의미론적 계획자로 취급하거나 인간이 설계한 제어 기본 요소에 의존하여 로봇과 인터페이스하는 방식에 머물러 있었습니다. 반면, 보상 함수는 다양한 작업을 달성하기 위해 제어 정책을 최적화할 수 있는 유연한 표현으로 입증되었으며, 그 의미론적 풍부함으로 인해 LLM에 의해 지정되기에 적합합니다. 본 연구에서는 이러한 통찰을 활용하여 LLM을 통해 다양한 로봇 작업을 최적화하고 달성할 수 있는 보상 매개변수를 정의하는 새로운 패러다임을 소개합니다. LLM에 의해 생성된 중간 인터페이스로서 보상을 사용함으로써, 우리는 고수준 언어 지시 또는 수정 사항과 저수준 로봇 동작 사이의 간극을 효과적으로 메울 수 있습니다. 동시에, 이를 실시간 최적화 도구인 MuJoCo MPC와 결합함으로써 사용자가 즉각적인 결과를 관찰하고 시스템에 피드백을 제공할 수 있는 상호작용형 행동 생성 경험을 가능하게 합니다. 제안된 방법의 성능을 체계적으로 평가하기 위해, 우리는 시뮬레이션된 사족 보행 로봇과 민첩한 조작 로봇을 위한 총 17개의 작업을 설계했습니다. 우리는 제안된 방법이 설계된 작업의 90%를 안정적으로 해결하는 반면, Code-as-policies를 사용한 기본 인터페이스 방식은 작업의 50%를 달성함을 보여줍니다. 또한, 우리는 실제 로봇 팔에서 복잡한 조작 기술(예: 비파지형 밀기)이 우리의 상호작용 시스템을 통해 나타남을 추가로 검증했습니다.

English

Large language models (LLMs) have demonstrated exciting progress in acquiring diverse new capabilities through in-context learning, ranging from logical reasoning to code-writing. Robotics researchers have also explored using LLMs to advance the capabilities of robotic control. However, since low-level robot actions are hardware-dependent and underrepresented in LLM training corpora, existing efforts in applying LLMs to robotics have largely treated LLMs as semantic planners or relied on human-engineered control primitives to interface with the robot. On the other hand, reward functions are shown to be flexible representations that can be optimized for control policies to achieve diverse tasks, while their semantic richness makes them suitable to be specified by LLMs. In this work, we introduce a new paradigm that harnesses this realization by utilizing LLMs to define reward parameters that can be optimized and accomplish variety of robotic tasks. Using reward as the intermediate interface generated by LLMs, we can effectively bridge the gap between high-level language instructions or corrections to low-level robot actions. Meanwhile, combining this with a real-time optimizer, MuJoCo MPC, empowers an interactive behavior creation experience where users can immediately observe the results and provide feedback to the system. To systematically evaluate the performance of our proposed method, we designed a total of 17 tasks for a simulated quadruped robot and a dexterous manipulator robot. We demonstrate that our proposed method reliably tackles 90% of the designed tasks, while a baseline using primitive skills as the interface with Code-as-policies achieves 50% of the tasks. We further validated our method on a real robot arm where complex manipulation skills such as non-prehensile pushing emerge through our interactive system.

로봇 기술 합성을 위한 언어에서 보상으로의 변환

Language to Rewards for Robotic Skill Synthesis

초록

Support