유레카: 대형 언어 모델 코딩을 통한 인간 수준의 보상 설계

초록

대규모 언어 모델(LLMs)은 순차적 의사결정 작업에서 높은 수준의 의미론적 계획자로서 뛰어난 성능을 보여왔습니다. 그러나 손재주가 필요한 펜 돌리기와 같은 복잡한 저수준 조작 작업을 학습하는 데 이를 활용하는 것은 여전히 해결되지 않은 문제로 남아 있습니다. 우리는 이러한 근본적인 격차를 메우고 LLMs의 힘을 빌린 인간 수준의 보상 설계 알고리즘인 Eureka를 제시합니다. Eureka는 GPT-4와 같은 최첨단 LLMs의 놀라운 제로샷 생성, 코드 작성, 그리고 문맥 내 개선 능력을 활용하여 보상 코드에 대한 진화적 최적화를 수행합니다. 그 결과로 얻어진 보상은 강화 학습을 통해 복잡한 기술을 습득하는 데 사용될 수 있습니다. Eureka는 작업별 프롬프트나 사전 정의된 보상 템플릿 없이도 전문가가 설계한 인간 공학적 보상을 능가하는 보상 함수를 생성합니다. 10가지의 독특한 로봇 형태를 포함한 29개의 오픈소스 강화 학습 환경에서, Eureka는 83%의 작업에서 인간 전문가를 능가하며 평균 52%의 정규화된 개선을 이끌어냅니다. Eureka의 일반성은 또한 인간 피드백을 통한 강화 학습(RLHF)에 대한 새로운 경사 없음 문맥 내 학습 접근법을 가능하게 하여, 모델 업데이트 없이도 인간의 입력을 쉽게 통합하여 생성된 보상의 품질과 안전성을 향상시킵니다. 마지막으로, 커리큘럼 학습 설정에서 Eureka 보상을 사용하여, 우리는 시뮬레이션된 Shadow Hand가 펜 돌리기 트릭을 수행할 수 있음을 처음으로 입증했습니다. 이는 펜을 빠른 속도로 원을 그리며 능숙하게 조작하는 능력을 보여줍니다.

English

Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.

유레카: 대형 언어 모델 코딩을 통한 인간 수준의 보상 설계

Eureka: Human-Level Reward Design via Coding Large Language Models

초록

Support