尤里卡：通过编码大型语言模型实现人类水平的奖励设计

摘要

大型语言模型（LLMs）在顺序决策任务中表现出色，作为高级语义规划器。然而，利用它们学习复杂的低级操作任务，如灵巧的转笔技巧，仍然是一个悬而未决的问题。我们弥合了这一基本差距，并提出了Eureka，一种由LLMs驱动的人类级奖励设计算法。Eureka利用了最先进的LLMs（如GPT-4）的显著零射生成、编写代码和上下文改进能力，通过对奖励代码进行进化优化。然后，可以利用生成的奖励来通过强化学习获取复杂技能。在29个开源RL环境中（包括10种不同的机器人形态），Eureka在83%的任务上优于人类专家，在平均标准化改进率达到52%。Eureka的通用性还实现了一种新的无梯度上下文学习方法，即通过人类反馈进行强化学习（RLHF），可以方便地整合人类输入，以改进生成的奖励的质量和安全性，而无需模型更新。最后，通过在课程学习环境中使用Eureka奖励，我们首次展示了一个模拟的Shadow Hand，能够进行转笔技巧，熟练地在高速下旋转笔。

English

Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.

尤里卡：通过编码大型语言模型实现人类水平的奖励设计

Eureka: Human-Level Reward Design via Coding Large Language Models

摘要

Support