啟發：透過編碼大型語言模型實現人類級別的獎勵設計

摘要

大型語言模型（LLMs）在順序決策任務中表現卓越，作為高層語義規劃者。然而，將它們應用於學習複雜的低層操作任務，如靈巧的筆芯旋轉，仍然是一個懸而未決的問題。我們填補了這一基本差距，提出了Eureka，一種由LLMs驅動的人類級獎勵設計算法。Eureka利用最先進的LLMs（如GPT-4）的卓越零-shot生成、編碼編寫和上下文改進能力，對獎勵代碼進行進化優化。然後可以使用生成的獎勵來通過強化學習獲取複雜技能。在不需要任何特定任務提示或預定義獎勵模板的情況下，Eureka生成的獎勵函數優於專家設計的獎勵。在包括10種不同機器人形態的29個開源RL環境中，Eureka在83%的任務上優於人類專家，平均標準化改進率為52%。Eureka的通用性還實現了一種新的無梯度上下文學習方法，即從人類反饋中進行強化學習（RLHF），可以輕鬆地整合人類輸入，以改進生成的獎勵的質量和安全性，而無需模型更新。最後，通過在課程學習環境中使用Eureka獎勵，我們首次展示了一個模擬的Shadow Hand能夠進行筆芯旋轉技巧，熟練地以快速速度在圓圈中操作筆芯。

English

Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.

啟發：透過編碼大型語言模型實現人類級別的獎勵設計

Eureka: Human-Level Reward Design via Coding Large Language Models

摘要

Support