利用大型语言模型发现强化学习接口

摘要

强化学习系统依赖于指定观测与奖励函数的环境接口，但为新任务构建这些接口通常需要大量人工努力。尽管近期研究利用大型语言模型实现了奖励设计自动化，但这些方法假设观测固定，未能解决合成完整任务接口这一更广泛挑战。本文研究从原始仿真器状态发现强化学习任务接口的问题——其中观测映射与奖励函数均需生成。我们提出LIMEN（代码详见https://github.com/Lossfunk/LIMEN），一种基于大型语言模型引导的进化框架，将候选接口生成为可执行程序，并利用策略训练反馈对其进行迭代优化。在涵盖离散网格世界新任务以及运动与操控的连续控制领域实验中，针对观测与奖励的联合进化能够仅凭轨迹级成功度量发现有效接口，而单独优化任意单一组件均会在至少一个领域失败。这些结果表明，从原始状态自动构建强化学习接口可大幅减少人工工程，且观测与奖励组件往往受益于协同设计——因单组件优化会在我们评估套件中的至少一个领域遭遇灾难性失败。

English

Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory-level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co-design, as single-component optimization fails catastrophically on at least one domain in our evaluation suite.

利用大型语言模型发现强化学习接口

Discovering Reinforcement Learning Interfaces with Large Language Models

摘要

Support