利用大型語言模型探索強化學習介面

摘要

強化學習系統依賴於指定觀察與獎勵函數的環境介面，然而為新任務構建這些介面通常需要大量的人工努力。雖然近期的研究已運用大型語言模型自動化獎勵設計，但這些方法假設觀察為固定，並未解決合成完整任務介面的更廣泛挑戰。我們研究從原始模擬器狀態中發現強化學習任務介面，此處必須同時生成觀察映射與獎勵函數。我們提出 LIMEN（程式碼位於 https://github.com/Lossfunk/LIMEN），這是一個由大型語言模型引導的演化框架，能以可執行程序的形式生成候選介面，並透過策略訓練反饋迭代式地優化這些介面。在涵蓋運動與操作的新穎離散網格世界任務及連續控制領域中，觀察與獎勵的聯合演化僅憑藉軌跡層級成功指標即可發現有效介面，而單獨最佳化任一組件則至少在一個領域中失敗。這些結果顯示，從原始狀態自動構建強化學習介面能大幅減少人工工程，且觀察與獎勵組件常受益於共同設計，因為在我們的評估套件中，單一組件最佳化至少在一個領域上遭遇災難性失敗。

English

Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory-level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co-design, as single-component optimization fails catastrophically on at least one domain in our evaluation suite.