自然言語強化学習

要旨

強化学習（RL）は、マルコフ決定過程（MDP）を用いて意思決定を数学的に定式化します。MDPを用いることで、研究者たちはゲーム、ロボティクス、言語モデルなど様々な分野で顕著な進展を遂げてきました。本論文では、従来のMDPを自然言語ベースの表現空間に拡張することで、新たな可能性である自然言語強化学習（NLRL）を探求します。具体的には、NLRLは、タスク目標、ポリシー、価値関数、ベルマン方程式、ポリシー反復など、RLの原則を言語に置き換える革新的な手法です。大規模言語モデル（LLM）の最近の進歩により、NLRLは、純粋なプロンプティングまたは勾配ベースのトレーニングによってRLに似たポリシーと価値の向上を実現するために実用的に実装できます。迷路、ブレイクスルー、三目並べのゲームを対象とした実験は、NLRLフレームワークの効果的で効率的であり、多様なユースケースにおいて解釈可能であることを示しています。当該コードは、https://github.com/waterhorse1/Natural-language-RL で公開されます。

English

Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at https://github.com/waterhorse1/Natural-language-RL.