RLTF：從單元測試反饋中的強化學習

摘要

程式合成或代碼生成的目標是根據給定的描述生成可執行的代碼。最近，越來越多的研究採用強化學習（RL）來改善用於代碼的大型語言模型（LLMs）的性能。然而，這些RL方法僅使用離線框架，限制了它們對新樣本空間的探索。此外，目前利用單元測試信號的方法相當簡單，未考慮代碼中特定錯誤位置。為了解決這些問題，我們提出了RLTF，即從單元測試反饋中學習的強化學習，這是一個新穎的在線RL框架，具有用於精煉代碼LLMs的多粒度單元測試反饋。我們的方法在訓練期間實時生成數據，同時利用精細的反饋信號引導模型生成更高質量的代碼。大量實驗表明，RLTF在APPS和MBPP基準測試中實現了最先進的性能。我們的代碼可在以下位置找到：https://github.com/Zyq-scut/RLTF。

English

The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, these RL methods have only used offline frameworks, limiting their exploration of new sample spaces. Additionally, current approaches that utilize unit test signals are rather simple, not accounting for specific error locations within the code. To address these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code can be found at: https://github.com/Zyq-scut/RLTF.

RLTF：從單元測試反饋中的強化學習

RLTF: Reinforcement Learning from Unit Test Feedback

摘要

Support