RLTF：从单元测试反馈中进行强化学习

摘要

程序合成或代码生成的目标是根据给定的描述生成可执行代码。最近，越来越多的研究采用强化学习（RL）来提高大型语言模型（LLMs）在代码方面的性能。然而，这些RL方法仅使用离线框架，限制了它们对新样本空间的探索。此外，目前利用单元测试信号的方法相当简单，未考虑代码中特定错误位置。为了解决这些问题，我们提出了RLTF，即基于单元测试反馈的强化学习，这是一种新颖的在线RL框架，具有多粒度的单元测试反馈，用于优化代码LLMs。我们的方法在训练过程中实时生成数据，并同时利用细粒度的反馈信号引导模型生成更高质量的代码。大量实验证明，RLTF在APPS和MBPP基准测试上实现了最先进的性能。我们的代码可在以下链接找到：https://github.com/Zyq-scut/RLTF。

English

The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, these RL methods have only used offline frameworks, limiting their exploration of new sample spaces. Additionally, current approaches that utilize unit test signals are rather simple, not accounting for specific error locations within the code. To address these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code can be found at: https://github.com/Zyq-scut/RLTF.

RLTF：从单元测试反馈中进行强化学习

RLTF: Reinforcement Learning from Unit Test Feedback

摘要

Support