StepCoder:通过编译器反馈,利用强化学习改进代码生成
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
February 2, 2024
作者: Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Junjie Shan, Caishuang Huang, Wei Shen, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, Tao Gui
cs.AI
摘要
大型语言模型(LLMs)的进展显著推动了代码生成领域的发展。先前的工作将强化学习(RL)与编译器反馈相结合,以探索LLMs的输出空间,提升代码生成质量。然而,LLMs生成的长代码针对复杂人类需求的情况使得RL探索变得具有挑战性。此外,由于单元测试可能无法覆盖复杂代码,因此通过这些未执行的代码片段来优化LLMs是无效的。为了解决这些挑战,我们引入了StepCoder,这是一个用于代码生成的新型RL框架,由两个主要组件组成:CCCS通过将长序列代码生成任务分解为一系列代码完成子任务来解决探索挑战,而FGO则通过屏蔽未执行的代码段来提供细粒度优化来优化模型。此外,我们还构建了APPS+数据集用于RL训练,经过手工验证以确保单元测试的正确性。实验结果表明,我们的方法提高了探索输出空间的能力,并在相应基准测试中优于最先进的方法。
English
The advancement of large language models (LLMs) has significantly propelled
the field of code generation. Previous work integrated reinforcement learning
(RL) with compiler feedback for exploring the output space of LLMs to enhance
code generation quality. However, the lengthy code generated by LLMs in
response to complex human requirements makes RL exploration a challenge. Also,
since the unit tests may not cover the complicated code, optimizing LLMs by
using these unexecuted code snippets is ineffective. To tackle these
challenges, we introduce StepCoder, a novel RL framework for code generation,
consisting of two main components: CCCS addresses the exploration challenge by
breaking the long sequences code generation task into a Curriculum of Code
Completion Subtasks, while FGO only optimizes the model by masking the
unexecuted code segments to provide Fine-Grained Optimization. In addition, we
furthermore construct the APPS+ dataset for RL training, which is manually
verified to ensure the correctness of unit tests. Experimental results show
that our method improves the ability to explore the output space and
outperforms state-of-the-art approaches in corresponding benchmarks.Summary
AI-Generated Summary