StepCoder：通过编译器反馈，利用强化学习改进代码生成

摘要

大型语言模型（LLMs）的进展显著推动了代码生成领域的发展。先前的工作将强化学习（RL）与编译器反馈相结合，以探索LLMs的输出空间，提升代码生成质量。然而，LLMs生成的长代码针对复杂人类需求的情况使得RL探索变得具有挑战性。此外，由于单元测试可能无法覆盖复杂代码，因此通过这些未执行的代码片段来优化LLMs是无效的。为了解决这些挑战，我们引入了StepCoder，这是一个用于代码生成的新型RL框架，由两个主要组件组成：CCCS通过将长序列代码生成任务分解为一系列代码完成子任务来解决探索挑战，而FGO则通过屏蔽未执行的代码段来提供细粒度优化来优化模型。此外，我们还构建了APPS+数据集用于RL训练，经过手工验证以确保单元测试的正确性。实验结果表明，我们的方法提高了探索输出空间的能力，并在相应基准测试中优于最先进的方法。

English

The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.

StepCoder：通过编译器反馈，利用强化学习改进代码生成

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

摘要

Summary

Support