StepCoder：利用編譯器反饋以強化學習改進程式碼生成

摘要

大型語言模型（LLMs）的進步顯著推動了代碼生成領域的發展。先前的工作將強化學習（RL）與編譯器反饋相結合，以探索LLMs的輸出空間，從而提升代碼生成的質量。然而，LLMs生成的冗長代碼響應複雜的人類需求，使得RL探索變得困難。此外，由於單元測試可能無法覆蓋複雜代碼，因此通過使用這些未執行的代碼片段來優化LLMs是無效的。為應對這些挑戰，我們引入了StepCoder，一個用於代碼生成的新型RL框架，由兩個主要組件組成：CCCS通過將長序列代碼生成任務拆分為一系列代碼完成子任務來應對探索挑戰，而FGO則通過遮蔽未執行的代碼段來提供精細的優化。此外，我們進一步構建了APPS+數據集用於RL訓練，經手動驗證以確保單元測試的正確性。實驗結果表明，我們的方法提高了探索輸出空間的能力，並在相應基準測試中優於最先進的方法。

English

The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.

StepCoder：利用編譯器反饋以強化學習改進程式碼生成

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

摘要

Summary

Support