通過單步獎勵實現多輪代碼生成

摘要

我們探討了基於多輪執行反饋的代碼生成問題。現有方法要麼在無反饋的情況下生成代碼，要麼使用複雜的分層強化學習來優化多輪獎勵。我們提出了一種簡單且可擴展的方法——muCode，它僅利用單步獎勵來解決多輪代碼生成問題。我們的關鍵洞察在於，代碼生成是一個單步可恢復的馬爾可夫決策過程（MDP），即從任何中間代碼狀態出發，都能在一輪內恢復出正確的代碼。muCode迭代訓練兩個組件：一個是基於多輪執行反饋提供代碼解決方案的生成器，另一個是對新生成代碼進行評分的驗證器。實驗評估表明，我們的方法相較於最先進的基線模型取得了顯著提升。我們分析了獎勵模型和策略的設計選擇，並展示了muCode在利用執行反饋方面的有效性。我們的代碼可在https://github.com/portal-cornell/muCode獲取。

English

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, muCode, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. muCode iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of muCode at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

通過單步獎勵實現多輪代碼生成

Multi-Turn Code Generation Through Single-Step Rewards

摘要

Support