単一ステップ報酬によるマルチターンコード生成

要旨

マルチターン実行フィードバックからのコード生成の問題に取り組みます。既存の手法は、フィードバックなしでコードを生成するか、複雑な階層型強化学習を用いてマルチターンの報酬を最適化します。私たちは、シングルステップの報酬のみを使用してマルチターンコード生成を解決する、シンプルでスケーラブルなアプローチであるmuCodeを提案します。私たちの重要な洞察は、コード生成が1ステップで回復可能なMDP（マルコフ決定過程）であり、任意の中間コード状態から正しいコードを1ターンで回復できるという点です。muCodeは、マルチターン実行フィードバックに基づいてコードソリューションを提供するジェネレータと、新しく生成されたコードを評価するベリファイアを反復的にトレーニングします。実験的評価により、私たちのアプローチが最先端のベースラインを大幅に上回ることを示します。報酬モデルとポリシーの設計選択の分析を提供し、muCodeが実行フィードバックを活用する有効性を示します。私たちのコードはhttps://github.com/portal-cornell/muCodeで公開されています。

English

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, muCode, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. muCode iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of muCode at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

単一ステップ報酬によるマルチターンコード生成

Multi-Turn Code Generation Through Single-Step Rewards

要旨

Support