IterPref: 反復的デバッグによるコード生成のための焦点選好学習

要旨

選好学習は、相対的な品質比較を活用することで、教師ありファインチューニングを超えたCode LLMの性能向上を実現します。既存の手法では、テストケースの成功率に基づいて候補から選好ペアを構築し、高い合格率のサンプルを正例、低い合格率のサンプルを負例として扱います。しかし、このアプローチではコード内の特定のエラーを特定できないため、モデルがより有益なエラー修正パターンを学習することを妨げます。なぜなら、失敗したコード全体を整列させる方法では、意味のあるエラー解決の関係性を捉えるために必要な細粒度が欠けているからです。これらの課題に対処するため、我々はIterPrefという新しい選好整列フレームワークを提案します。IterPrefは、人間の反復的なデバッグを模倣してCode LLMを洗練させます。IterPrefはエラー領域を明示的に特定し、対応するトークンをカスタマイズされたDPOアルゴリズムを通じて整列させます。有益なペアを生成するために、我々はCodeFlowデータセットを導入しました。このデータセットでは、サンプルがテストに合格するまで反復的に改良され、エラー修正を捉えた変更が記録されています。大規模な実験により、IterPrefを搭載した多様なCode LLMがコード生成において大幅な性能向上を達成し、BigCodeBenchのような難しいタスクでも改善を示すことが明らかになりました。詳細な分析により、IterPrefがより少ないエラーを生み出すことが判明しました。我々のコードとデータは公開される予定です。

English

Preference learning enhances Code LLMs beyond supervised fine-tuning by leveraging relative quality comparisons. Existing methods construct preference pairs from candidates based on test case success, treating the higher pass rate sample as positive and the lower as negative. However, this approach does not pinpoint specific errors in the code, which prevents the model from learning more informative error correction patterns, as aligning failing code as a whole lacks the granularity needed to capture meaningful error-resolution relationships. To address these issues, we propose IterPref, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. IterPref explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To generate informative pairs, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with IterPref achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that IterPref yields fewer errors. Our code and data will be made publicaly available.

IterPref: 反復的デバッグによるコード生成のための焦点選好学習

IterPref: Focal Preference Learning for Code Generation via Iterative Debugging

要旨

Support