IterPref：通過迭代調試實現代碼生成的焦點偏好學習

摘要

偏好學習通過利用相對質量比較，使代碼大語言模型（Code LLMs）超越了監督式微調的範疇。現有方法基於測試案例的成功率構建偏好對，將通過率較高的樣本視為正面，較低的視為負面。然而，這種方法未能精確定位代碼中的具體錯誤，阻礙了模型學習更具信息量的錯誤修正模式，因為將失敗的代碼整體對齊缺乏捕捉有意義的錯誤解決關係所需的細粒度。為解決這些問題，我們提出了IterPref，一個新的偏好對齊框架，它模仿人類迭代調試過程來精煉Code LLMs。IterPref明確定位錯誤區域，並通過定制的DPO算法對齊相應的令牌。為了生成信息豐富的配對，我們引入了CodeFlow數據集，其中樣本會迭代改進直至通過測試，修改記錄捕捉了錯誤修正。大量實驗表明，配備IterPref的多樣化Code LLMs套件在代碼生成上取得了顯著的性能提升，並在BigCodeBench等挑戰性任務上有所進步。深入分析揭示，IterPref產生的錯誤更少。我們的代碼和數據將公開提供。

English

Preference learning enhances Code LLMs beyond supervised fine-tuning by leveraging relative quality comparisons. Existing methods construct preference pairs from candidates based on test case success, treating the higher pass rate sample as positive and the lower as negative. However, this approach does not pinpoint specific errors in the code, which prevents the model from learning more informative error correction patterns, as aligning failing code as a whole lacks the granularity needed to capture meaningful error-resolution relationships. To address these issues, we propose IterPref, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. IterPref explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To generate informative pairs, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with IterPref achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that IterPref yields fewer errors. Our code and data will be made publicaly available.

IterPref：通過迭代調試實現代碼生成的焦點偏好學習

IterPref: Focal Preference Learning for Code Generation via Iterative Debugging

摘要

Support