IterPref: 반복적 디버깅을 통한 코드 생성을 위한 포커스 선호 학습

초록

선호도 학습은 상대적 품질 비교를 활용하여 지도 미세 조정을 넘어 Code LLM을 향상시킵니다. 기존 방법은 테스트 케이스 성공 여부를 기준으로 후보들로부터 선호도 쌍을 구성하며, 더 높은 통과율을 보이는 샘플을 긍정적, 더 낮은 통과율을 보이는 샘플을 부정적으로 처리합니다. 그러나 이 접근 방식은 코드 내 특정 오류를 정확히 지적하지 못하므로, 모델이 더 유익한 오류 수정 패턴을 학습하는 데 방해가 됩니다. 실패한 코드 전체를 정렬하는 방식은 의미 있는 오류 해결 관계를 포착하는 데 필요한 세분성을 제공하지 못하기 때문입니다. 이러한 문제를 해결하기 위해, 우리는 인간의 반복적 디버깅을 모방하여 Code LLM을 개선하는 새로운 선호도 정렬 프레임워크인 IterPref를 제안합니다. IterPref는 오류 영역을 명시적으로 찾아내고, 맞춤형 DPO 알고리즘을 통해 해당 토큰들을 정렬합니다. 정보가 풍부한 쌍을 생성하기 위해, 우리는 CodeFlow 데이터셋을 도입했습니다. 이 데이터셋은 테스트를 통과할 때까지 샘플이 반복적으로 개선되며, 수정 사항이 오류 수정을 포착합니다. 광범위한 실험 결과, IterPref를 장착한 다양한 Code LLM들이 코드 생성에서 상당한 성능 향상을 달성하고 BigCodeBench와 같은 도전적인 과제에서도 개선된 성능을 보임을 확인했습니다. 심층 분석 결과, IterPref는 더 적은 오류를 발생시키는 것으로 나타났습니다. 우리의 코드와 데이터는 공개될 예정입니다.

English

Preference learning enhances Code LLMs beyond supervised fine-tuning by leveraging relative quality comparisons. Existing methods construct preference pairs from candidates based on test case success, treating the higher pass rate sample as positive and the lower as negative. However, this approach does not pinpoint specific errors in the code, which prevents the model from learning more informative error correction patterns, as aligning failing code as a whole lacks the granularity needed to capture meaningful error-resolution relationships. To address these issues, we propose IterPref, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. IterPref explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To generate informative pairs, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with IterPref achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that IterPref yields fewer errors. Our code and data will be made publicaly available.

IterPref: 반복적 디버깅을 통한 코드 생성을 위한 포커스 선호 학습

IterPref: Focal Preference Learning for Code Generation via Iterative Debugging

초록

Support