코드 생성을 위한 GPT 자가 수정 기능의 이해

초록

대형 언어 모델(LLMs)은 코드 생성에서 뛰어난 능력을 보여주지만 여전히 도전적인 프로그래밍 작업에서는 어려움을 겪고 있습니다. 최근에는 모델이 자신이 생성한 코드의 오류를 디버깅하고 수정하는 '자기 수정(Self-repair)'이 이러한 환경에서 성능을 향상시키는 인기 있는 방법으로 부상했습니다. 그러나 문헌에서는 자기 수정이 어떻게 그리고 언제 효과적으로 작동하는지에 대한 연구가 매우 제한적이며, 동일한 모델이 생성한 코드에 대해 정확한 피드백을 제공할 수 있는 능력이 어느 정도인지 의문이 들 수 있습니다. 본 논문에서는 GPT-3.5와 GPT-4가 다양한 코딩 문제로 구성된 도전적인 데이터셋인 APPS에서 자기 수정을 수행하는 능력을 분석합니다. 이를 위해 먼저 모델에서 샘플링된 총 토큰 수에 대해 작업의 통과율을 측정하는 새로운 평가 전략인 'pass@t'를 제안하여 순수 샘플링 기반 접근 방식과의 공정한 비교를 가능하게 합니다. 이 평가 전략을 통해 우리는 자기 수정의 효과가 GPT-4에서만 나타난다는 것을 발견했습니다. 또한 자기 수정이 피드백 단계에서 병목 현상을 겪고 있음을 관찰했습니다. GPT-4가 GPT-3.5가 생성한 프로그램에 피드백을 제공하거나 전문 인간 프로그래머가 GPT-4가 생성한 프로그램에 피드백을 제공함으로써 상당한 성능 향상을 이끌어냈습니다.

English

Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.

코드 생성을 위한 GPT 자가 수정 기능의 이해

Demystifying GPT Self-Repair for Code Generation

초록

Support