揭開 GPT 自我修復在程式碼生成中的神秘面紗
Demystifying GPT Self-Repair for Code Generation
June 16, 2023
作者: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama
cs.AI
摘要
大型語言模型(LLMs)在程式碼生成方面表現出卓越的能力,但在具有挑戰性的程式設計任務上仍然遇到困難。自我修復——即模型自行偵錯並修正程式碼中的錯誤——最近已成為提升這些情境下性能的熱門方法。然而,文獻中對自我修復如何以及何時有效的研究非常有限,人們可能會懷疑模型在程式碼由同一模型生成時,是否真的能夠提供準確的反饋,解釋程式碼錯誤的原因。在本文中,我們分析了GPT-3.5和GPT-4在APPs上執行自我修復的能力,該資料集包含各種不同的編碼挑戰。為此,我們首先建立了一種名為pass@t的新評估策略,該策略衡量任務的通過率與從模型中取樣的總token數量之比,從而實現與純取樣方法的公平比較。通過這種評估策略,我們發現自我修復的效果僅在GPT-4中可見。我們還觀察到,自我修復受到反饋階段的瓶頸影響;使用GPT-4對由GPT-3.5生成的程式進行反饋,以及使用專家人類程式設計師對由GPT-4生成的程式進行反饋,我們實現了顯著的性能提升。
English
Large Language Models (LLMs) have shown remarkable aptitude in code
generation but still struggle on challenging programming tasks. Self-repair --
in which the model debugs and fixes mistakes in its own code -- has recently
become a popular way to boost performance in these settings. However, only very
limited studies on how and when self-repair works effectively exist in the
literature, and one might wonder to what extent a model is really capable of
providing accurate feedback on why the code is wrong when that code was
generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's
ability to perform self-repair on APPS, a challenging dataset consisting of
diverse coding challenges. To do so, we first establish a new evaluation
strategy dubbed pass@t that measures the pass rate of the tasks against the
total number of tokens sampled from the model, enabling a fair comparison to
purely sampling-based approaches. With this evaluation strategy, we find that
the effectiveness of self-repair is only seen in GPT-4. We also observe that
self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback
on the programs generated by GPT-3.5 and using expert human programmers to give
feedback on the programs generated by GPT-4, we unlock significant performance
gains.