揭秘GPT自我修复用于代码生成

摘要

大型语言模型（LLMs）在代码生成方面表现出色，但在具有挑战性的编程任务上仍然存在困难。自我修复——即模型调试和修复自身代码中的错误——最近成为提升性能的流行方式。然而，文献中对自我修复如何以及何时有效的研究非常有限，人们可能会想知道模型在由同一模型生成的代码出错时，模型到底能够提供多大程度上关于代码错误的准确反馈。本文分析了GPT-3.5和GPT-4在APPs上执行自我修复的能力，该数据集包含各种编程挑战。为此，我们首先建立了一种名为pass@t的新评估策略，该策略衡量任务的通过率与从模型中采样的标记总数之间的比率，使得与纯采样方法进行公平比较成为可能。通过这种评估策略，我们发现自我修复的有效性仅在GPT-4中得以体现。我们还观察到自我修复受到反馈阶段的限制；使用GPT-4对由GPT-3.5生成的程序进行反馈，并使用专业人类程序员对由GPT-4生成的程序进行反馈，我们实现了显著的性能提升。

English

Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.

揭秘GPT自我修复用于代码生成

Demystifying GPT Self-Repair for Code Generation

摘要

Support