CodeChain:通过具有代表性子模块的自我修订链实现模块化代码生成
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules
October 13, 2023
作者: Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, Shafiq Joty
cs.AI
摘要
大型语言模型(LLMs)已经在解决类似HumanEval或MBPP基准测试中的简单编程任务方面表现得相当熟练。然而,解决更复杂和具有竞争性的编程任务对这些模型来说仍然是相当具有挑战性的 - 可能是因为它们倾向于生成作为整体代码块的解决方案,而不是将其分解为逻辑子任务和子模块。另一方面,有经验的程序员会本能地编写带有抽象的模块化代码来解决复杂任务,通常会重复使用先前开发的模块。为了弥补这一差距,我们提出了CodeChain,这是一个通过一系列自我修订引导模块化代码生成的新颖框架,每个修订都由前几次迭代中生成的一些代表性子模块引导。具体来说,CodeChain首先通过一系列思维链提示指导LLM生成模块化代码。然后,通过迭代两个步骤来应用一系列自我修订:1)提取和聚类生成的子模块,并选择聚类代表作为更通用和可重复使用的实现,2)利用这些选定的模块实现扩充原始的思维链提示,并指导LLM重新生成新的模块化解决方案。我们发现,通过自然地鼓励LLM重复使用先前开发和验证的子模块,CodeChain可以显著提升生成解决方案的模块化程度和正确性,实现在APPS上相对pass@1改进35%,在CodeContests上为76%。它在OpenAI LLMs以及开源LLMs如WizardCoder上都表现有效。我们还进行了全面的消融研究,涉及提示方法、聚类数量、模型大小、程序质量等不同方面,以提供支持CodeChain成功的有用见解。
English
Large Language Models (LLMs) have already become quite proficient at solving
simpler programming tasks like those in HumanEval or MBPP benchmarks. However,
solving more complex and competitive programming tasks is still quite
challenging for these models - possibly due to their tendency to generate
solutions as monolithic code blocks instead of decomposing them into logical
sub-tasks and sub-modules. On the other hand, experienced programmers
instinctively write modularized code with abstraction for solving complex
tasks, often reusing previously developed modules. To address this gap, we
propose CodeChain, a novel framework for inference that elicits modularized
code generation through a chain of self-revisions, each being guided by some
representative sub-modules generated in previous iterations. Concretely,
CodeChain first instructs the LLM to generate modularized codes through
chain-of-thought prompting. Then it applies a chain of self-revisions by
iterating the two steps: 1) extracting and clustering the generated sub-modules
and selecting the cluster representatives as the more generic and re-usable
implementations, and 2) augmenting the original chain-of-thought prompt with
these selected module-implementations and instructing the LLM to re-generate
new modularized solutions. We find that by naturally encouraging the LLM to
reuse the previously developed and verified sub-modules, CodeChain can
significantly boost both modularity as well as correctness of the generated
solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on
CodeContests. It is shown to be effective on both OpenAI LLMs as well as
open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation
studies with different methods of prompting, number of clusters, model sizes,
program qualities, etc., to provide useful insights that underpin CodeChain's
success.