代码链：利用语言模型增强的代码模拟器进行推理

摘要

代码提供了一个通用的句法结构，用于构建复杂程序并在与代码解释器配对时执行精确计算 — 我们假设语言模型（LMs）可以利用编写代码来改进“思维链”推理，不仅适用于逻辑和算术任务，还适用于语言任务（尤其是那些混合了逻辑和算术的任务）。例如，考虑提示一个LM编写代码来计算它在一篇文章中检测到讽刺的次数：LM可能会在编写一个可由解释器执行的“detect_sarcasm(string)”实现时遇到困难（处理边缘情况将是不可逾越的）。然而，如果LM不仅用于编写代码，还用于有选择地“模拟”解释器，通过生成“detect_sarcasm(string)”和其他代码行的预期输出（例如，解释器无法编译的内容），它们仍可能生成一个有效的解决方案。在这项工作中，我们提出了“代码链”（CoT），这是一个简单但出乎意料地有效的扩展，可改进LM基于代码的推理。关键思想是鼓励LM将程序中的语言子任务格式化为灵活的伪代码，以便编译器可以明确捕捉未定义行为，并将其交给LM进行模拟（作为“LMulator”）。实验证明，“代码链”在各种基准测试中优于“思维链”和其他基准；在BIG-Bench Hard上，“代码链”达到了84%，比“思维链”提高了12%。CoT可以很好地适用于大型和小型模型，并通过“以代码思考”扩大了LM能够正确回答的推理问题范围。项目网页：https://chain-of-code.github.io/。

English

Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter -- we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for linguistic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they are used not only to write the code, but also to selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)" and other lines of code (e.g., that the interpreter could not compile). In this work, we propose Chain of Code (CoT), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format linguistic sub-tasks in a program as flexible pseudocode that the compiler can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. CoT scales well with large and small models alike, and broadens the scope of reasoning questions that LMs can correctly answer by "thinking in code". Project webpage: https://chain-of-code.github.io/.

代码链：利用语言模型增强的代码模拟器进行推理

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

摘要

Support