利用GPT-4代码解释器解决具有挑战性的数学文字问题，采用基于代码的自验证。

摘要

近期大语言模型（LLMs）如GPT-4和PaLM-2的进展在解决数学推理问题方面取得了显著进展。特别是OpenAI最新版本的GPT-4，即GPT-4代码解释器，在具有挑战性的数学数据集上展现出卓越的性能。本文探讨了代码对增强大语言模型推理能力的影响，通过对GPT-4代码解释器的“代码使用频率”引入不同约束。我们发现，其成功很大程度上归因于其在生成和执行代码、评估代码执行输出以及在接收到不合理输出时纠正解决方案方面的强大技能。基于这一观察，我们提出了一种新颖有效的提示方法，即明确的基于代码的自我验证（CSV），以进一步提升GPT-4代码解释器的数学推理潜力。该方法在GPT-4代码解释器上采用了零样本提示，鼓励其使用代码自我验证答案。在验证状态为“False”时，模型将自动修正其解决方案，类似于我们在数学考试期间纠正错误的方法。此外，我们认识到验证结果的状态表示解决方案的置信度，这可以提高多数投票的有效性。借助GPT-4代码解释器和CSV，我们在MATH数据集上实现了令人印象深刻的零样本准确率（从53.9％提高到84.3％）。

English

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit code-based self-verification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset (53.9\% to 84.3\%).

利用GPT-4代码解释器解决具有挑战性的数学文字问题，采用基于代码的自验证。

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

摘要

Support