利用GPT-4代码解释器解决具有挑战性的数学文字问题,采用基于代码的自验证。
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
August 15, 2023
作者: Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, Hongsheng Li
cs.AI
摘要
近期大语言模型(LLMs)如GPT-4和PaLM-2的进展在解决数学推理问题方面取得了显著进展。特别是OpenAI最新版本的GPT-4,即GPT-4代码解释器,在具有挑战性的数学数据集上展现出卓越的性能。本文探讨了代码对增强大语言模型推理能力的影响,通过对GPT-4代码解释器的“代码使用频率”引入不同约束。我们发现,其成功很大程度上归因于其在生成和执行代码、评估代码执行输出以及在接收到不合理输出时纠正解决方案方面的强大技能。基于这一观察,我们提出了一种新颖有效的提示方法,即明确的基于代码的自我验证(CSV),以进一步提升GPT-4代码解释器的数学推理潜力。该方法在GPT-4代码解释器上采用了零样本提示,鼓励其使用代码自我验证答案。在验证状态为“False”时,模型将自动修正其解决方案,类似于我们在数学考试期间纠正错误的方法。此外,我们认识到验证结果的状态表示解决方案的置信度,这可以提高多数投票的有效性。借助GPT-4代码解释器和CSV,我们在MATH数据集上实现了令人印象深刻的零样本准确率(从53.9%提高到84.3%)。
English
Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has
brought significant advancements in addressing math reasoning problems. In
particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter,
shows remarkable performance on challenging math datasets. In this paper, we
explore the effect of code on enhancing LLMs' reasoning capability by
introducing different constraints on the Code Usage Frequency of GPT-4
Code Interpreter. We found that its success can be largely attributed to its
powerful skills in generating and executing code, evaluating the output of code
execution, and rectifying its solution when receiving unreasonable outputs.
Based on this insight, we propose a novel and effective prompting method,
explicit code-based self-verification~(CSV), to further
boost the mathematical reasoning potential of GPT-4 Code Interpreter. This
method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to
use code to self-verify its answers. In instances where the verification state
registers as ``False'', the model shall automatically amend its solution,
analogous to our approach of rectifying errors during a mathematics
examination. Furthermore, we recognize that the states of the verification
result indicate the confidence of a solution, which can improve the
effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we
achieve an impressive zero-shot accuracy on MATH dataset (53.9\% to
84.3\%).