利用 GPT-4 程式碼解譯器解決具挑戰性的數學應用題，並採用基於程式碼的自我驗證。

摘要

近期在大型語言模型（LLMs）如 GPT-4 和 PaLM-2 上取得的進展，顯著提升了解決數學推理問題的能力。特別是 OpenAI 最新版本的 GPT-4，被稱為 GPT-4 Code Interpreter，在具有挑戰性的數學數據集上表現出色。本文探討代碼對增強LLMs推理能力的影響，通過對 GPT-4 Code Interpreter 的代碼使用頻率引入不同約束。我們發現其成功主要歸因於其在生成和執行代碼、評估代碼執行輸出以及在收到不合理輸出時糾正解決方案方面的強大技能。基於這一洞察，我們提出了一種新穎且有效的提示方法，即明示代碼自我驗證（CSV），以進一步提升 GPT-4 Code Interpreter 的數學推理潛力。該方法在 GPT-4 Code Interpreter 上採用零猜測提示，鼓勵其使用代碼自我驗證答案。在驗證狀態為“False”的情況下，模型將自動修正其解決方案，類似於我們在數學考試中糾正錯誤的方法。此外，我們認識到驗證結果的狀態表明解決方案的信心水平，這可以提高多數投票的效果。通過 GPT-4 Code Interpreter 和 CSV，我們在 MATH 數據集上實現了令人印象深刻的零猜測準確率（從53.9％提高到84.3％）。

English

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit code-based self-verification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset (53.9\% to 84.3\%).

利用 GPT-4 程式碼解譯器解決具挑戰性的數學應用題，並採用基於程式碼的自我驗證。

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

摘要

Support