GPT-4コードインタプリタを用いたコードベースの自己検証による難解な数学文章題の解法

要旨

GPT-4やPaLM-2のような大規模言語モデル（LLMs）の最近の進展は、数学的推論問題の解決において大きな進歩をもたらしました。特に、OpenAIの最新バージョンであるGPT-4 Code Interpreterは、難しい数学データセットで顕著な性能を示しています。本論文では、GPT-4 Code Interpreterのコード使用頻度に異なる制約を導入することで、コードがLLMsの推論能力をどのように強化するかを探ります。その成功は、コードの生成と実行、コード実行の出力の評価、そして不合理な出力を受け取った際に解を修正する強力なスキルに大きく起因していることがわかりました。この洞察に基づき、GPT-4 Code Interpreterの数学的推論能力をさらに向上させるために、新しい効果的なプロンプト手法である明示的なコードベースの自己検証（CSV）を提案します。この手法は、GPT-4 Code Interpreterにゼロショットプロンプトを使用して、コードを使って自身の答えを自己検証するよう促します。検証状態が「False」と記録された場合、モデルは自動的に解を修正します。これは、数学の試験中に誤りを修正するアプローチに似ています。さらに、検証結果の状態は解の信頼度を示しており、多数決の効果を向上させることができることを認識しています。GPT-4 Code InterpreterとCSVを使用することで、MATHデータセットで印象的なゼロショット精度（53.9％から84.3％）を達成しました。

English

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit code-based self-verification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset (53.9\% to 84.3\%).

GPT-4コードインタプリタを用いたコードベースの自己検証による難解な数学文章題の解法

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

要旨

Support