GPT-4 코드 인터프리터를 활용한 코드 기반 자가 검증을 통한 복잡한 수학 단어 문제 해결

초록

GPT-4 및 PaLM-2와 같은 대규모 언어 모델(LLMs)의 최근 발전은 수학적 추론 문제 해결에 있어 상당한 진전을 가져왔다. 특히, OpenAI의 최신 버전인 GPT-4 Code Interpreter는 도전적인 수학 데이터셋에서 뛰어난 성능을 보여준다. 본 논문에서는 GPT-4 Code Interpreter의 코드 사용 빈도에 다양한 제약을 도입함으로써 코드가 LLMs의 추론 능력 향상에 미치는 영향을 탐구한다. 우리는 그 성공이 주로 코드 생성 및 실행, 코드 실행 결과 평가, 그리고 비합리적인 출력을 받았을 때 해결책을 수정하는 강력한 능력에 기인한다는 사실을 발견했다. 이러한 통찰을 바탕으로, 우리는 GPT-4 Code Interpreter의 수학적 추론 잠재력을 더욱 향상시키기 위해 새로운 효과적인 프롬프트 방법인 명시적 코드 기반 자체 검증(CSV)을 제안한다. 이 방법은 GPT-4 Code Interpreter에 제로샷 프롬프트를 적용하여 코드를 사용해 자신의 답을 자체 검증하도록 유도한다. 검증 상태가 "False"로 기록되는 경우, 모델은 수학 시험 중 오류를 수정하는 방식과 유사하게 자동으로 해결책을 수정한다. 또한, 검증 결과의 상태는 해결책의 신뢰도를 나타내며, 이는 다수결 투표의 효과를 향상시킬 수 있다. GPT-4 Code Interpreter와 CSV를 사용하여, 우리는 MATH 데이터셋에서 인상적인 제로샷 정확도(53.9%에서 84.3%)를 달성했다.

English

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit code-based self-verification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset (53.9\% to 84.3\%).

GPT-4 코드 인터프리터를 활용한 코드 기반 자가 검증을 통한 복잡한 수학 단어 문제 해결

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

초록

Support