代碼鏈:利用語言模型增強的代碼仿真器進行推理
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
December 7, 2023
作者: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter
cs.AI
摘要
程式碼提供了建構複雜程式和進行精確計算的一般語法結構,當與程式碼解譯器配對時--我們假設語言模型(LMs)可以利用編寫程式碼來改善思維鏈推理,不僅適用於邏輯和算術任務,還適用於語言任務(特別是那些混合兩者的任務)。例如,考慮提示一個LM編寫程式碼來計算它在一篇文章中檢測到諷刺的次數:LM可能會難以編寫一個可以由解譯器執行的"檢測諷刺(字串)"實現(處理邊界情況將是不可逾越的)。然而,如果LM不僅用於編寫程式碼,還用於有選擇性地"模擬"解譯器,生成"檢測諷刺(字串)"和其他程式碼行的預期輸出(例如,解譯器無法編譯的部分),LM仍然可以產生有效的解決方案。在這項工作中,我們提出了程式碼鏈(CoT),這是一個簡單但出乎意料地有效的擴展,可以改善LM基於程式碼的推理。其關鍵思想是鼓勵LM將程式中的語言子任務格式化為靈活的偽代碼,使編譯器可以明確捕捉未定義的行為並交由LM模擬(作為"LMulator")。實驗表明,程式碼鏈在各種基準測試中優於思維鏈和其他基準;在BIG-Bench Hard上,程式碼鏈達到84%,比思維鏈提高了12%。CoT適用於大型和小型模型,並擴大了LM可以正確回答的推理問題範圍,透過"以程式碼思考"。項目網頁:https://chain-of-code.github.io/。
English
Code provides a general syntactic structure to build complex programs and
perform precise computations when paired with a code interpreter -- we
hypothesize that language models (LMs) can leverage code-writing to improve
Chain of Thought reasoning not only for logic and arithmetic tasks, but also
for linguistic ones (and in particular, those that are a mix of both). For
example, consider prompting an LM to write code that counts the number of times
it detects sarcasm in an essay: the LM may struggle to write an implementation
for "detect_sarcasm(string)" that can be executed by the interpreter (handling
the edge cases would be insurmountable). However, LMs may still produce a valid
solution if they are used not only to write the code, but also to selectively
"emulate" the interpreter by generating the expected output of
"detect_sarcasm(string)" and other lines of code (e.g., that the interpreter
could not compile). In this work, we propose Chain of Code (CoT), a simple yet
surprisingly effective extension that improves LM code-driven reasoning. The
key idea is to encourage LMs to format linguistic sub-tasks in a program as
flexible pseudocode that the compiler can explicitly catch undefined behaviors
and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate
that Chain of Code outperforms Chain of Thought and other baselines across a
variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of
12% over Chain of Thought. CoT scales well with large and small models alike,
and broadens the scope of reasoning questions that LMs can correctly answer by
"thinking in code". Project webpage: https://chain-of-code.github.io/.