Chain of Code: 言語モデル拡張コードエミュレータを用いた推論

要旨

コードは、複雑なプログラムを構築し、コードインタプリタと組み合わせることで精密な計算を実行するための一般的な構文構造を提供します。私たちは、言語モデル（LM）がコード記述を活用することで、論理や算術タスクだけでなく、言語タスク（特に両者が混在するタスク）においても、Chain of Thought推論を改善できると仮説を立てています。例えば、エッセイ内の皮肉を検出し、その回数を数えるコードをLMにプロンプトすることを考えてみましょう。LMは「detect_sarcasm(string)」の実装をインタプリタで実行可能な形で記述するのに苦労するかもしれません（エッセイケースの処理は克服できないほど困難でしょう）。しかし、LMがコードを記述するだけでなく、インタプリタがコンパイルできないコードを含む「detect_sarcasm(string)」や他の行のコードの期待される出力を選択的に「エミュレート」することで、有効なソリューションを生成できる可能性があります。本研究では、LMのコード駆動推論を改善する、シンプルでありながら驚くほど効果的な拡張であるChain of Code（CoT）を提案します。鍵となるアイデアは、言語サブタスクをプログラム内で柔軟な疑似コードとしてフォーマットするようLMに促し、コンパイラが未定義の動作を明示的に捕捉し、LM（「LMulator」として）にシミュレーションを委ねることです。実験結果は、Chain of CodeがChain of Thoughtや他のベースラインを様々なベンチマークで上回ることを示しています。BIG-Bench Hardでは、Chain of Codeは84%を達成し、Chain of Thoughtに対して12%の向上を実現しました。CoTは大規模モデルと小規模モデルの両方でうまくスケールし、LMが「コードで考える」ことで正しく答えられる推論問題の範囲を広げます。プロジェクトウェブページ: https://chain-of-code.github.io/。

English

Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter -- we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for linguistic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they are used not only to write the code, but also to selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)" and other lines of code (e.g., that the interpreter could not compile). In this work, we propose Chain of Code (CoT), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format linguistic sub-tasks in a program as flexible pseudocode that the compiler can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. CoT scales well with large and small models alike, and broadens the scope of reasoning questions that LMs can correctly answer by "thinking in code". Project webpage: https://chain-of-code.github.io/.

Chain of Code: 言語モデル拡張コードエミュレータを用いた推論

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

要旨

Support