検証可能な探索は学習可能な思考連鎖ではない

要旨

短いプログラムで解決可能なタスクであれば、その手順をチェーン・オブ・ソートとしてモデルに教え込める——つまり、手順を書き出してファインチューニングすればモデルが従うようになる——と考えるのは魅力的である。本論文では、この想定が特定可能なクラスの手順に対しては成立しないことを示す。評価環境として9つの推論タスクを用いた。各タスクは決定論的生成器から作成される。公開データと非公開データは生成器を共有しており、保持データがテスト精度の代理となる。これらの生成器をリバースエンジニアリングしてPythonによる解法プログラムに変換し、それをチェーン・オブ・ソートとしてレンダリングし、ランク32以下のLoRAを用いて300億パラメータ（うち35億がアクティブ）のNemotronモデルに蒸留した。前方計算可能なタスクは容易にインストールできた。ルックアップや算術、8ビットのブール演算タスクは転送に成功した（それぞれ0.99以上、0.68）。しかし暗号算（cryptarithm）はそうではなかった。バックトラッキング探索を蒸留しても、11種類のチェーン・オブ・ソート設計、検証可能報酬による強化学習、自己学習のいずれでも精度は0.01～0.07にとどまった。それにもかかわらず、探索解法はインスタンスの71%を正解できる。これは能力のギャップではない。モデルは行の97～100%で算術演算を実行し、71%で正しい暗号を上位8位以内にランク付けできる。しかし、探索を左から右への導出として進めることはできない。ファインチューニングは検証可能な除去ステップの形状を学習するが、その判定は無条件のテンプレートとなり、正しさは16～57%の時間にしかならない（「トークンとしての判定」）。この上限は、30億から6710億パラメータまでのバックボーン、ファインチューニングとプロンプティングの両方で変わらない。制御された介入によって原因が特定された。すなわち、暗号鍵を明らかにすると（これにより導出が前方化される）、同じインスタンスの精度が0.03から0.57に上昇する。手順の唯一の解法が情報を持たない構造の探索である場合、模倣可能な忠実な前方チェーン・オブ・ソートは存在しない。タスクを学習可能にするためには、探索を除去し、その組み合わせ論的核心をカタログに事前計算し、トレースを想起と検証に縮約するしかない。この方法で、一位の解法はPrivate LBで0.92に達した。蒸留されるのは記憶と検証であり、探索ではない。

English

It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time ("verdict-as-token"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.