可验证的搜索并非可学习的思维链

摘要

人们很容易假设，任何可由短程序解决的任务都可以通过其思维链教给模型：写出步骤、微调，模型就能跟随。本文表明，对于一类可识别的程序，该假设并不成立。测试平台包含九个推理任务，每个任务来自确定性生成器；公开和隐藏的数据切分共享生成器，因此保留集可代理测试准确率。我将这些生成器逆向工程为 Python 求解器，将其呈现为思维链，并使用秩 ≤ 32 的 LoRA 蒸馏到一个 30B（3.5B 活跃参数）的 Nemotron 模型上。可前向计算的任务易于安装：查找/算术任务以及一个 8 位布尔任务（转移率分别 ≥ 0.99 和 0.68）。但密码算术任务不行：即使搜索求解器能回答 71% 的实例，将其回溯搜索蒸馏到十一种思维链设计、基于可验证奖励的强化学习以及自训练中，性能仍维持在 0.01-0.07。这不是能力差距。模型能完成 97-100% 的算术行，并在 71% 的情况下将正确密码排在前八位；但它无法将搜索作为从左到右的推导持续推进。微调学会了可验证消除步骤的形状，但其判定却变成了无条件模板，正确率仅为 16-57%（“判定即令牌”）。这一天花板在从 3B 到 671B 的骨干模型以及微调和提示方法中均成立；一次受控干预隔离了原因：揭示密码密钥（使推导变为前向）后，相同实例的准确率从 0.03 提升至 0.57。当某个程序唯一的解决方案是在无信息结构上进行搜索时，就不存在可模仿的忠实前向思维链。该任务只有在移除搜索、将其组合核心预计算为一个目录、并将追踪简化为记忆加验证后才能被学习；第一名解决方案正是通过这种方式在私有排行榜上达到 0.92。因此，真正被蒸馏的是记忆和验证，而非搜索。

English

It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time ("verdict-as-token"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.