검증 가능한 검색은 학습 가능한 사고 사슬이 아니다.

초록

짧은 프로그램으로 해결 가능한 모든 작업이 모델의 사고 사슬(chain-of-thought)로 가르쳐질 수 있다고 가정하기 쉽다. 즉, 단계를 작성하고 미세 조정하면 모델이 따라 한다. 본 논문은 식별 가능한 절차 클래스에 대해 이 가정이 실패함을 보여준다. 테스트베드는 각각 결정론적 생성기(deterministic generator)에서 비롯된 9가지 추론 작업이다. 공개 분할과 비공개 분할은 생성기를 공유하므로, 보류된 데이터는 테스트 정확도를 대리한다. 생성기를 파이썬 솔버로 역설계하고, 이를 사고 사슬로 렌더링한 후, 30B(3.5B 활성) Nemotron 모델에 대해 순위 ≤ 32인 LoRA로 증류한다. 전방 계산 가능한 작업(forward-computable tasks)은 쉽게 설치된다. 조회/산술 및 8비트 부울 작업은 각각 0.99 및 0.68의 전이(transfer)를 보인다. 그러나 암호산술(cryptarithm)은 그렇지 않다. 역추적 검색(backtracking search)을 증류하는 것은 11가지 사고 사슬 설계, 검증 가능한 보상(verifiable rewards)을 통한 강화 학습, 자기 학습(self-training) 전반에 걸쳐 0.01–0.07에 머무르며, 검색 솔버는 인스턴스의 71%를 답하는데도 그러하다. 이는 능력 격차(capability gap)가 아니다. 모델은 97–100%의 라인에서 산술을 수행하고 71%에서 올바른 암호를 상위 8위 안에 순위 매긴다. 하지만 모델은 검색을 좌에서 우로의 유도(derivation)로 전달할 수 없다. 미세 조정은 검증 가능한 제거 단계(verifiable elimination step)의 형태를 학습하지만, 그 판정은 무조건적인 템플릿이 되어 16–57%의 경우만 정확하다("판정-토큰", verdict-as-token). 이러한 한계는 3B에서 671B까지의 백본(backbone)과 미세 조정 및 프롬프팅(prompting) 전반에 걸쳐 유지된다. 통제된 개입(controlled intervention)이 원인을 분리한다. 암호 키를 드러내어 유도를 전방으로 전환하면 동일한 인스턴스가 0.03에서 0.57로 상승한다. 절차의 유일한 해결책이 정보가 없는 구조(information-free structure)에 대한 검색일 때, 모방할 충실한 전방 사고 사슬은 존재하지 않는다. 작업은 검색을 제거하고, 그 조합적 핵심을 카탈로그로 사전 계산하며, 추적(trace)을 회상(recall)과 검증(verification)으로 축소해야만 학습 가능해진다. 1위 솔루션은 이러한 방식으로 Private LB 0.92에 도달한다. 증류되는 것은 검색이 아니라 암기(memorization)와 검증이다.

English

It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time ("verdict-as-token"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.