사고 토큰 재고: 개선 연산자로서의 대형 언어 모델

초록

추론 훈련은 대형 언어 모델(LLM)이 긴 사고 사슬(long CoT)을 생성하도록 유도하며, 이를 통해 모델은 자체 점검을 통해 해결 전략을 탐색할 수 있다. 이는 정확도를 높이는 결과를 가져오지만, 컨텍스트 길이, 토큰/계산 비용, 그리고 응답 지연 시간을 증가시킨다. 우리는 다음과 같은 질문을 던진다: 현재의 모델들이 메타인지를 활용하여 이 파레토 프론티어 상에서 다른 조합(예: 더 낮은 컨텍스트 길이 및/또는 지연 시간과 함께 더 나은 정확도)을 제공할 수 있는가? 추상적으로, 우리는 모델을 자체 "사고"에 대한 개선 연산자로 간주하며, 가능한 전략의 연속체를 고려한다. 우리는 흥미로운 추론 패밀리인 병렬-증류-정제(Parallel-Distill-Refine, PDR)를 식별한다. 이는 다음과 같은 과정을 수행한다: (i) 병렬적으로 다양한 초안을 생성; (ii) 이를 제한된 텍스트 작업 공간으로 증류; (iii) 이 작업 공간을 조건으로 정제하여 다음 라운드의 시드가 되는 출력을 생성. 중요한 점은, 병렬화 정도를 통해 컨텍스트 길이(따라서 계산 비용)를 제어할 수 있으며, 이는 더 이상 생성된 토큰의 총 수와 혼동되지 않는다. 우리는 현재 모델의 PDR 구현이 긴 CoT보다 더 나은 정확도를 제공하면서도 더 낮은 지연 시간을 발생시킨다고 보고한다. 병렬화 정도를 1로 설정하면 흥미로운 하위 사례인 순차적 정제(Sequential Refinement, SR)(단일 후보 답변을 반복적으로 개선)가 발생하며, 이는 긴 CoT보다 우수한 성능을 제공한다. 이러한 모델 오케스트레이션의 성공은 추가 훈련이 파레토 프론티어를 이동시킬 수 있는지에 대한 질문을 제기한다. 이를 위해, 우리는 8B 사고 모델을 강화 학습(Reinforcement Learning, RL)으로 훈련시켜 PDR을 추론 방법으로 일관되게 만든다. 검증 가능한 답변이 있는 수학 과제에서, 반복적 파이프라인은 동일한 순차적 예산에서 단일 패스 기준선을 능가하며, PDR이 가장 큰 이득을 제공한다(예: AIME 2024에서 +11%, AIME 2025에서 +9%).

English

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

사고 토큰 재고: 개선 연산자로서의 대형 언어 모델

Rethinking Thinking Tokens: LLMs as Improvement Operators

초록

Support