소프트 토큰, 하드 트루스

초록

최근 사고 연쇄(Chain-of-Thought, CoT) 단계에서 이산 토큰 대신 연속 토큰을 사용하는 대규모 언어 모델(LLM)의 추론 방식이 주목받고 있다. 이는 연속적인 이산 토큰의 혼합이 여러 추론 경로의 중첩을 동시에 시뮬레이션할 수 있다는 직관에 기반한다. 이론적 연구 결과에 따르면, 연속 토큰은 훨씬 더 큰 표현력을 가지며 특정 문제를 더 효율적으로 해결할 수 있음이 공식적으로 입증되었다. 그러나 연속 토큰의 실용적 사용은 강력한 학습 어려움으로 인해 제한되어 왔다: 기존 연구들은 사전 학습된 이산 토큰 모델에서 추론 시에만 연속 토큰을 사용하거나, 참조 이산 CoT에서 연속 CoT를 증류해야 했으며, 이로 인해 계산 비용이 증가하여 CoT를 매우 적은 수의 토큰으로 제한할 수밖에 없었다. 본 연구는 참조 이산 CoT로부터 증류하지 않고도 강화 학습(Reinforcement Learning, RL)을 통해 연속 CoT를 학습할 수 있는 확장 가능한 방법을 최초로 소개한다. 우리는 "소프트" 토큰을 사용한다: 토큰의 혼합과 입력 임베딩에 노이즈를 추가하여 RL 탐색을 제공한다. 계산 오버헤드는 최소화되어 수백 개의 토큰으로 구성된 연속 CoT를 학습할 수 있다. Llama와 Qwen 모델(최대 8B)을 사용한 수학적 추론 벤치마크에서, 연속 CoT로 학습한 모델은 pass@1에서는 이산 토큰 CoT와 동등한 성능을 보였으며, pass@32에서는 이를 능가하여 더 다양한 CoT를 생성함을 보여주었다. 체계적인 비교에서 가장 성능이 좋은 시나리오는 연속 CoT 토큰으로 학습한 후 추론 시 이산 토큰을 사용하는 것이었으며, 이는 "소프트" 모델이 표준 방식으로 배포될 수 있음을 의미한다. 마지막으로, 연속 CoT RL 학습은 기본 모델의 예측을 도메인 외 작업에서 더 잘 보존함을 보여주어, 기본 모델에 더 부드러운 접근을 제공한다.

English

The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

소프트 토큰, 하드 트루스

Soft Tokens, Hard Truths

초록

Support