언어 모델을 활용한 적응형 병렬 추론 학습

초록

추론 시간 계산의 확장은 언어 모델의 추론 능력을 상당히 향상시켜 왔습니다. 그러나 기존 방법들은 중요한 한계를 가지고 있습니다: 직렬화된 사고 연쇄(chain-of-thought) 접근법은 지나치게 긴 출력을 생성하여 지연 시간을 증가시키고 컨텍스트 윈도우를 고갈시키는 반면, 자기 일관성(self-consistency)과 같은 병렬 방법은 충분한 조정이 이루어지지 않아 중복 계산과 제한된 성능 향상을 초래합니다. 이러한 단점을 해결하기 위해, 우리는 직렬화된 계산과 병렬 계산을 종단 간 조율할 수 있는 새로운 추론 프레임워크인 적응형 병렬 추론(Adaptive Parallel Reasoning, APR)을 제안합니다. APR은 spawn() 및 join() 연산을 사용하여 적응형 다중 스레드 추론을 가능하게 함으로써 기존 추론 방법을 일반화합니다. 주요 혁신은 미리 정의된 추론 구조 없이도 부모 및 자식 추론 스레드를 최적화하여 작업 성공률을 향상시키는 종단 간 강화 학습 전략입니다. 카운트다운(Countdown) 추론 작업에 대한 실험은 APR의 상당한 이점을 보여줍니다: (1) 동일한 컨텍스트 윈도우 내에서 더 높은 성능(4k 컨텍스트에서 83.4% 대 60.0%); (2) 증가된 계산에서 더 우수한 확장성(20k 토큰에서 80.1% 대 66.6%); (3) 동등한 지연 시간에서 향상된 정확도(약 5,000ms에서 75.2% 대 57.3%). APR은 언어 모델이 계산의 적응형 할당을 통해 자율적으로 추론 프로세스를 최적화할 수 있도록 하는 한 걸음을 나타냅니다.

English

Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.

언어 모델을 활용한 적응형 병렬 추론 학습

Learning Adaptive Parallel Reasoning with Language Models

초록

Support