대규모 언어 모델의 테스트 타임 컴퓨팅 확장 기술

초록

테스트 타임 스케일링(TTS), 즉 추론 과정에서 계산 자원을 동적으로 할당하는 방식은 대규모 언어 모델(LLM)의 추론 능력 향상을 위한 유망한 방향성입니다. 그러나 동일한 조건에서 잘 알려진 TTS 전략들을 체계적으로 비교한 연구는 부재하며, 모델 유형과 문제 난이도가 성능에 미치는 영향도 여전히 불분명합니다. 이러한 공백을 해소하기 위해 우리는 4개의 추론 데이터셋에 대해 8개의 오픈소스 LLM(70억~2350억 개의 매개변수)을 사용하여 생성된 300억 개 이상의 토큰을 아우르는 첫 대규모 TTS 연구를 수행했습니다. 우리는 세 가지 일관된 경향성을 관찰했습니다: (1) 단일 TTS 전략이 모든 상황을 압도하지는 않음; (2) 추론 모델들은 문제 난이도와 추론 궤적 길이에 따라 서로 다른 궤적 품질 패턴을 보이며, 단기 계획형과 장기 계획형 범주로 구분됨; (3) 주어진 모델 유형에 대해 최적의 TTS 성능은 계산 예산에 따라 단조롭게 증가함. 이러한 통찰을 바탕으로 우리는 문제 난이도, 모델 유형, 계산 예산을 고려하여 최적의 TTS 전략을 선택하는 실용적인 방안을 제시하며, 효과적인 추론 시점 스케일링을 위한 실용 가이드를 제공합니다.

English

Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.

대규모 언어 모델의 테스트 타임 컴퓨팅 확장 기술

The Art of Scaling Test-Time Compute for Large Language Models

초록

Support