TEMPO: 대규모 추론 모델을 위한 테스트 타임 학습 확장

초록

테스트 타임 트레이닝(TTT)은 추론 단계에서 레이블이 없는 테스트 인스턴스를 기반으로 모델 파라미터를 적응시키며, 오프라인 훈련의 한계를 지속적으로 확장합니다. 초기 성능 향상에도 불구하고, 기존 대규모 추론 모델(LRM)용 TTT 방법들은 빠르게 정체에 이르며 추가적인 테스트 시간 컴퓨팅 자원을 활용하지 못합니다. 외부 보정이 없는 경우, 정책 모델이 진화함에 따라 자체 생성된 보상 신호가 점점 더 표류하여 성능 정체와 다양성 붕괴를 동시에 초래합니다. 우리는 레이블이 없는 질문에 대한 정책 정제와 레이블된 데이터셋을 이용한 주기적인 비평가 재보정을 교차 수행하는 TTT 프레임워크인 TEMPO를 제안합니다. 기대값 최대화(EM) 알고리즘을 통해 이 교번 절차를 정형화함으로써, 기존 방법들이 중요한 재보정 단계를 생략한 불완전한 변형으로 해석될 수 있음을 밝힙니다. 이 단계를 재도입하면 증거 하한(ELBO)이 강화되고 지속적인 개선이 가능해집니다. 다양한 모델 패밀리(Qwen3 및 OLMO3)와 추론 과제에서 TEMPO는 OLMO3-7B의 AIME 2024 성적을 33.0%에서 51.1%로, Qwen3-14B의 성적을 42.3%에서 65.8%로 향상시키면서 높은 다양성을 유지합니다.

English

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

TEMPO: 대규모 추론 모델을 위한 테스트 타임 학습 확장

TEMPO: Scaling Test-time Training for Large Reasoning Models

초록

Support