TEMPO：大規模推理模型測試時訓練的擴展研究

摘要

測試時訓練（TTT）在推理階段針對未標記測試樣本進行模型參數適應，持續擴展離線訓練無法觸及的能力。儘管初期成效顯著，現有大型推理模型（LRM）的TTT方法很快陷入瓶頸，且無法受益於額外的測試時計算資源。由於缺乏外部校準，隨著策略模型演化，自我生成的獎勵信號會逐漸偏離，導致性能停滯與多樣性崩潰。我們提出TEMPO框架，透過在未標記問題上交替執行策略優化與標記數據集上的定期評判器重校準。藉由期望最大化（EM）算法形式化此交替過程，我們揭示先前方法可被解讀為缺失關鍵重校準步驟的不完整變體。重新引入該步驟能緊密證據下界（ELBO），實現持續改進。在各類模型系列（Qwen3與OLMO3）與推理任務中，TEMPO將OLMO3-7B在AIME 2024的表現從33.0%提升至51.1%，Qwen3-14B從42.3%提升至65.8%，同時維持高多樣性。

English

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

TEMPO：大規模推理模型測試時訓練的擴展研究

TEMPO: Scaling Test-time Training for Large Reasoning Models

摘要

Support