TEMPO：面向大型推理模型的测试时训练规模化方法

摘要

测试时训练（TTT）通过在推理阶段基于未标注测试样本自适应调整模型参数，持续拓展离线训练无法企及的能力边界。尽管初期成效显著，现有大型推理模型（LRM）的TTT方法很快会陷入性能平台期，且无法受益于额外的测试时计算资源。由于缺乏外部校准机制，随着策略模型的演化，其自生成奖励信号会逐渐偏离，最终导致性能停滞与多样性坍缩。我们提出TEMPO框架，通过交替执行未标注问题上的策略优化与标注数据集上的评判器周期性重校准，将这一交替过程形式化为期望最大化（EM）算法，揭示出先前方法可被视为缺失关键重校准步骤的不完整变体。重新引入该步骤能够收紧证据下界（ELBO），实现持续性能提升。在多样化模型家族（Qwen3与OLMO3）与推理任务上的实验表明，TEMPO将OLMO3-7B在AIME 2024上的表现从33.0%提升至51.1%，Qwen3-14B从42.3%提升至65.8%，同时保持高度多样性。

English

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

TEMPO：面向大型推理模型的测试时训练规模化方法

TEMPO: Scaling Test-time Training for Large Reasoning Models

摘要

Support