TEMPO: 大規模推論モデルにおけるテスト時トレーニングのスケーリング

要旨

テスト時学習（TTT）は、推論時にラベルなしテストインスタンスに対してモデルパラメータを適応させる手法であり、オフライン学習の限界を超えて継続的に能力を拡張する。初期の成果にもかかわらず、既存のLRM向けTTT手法は急速に頭打ちとなり、追加のテスト時計算資源の恩恵を受けない。外部キャリブレーションがない場合、政策モデルが進化するにつれて自己生成される報酬信号が次第に乖離し、性能の頭打ちと多様性の崩壊を同時に引き起こす。我々はTEMPOを提案する。これはラベル付きデータセットを用いた定期的な批評家キャリブレーションと、ラベルなし質問に対する政策改良を交互に実行するTTTフレームワークである。この交互手順を期待値最大化（EM）アルゴリズムを通じて形式化することで、従来手法が重要なキャリブレーション段階を省略した不完全な変種と解釈できることを明らかにする。この段階を再導入することで証拠下限（ELBO）が強化され、持続的な改善が可能となる。多様なモデルファミリー（Qwen3とOLMO3）と推論タスクにわたる実験で、TEMPOはOLMO3-7BをAIME 2024で33.0%から51.1%に、Qwen3-14Bを42.3%から65.8%に改善し、高い多様性を維持した。

English

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

TEMPO: 大規模推論モデルにおけるテスト時トレーニングのスケーリング

TEMPO: Scaling Test-time Training for Large Reasoning Models

要旨

Support