職場での学習：ターゲットを絞った強化学習のためのテスト時カリキュラム

要旨

人間は実践を通じて学習するのが得意です。私たちは直面する課題をその場で解決する方法を学びます。では、モデルも同じことができるでしょうか？私たちは、タスク固有のカリキュラムを組み立てるエージェントを提案します。これをテストタイムカリキュラム（TTC-RL）と呼び、強化学習を適用してモデルをターゲットタスク向けに継続的にトレーニングします。テストタイムカリキュラムは、大量の利用可能なトレーニングデータから自動的にタスクに関連性の高いデータを選択することで、時間のかかる人間によるデータセットのキュレーションを回避します。私たちの実験では、テストタイムカリキュラムを用いた強化学習が、様々な評価やモデルにおいて、ターゲットタスクに対するモデルの性能を一貫して向上させることが示されました。特に、難しい数学やコーディングのベンチマークでは、TTC-RLはQwen3-8Bのpass@1をAIME25で約1.8倍、CodeEloで約2.1倍向上させました。さらに、TTC-RLは初期モデルと比較して性能の上限を大幅に引き上げ、AIME25でのpass@8を40%から62%に、CodeEloでのpass@8を28%から43%に増加させました。私たちの研究結果は、テストタイムスケーリングのパラダイムを、テストタイム中に数千のタスク関連経験を継続的にトレーニングする領域に拡張する上で、テストタイムカリキュラムの可能性を示しています。

English

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

職場での学習：ターゲットを絞った強化学習のためのテスト時カリキュラム

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

要旨

Support