在职学习：面向目标强化学习的测试时课程设计

摘要

人類擅長在實踐中學習：我們在面對任務時逐步掌握解決之道。那麼，模型能否做到同樣的事情呢？我們提出了一種代理，它能夠構建一個針對特定任務的課程，稱為測試時課程（TTC-RL），並應用強化學習來持續訓練模型以完成其目標任務。測試時課程通過自動從大量可用訓練數據中選取與任務最相關的數據，避免了耗時的人工數據集整理工作。我們的實驗表明，在測試時課程上進行強化學習，能夠在多種評估和模型上持續提升模型在目標任務上的表現。值得注意的是，在具有挑戰性的數學和編程基準測試中，TTC-RL將Qwen3-8B的pass@1在AIME25上提高了約1.8倍，在CodeElo上提高了2.1倍。此外，我們發現，與初始模型相比，TTC-RL顯著提升了性能上限，將AIME25上的pass@8從40%提升至62%，CodeElo上的pass@8從28%提升至43%。我們的研究結果展示了測試時課程在將測試時擴展範式延伸至測試期間對數千次任務相關經驗進行持續訓練方面的潛力。

English

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

在职学习：面向目标强化学习的测试时课程设计

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

摘要

Support