在职学习:面向目标强化学习的测试时课程设计
Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
October 6, 2025
作者: Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, Moritz Hardt
cs.AI
摘要
人類擅長在實踐中學習:我們在面對任務時逐步掌握解決之道。那麼,模型能否做到同樣的事情呢?我們提出了一種代理,它能夠構建一個針對特定任務的課程,稱為測試時課程(TTC-RL),並應用強化學習來持續訓練模型以完成其目標任務。測試時課程通過自動從大量可用訓練數據中選取與任務最相關的數據,避免了耗時的人工數據集整理工作。我們的實驗表明,在測試時課程上進行強化學習,能夠在多種評估和模型上持續提升模型在目標任務上的表現。值得注意的是,在具有挑戰性的數學和編程基準測試中,TTC-RL將Qwen3-8B的pass@1在AIME25上提高了約1.8倍,在CodeElo上提高了2.1倍。此外,我們發現,與初始模型相比,TTC-RL顯著提升了性能上限,將AIME25上的pass@8從40%提升至62%,CodeElo上的pass@8從28%提升至43%。我們的研究結果展示了測試時課程在將測試時擴展範式延伸至測試期間對數千次任務相關經驗進行持續訓練方面的潛力。
English
Humans are good at learning on the job: We learn how to solve the tasks we
face as we go along. Can a model do the same? We propose an agent that
assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and
applies reinforcement learning to continue training the model for its target
task. The test-time curriculum avoids time-consuming human curation of datasets
by automatically selecting the most task-relevant data from a large pool of
available training data. Our experiments demonstrate that reinforcement
learning on a test-time curriculum consistently improves the model on its
target tasks, across a variety of evaluations and models. Notably, on
challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B
by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that
TTC-RL significantly raises the performance ceiling compared to the initial
model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to
43%. Our findings show the potential of test-time curricula in extending the
test-time scaling paradigm to continual training on thousands of task-relevant
experiences during test-time.