在职学习:面向目标强化学习的测试时课程优化
Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
October 6, 2025
作者: Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, Moritz Hardt
cs.AI
摘要
人类擅长在工作中学习:我们边做边学,逐步掌握如何解决面临的任务。那么,模型能否同样做到这一点?我们提出了一种能够构建任务特定课程(称为测试时课程,TTC-RL)的智能体,并应用强化学习持续针对目标任务训练模型。测试时课程通过自动从大量可用训练数据中选取与任务最相关的数据,避免了耗时的人工数据集整理。我们的实验表明,基于测试时课程的强化学习在各种评估和模型上均能持续提升模型在目标任务上的表现。特别是在具有挑战性的数学和编程基准测试中,TTC-RL将Qwen3-8B在AIME25上的pass@1提升了约1.8倍,在CodeElo上提升了2.1倍。此外,我们发现TTC-RL显著提高了模型的性能上限,使AIME25上的pass@8从40%提升至62%,CodeElo上的pass@8从28%提升至43%。我们的研究揭示了测试时课程在将测试时扩展范式延伸至测试期间对数千项任务相关经验进行持续训练中的潜力。
English
Humans are good at learning on the job: We learn how to solve the tasks we
face as we go along. Can a model do the same? We propose an agent that
assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and
applies reinforcement learning to continue training the model for its target
task. The test-time curriculum avoids time-consuming human curation of datasets
by automatically selecting the most task-relevant data from a large pool of
available training data. Our experiments demonstrate that reinforcement
learning on a test-time curriculum consistently improves the model on its
target tasks, across a variety of evaluations and models. Notably, on
challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B
by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that
TTC-RL significantly raises the performance ceiling compared to the initial
model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to
43%. Our findings show the potential of test-time curricula in extending the
test-time scaling paradigm to continual training on thousands of task-relevant
experiences during test-time.