在职学习：面向目标强化学习的测试时课程优化

摘要

人类擅长在工作中学习：我们边做边学，逐步掌握如何解决面临的任务。那么，模型能否同样做到这一点？我们提出了一种能够构建任务特定课程（称为测试时课程，TTC-RL）的智能体，并应用强化学习持续针对目标任务训练模型。测试时课程通过自动从大量可用训练数据中选取与任务最相关的数据，避免了耗时的人工数据集整理。我们的实验表明，基于测试时课程的强化学习在各种评估和模型上均能持续提升模型在目标任务上的表现。特别是在具有挑战性的数学和编程基准测试中，TTC-RL将Qwen3-8B在AIME25上的pass@1提升了约1.8倍，在CodeElo上提升了2.1倍。此外，我们发现TTC-RL显著提高了模型的性能上限，使AIME25上的pass@8从40%提升至62%，CodeElo上的pass@8从28%提升至43%。我们的研究揭示了测试时课程在将测试时扩展范式延伸至测试期间对数千项任务相关经验进行持续训练中的潜力。

English

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

在职学习：面向目标强化学习的测试时课程优化

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

摘要

Support