직무 학습: 목표 지향 강화 학습을 위한 테스트 시점 커리큘럼

초록

인간은 업무를 수행하면서 학습하는 데 능숙합니다: 우리는 직면한 과제를 해결하는 방법을 진행하면서 배웁니다. 모델도 같은 방식으로 학습할 수 있을까요? 우리는 특정 과제에 맞춘 커리큘럼을 구성하고, 이를 테스트 시간 커리큘럼(Test-Time Curriculum, TTC-RL)이라 명명하며, 강화 학습을 적용하여 모델을 대상 과제에 대해 계속 훈련시키는 에이전트를 제안합니다. 테스트 시간 커리큘럼은 대규모의 사용 가능한 훈련 데이터 풀에서 가장 과제와 관련된 데이터를 자동으로 선택함으로써 시간이 많이 소요되는 인간의 데이터 큐레이션을 피합니다. 우리의 실험은 테스트 시간 커리큘럼을 통한 강화 학습이 다양한 평가와 모델에 걸쳐 대상 과제에서 모델의 성능을 꾸준히 향상시킨다는 것을 보여줍니다. 특히, 도전적인 수학 및 코딩 벤치마크에서 TTC-RL은 Qwen3-8B의 pass@1을 AIME25에서 약 1.8배, CodeElo에서 2.1배 향상시켰습니다. 또한, TTC-RL은 초기 모델에 비해 성능 상한선을 크게 높였으며, AIME25에서 pass@8을 40%에서 62%로, CodeElo에서 28%에서 43%로 증가시켰습니다. 우리의 연구 결과는 테스트 시간 커리큘럼이 테스트 시간 스케일링 패러다임을 테스트 시간 동안 수천 개의 과제 관련 경험에 대한 지속적인 훈련으로 확장하는 데 있어 잠재력을 보여줍니다.

English

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

직무 학습: 목표 지향 강화 학습을 위한 테스트 시점 커리큘럼

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

초록

Support