TAROT:面向大语言模型代码生成的测试驱动与能力自适应课程强化微调方法
TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
February 17, 2026
作者: Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen, Jing Tang, Jianguo Li
cs.AI
摘要
大型语言模型(LLMs)正在改变编程范式,即所谓的氛围编程,但生成算法复杂且健壮的代码仍是关键挑战。激发LLMs的深度推理能力对突破此障碍至关重要。强化微调(RFT)已成为应对这一需求的有效策略。然而现有方法大多忽视了测试用例固有的异质难度与粒度差异,导致奖励信号分布失衡,进而引发训练过程中的梯度更新偏差。为此,我们提出测试驱动与能力自适应课程强化微调框架(TAROT)。该框架系统性地为每个问题构建四层级测试套件(基础、中级、复杂、边缘),为课程设计与评估提供可控的难度梯度。关键创新在于,TAROT将课程进度与原始奖励分数解耦,通过能力条件化评估从课程策略组合中进行原则性选择,而非依赖偶然的测试用例难度组合。这种设计实现了稳定优化与更高效的能力习得。大量实验表明,代码生成中RFT的最佳课程策略与模型内在能力密切相关:能力较弱的模型通过由易到难的渐进课程获益更大,而能力更强的模型则在难度优先的课程中表现更优。TAROT提供了一种可复现的方法,能根据模型能力自适应定制课程设计,从而持续提升生成代码的功能正确性与健壮性。所有代码与数据已开源(https://github.com/deep-diver/TAROT),以促进可复现性并推动社区研究。
English
Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.