TAROT：基於大型語言模型的程式碼生成之測試驅動與能力自適應課程強化微調

摘要

大型语言模型（LLMs）正在改变编程范式（即氛围编程），但如何合成算法复杂且健壮的代码仍是关键挑战。激发LLMs的深度推理能力对突破此障碍至关重要。强化微调（RFT）已成为应对这一需求的有效策略。然而现有方法大多忽视了测试用例固有的异质性难度与粒度，导致奖励信号分布失衡，进而引发训练过程中的梯度更新偏差。为此，我们提出测试驱动与能力自适应课程强化微调框架（TAROT）。该框架针对每个编程问题系统化构建四层级测试套件（基础、中级、复杂、边缘），为课程设计与评估提供可控的难度梯度。TAROT的核心创新在于将课程进度与原始奖励分数解耦，通过能力条件化评估从课程策略组合中进行原则性选择，而非依赖偶然的测试用例难度组合。这种设计确保了优化稳定性与能力习得效率。大量实验表明，代码生成中RFT的最佳课程策略与模型内在能力密切相关：能力较弱的模型采用由易到难的渐进课程收益更大，而能力较强的模型则在难度优先的课程中表现更优。TAROT提供了一种可复现的方法，能根据模型能力自适应定制课程设计，从而持续提升生成代码的功能正确性与鲁棒性。所有代码与数据已发布于https://github.com/deep-diver/TAROT，以促进可复现性并推动社区研究。

English

Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

TAROT：基於大型語言模型的程式碼生成之測試驅動與能力自適應課程強化微調

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

摘要

Support