TAROT: 大規模言語モデルによるコード生成のためのテスト駆動・能力適応型カリキュラム強化学習ファインチューニング

要旨

大規模言語モデル（LLM）は「バイブコーディング」として知られるコーディングのパラダイムを変革しつつあるが、アルゴリズム的に高度で堅牢なコードの合成は依然として重大な課題である。この障壁を克服するには、LLMの深い推論能力を促進することが不可欠である。強化学習ファインチューニング（RFT）はこの要請に応える有望な手法として登場した。しかし、既存手法の多くはテストケースに内在する難易度と粒度の不均一性を看過しており、報酬信号の不均衡な分配、ひいては訓練時のバイアスを含んだ勾配更新を引き起こしている。この問題に対処するため、我々はテスト駆動・能力適応型カリキュラム強化学習ファインチューニング（TAROT）を提案する。TAROTは各問題に対して4段階（基本、中級、複雑、エッジケース）のテストスイートを体系的に構築し、カリキュラム設計と評価のための制御された難易度環境を提供する。決定的に、TAROTはカリキュラムの進行を生の報酬スコアから分離し、能力に条件付けられた評価と、偶発的なテストケース難易度構成ではなく、カリキュラム方針のポートフォリオからの原理に基づいた選択を可能にする。この設計は、最適化の安定化とより効率的な能力習得を促進する。大規模な実験結果から、コード生成におけるRFTの最適なカリキュラムはモデルの内在的能力と密接に関連しており、能力の低いモデルは易から難への進行で大きな向上を達成する一方、能力の高いモデルは難易度優先のカリキュラムで優れた性能を発揮することが明らかになった。TAROTは、モデルの能力に適応的にカリキュラム設計を調整する再現可能な手法を提供し、生成コードの機能的正確性と堅牢性を一貫して向上させる。すべてのコードとデータはhttps://github.com/deep-diver/TAROT で公開され、再現性の確保とコミュニティ研究の推進に貢献する。

English

Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

TAROT: 大規模言語モデルによるコード生成のためのテスト駆動・能力適応型カリキュラム強化学習ファインチューニング

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

要旨

Support