TAROT: 대규모 언어 모델 기반 코드 생성을 위한 테스트 주도 및 역량 적응형 커리큘럼 강화 미세 조정

초록

대규모 언어 모델(LLM)은 바이브 코딩(vibe coding)으로 알려진 코딩 패러다임을 변화시키고 있지만, 알고리즘적으로 정교하고 강건한 코드를 합성하는 것은 여전히 중요한 과제로 남아 있습니다. 이러한 장벽을 극복하기 위해서는 LLM의 심층 추론 능력을 고양시키는 것이 필수적입니다. 강화 미세 조정(RFT)은 이러한 필요를 해결하기 위한 유망한 전략으로 부상했습니다. 그러나 기존 대부분의 접근법은 테스트 케이스에 내재된 이질적 난이도와 세분화를 간과하여 보상 신호의 불균형한 분포와 이로 인한 훈련 중 편향된 기울기 업데이트를 초래합니다. 이를 해결하기 위해 우리는 테스트 주도 및 능력 적응형 커리큘럼 강화 미세 조정(TAROT)을 제안합니다. TAROT는 각 문제에 대해 기본, 중급, 복합, 경계의 4단계 테스트 슈트를 체계적으로 구성하여 커리큘럼 설계와 평가를 위한 통제된 난이도 환경을 제공합니다. 중요한 것은, TAROT가 커리큘럼 진행을 원시 보상 점수로부터 분리하여 능력 기반 평가와 다양한 커리큘럼 정책 포트폴리오로부터의 원칙적 선택을 가능하게 하며, 이는 우발적인 테스트 케이스 난이도 구성에 의존하지 않습니다. 이러한 설계는 안정적인 최적화와 더 효율적인 능력 습득을 촉진합니다. 광범위한 실험 결과는 코드 생성에서 RFT를 위한 최적의 커리큘럼이 모델의 내재적 능력과 밀접하게 연관되어 있음을 보여주며, 능력이 낮은 모델은 쉬운 것에서 어려운 순서의 진행으로 더 큰 향상을 달성하는 반면, 능력이 높은 모델은 어려운 것부터 시작하는 커리큘럼에서 더 뛰어난 성과를 보입니다. TAROT는 모델의 능력에 맞춰 커리큘럼 설계를 적응적으로 조정하는 재현 가능한 방법을 제공함으로써 생성된 코드의 기능적 정확성과 강건함을 지속적으로 향상시킵니다. 모든 코드와 데이터는 재현성을 촉진하고 커뮤니티 연구를 발전시키기 위해 https://github.com/deep-diver/TAROT 에 공개되었습니다.

English

Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

TAROT: 대규모 언어 모델 기반 코드 생성을 위한 테스트 주도 및 역량 적응형 커리큘럼 강화 미세 조정

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

초록

Support