RLTF: 단위 테스트 피드백을 통한 강화 학습

초록

프로그램 합성 또는 코드 생성의 목표는 주어진 설명을 기반으로 실행 가능한 코드를 생성하는 것입니다. 최근에는 대규모 언어 모델(LLM)의 코드 생성 성능을 향상시키기 위해 강화 학습(RL)을 활용한 연구가 점점 더 많아지고 있습니다. 그러나 이러한 RL 방법들은 오프라인 프레임워크만을 사용하여 새로운 샘플 공간을 탐색하는 데 제한이 있었습니다. 또한, 현재 유닛 테스트 신호를 활용하는 접근 방식은 상당히 단순하여 코드 내 특정 오류 위치를 고려하지 못하고 있습니다. 이러한 문제를 해결하기 위해, 우리는 RLTF(Reinforcement Learning from Unit Test Feedback)를 제안했습니다. RLTF는 다중 세분화 유닛 테스트 피드백을 활용한 새로운 온라인 RL 프레임워크로, 코드 LLM을 개선하기 위해 설계되었습니다. 우리의 접근 방식은 훈련 중 실시간으로 데이터를 생성하고 동시에 세분화된 피드백 신호를 활용하여 모델이 더 높은 품질의 코드를 생성하도록 유도합니다. 광범위한 실험을 통해 RLTF가 APPS 및 MBPP 벤치마크에서 최첨단 성능을 달성함을 보여줍니다. 우리의 코드는 https://github.com/Zyq-scut/RLTF에서 확인할 수 있습니다.

English

The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, these RL methods have only used offline frameworks, limiting their exploration of new sample spaces. Additionally, current approaches that utilize unit test signals are rather simple, not accounting for specific error locations within the code. To address these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code can be found at: https://github.com/Zyq-scut/RLTF.

RLTF: 단위 테스트 피드백을 통한 강화 학습

RLTF: Reinforcement Learning from Unit Test Feedback

초록

Support