애프터버너: 강화 학습을 통한 자가 개선 코드 효율성 최적화

초록

대규모 언어 모델(LLMs)은 기능적으로 정확한 솔루션을 생성하지만, 실제 배포에 있어 중요한 병목 현상인 코드 효율성 측면에서는 종종 부족함을 보입니다. 본 논문에서는 이를 해결하기 위해 새로운 테스트 시점 반복 최적화 프레임워크를 소개합니다. 이 프레임워크는 실행 샌드박스에서 얻은 경험적 성능 피드백을 기반으로 LLM이 코드를 반복적으로 개선하는 폐쇄 루프 시스템을 활용합니다. 우리는 세 가지 학습 전략을 탐구합니다: 지도 미세 조정(SFT), 직접 선호 최적화(DPO), 그리고 그룹 상대 정책 최적화(GRPO). Venus 데이터셋과 APPS 벤치마크에서의 실험 결과, SFT와 DPO는 효율성 향상에서 빠르게 포화 상태에 도달하는 반면, 실행 피드백과 함께 강화 학습(RL)을 사용하는 GRPO는 코드 성능을 지속적으로 최적화하며, pass@1(47%에서 62%로)과 인간 제출물을 효율성 측면에서 능가할 가능성(31%에서 45%로)을 크게 향상시켰습니다. 본 연구는 테스트 시점 코드 효율성 개선의 효과를 입증하고, LLM이 진정으로 코드 효율성을 자기 개선하도록 가르치는 데 있어 RL의 강력함을 비판적으로 드러냅니다.

English

Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization~(GRPO). Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both pass@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency.

애프터버너: 강화 학습을 통한 자가 개선 코드 효율성 최적화

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

초록

Support