Afterburner:強化學習助力自我提升的程式碼效率優化
Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
May 29, 2025
作者: Mingzhe Du, Luu Tuan Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, See-kiong Ng
cs.AI
摘要
大型語言模型(LLMs)能夠生成功能正確的解決方案,但在代碼效率方面往往表現不足,這成為了實際部署中的關鍵瓶頸。本文提出了一種新穎的測試時迭代優化框架來解決這一問題,該框架採用閉環系統,其中LLMs基於執行沙箱的實測性能反饋迭代優化代碼。我們探索了三種訓練策略:監督微調(SFT)、直接偏好優化(DPO)以及群組相對策略優化(GRPO)。在我們的Venus數據集和APPS基準上的實驗表明,SFT和DPO在效率提升方面迅速達到飽和。相比之下,GRPO利用強化學習(RL)結合執行反饋,持續優化代碼性能,顯著提升了pass@1(從47%提升至62%)以及在效率上超越人類提交的概率(從31%提升至45%)。我們的工作展示了在測試時有效提升代碼效率的方法,並關鍵性地揭示了RL在教導LLMs真正自我提升代碼效率方面的強大能力。
English
Large Language Models (LLMs) generate functionally correct solutions but
often fall short in code efficiency, a critical bottleneck for real-world
deployment. In this paper, we introduce a novel test-time iterative
optimization framework to address this, employing a closed-loop system where
LLMs iteratively refine code based on empirical performance feedback from an
execution sandbox. We explore three training strategies: Supervised Fine-Tuning
(SFT), Direct Preference Optimization (DPO), and Group Relative Policy
Optimization~(GRPO). Experiments on our Venus dataset and the APPS benchmark
show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO,
using reinforcement learning (RL) with execution feedback, continuously
optimizes code performance, significantly boosting both pass@1 (from 47% to
62%) and the likelihood of outperforming human submissions in efficiency (from
31% to 45%). Our work demonstrates effective test-time code efficiency
improvement and critically reveals the power of RL in teaching LLMs to truly
self-improve code efficiency.Summary
AI-Generated Summary