后燃器:强化学习赋能代码效率优化的自我提升
Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
May 29, 2025
作者: Mingzhe Du, Luu Tuan Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, See-kiong Ng
cs.AI
摘要
大型语言模型(LLMs)虽能生成功能正确的解决方案,但在代码效率上往往表现欠佳,这成为实际部署中的关键瓶颈。本文提出了一种新颖的测试时迭代优化框架,采用闭环系统,让LLMs基于执行沙箱中的性能反馈不断优化代码。我们探讨了三种训练策略:监督微调(SFT)、直接偏好优化(DPO)以及组相对策略优化(GRPO)。在Venus数据集和APPS基准上的实验表明,SFT和DPO在效率提升上迅速达到饱和。相比之下,GRPO利用强化学习(RL)结合执行反馈,持续优化代码性能,显著提高了pass@1(从47%提升至62%)及在效率上超越人类提交的可能性(从31%增至45%)。本研究不仅展示了测试时代码效率提升的有效性,更重要的是揭示了RL在教导LLMs真正自我提升代码效率方面的强大潜力。
English
Large Language Models (LLMs) generate functionally correct solutions but
often fall short in code efficiency, a critical bottleneck for real-world
deployment. In this paper, we introduce a novel test-time iterative
optimization framework to address this, employing a closed-loop system where
LLMs iteratively refine code based on empirical performance feedback from an
execution sandbox. We explore three training strategies: Supervised Fine-Tuning
(SFT), Direct Preference Optimization (DPO), and Group Relative Policy
Optimization~(GRPO). Experiments on our Venus dataset and the APPS benchmark
show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO,
using reinforcement learning (RL) with execution feedback, continuously
optimizes code performance, significantly boosting both pass@1 (from 47% to
62%) and the likelihood of outperforming human submissions in efficiency (from
31% to 45%). Our work demonstrates effective test-time code efficiency
improvement and critically reveals the power of RL in teaching LLMs to truly
self-improve code efficiency.Summary
AI-Generated Summary