智能奖励反馈驱动的代码美学优化
Code Aesthetics with Agentic Reward Feedback
October 27, 2025
作者: Bang Xiao, Lingjie Jiang, Shaohan Huang, Tengchao Lv, Yupan Huang, Xun Wu, Lei Cui, Furu Wei
cs.AI
摘要
大型语言模型(LLMs)已成为开发人员在代码相关任务中的重要助手。虽然LLMs在代码生成和错误修复等传统编程任务中表现出色,但在视觉化编程任务中往往表现不佳,常产生美学效果欠佳的代码。本文提出了一种新型流程来提升LLM生成代码的美学质量。我们首先构建了AesCode-358K——一个专注于代码美学的大规模指令微调数据集;接着提出代理奖励反馈机制,通过多智能体系统评估代码的可执行性、静态美学和交互美学;在此基础上开发了GRPO-AR算法,将这些评估信号整合至GRPO算法中,实现功能性与代码美学的联合优化;最后建立了OpenDesign基准用于评估代码美学。实验结果表明,结合AesCode-358K的监督微调与代理奖励反馈的强化学习,能显著提升模型在OpenDesign上的表现,并在PandasPlotBench等现有基准上取得改进。值得注意的是,我们的AesCoder-4B模型超越了GPT-4o和GPT-4.1,其性能可与参数量达480B-685B的大型开源模型相媲美,这充分验证了所提方法的有效性。
English
Large Language Models (LLMs) have become valuable assistants for developers
in code-related tasks. While LLMs excel at traditional programming tasks such
as code generation and bug fixing, they struggle with visually-oriented coding
tasks, often producing suboptimal aesthetics. In this paper, we introduce a new
pipeline to enhance the aesthetic quality of LLM-generated code. We first
construct AesCode-358K, a large-scale instruction-tuning dataset focused on
code aesthetics. Next, we propose agentic reward feedback, a multi-agent system
that evaluates executability, static aesthetics, and interactive aesthetics.
Building on this, we develop GRPO-AR, which integrates these signals into the
GRPO algorithm for joint optimization of functionality and code aesthetics.
Finally, we develop OpenDesign, a benchmark for assessing code aesthetics.
Experimental results show that combining supervised fine-tuning on AesCode-358K
with reinforcement learning using agentic reward feedback significantly
improves performance on OpenDesign and also enhances results on existing
benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o
and GPT-4.1, and achieves performance comparable to large open-source models
with 480B-685B parameters, underscoring the effectiveness of our approach.