智能体奖励反馈下的代码美学

摘要

大型语言模型（LLMs）已成为开发者在代码相关任务中的重要助手。虽然LLMs在代码生成和缺陷修复等传统编程任务中表现出色，但在视觉导向的编码任务中往往难以达到理想的美学效果。本文提出了一种提升LLM生成代码美学质量的新流程：首先构建了专注于代码美学的AesCode-358K大规模指令调优数据集；继而提出代理奖励反馈机制——通过多智能体系统评估代码可执行性、静态美学和交互美学；在此基础上开发GRPO-AR算法，将上述评估信号整合至GRPO算法中，实现功能性与代码美学的联合优化；最后建立了用于评估代码美学的OpenDesign基准测试集。实验结果表明，结合AesCode-358K监督微调与代理奖励反馈强化学习的方案，在OpenDesign基准上取得显著提升，同时在PandasPlotBench等现有基准测试中也表现优异。值得注意的是，我们提出的AesCoder-4B模型在美学质量评估中超越GPT-4o和GPT-4.1，其表现与参数量达4800亿-6850亿的大型开源模型相当，有力验证了本方法的有效性。

English

Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.

智能体奖励反馈下的代码美学

Code Aesthetics with Agentic Reward Feedback

摘要

Support