エージェント的報酬フィードバックによるコード美学

要旨

大規模言語モデル（LLM）は、コード関連タスクにおける開発者の貴重なアシスタントとなっている。従来のコード生成やバグ修正といったプログラミングタスクでは優れた性能を発揮するLLMであるが、視覚的要素が強いコーディングタスクでは苦戦し、美的に最適とは言えない結果を生成することが多い。本論文では、LLMが生成するコードの美的品質を向上させる新しいパイプラインを提案する。まず、コード美学に特化した大規模命令チューニングデータセットAesCode-358Kを構築する。次に、実行可能性、静的美観、対話的美観を評価するマルチエージェントシステムであるagentic reward feedbackを提案する。これを基盤として、機能性とコード美学の共同最適化のためにGRPOアルゴリズムにこれらの信号を統合したGRPO-ARを開発する。最後に、コード美学を評価するベンチマークOpenDesignを構築する。実験結果は、AesCode-358Kによる教師ありファインチューニングとagentic reward feedbackを用いた強化学習を組み合わせることで、OpenDesignにおける性能が大幅に向上し、既存のPandasPlotBenchなどのベンチマークにおける結果も向上することを示している。特に、我々のAesCoder-4BはGPT-4oやGPT-4.1を凌駕し、480B-685Bパラメータの大規模オープンソースモデルに匹敵する性能を達成し、本アプローチの有効性を実証している。

English

Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.

エージェント的報酬フィードバックによるコード美学

Code Aesthetics with Agentic Reward Feedback

要旨

Support