WebGen-Agent: マルチレベルフィードバックとステップレベル強化学習によるインタラクティブなウェブサイト生成の強化

要旨

大規模言語モデル（LLM）を基盤としたエージェントシステムは、リポジトリレベルのコード生成タスクにおいて印象的な性能を発揮している。しかし、視覚効果やユーザインタラクションのフィードバックに大きく依存するウェブサイトコードベースの生成のようなタスクにおいて、現在のコードエージェントは単純なコード実行にのみ依存してフィードバックと検証を行っている。このアプローチでは、生成されたコードの実際の品質を捉えることができない。本論文では、包括的かつ多層的な視覚フィードバックを活用して、ウェブサイトコードベースを反復的に生成・改良する新しいウェブサイト生成エージェントであるWebGen-Agentを提案する。ウェブサイトのスクリーンショットとGUIエージェントテストに関する詳細かつ表現力豊かなテキスト記述と提案が、視覚言語モデル（VLM）によって生成され、それらの品質を定量化するスコアが付与される。スクリーンショットとGUIエージェントのスコアは、バックトラッキングと最良選択メカニズムと統合され、エージェントの性能を向上させる。WebGen-Agentのワークフローに内在する正確な視覚スコアを活用して、LLMがWebGen-Agentの推論エンジンとして機能する能力を向上させるために、スクリーンショットとGUIエージェントフィードバックを組み込んだStep-GRPOをさらに導入する。各ステップにおけるスクリーンショットとGUIエージェントのスコアをStep-GRPOの報酬として使用することで、密で信頼性の高いプロセス監視信号を提供し、モデルのウェブサイト生成能力を効果的に向上させる。WebGen-Benchデータセットにおいて、WebGen-AgentはClaude-3.5-Sonnetの精度を26.4%から51.9%に、外観スコアを3.0から3.9に向上させ、従来の最先端エージェントシステムを凌駕する。さらに、我々のStep-GRPOトレーニングアプローチは、Qwen2.5-Coder-7B-Instructの精度を38.9%から45.4%に、外観スコアを3.4から3.7に引き上げる。

English

Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce Step-GRPO with Screenshot and GUI-agent Feedback to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model's website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.

WebGen-Agent: マルチレベルフィードバックとステップレベル強化学習によるインタラクティブなウェブサイト生成の強化

WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning

要旨

Support