ChatPaper.aiChatPaper

ReLook:基于视觉的多模态LLM评论家驱动的智能网页编码强化学习

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

October 13, 2025
作者: Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou
cs.AI

摘要

尽管大型语言模型(LLMs)在算法代码生成方面表现出色,但在前端开发领域却面临挑战,因为其正确性需通过渲染像素和交互效果来评判。我们提出了ReLook,一个基于视觉的强化学习框架,赋予智能体能力,通过调用多模态大语言模型(MLLM)作为工具,实现一个稳健的生成-诊断-优化闭环。在训练过程中,智能体利用MLLM作为视觉评判者——通过截图给代码打分——以及提供可操作的、基于视觉的反馈来源;对于无效渲染实施严格的零奖励规则,确保渲染可行性并防止奖励滥用。为避免行为崩溃,我们引入了强制优化策略,即严格的接受准则,仅采纳改进的修订,确保轨迹单调优化。在推理阶段,我们解耦评判机制,运行轻量级、无评判者的自我编辑循环,保持与基础解码相当的延迟,同时保留大部分性能提升。在三个广泛使用的基准测试中,ReLook在基于视觉的前端代码生成任务上持续超越强基线,凸显了智能体感知、视觉奖励及训练推理解耦的优势。
English
While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.
PDF102October 14, 2025