ReLook:基於視覺的強化學習與多模態LLM評判器,用於代理式網絡編程
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
October 13, 2025
作者: Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou
cs.AI
摘要
儘管大型語言模型(LLMs)在算法代碼生成方面表現出色,它們在前端開發領域卻面臨挑戰,因為正確性需基於渲染的像素和交互來判斷。我們提出了ReLook,這是一個基於視覺的強化學習框架,賦能代理通過調用多模態大語言模型(MLLM)作為工具,來實現一個穩健的生成-診斷-優化循環。在訓練過程中,代理利用MLLM作為視覺評判者——通過截圖評分代碼——並作為可操作的、基於視覺的反饋來源;對於無效渲染實施嚴格的零獎勵規則,確保渲染可行性並防止獎勵欺詐。為避免行為崩潰,我們引入了強制優化,這是一項嚴格的接受規則,僅允許改進的修訂,從而產生單調遞增的優化軌跡。在推理階段,我們分離評判者,運行一個輕量級、無評判者的自我編輯循環,保持與基礎解碼相當的延遲,同時保留大部分性能提升。在三個廣泛使用的基準測試中,ReLook在基於視覺的前端代碼生成方面持續超越強基準,凸顯了代理感知、視覺獎勵以及訓練-推理解耦的優勢。
English
While Large Language Models (LLMs) excel at algorithmic code generation, they
struggle with front-end development, where correctness is judged on rendered
pixels and interaction. We present ReLook, an agentic, vision-grounded
reinforcement learning framework that empowers an agent to close a robust
generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool.
During training, the agent uses the MLLM-in-the-loop both as a visual
critic--scoring code with screenshots--and as a source of actionable,
vision-grounded feedback; a strict zero-reward rule for invalid renders anchors
renderability and prevents reward hacking. To prevent behavioral collapse, we
introduce Forced Optimization, a strict acceptance rule that admits only
improving revisions, yielding monotonically better trajectories. At inference,
we decouple the critic and run a lightweight, critic-free self-edit cycle,
keeping latency comparable to base decoding while retaining most of the gains.
Across three widely used benchmarks, ReLook consistently outperforms strong
baselines in vision-grounded front-end code generation, highlighting the
benefits of agentic perception, visual rewards, and training-inference
decoupling.