ChatPaper.aiChatPaper

視覺中的強化學習:綜述

Reinforcement Learning in Vision: A Survey

August 11, 2025
作者: Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou
cs.AI

摘要

近期在強化學習(RL)與視覺智能交叉領域的進展,使得智能體不僅能感知複雜的視覺場景,還能進行推理、生成並在其中行動。本調查提供了該領域的批判性與最新綜合分析。我們首先形式化了視覺RL問題,並追溯了從RLHF到可驗證獎勵範式,以及從近端策略優化到群組相對策略優化的策略優化策略演變。接著,我們將超過200項代表性工作歸納為四大主題支柱:多模態大型語言模型、視覺生成、統一模型框架及視覺-語言-行動模型。針對每一支柱,我們探討了算法設計、獎勵工程、基準測試進展,並提煉出諸如課程驅動訓練、偏好對齊擴散及統一獎勵建模等趨勢。最後,我們回顧了涵蓋集合級保真度、樣本級偏好及狀態級穩定性的評估協議,並指出了包括樣本效率、泛化能力及安全部署在內的開放挑戰。我們的目標是為研究人員與實踐者提供一幅視覺RL快速擴展版圖的連貫地圖,並強調未來探索的潛在方向。相關資源可於以下網址獲取:https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning。
English
Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.
PDF232August 12, 2025