视觉强化学习研究综述

摘要

近期，强化学习（RL）与视觉智能的交叉领域取得了显著进展，使得智能体不仅能感知复杂的视觉场景，还能在其中进行推理、生成和行动。本综述对该领域进行了批判性且最新的综合梳理。我们首先形式化了视觉RL问题，并追溯了从RLHF到可验证奖励范式，以及从近端策略优化到群体相对策略优化的策略优化策略演变历程。随后，我们将200多项代表性工作归纳为四大主题支柱：多模态大语言模型、视觉生成、统一模型框架及视觉-语言-行动模型。针对每一支柱，我们深入探讨了算法设计、奖励工程、基准测试进展，并提炼出诸如课程驱动训练、偏好对齐扩散、统一奖励建模等趋势。最后，我们回顾了涵盖集合级保真度、样本级偏好及状态级稳定性的评估协议，并指出了包括样本效率、泛化能力及安全部署在内的开放挑战。我们的目标是为研究人员和实践者提供一幅视觉RL快速扩展版图的清晰导航图，并指明未来探索的潜力方向。相关资源可访问：https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning。

English

Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.