SRPO:视觉-语言-动作模型的自参照策略优化
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
November 19, 2025
作者: Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, Xipeng Qiu
cs.AI
摘要
视觉-语言-动作(VLA)模型在机器人操控领域表现卓越,但其性能受限于对专家示范的严重依赖,导致存在示范偏差问题。强化学习(RL)作为克服这些局限的关键后训练策略,当前包括群体优化方法在内的VLA-RL方法却因严重的奖励稀疏问题而效能受限。仅依赖二元成功指标会浪费失败轨迹中的宝贵信息,造成训练效率低下。为此,我们提出自参照策略优化(SRPO)这一新型VLA-RL框架。SRPO通过将当前训练批次中生成的成功轨迹作为自我参照,无需外部示范或人工奖励设计,即可为失败尝试分配渐进式奖励。其核心创新在于利用潜在世界表征来鲁棒地衡量行为进展:通过世界模型潜在空间中的压缩化、可迁移编码,而非依赖原始像素或领域特异性微调,这些表征能自然捕获跨环境进展模式,实现精准的通用化轨迹比较。在LIBERO基准测试中的实证表明,SRPO从成功率48.9%的监督基线出发,仅用200步强化学习就将成功率提升至99.2%的新标杆,相对提升达103%且无需额外监督。此外,SRPO在LIBERO-Plus基准上实现167%的性能提升,展现出卓越的鲁棒性。
English
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.