自指策略优化:面向视觉-语言-动作模型的自参照策略优化
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
November 19, 2025
作者: Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, Xipeng Qiu
cs.AI
摘要
视觉-语言-动作(VLA)模型在机器人操控任务中表现出色,但其性能受限于对专家演示数据的重度依赖,易产生演示偏差。强化学习(RL)作为克服这一局限的关键后训练策略,现有VLA-RL方法(包括基于群体的优化方法)却因奖励稀疏性问题而效能受限。仅依赖二元成功指标会浪费失败轨迹中的宝贵信息,导致训练效率低下。为此,我们提出自参照策略优化(SRPO),一种新型VLA-RL框架。SRPO通过将当前训练批次中生成的成功轨迹作为自我参照,无需外部演示或人工奖励工程即可为失败尝试分配渐进式奖励。其核心创新在于利用潜在世界表征来稳健衡量行为进展:通过世界模型潜在空间中的压缩化、可迁移编码,而非依赖原始像素或领域特定微调,自然捕捉跨环境进展模式,实现精准的通用化轨迹比较。在LIBERO基准测试中,SRPO从成功率48.9%的监督基线出发,仅用200步强化学习即达到99.2%的最新最优成功率,相对提升103%且无需额外监督。此外,SRPO在LIBERO-Plus基准上实现167%的性能提升,展现出卓越的鲁棒性。
English
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.