ViVa：面向机器人强化学习的视频生成价值模型

摘要

视觉-语言-动作模型通过大规模预训练推动了机器人操作技术的发展，但在实际部署中仍因部分可观测性和延迟反馈而面临挑战。强化学习通过价值函数评估任务进度并指导策略改进，但现有基于视觉语言模型的价值模型难以捕捉时序动态，影响了长周期任务中的价值估计可靠性。本文提出ViVa——一种视频生成式价值模型，通过改造预训练视频生成器实现价值估计。该模型以当前观测数据和机器人本体感知为输入，联合预测未来本体感知和当前状态的标量价值。通过利用预训练视频生成器的时空先验知识，我们的方法将价值估计建立在预期的具身动态基础上，突破静态快照的局限，实现价值与前瞻能力的本质耦合。将ViVa集成至RECAP系统后，在现实世界的箱子组装任务中实现了显著提升。三项任务的定性分析证实，ViVa能生成更可靠的价值信号，准确反映任务进度。通过利用视频语料库的时空先验知识，ViVa还能泛化至新物体，彰显了视频生成模型在价值估计领域的应用潜力。

English

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator for value estimation. Taking the current observation and robot proprioception as input, ViVa jointly predicts future proprioception and a scalar value for the current state. By leveraging the spatiotemporal priors of a pretrained video generator, our approach grounds value estimation in anticipated embodiment dynamics, moving beyond static snapshots to intrinsically couple value with foresight. Integrated into RECAP, ViVa delivers substantial improvements on real-world box assembly. Qualitative analysis across all three tasks confirms that ViVa produces more reliable value signals, accurately reflecting task progress. By leveraging spatiotemporal priors from video corpora, ViVa also generalizes to novel objects, highlighting the promise of video-generative models for value estimation.

ViVa：面向机器人强化学习的视频生成价值模型

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

摘要

Support