ViVa：面向机器人强化学习的视频生成价值模型

摘要

視覺-語言-動作（VLA）模型通過大規模預訓練推進了機器人操作技術，但由於部分可觀測性和延遲反饋的存在，其實際部署仍面臨挑戰。強化學習通過價值函數應對這一問題，該函數能評估任務進展並指導策略改進。然而，現有基於視覺語言模型（VLM）構建的價值模型難以捕捉時序動態，導致長週期任務中的價值估計可靠性不足。本文提出ViVa——一種視頻生成式價值模型，其通過重新賦能預訓練視頻生成器來實現價值估計。ViVa以當前觀測數據和機器人本体感知為輸入，聯合預測未來本体感知數據及當前狀態的標量價值。通過利用預訓練視頻生成器的時空先驗知識，我們的方法將價值估計錨定於預期的具身動態中，突破靜態快照的局限，實現價值與預見性的內在耦合。集成至RECAP框架後，ViVa在真實世界箱體組裝任務中取得顯著提升。三項任務的定性分析證實，ViVa能產生更可靠的價值信號，精準反映任務進展。通過利用視頻語料庫的時空先驗，ViVa還能泛化至新物體，展現了視頻生成模型在價值估計領域的應用潛力。

English

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator for value estimation. Taking the current observation and robot proprioception as input, ViVa jointly predicts future proprioception and a scalar value for the current state. By leveraging the spatiotemporal priors of a pretrained video generator, our approach grounds value estimation in anticipated embodiment dynamics, moving beyond static snapshots to intrinsically couple value with foresight. Integrated into RECAP, ViVa delivers substantial improvements on real-world box assembly. Qualitative analysis across all three tasks confirms that ViVa produces more reliable value signals, accurately reflecting task progress. By leveraging spatiotemporal priors from video corpora, ViVa also generalizes to novel objects, highlighting the promise of video-generative models for value estimation.

ViVa：面向机器人强化学习的视频生成价值模型

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

摘要

Support