TOPReward：將詞彙機率作為機器人學的隱藏零樣本獎勵

摘要

儘管視覺-語言-動作模型在預訓練領域快速發展，其在強化學習領域的進展仍受制於現實場景中的低樣本效率與稀疏獎勵問題。開發具泛化能力的過程獎勵模型對於提供細粒度反饋至關重要，然而現有的時序價值函數往往難以超越其訓練領域的泛化能力。我們提出TOPReward——一種基於概率建模的新型時序價值函數，該方法利用預訓練影片視覺-語言模型的潛在世界知識來估算機器人任務進度。有別於先前直接提示視覺-語言模型輸出進度值（易產生數值表徵偏差）的方法，TOPReward直接從視覺-語言模型的內部詞元邏輯值提取任務進度。在涵蓋130多個真實世界任務與多種機器人平台（如Franka、YAM、SO-100/101）的零樣本評估中，TOPReward在Qwen3-VL模型上實現了0.947的平均價值順序相關性，顯著優化現有最先進的GVL基線方法（該基線在同一開源模型上相關性接近零）。我們進一步驗證TOPReward可作為下游應用的多功能工具，包括成功狀態檢測與獎勵對齊的行為克隆。

English

While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

TOPReward：將詞彙機率作為機器人學的隱藏零樣本獎勵

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

摘要

Support