TOPReward：将标记概率作为机器人学的隐式零样本奖励

摘要

尽管视觉-语言-动作模型在预训练领域进展迅速，但其在强化学习方面的发展仍受限于现实场景中的低样本效率和稀疏奖励。开发通用化的过程奖励模型对于提供细粒度反馈以弥合这一差距至关重要，然而现有的时序价值函数往往难以泛化至训练域之外。我们提出TOPReward——一种基于概率建模的新型时序价值函数，它利用预训练视频视觉语言模型的潜在世界知识来估计机器人任务进度。与先前直接提示视觉语言模型输出进度值（易产生数值表征偏差）的方法不同，TOPReward直接从视觉语言模型的内部标记逻辑值中提取任务进度。在涵盖130余种真实世界任务和多种机器人平台（如Franka、YAM、SO-100/101）的零样本评估中，TOPReward在Qwen3-VL模型上实现了0.947的平均值序相关性，显著优于同类开源模型上接近零相关性的最先进GVL基线。我们进一步证明，TOPReward可作为下游应用的通用工具，包括成功检测和奖励对齐的行为克隆。

English

While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

TOPReward：将标记概率作为机器人学的隐式零样本奖励

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

摘要

Support