机器人需要的不仅是VLA和世界模型

摘要

通用机器人智能常被理解为一种策略扩展问题：收集更多机器人示范数据，训练更大规模的视觉-语言-动作（VLA）模型，从而期待更广泛的泛化能力。本文立场认为，这一框架并不完整。核心瓶颈不仅在于策略学习，更在于缺乏将现实中丰富的非结构化行为数据转化为具身机器人监督信号的机制。人类动作、互联网视频、仿真推演及交互式示范中蕴含了大量关于任务、目标、接触、失败及物理约束的信息，然而，由于这些数据缺乏特定于机器人的动作标签、任务语义及奖励结构，大多数信息无法被机器人策略直接利用。我们识别出下一代机器人技术所缺失的四个组件：用于自动标注非结构化行为的数据接口、将人类动作重映射至机器人动作的具身接口、基于物理的3D推理世界模型接口、以及从视频及语言推断任务进程与成败的奖励接口。我们综述了机器人基础模型、跨具身数据集、从视频学习、世界模型及奖励建模等领域的最新进展，并提出一项研究议程：构建不仅能从机器人示范中学习，更能从更广泛的物理世界中学习的机器人系统。

English

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.