ロボットにはVLAと世界モデルだけでは不十分

要旨

汎用ロボット知能は、しばしばポリシースケーリングの問題として捉えられている。すなわち、より多くのロボットデモンストレーションを収集し、より大規模なVision-Language-Action（VLA）モデルを訓練すれば、より広範な汎化が期待できるという考え方である。本ポジションペーパーでは、この枠組みは不完全であると主張する。中心的なボトルネックはポリシー学習だけではなく、世界に豊富に存在する非構造化行動データを、接地されたロボットの教師信号に変換するメカニズムが欠如していることにある。人間の動作、インターネット動画、シミュレーションロールアウト、対話型デモンストレーションには、タスク、目標、接触、失敗、物理的制約に関する豊富な情報が含まれているが、それらの情報の大部分は、エンボディメント固有の行動ラベル、タスク意味論、報酬構造が欠如しているため、ロボットポリシーが直接利用できない。本稿では、次世代ロボティクスに必要な4つの欠落要素を特定する。すなわち、非構造化行動を自動ラベリングするためのデータインターフェース、人間の動作をロボット行動にリターゲティングするためのエンボディメントインターフェース、物理に接地された3D推論のための世界モデルインターフェース、そして動画と言語からタスクの進捗と成功を推論するための報酬インターフェースである。ロボット基盤モデル、クロスエンボディメントデータセット、動画からの学習、世界モデル、報酬モデリングに関する最近の進展を概観し、ロボットデモンストレーションからのみならず、より広範な物理世界からも学習可能なロボティクスシステムを構築するための研究課題を提案する。

English

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.