ORV:面向四维空间占用的机器人视频生成
ORV: 4D Occupancy-centric Robot Video Generation
June 3, 2025
作者: Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Xin Jin, Hang Zhao, Hao Zhao
cs.AI
摘要
通过遥操作获取真实世界的机器人仿真数据,众所周知既耗时又费力。近年来,动作驱动的生成模型在机器人学习与仿真领域得到了广泛应用,因为它们消除了安全隐患并减少了维护成本。然而,这些方法所采用的动作序列由于全局粗粒度对齐,往往导致控制精度受限且泛化能力欠佳。为解决这些局限,我们提出了ORV(Occupancy-centric Robot Video)框架,一个以占据为中心的机器人视频生成系统,它利用4D语义占据序列作为细粒度表示,为视频生成提供更精确的语义与几何指导。通过基于占据的表示方法,ORV能够将仿真数据无缝转换为逼真的机器人视频,同时确保高时间一致性与精确可控性。此外,我们的框架支持同时生成多视角的机器人抓取操作视频——这对于下游机器人学习任务至关重要。大量实验结果表明,ORV在多个数据集及子任务上均持续超越现有基线方法。演示、代码与模型请访问:https://orangesodahub.github.io/ORV
English
Acquiring real-world robotic simulation data through teleoperation is
notoriously time-consuming and labor-intensive. Recently, action-driven
generative models have gained widespread adoption in robot learning and
simulation, as they eliminate safety concerns and reduce maintenance efforts.
However, the action sequences used in these methods often result in limited
control precision and poor generalization due to their globally coarse
alignment. To address these limitations, we propose ORV, an Occupancy-centric
Robot Video generation framework, which utilizes 4D semantic occupancy
sequences as a fine-grained representation to provide more accurate semantic
and geometric guidance for video generation. By leveraging occupancy-based
representations, ORV enables seamless translation of simulation data into
photorealistic robot videos, while ensuring high temporal consistency and
precise controllability. Furthermore, our framework supports the simultaneous
generation of multi-view videos of robot gripping operations - an important
capability for downstream robotic learning tasks. Extensive experimental
results demonstrate that ORV consistently outperforms existing baseline methods
across various datasets and sub-tasks. Demo, Code and Model:
https://orangesodahub.github.io/ORVSummary
AI-Generated Summary