ORV：面向四维空间占用的机器人视频生成

摘要

通过遥操作获取真实世界的机器人仿真数据，众所周知既耗时又费力。近年来，动作驱动的生成模型在机器人学习与仿真领域得到了广泛应用，因为它们消除了安全隐患并减少了维护成本。然而，这些方法所采用的动作序列由于全局粗粒度对齐，往往导致控制精度受限且泛化能力欠佳。为解决这些局限，我们提出了ORV（Occupancy-centric Robot Video）框架，一个以占据为中心的机器人视频生成系统，它利用4D语义占据序列作为细粒度表示，为视频生成提供更精确的语义与几何指导。通过基于占据的表示方法，ORV能够将仿真数据无缝转换为逼真的机器人视频，同时确保高时间一致性与精确可控性。此外，我们的框架支持同时生成多视角的机器人抓取操作视频——这对于下游机器人学习任务至关重要。大量实验结果表明，ORV在多个数据集及子任务上均持续超越现有基线方法。演示、代码与模型请访问：https://orangesodahub.github.io/ORV

English

Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive. Recently, action-driven generative models have gained widespread adoption in robot learning and simulation, as they eliminate safety concerns and reduce maintenance efforts. However, the action sequences used in these methods often result in limited control precision and poor generalization due to their globally coarse alignment. To address these limitations, we propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation to provide more accurate semantic and geometric guidance for video generation. By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability. Furthermore, our framework supports the simultaneous generation of multi-view videos of robot gripping operations - an important capability for downstream robotic learning tasks. Extensive experimental results demonstrate that ORV consistently outperforms existing baseline methods across various datasets and sub-tasks. Demo, Code and Model: https://orangesodahub.github.io/ORV

ORV：面向四维空间占用的机器人视频生成

ORV: 4D Occupancy-centric Robot Video Generation

摘要

Support