ORV：以四維佔用為中心的機器人視頻生成

摘要

透過遠端操作獲取真實世界的機器人模擬數據，眾所周知既耗時又費力。近年來，動作驅動的生成模型在機器人學習與模擬中獲得了廣泛應用，因為它們消除了安全顧慮並減少了維護工作量。然而，這些方法所採用的動作序列由於其全局上的粗略對齊，往往導致控制精度有限且泛化能力不佳。為解決這些限制，我們提出了ORV，一個以佔用為中心的機器人視頻生成框架，該框架利用4D語義佔用序列作為細粒度表示，為視頻生成提供更精確的語義和幾何指導。通過基於佔用的表示，ORV能夠無縫地將模擬數據轉化為逼真的機器人視頻，同時確保高時間一致性和精確可控性。此外，我們的框架支持同時生成機器人抓取操作的多視角視頻——這對於下游機器人學習任務至關重要。大量實驗結果表明，ORV在多個數據集和子任務上始終優於現有的基線方法。演示、代碼和模型請訪問：https://orangesodahub.github.io/ORV

English

Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive. Recently, action-driven generative models have gained widespread adoption in robot learning and simulation, as they eliminate safety concerns and reduce maintenance efforts. However, the action sequences used in these methods often result in limited control precision and poor generalization due to their globally coarse alignment. To address these limitations, we propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation to provide more accurate semantic and geometric guidance for video generation. By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability. Furthermore, our framework supports the simultaneous generation of multi-view videos of robot gripping operations - an important capability for downstream robotic learning tasks. Extensive experimental results demonstrate that ORV consistently outperforms existing baseline methods across various datasets and sub-tasks. Demo, Code and Model: https://orangesodahub.github.io/ORV

ORV：以四維佔用為中心的機器人視頻生成

ORV: 4D Occupancy-centric Robot Video Generation

摘要

Support