面向机器人操作的视频生成学习与协作轨迹控制
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
June 2, 2025
作者: Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin
cs.AI
摘要
近期视频扩散模型的进展展现了其在生成机器人决策数据方面的强大潜力,轨迹条件进一步实现了精细控制。然而,现有的基于轨迹的方法主要关注单个物体的运动,难以捕捉复杂机器人操作中至关重要的多物体交互。这一局限源于重叠区域内的多特征纠缠,导致视觉保真度下降。为此,我们提出了RoboMaster,一个通过协作轨迹公式建模物体间动态的新颖框架。与先前分解物体的方法不同,我们的核心是将交互过程分解为三个子阶段:交互前、交互中和交互后。每个阶段均利用主导物体的特征进行建模,具体而言,交互前和交互后阶段采用机械臂的特征,而交互过程中则使用被操作物体的特征,从而缓解了先前工作中交互期间多物体特征融合的弊端。为了进一步确保视频中主体语义的一致性,我们为物体引入了外观和形状感知的潜在表示。在具有挑战性的Bridge V2数据集上的大量实验以及野外评估表明,我们的方法超越了现有技术,在轨迹控制的机器人操作视频生成领域确立了新的最先进性能。
English
Recent advances in video diffusion models have demonstrated strong potential
for generating robotic decision-making data, with trajectory conditions further
enabling fine-grained control. However, existing trajectory-based methods
primarily focus on individual object motion and struggle to capture
multi-object interaction crucial in complex robotic manipulation. This
limitation arises from multi-feature entanglement in overlapping regions, which
leads to degraded visual fidelity. To address this, we present RoboMaster, a
novel framework that models inter-object dynamics through a collaborative
trajectory formulation. Unlike prior methods that decompose objects, our core
is to decompose the interaction process into three sub-stages: pre-interaction,
interaction, and post-interaction. Each stage is modeled using the feature of
the dominant object, specifically the robotic arm in the pre- and
post-interaction phases and the manipulated object during interaction, thereby
mitigating the drawback of multi-object feature fusion present during
interaction in prior work. To further ensure subject semantic consistency
throughout the video, we incorporate appearance- and shape-aware latent
representations for objects. Extensive experiments on the challenging Bridge V2
dataset, as well as in-the-wild evaluation, demonstrate that our method
outperforms existing approaches, establishing new state-of-the-art performance
in trajectory-controlled video generation for robotic manipulation.Summary
AI-Generated Summary