DeVI：基于物理的灵巧人机交互合成视频模仿

摘要

近期视频生成模型的技术突破，使得能够跨多种场景和物体类别合成逼真的人-物交互视频，包括难以通过动作捕捉系统记录的复杂灵巧操作。尽管这些合成视频中蕴含的丰富交互知识对灵巧机器人操作的运动规划具有巨大潜力，但其有限的物理保真度和纯二维特性使其难以直接作为基于物理的角色控制模仿目标。我们提出DeVI（灵巧视频模仿）框架，该创新系统利用文本条件合成视频实现与未知目标物体交互的物理合理灵巧智能体控制。为克服生成式二维线索的不精确性，我们引入了融合三维人体追踪与鲁棒二维物体追踪的混合追踪奖励机制。与依赖高质量三维运动学演示的方法不同，DeVI仅需生成视频即可实现跨不同物体和交互类型的零样本泛化。大量实验表明，DeVI在模仿三维人-物交互演示的方法中表现优异，尤其在建模灵巧手-物交互方面优势显著。我们进一步验证了DeVI在多物体场景和文本驱动动作多样性中的有效性，彰显了视频作为人-物交互感知运动规划器的优势。

English

Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.

DeVI：基于物理的灵巧人机交互合成视频模仿

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

摘要

Support