DeVI：基於物理的靈巧人物物體互動——透過合成影片模仿實現

摘要

近期影片生成模型的進展，使得我們能夠在廣泛情境與物件類別中合成逼真的人機互動影片，其中包含難以透過動作捕捉系統記錄的複雜精細操作。儘管這些合成影片中蘊含的豐富互動知識對於精細機器人操作的運動規劃具有巨大潛力，但其有限的物理真實性與純二維特性，使其難以直接作為物理基礎角色控制的模仿目標。我們提出DeVI（精細影片模仿）框架，這是一種創新方法，能利用文字條件化的合成影片實現與未見過目標物件互動時具有物理合理性的精細代理控制。為克服生成式二維線索的不精確性，我們引入混合追蹤獎勵機制，整合三維人體追蹤與穩健的二維物件追蹤技術。有別於依賴高品質三維運動學示範的方法，DeVI僅需生成的影片即可實現跨不同物件與互動類型的零樣本泛化能力。大量實驗表明，DeVI在模仿三維人機互動示範的任務中優於現有方法，特別是在建模精細手部-物件互動方面表現突出。我們進一步驗證了DeVI在多物件場景與文字驅動動作多樣性中的有效性，展現了將影片作為具備人機互動感知能力的運動規劃器之優勢。

English

Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.

DeVI：基於物理的靈巧人物物體互動——透過合成影片模仿實現

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

摘要

Support