ChatPaper.aiChatPaper

ArtHOI:基于视频先验的四维重建实现关节化人物-物体交互合成

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

March 4, 2026
作者: Zihao Huang, Tianqi Liu, Zhaoxi Chen, Shaocong Xu, Saining Zhang, Lixing Xiao, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu
cs.AI

摘要

在没有3D/4D监督的情况下合成物理合理的铰接式人物-物体交互(HOI)仍是一个基础性挑战。虽然现有零样本方法利用视频扩散模型合成人物-物体交互,但这些方法大多局限于刚性物体操作,且缺乏显式的4D几何推理。为弥补这一差距,我们将铰接式HOI合成构建为基于单目视频先验的4D重建问题:仅给定扩散模型生成的视频,无需任何3D监督即可重建完整的4D铰接场景。这种基于重建的方法将生成的2D视频视为逆向渲染问题的监督信号,恢复出几何一致、物理合理的4D场景,自然满足接触关系、铰接结构和时间连贯性。我们提出ArtHOI——首个通过视频先验进行4D重建的零样本铰接式人物-物体交互合成框架。其核心设计包括:1)基于光流的部件分割:利用光流作为几何线索,在单目视频中分离动态与静态区域;2)解耦重建流程:在单目视觉歧义下,人物运动与物体铰接的联合优化不稳定,因此我们首先重建物体铰接状态,再基于重建的物体状态合成人物运动。ArtHOI连接了基于视频的生成与几何感知重建,生成的交互既满足语义对齐又具备物理基础。在多样化铰接场景(如开关冰箱、橱柜、微波炉)中,ArtHOI在接触精度、穿透减少和铰接保真度方面显著优于现有方法,通过重建引导的合成将零样本交互合成拓展至刚性操作之外。
English
Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.
PDF193March 6, 2026