ArtHOI：基於影片先驗的四維重建實現關節化人物-物件互動合成

摘要

在無需3D/4D監督的情況下合成物理合理的關節化人物-物體交互（HOI）仍是一項基礎性挑戰。儘管現有的零樣本方法利用視頻擴散模型來合成人物-物體交互，但它們大多侷限於剛性物體操作，且缺乏明確的4D幾何推理。為彌合這一差距，我們將關節化HOI合成定義為基於單目視頻先驗的4D重建問題：僅通過擴散模型生成的視頻，無需任何3D監督即可重建完整的4D關節化場景。這種基於重建的方法將生成的2D視頻視為逆向渲染問題的監督信號，恢復出幾何一致、物理合理且自然遵循接觸關係、關節運動與時間連貫性的4D場景。我們提出ArtHOI——首個通過視頻先驗進行4D重建的零樣本關節化人物-物體交互合成框架。其核心設計包括：1）基於光流的部件分割：利用光流作為幾何線索，在單目視頻中分離動態與靜態區域；2）解耦重建流程：由於在單目模糊性下聯合優化人體運動與物體關節會不穩定，我們先重建物體關節狀態，再根據重建結果合成人體運動。ArtHOI連接了基於視頻的生成與幾何感知重建，產生的交互既語義對齊又物理可信。在多樣化關節場景（如打開冰箱、櫥櫃、微波爐）中，ArtHOI在接觸精度、穿透減少和關節保真度上顯著優於現有方法，通過重建引導的合成將零樣本交互合成拓展至剛性操作之外的領域。

English

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

ArtHOI：基於影片先驗的四維重建實現關節化人物-物件互動合成

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

摘要

Support