ArtHOI: 비디오 사전 정보를 통한 4D 재구성 기반 관절형 인간-객체 상호작용 합성

초록

3D/4D 감독 없이 물리적으로 타당한 관절형 인간-객체 상호작용(HOI)을 합성하는 것은 근본적인 과제로 남아 있다. 최근 제로샷 접근법들이 비디오 확산 모델을 활용하여 인간-객체 상호작용을 합성하고 있지만, 이들은 주로 강체 객체 조작에 국한되며 명시적인 4D 기하학적 추론이 부족하다. 이 차이를 해소하기 위해 우리는 관절형 HOI 합성을 단안 비디오 사전 정보로부터의 4D 재구성 문제로 공식화한다: 확산 모델이 생성한 비디오만을 입력으로 하여, 어떠한 3D 감독 없이 완전한 4D 관절형 장면을 재구성한다. 이 재구성 기반 접근법은 생성된 2D 비디오를 역렌더링 문제에 대한 감독 신호로 취급하여, 접촉, 관절 운동, 시간적 일관성을 자연스럽게 따르는 기하학적으로 일관되고 물리적으로 타당한 4D 장면을 복원한다. 우리는 비디오 사전 정보로부터의 4D 재구성을 통한 관절형 인간-객체 상호작용 합성을 위한 최초의 제로샷 프레임워크인 ArtHOI를 소개한다. 우리의 핵심 설계는 다음과 같다: 1) 광류 기반 부위 분할: 단안 비디오에서 동적 영역과 정적 영역을 분리하기 위한 기하학적 단서로 광류 활용; 2) 분리된 재구성 파이프라인: 단안 모호성 하에서 인간 운동과 객체 관절 운동의 공동 최적화는 불안정하므로, 먼저 객체 관절 운동을 복원한 후 재구성된 객체 상태에 조건화된 인간 운동을 합성한다. ArtHOI는 비디오 기반 생성과 기하학 인식 재구성을 연결하여 의미론적으로 정렬되고 물리적으로 근거 있는 상호작용을 생성한다. 다양한 관절형 장면(예: 냉장고, 캐비닛, 전자레인지 열기)에서 ArtHOI는 접촉 정확도, 관통 감소, 관절 운동 정확도 측면에서 기존 방법들을 크게 능가하며, 재구형 정보 기반 합성을 통해 제로샷 상호작용 합성을 강체 조작을 넘어 확장한다.

English

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

ArtHOI: 비디오 사전 정보를 통한 4D 재구성 기반 관절형 인간-객체 상호작용 합성

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

초록

Support