ArtHOI: 動画事前情報からの4D再構成による関節的ヒューマンオブジェクトインタラクション合成

要旨

3D/4Dの教師データを用いずに、物理的に妥当な関節を持つ人物-物体インタラクション（HOI）を合成することは、依然として根本的な課題である。近年のゼロショットアプローチでは、ビデオ拡散モデルを活用して人物-物体インタラクションを合成するが、それらは剛体操作に限定されており、明示的な4D幾何学的推論を欠いている。このギャップを埋めるため、我々は関節HOI合成を、単眼ビデオ事前分布からの4D再構成問題として定式化する：拡散モデルによって生成されたビデオのみを入力として、3D教師データを一切使わずに完全な4D関節シーンを再構成する。この再構成ベースのアプローチは、生成された2Dビデオを逆レンダリング問題の教師信号として扱い、接触、関節構造、時間的一貫性を自然に満たす、幾何学的に一貫性があり物理的に妥当な4Dシーンを復元する。我々は、ビデオ事前分布からの4D再構成による関節的人物-物体インタラクション合成のための初のゼロショットフレームワークであるArtHOIを提案する。主な設計要素は以下の通りである：1）フローに基づく部分セグメンテーション：単眼ビデオにおける動的領域と静的領域を分離する幾何学的手がかりとしてオプティカルフローを活用、2）分離型再構成パイプライン：単眼曖昧性の下では人物の動きと物体の関節動作の共同最適化が不安定であるため、まず物体の関節状態を復元し、その後再構成された物体状態を条件として人物の動きを合成する。ArtHOIはビデオベースの生成と幾何学的に意識した再構成を橋渡しし、意味的に整合性が取れて物理的に接地されたインタラクションを生成する。多様な関節シーン（冷蔵庫の開閉、キャビネット、電子レンジなど）において、ArtHOIは接触精度、貫通の低減、関節の忠実度において従来手法を大幅に上回り、再構成を考慮した合成を通じてゼロショットインタラクション合成を剛体操作の枠を超えて拡張する。

English

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

ArtHOI: 動画事前情報からの4D再構成による関節的ヒューマンオブジェクトインタラクション合成

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

要旨

Support