ArtHOI: Synthese artikulierter Mensch-Objekt-Interaktionen durch 4D-Rekonstruktion aus Video-Priors

Zusammenfassung

Die Synthese physikalisch plausibler artikulierter Mensch-Objekt-Interaktionen (HOI) ohne 3D/4D-Aufsicht bleibt eine grundlegende Herausforderung. Während neuere Zero-Shot-Ansätze Video-Diffusionsmodelle nutzen, um Mensch-Objekt-Interaktionen zu synthetisieren, sind diese weitgehend auf die Manipulation starrer Objekte beschränkt und es fehlt ihnen an expliziter 4D-geometrischer Reasoning. Um diese Lücke zu schließen, formulieren wir artikulierte HOI-Synthese als ein 4D-Rekonstruktionsproblem aus monokularen Video-Priors: Ausgehend nur von einem durch ein Diffusionsmodell generierten Video rekonstruieren wir eine vollständige 4D-artikulierte Szene ohne jegliche 3D-Aufsicht. Dieser rekonstruktionsbasierte Ansatz behandelt das generierte 2D-Video als Aufsicht für ein inverses Rendering-Problem und stellt geometrisch konsistente und physikalisch plausible 4D-Szenen wieder her, die Kontakt, Artikulation und zeitliche Kohärenz natürlicherweise einhalten. Wir stellen ArtHOI vor, den ersten Zero-Shot-Rahmen für artikulierte Mensch-Objekt-Interaktionssynthese via 4D-Rekonstruktion aus Video-Priors. Unsere Schlüsselentwürfe sind: 1) Flussbasierte Teilesegmentierung: Nutzung des optischen Flusses als geometrisches Hilfsmittel, um dynamische von statischen Regionen in monokularen Videos zu trennen; 2) Entkoppelter Rekonstruktionspipeline: Die gemeinsame Optimierung von menschlicher Bewegung und Objektartikulation ist unter monokularer Ambiguität instabil, daher rekonstruieren wir zunächst die Objektartikulation und synthetisieren dann die menschliche Bewegung, bedingt auf die rekonstruierten Objektzustände. ArtHOI überbrückt videobasierte Generierung und geometriebewusste Rekonstruktion und erzeugt Interaktionen, die sowohl semantisch ausgerichtet als auch physikalisch fundiert sind. In verschiedenen artikulierten Szenen (z.B. Öffnen von Kühlschränken, Schränken, Mikrowellen) übertrifft ArtHOI frühere Methoden signifikant in Kontaktgenauigkeit, Reduzierung von Penetrationen und Artikulationstreue und erweitert die Zero-Shot-Interaktionssynthese über starre Manipulation hinaus durch rekonstruktionsinformierte Synthese.

English

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

ArtHOI: Synthese artikulierter Mensch-Objekt-Interaktionen durch 4D-Rekonstruktion aus Video-Priors

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Zusammenfassung

Support