InteractVLM: 3D-interactieredenering vanuit 2D-fundamentele modellen

Samenvatting

We introduceren InteractVLM, een nieuwe methode om 3D-contactpunten op menselijke lichamen en objecten te schatten vanuit enkele afbeeldingen in natuurlijke omgevingen, wat nauwkeurige 3D-reconstructie van mens-object interacties mogelijk maakt. Dit is een uitdaging vanwege occlusies, diepteambiguïteiten en de grote variatie in objectvormen. Bestaande methoden zijn afhankelijk van 3D-contactannotaties die zijn verzameld via kostbare motion-capturesystemen of tijdrovende handmatige labeling, wat de schaalbaarheid en generalisatie beperkt. Om dit te overwinnen, maakt InteractVLM gebruik van de brede visuele kennis van grote Vision-Language Models (VLMs), die zijn verfijnd met beperkte 3D-contactgegevens. Het direct toepassen van deze modellen is echter niet eenvoudig, omdat ze alleen in 2D redeneren, terwijl mens-object contact inherent 3D is. Daarom introduceren we een nieuwe Render-Localize-Lift module die: (1) 3D-lichaams- en objectoppervlakken in 2D-ruimte inbedt via multi-view rendering, (2) een nieuw multi-view localisatiemodel (MV-Loc) traint om contacten in 2D af te leiden, en (3) deze naar 3D optilt. Daarnaast stellen we een nieuwe taak voor genaamd Semantic Human Contact Estimation, waarbij menselijke contactvoorspellingen expliciet worden geconditioneerd op object semantiek, wat rijkere interactiemodellering mogelijk maakt. InteractVLM overtreft bestaande werkzaamheden op het gebied van contactschatting en vergemakkelijkt ook 3D-reconstructie vanuit een afbeelding in een natuurlijke omgeving. Code en modellen zijn beschikbaar op https://interactvlm.is.tue.mpg.de.

English

We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image. Code and models are available at https://interactvlm.is.tue.mpg.de.

InteractVLM: 3D-interactieredenering vanuit 2D-fundamentele modellen

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Samenvatting

Support