DreamHOI : Génération basée sur les sujets des interactions humain-objet en 3D avec des prédictions de diffusion

papers.abstract

Nous présentons DreamHOI, une méthode novatrice pour la synthèse sans apprentissage des interactions humain-objet (HOI), permettant à un modèle humain 3D d'interagir de manière réaliste avec n'importe quel objet donné en se basant sur une description textuelle. Cette tâche est complexe en raison des catégories et des géométries variables des objets du monde réel et de la rareté des ensembles de données englobant des HOI diversifiées. Pour contourner le besoin de données étendues, nous exploitons des modèles de diffusion texte-image entraînés sur des milliards de paires image-légende. Nous optimisons l'articulation d'un maillage humain habillé en utilisant les gradients de Score Distillation Sampling (SDS) obtenus à partir de ces modèles, qui prédisent des modifications dans l'espace image. Cependant, la rétropropagation directe des gradients de l'espace image dans des paramètres d'articulation complexes est inefficace en raison de la nature locale de ces gradients. Pour surmonter cela, nous introduisons une représentation implicite-explicite double d'un maillage habillé, combinant les champs de radiance neurale (NeRFs) (implicites) avec l'articulation du maillage pilotée par un squelette (explicite). Pendant l'optimisation, nous transitionnons entre les formes implicites et explicites, ancrant la génération NeRF tout en affinant l'articulation du maillage. Nous validons notre approche à travers des expériences approfondies, démontrant son efficacité dans la génération d'HOI réalistes.

English

We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.

DreamHOI : Génération basée sur les sujets des interactions humain-objet en 3D avec des prédictions de diffusion

DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors

papers.abstract

Support