DreamHOI：利用擴散先驗生成主題驅動的3D人物物體互動

摘要

我們提出了DreamHOI，一種新穎的方法，用於零樣本合成人物-物體互動（HOIs），使3D人體模型能夠根據文本描述與任何給定物體進行逼真互動。這項任務由現實世界物體的不同類別和幾何形狀以及包含多樣HOIs的數據集的稀缺性而變得複雜。為了避免對大量數據的需求，我們利用在數十億圖像說明配對上訓練的文本到圖像擴散模型。我們通過從這些模型獲得的分數蒸餾取樣（SDS）梯度來優化一個有皮膚的人體網格的表達，這些梯度預測圖像空間的編輯。然而，直接將圖像空間梯度反向傳播到複雜的表達參數中是無效的，這是由於這些梯度的局部性質。為了克服這一問題，我們引入了一種有皮膚網格的雙隱式-顯式表示，將（隱式）神經輻射場（NeRFs）與（顯式）骨骼驅動的網格表達結合在一起。在優化過程中，我們在隱式和顯式形式之間過渡，將NeRF生成與精煉網格表達結合起來。我們通過廣泛的實驗驗證了我們的方法，展示了它在生成逼真HOIs方面的有效性。

English

We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.

DreamHOI：利用擴散先驗生成主題驅動的3D人物物體互動

DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors

摘要

Support