在3D中重建手持物品
Reconstructing Hand-Held Objects in 3D
April 9, 2024
作者: Jane Wu, Georgios Pavlakos, Georgia Gkioxari, Jitendra Malik
cs.AI
摘要
手部操控的物件(即 manipulanda)特別具挑戰性,從野外的 RGB 影像或影片中重建這些物件。手部不僅遮擋了大部分的物件,而且物件通常只在少數像素中可見。同時,在這種情況下出現了兩個強大的錨點:(1)估計的 3D 手有助於消除物件的位置和尺度的歧義,以及(2)相對於所有可能的物件,可操控物件的集合很小。基於這些見解,我們提出了一種可擴展的手持物件重建範式,借鑑了最近在大型語言/視覺模型和 3D 物件數據集上的突破。我們的模型,MCC-Hand-Object(MCC-HO),在單個 RGB 影像和推斷的 3D 手作為輸入時,聯合重建手部和物件幾何。隨後,我們使用 GPT-4(V) 檢索一個與影像中物件相匹配的 3D 物件模型,並將模型剛性對齊到網絡推斷的幾何;我們將此對齊稱為檢索增強重建(RAR)。實驗表明,MCC-HO 在實驗室和互聯網數據集上實現了最先進的性能,並展示了如何使用 RAR 自動獲取手部-物件互動的野外影像的 3D 標籤。
English
Objects manipulated by the hand (i.e., manipulanda) are particularly
challenging to reconstruct from in-the-wild RGB images or videos. Not only does
the hand occlude much of the object, but also the object is often only visible
in a small number of image pixels. At the same time, two strong anchors emerge
in this setting: (1) estimated 3D hands help disambiguate the location and
scale of the object, and (2) the set of manipulanda is small relative to all
possible objects. With these insights in mind, we present a scalable paradigm
for handheld object reconstruction that builds on recent breakthroughs in large
language/vision models and 3D object datasets. Our model, MCC-Hand-Object
(MCC-HO), jointly reconstructs hand and object geometry given a single RGB
image and inferred 3D hand as inputs. Subsequently, we use GPT-4(V) to retrieve
a 3D object model that matches the object in the image and rigidly align the
model to the network-inferred geometry; we call this alignment
Retrieval-Augmented Reconstruction (RAR). Experiments demonstrate that MCC-HO
achieves state-of-the-art performance on lab and Internet datasets, and we show
how RAR can be used to automatically obtain 3D labels for in-the-wild images of
hand-object interactions.Summary
AI-Generated Summary