在3D中重建手持物体

摘要

手动操作的物体（即 manipulanda）特别具有挑战性，难以从野外的 RGB 图像或视频中重建。手部不仅遮挡了物体的大部分部分，而且物体通常只在少数图像像素中可见。与此同时，在这种情境中出现了两个强大的锚点：（1）估计的 3D 手有助于消除物体的位置和尺度的歧义，（2）与所有可能的物体相比，可操作物体的集合较小。基于这些见解，我们提出了一种可扩展的手持物体重建范式，借鉴了最近在大型语言/视觉模型和 3D 物体数据集方面的突破。我们的模型，MCC-Hand-Object（MCC-HO），联合重建手部和物体几何，给定单个 RGB 图像和推断的 3D 手作为输入。随后，我们使用 GPT-4(V) 检索一个与图像中物体匹配的 3D 物体模型，并将模型刚性对齐到网络推断的几何；我们称这种对齐为检索增强重建（RAR）。实验表明，MCC-HO 在实验室和互联网数据集上实现了最先进的性能，并展示了如何使用 RAR 自动获取手-物体相互作用的野外图像的 3D 标签。

English

Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from in-the-wild RGB images or videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for handheld object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Our model, MCC-Hand-Object (MCC-HO), jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we use GPT-4(V) to retrieve a 3D object model that matches the object in the image and rigidly align the model to the network-inferred geometry; we call this alignment Retrieval-Augmented Reconstruction (RAR). Experiments demonstrate that MCC-HO achieves state-of-the-art performance on lab and Internet datasets, and we show how RAR can be used to automatically obtain 3D labels for in-the-wild images of hand-object interactions.

在3D中重建手持物体

Reconstructing Hand-Held Objects in 3D

摘要

Support