手持ち物体の3D再構築

要旨

手で操作される物体（すなわち、マニプランダ）は、実世界のRGB画像や動画から再構築する際に特に困難を伴います。手が物体の大部分を隠してしまうだけでなく、物体が画像のごく少数のピクセルにしか映らないことが多いためです。しかし、この状況においては2つの強力なアンカーが存在します。(1)推定された3D手は物体の位置とスケールを明確にするのに役立ち、(2)マニプランダの集合はすべての可能な物体に比べて小さいという点です。これらの洞察を踏まえ、我々は大規模言語/視覚モデルと3D物体データセットの最近のブレークスルーに基づいて、手持ち物体の再構築を行うスケーラブルなパラダイムを提案します。我々のモデル、MCC-Hand-Object（MCC-HO）は、単一のRGB画像と推定された3D手を入力として、手と物体の形状を同時に再構築します。その後、GPT-4(V)を使用して画像内の物体に一致する3D物体モデルを検索し、そのモデルをネットワークが推定した形状に剛体変換して整列させます。我々はこの整列を「検索拡張再構築（Retrieval-Augmented Reconstruction, RAR）」と呼びます。実験により、MCC-HOは実験室およびインターネットデータセットにおいて最先端の性能を達成することが示され、RARが手と物体の相互作用を捉えた実世界画像の3Dラベルを自動的に取得するためにどのように使用できるかを示します。

English

Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from in-the-wild RGB images or videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for handheld object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Our model, MCC-Hand-Object (MCC-HO), jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we use GPT-4(V) to retrieve a 3D object model that matches the object in the image and rigidly align the model to the network-inferred geometry; we call this alignment Retrieval-Augmented Reconstruction (RAR). Experiments demonstrate that MCC-HO achieves state-of-the-art performance on lab and Internet datasets, and we show how RAR can be used to automatically obtain 3D labels for in-the-wild images of hand-object interactions.

手持ち物体の3D再構築

Reconstructing Hand-Held Objects in 3D

要旨

Support