Ricostruzione di oggetti tenuti in mano in 3D

Abstract

Gli oggetti manipolati dalla mano (ovvero, i manipolandi) sono particolarmente difficili da ricostruire a partire da immagini o video RGB catturati in contesti reali. Non solo la mano occulta gran parte dell'oggetto, ma spesso l'oggetto è visibile solo in un numero limitato di pixel dell'immagine. Allo stesso tempo, emergono due punti di riferimento forti in questo contesto: (1) le mani 3D stimate aiutano a disambiguare la posizione e la scala dell'oggetto, e (2) l'insieme dei manipolandi è ridotto rispetto a tutti i possibili oggetti. Con queste intuizioni in mente, presentiamo un paradigma scalabile per la ricostruzione di oggetti tenuti in mano che si basa sui recenti progressi nei modelli linguistici/visivi su larga scala e nei dataset di oggetti 3D. Il nostro modello, MCC-Hand-Object (MCC-HO), ricostruisce congiuntamente la geometria della mano e dell'oggetto a partire da una singola immagine RGB e da una mano 3D inferita come input. Successivamente, utilizziamo GPT-4(V) per recuperare un modello 3D dell'oggetto che corrisponda all'oggetto nell'immagine e allineiamo rigidamente il modello alla geometria inferita dalla rete; chiamiamo questo allineamento Ricostruzione Aumentata dal Recupero (Retrieval-Augmented Reconstruction, RAR). Gli esperimenti dimostrano che MCC-HO raggiunge prestazioni all'avanguardia su dataset di laboratorio e Internet, e mostriamo come RAR possa essere utilizzato per ottenere automaticamente etichette 3D per immagini di interazioni mano-oggetto catturate in contesti reali.

English

Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from in-the-wild RGB images or videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for handheld object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Our model, MCC-Hand-Object (MCC-HO), jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we use GPT-4(V) to retrieve a 3D object model that matches the object in the image and rigidly align the model to the network-inferred geometry; we call this alignment Retrieval-Augmented Reconstruction (RAR). Experiments demonstrate that MCC-HO achieves state-of-the-art performance on lab and Internet datasets, and we show how RAR can be used to automatically obtain 3D labels for in-the-wild images of hand-object interactions.

Ricostruzione di oggetti tenuti in mano in 3D

Reconstructing Hand-Held Objects in 3D

Abstract

Support