從單張圖像理解3D物體互動

摘要

人類可以輕易理解一張圖像描繪多個潛在物體，使互動成為可能。我們利用這種技能來規劃與世界的互動，並加速對新物體的理解而無需進行互動。在本文中，我們希望賦予機器類似的能力，使智能代理能夠更好地探索3D場景或操作物體。我們的方法是基於Transformer的模型，用於預測物體的3D位置、物理特性和可供性。為了支持這個模型，我們收集了包含互聯網視頻、自我中心視頻和室內圖像的數據集，用於訓練和驗證我們的方法。我們的模型在我們的數據上表現出色，並且對機器人數據具有良好的泛化能力。

English

Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data.

從單張圖像理解3D物體互動

Understanding 3D Object Interaction from a Single Image

摘要

Support