从单张图像理解三维物体交互

摘要

人类可以轻松地理解一张图片描绘出多个潜在对象，从而允许互动。我们利用这种技能来规划与世界的互动，并加速理解新对象而无需进行互动。在本文中，我们希望赋予机器类似的能力，以便智能代理能更好地探索3D场景或操纵对象。我们的方法是基于Transformer的模型，用于预测对象的3D位置、物理属性和可供性。为了支撑这一模型，我们收集了包括互联网视频、自我中心视频和室内图像在内的数据集，用于训练和验证我们的方法。我们的模型在我们的数据上表现出色，并且在机器人数据上具有良好的泛化能力。

English

Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data.

从单张图像理解三维物体交互

Understanding 3D Object Interaction from a Single Image

摘要

Support