単一画像からの3Dオブジェクト相互作用の理解

要旨

人間は単一の画像を、相互作用が可能な複数の潜在的な物体として容易に理解することができます。私たちはこの能力を用いて、世界との相互作用を計画し、実際に相互作用することなく新しい物体を迅速に理解します。本論文では、機械に同様の能力を付与し、知的なエージェントが3Dシーンをより良く探索したり物体を操作したりできるようにすることを目指します。私たちのアプローチは、物体の3D位置、物理的特性、およびアフォーダンスを予測するトランスフォーマーベースのモデルです。このモデルを強化するため、インターネット動画、エゴセントリック動画、室内画像からなるデータセットを収集し、アプローチの訓練と検証を行いました。私たちのモデルは、収集したデータにおいて高い性能を発揮し、ロボティクスデータに対しても良好な汎化性能を示します。

English

Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data.

単一画像からの3Dオブジェクト相互作用の理解

Understanding 3D Object Interaction from a Single Image

要旨

Support