OCTScenes：オブジェクト中心学習のための多用途な実世界テーブルトップシーンデータセット

要旨

人間は、シーンを構成要素的に理解する認知能力を持っています。AIシステムに同様の能力を与えるため、オブジェクト中心表現学習は、視覚シーンから個々のオブジェクトの表現を教師なしで獲得することを目指しています。最近のオブジェクト中心表現学習の進展は、複雑な合成データセットにおいて目覚ましい進歩を遂げましたが、複雑な実世界シーンへの応用には大きな課題が残されています。その主な理由の一つは、オブジェクト中心表現学習手法に特化した実世界データセットの不足です。この問題を解決するため、我々はOCTScenesと呼ばれる、テーブル上のシーンを対象とした汎用的な実世界データセットを提案します。このデータセットは、オブジェクト中心表現学習手法の比較、評価、分析のためのベンチマークとして綿密に設計されています。OCTScenesは、15種類の日常的なオブジェクトを含む5000のテーブル上シーンで構成されており、各シーンは360度の視点をカバーする60フレームで撮影されています。その結果、OCTScenesは、静的シーン、動的シーン、マルチビューシーンのタスクにわたるオブジェクト中心表現学習手法の評価を同時に満たす汎用的なベンチマークデータセットとなっています。OCTScenes上で、静的、動的、マルチビューシーンに対するオブジェクト中心表現学習手法の広範な実験が行われました。その結果、最先端の手法が複雑な合成データセットでは印象的な性能を示す一方で、実世界データから意味のある表現を学習する上での欠点が明らかになりました。さらに、OCTScenesは、既存の最先端手法を進化させる触媒として機能し、それらが実世界シーンに適応することを促すことができます。データセットとコードはhttps://huggingface.co/datasets/Yinxuan/OCTScenesで公開されています。

English

Humans possess the cognitive ability to comprehend scenes in a compositional manner. To empower AI systems with similar abilities, object-centric representation learning aims to acquire representations of individual objects from visual scenes without any supervision. Although recent advancements in object-centric representation learning have achieved remarkable progress on complex synthesis datasets, there is a huge challenge for application in complex real-world scenes. One of the essential reasons is the scarcity of real-world datasets specifically tailored to object-centric representation learning methods. To solve this problem, we propose a versatile real-world dataset of tabletop scenes for object-centric learning called OCTScenes, which is meticulously designed to serve as a benchmark for comparing, evaluating and analyzing object-centric representation learning methods. OCTScenes contains 5000 tabletop scenes with a total of 15 everyday objects. Each scene is captured in 60 frames covering a 360-degree perspective. Consequently, OCTScenes is a versatile benchmark dataset that can simultaneously satisfy the evaluation of object-centric representation learning methods across static scenes, dynamic scenes, and multi-view scenes tasks. Extensive experiments of object-centric representation learning methods for static, dynamic and multi-view scenes are conducted on OCTScenes. The results demonstrate the shortcomings of state-of-the-art methods for learning meaningful representations from real-world data, despite their impressive performance on complex synthesis datasets. Furthermore, OCTScenes can serves as a catalyst for advancing existing state-of-the-art methods, inspiring them to adapt to real-world scenes. Dataset and code are available at https://huggingface.co/datasets/Yinxuan/OCTScenes.

OCTScenes：オブジェクト中心学習のための多用途な実世界テーブルトップシーンデータセット

OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning

要旨

Support