OCTScenes：一個多功能的現實世界桌面場景數據集，用於物體中心學習。

摘要

人類具備以組合方式理解場景的認知能力。為了賦予人工智能系統類似的能力，以物體為中心的表示學習旨在在視覺場景中無需監督地獲取個別物體的表示。儘管最近在以物體為中心的表示學習方面取得了對複雜合成數據集的顯著進展，但在應用於複雜現實世界場景中仍存在巨大挑戰。其中一個重要原因是現實世界數據集對以物體為中心的表示學習方法的特定設計非常稀缺。為了解決這個問題，我們提出了一個名為 OCTScenes 的多功能現實世界桌面場景數據集，旨在作為比較、評估和分析以物體為中心的表示學習方法的基準。OCTScenes 包含 5000 個桌面場景，總共包含 15 個日常物品。每個場景以 60 幀捕捉，涵蓋 360 度的視角。因此，OCTScenes 是一個多功能基準數據集，可以同時滿足對靜態場景、動態場景和多視角場景任務的以物體為中心的表示學習方法的評估。在 OCTScenes 上進行了對靜態、動態和多視角場景的以物體為中心的表示學習方法的大量實驗。結果顯示，儘管這些方法在複雜合成數據集上表現出色，但在從現實世界數據中學習有意義的表示方面存在不足。此外，OCTScenes 可以作為推動現有最先進方法進步的催化劑，激勵它們適應現實世界場景。數據集和代碼可在 https://huggingface.co/datasets/Yinxuan/OCTScenes 上獲得。

English

Humans possess the cognitive ability to comprehend scenes in a compositional manner. To empower AI systems with similar abilities, object-centric representation learning aims to acquire representations of individual objects from visual scenes without any supervision. Although recent advancements in object-centric representation learning have achieved remarkable progress on complex synthesis datasets, there is a huge challenge for application in complex real-world scenes. One of the essential reasons is the scarcity of real-world datasets specifically tailored to object-centric representation learning methods. To solve this problem, we propose a versatile real-world dataset of tabletop scenes for object-centric learning called OCTScenes, which is meticulously designed to serve as a benchmark for comparing, evaluating and analyzing object-centric representation learning methods. OCTScenes contains 5000 tabletop scenes with a total of 15 everyday objects. Each scene is captured in 60 frames covering a 360-degree perspective. Consequently, OCTScenes is a versatile benchmark dataset that can simultaneously satisfy the evaluation of object-centric representation learning methods across static scenes, dynamic scenes, and multi-view scenes tasks. Extensive experiments of object-centric representation learning methods for static, dynamic and multi-view scenes are conducted on OCTScenes. The results demonstrate the shortcomings of state-of-the-art methods for learning meaningful representations from real-world data, despite their impressive performance on complex synthesis datasets. Furthermore, OCTScenes can serves as a catalyst for advancing existing state-of-the-art methods, inspiring them to adapt to real-world scenes. Dataset and code are available at https://huggingface.co/datasets/Yinxuan/OCTScenes.

OCTScenes：一個多功能的現實世界桌面場景數據集，用於物體中心學習。

OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning

摘要

Support