OCTScenes：用于物体中心学习的多功能现实世界桌面场景数据集

摘要

人类具有理解场景的组合方式的认知能力。为了赋予人工智能系统类似的能力，以物体为中心的表示学习旨在在没有任何监督的情况下从视觉场景中获取单个物体的表示。尽管最近在以物体为中心的表示学习方面取得了显著进展，能够在复杂的合成数据集上取得重大进展，但在复杂的现实世界场景中应用仍然存在巨大挑战。其中一个重要原因是缺乏专门针对以物体为中心的表示学习方法的现实世界数据集。为了解决这个问题，我们提出了一个名为OCTScenes的多功能现实世界桌面场景数据集，精心设计为用于比较、评估和分析以物体为中心的表示学习方法的基准。OCTScenes包含5000个桌面场景，共包含15种日常物品。每个场景在60帧中捕获，覆盖360度视角。因此，OCTScenes是一个多功能基准数据集，可以同时满足对静态场景、动态场景和多视角场景任务的以物体为中心的表示学习方法的评估。在OCTScenes上进行了针对静态、动态和多视角场景的以物体为中心的表示学习方法的大量实验。结果表明，尽管现有技术在复杂的合成数据集上表现出色，但在从真实世界数据中学习有意义的表示方面存在不足。此外，OCTScenes可以作为推动现有最先进方法发展的催化剂，激励它们适应真实世界场景。数据集和代码可在https://huggingface.co/datasets/Yinxuan/OCTScenes 上获取。

English

Humans possess the cognitive ability to comprehend scenes in a compositional manner. To empower AI systems with similar abilities, object-centric representation learning aims to acquire representations of individual objects from visual scenes without any supervision. Although recent advancements in object-centric representation learning have achieved remarkable progress on complex synthesis datasets, there is a huge challenge for application in complex real-world scenes. One of the essential reasons is the scarcity of real-world datasets specifically tailored to object-centric representation learning methods. To solve this problem, we propose a versatile real-world dataset of tabletop scenes for object-centric learning called OCTScenes, which is meticulously designed to serve as a benchmark for comparing, evaluating and analyzing object-centric representation learning methods. OCTScenes contains 5000 tabletop scenes with a total of 15 everyday objects. Each scene is captured in 60 frames covering a 360-degree perspective. Consequently, OCTScenes is a versatile benchmark dataset that can simultaneously satisfy the evaluation of object-centric representation learning methods across static scenes, dynamic scenes, and multi-view scenes tasks. Extensive experiments of object-centric representation learning methods for static, dynamic and multi-view scenes are conducted on OCTScenes. The results demonstrate the shortcomings of state-of-the-art methods for learning meaningful representations from real-world data, despite their impressive performance on complex synthesis datasets. Furthermore, OCTScenes can serves as a catalyst for advancing existing state-of-the-art methods, inspiring them to adapt to real-world scenes. Dataset and code are available at https://huggingface.co/datasets/Yinxuan/OCTScenes.

OCTScenes：用于物体中心学习的多功能现实世界桌面场景数据集

OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning

摘要

Support