M3DBench：用多模态3D提示指导大型模型

摘要

最近，3D理解变得流行起来，以促进自主代理执行更深入的决策。然而，现有的3D数据集和方法通常局限于特定任务。另一方面，大型语言模型（LLMs）和多模态语言模型（MLMs）的最新进展展示了出色的通用语言和图像任务表现。因此，将MLM的潜力解锁为更广泛任务的3D通才是很有趣的。然而，由于缺乏大规模3D遵循指令数据集，目前MLMs的研究很少关注3D任务。在这项工作中，我们介绍了一个全面的3D遵循指令数据集，名为M3DBench，具有以下特点：1）支持与文本、图像、3D对象和其他视觉提示交织在一起的通用多模态指令。2）统一了不同区域和场景级别的各种3D任务，涵盖了现实世界3D环境中的各种基本能力。3）这是一个大规模的3D遵循指令数据集，拥有超过320k的指令-响应对。此外，我们建立了一个新的基准，用于评估大型模型在理解多模态3D提示方面的表现。大量实验证明了我们数据集和基准线的有效性，支持通用的3D中心任务，这可以激发未来的研究。

English

Recently, 3D understanding has become popular to facilitate autonomous agents to perform further decisionmaking. However, existing 3D datasets and methods are often limited to specific tasks. On the other hand, recent progress in Large Language Models (LLMs) and Multimodal Language Models (MLMs) have demonstrated exceptional general language and imagery tasking performance. Therefore, it is interesting to unlock MLM's potential to be 3D generalist for wider tasks. However, current MLMs' research has been less focused on 3D tasks due to a lack of large-scale 3D instruction-following datasets. In this work, we introduce a comprehensive 3D instructionfollowing dataset called M3DBench, which possesses the following characteristics: 1) It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts. 2) It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments. 3) It is a large-scale 3D instruction-following dataset with over 320k instruction-response pairs. Furthermore, we establish a new benchmark for assessing the performance of large models in understanding multi-modal 3D prompts. Extensive experiments demonstrate the effectiveness of our dataset and baseline, supporting general 3D-centric tasks, which can inspire future research.

M3DBench：用多模态3D提示指导大型模型

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

摘要

Support