M3DBench: マルチモーダル3Dプロンプトによる大規模モデルの指示

要旨

近年、自律エージェントによる意思決定を促進するために、3D理解が注目を集めている。しかし、既存の3Dデータセットや手法は特定のタスクに限定されることが多い。一方、大規模言語モデル（LLMs）やマルチモーダル言語モデル（MLMs）の進展により、一般的な言語および画像タスクにおいて優れた性能が示されている。したがって、MLMの潜在能力を引き出し、より広範なタスクに対応する3Dジェネラリストとして活用することは興味深い。しかし、現在のMLM研究は、大規模な3D指示追従データセットの不足により、3Dタスクにあまり焦点が当てられていない。本研究では、M3DBenchと呼ばれる包括的な3D指示追従データセットを提案する。このデータセットは以下の特徴を有する：1）テキスト、画像、3Dオブジェクト、その他の視覚的プロンプトが交錯した一般的なマルチモーダル指示をサポートする。2）領域レベルとシーンレベルの両方で多様な3Dタスクを統合し、現実世界の3D環境における基本的な能力を網羅する。3）32万以上の指示-応答ペアを有する大規模な3D指示追従データセットである。さらに、マルチモーダル3Dプロンプトの理解における大規模モデルの性能を評価するための新しいベンチマークを確立する。広範な実験により、本データセットとベースラインの有効性が実証され、一般的な3D中心タスクをサポートし、今後の研究を刺激するものである。

English

Recently, 3D understanding has become popular to facilitate autonomous agents to perform further decisionmaking. However, existing 3D datasets and methods are often limited to specific tasks. On the other hand, recent progress in Large Language Models (LLMs) and Multimodal Language Models (MLMs) have demonstrated exceptional general language and imagery tasking performance. Therefore, it is interesting to unlock MLM's potential to be 3D generalist for wider tasks. However, current MLMs' research has been less focused on 3D tasks due to a lack of large-scale 3D instruction-following datasets. In this work, we introduce a comprehensive 3D instructionfollowing dataset called M3DBench, which possesses the following characteristics: 1) It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts. 2) It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments. 3) It is a large-scale 3D instruction-following dataset with over 320k instruction-response pairs. Furthermore, we establish a new benchmark for assessing the performance of large models in understanding multi-modal 3D prompts. Extensive experiments demonstrate the effectiveness of our dataset and baseline, supporting general 3D-centric tasks, which can inspire future research.

M3DBench: マルチモーダル3Dプロンプトによる大規模モデルの指示

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

要旨

Support