3D LMMにおけるエンコーダーフリーアーキテクチャの可能性の探索

要旨

エンコーダーを使用しないアーキテクチャは、2Dビジュアル領域で予備的に探求されていますが、それが効果的に3D理解シナリオに適用できるかどうかは未解決の問題です。本論文では、エンコーダーを使用しないアーキテクチャの潜在能力を調査し、エンコーダーベースの3D大規模マルチモーダルモデル（LMMs）の課題を克服する可能性について初めて包括的に検討します。これらの課題には、さまざまなポイントクラウド解像度に適応できないことや、エンコーダーからのポイント特徴が大規模言語モデル（LLMs）の意味ニーズに満たないことが含まれます。私たちは、3D LMMsにおいてエンコーダーを取り除き、LLMが3Dエンコーダーの役割を担うための重要な側面を特定します。1）我々は、事前トレーニング段階でLLMに埋め込まれた意味エンコーディング戦略を提案し、さまざまなポイントクラウド自己教師付き損失の効果を探求します。また、高レベルの意味を抽出するためにハイブリッド意味損失を提示します。2）我々は、指示調整段階で階層的ジオメトリ集約戦略を導入します。これにより、LLMの初期層に帰納バイアスを組み込み、ポイントクラウドの局所詳細に焦点を当てます。最終的に、我々は初めてのエンコーダーを使用しない3D LMM、ENELを提示します。当社の7Bモデルは、現在の最先端モデルであるShapeLLM-13Bと競り合い、分類、キャプション付け、およびVQAタスクでそれぞれ55.0％、50.92％、42.7％を達成します。我々の結果は、エンコーダーを使用しないアーキテクチャが3D理解の分野でエンコーダーベースのアーキテクチャを置き換えるために非常に有望であることを示しています。コードはhttps://github.com/Ivan-Tang-3D/ENELで公開されています。

English

Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM early layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL

3D LMMにおけるエンコーダーフリーアーキテクチャの可能性の探索

Exploring the Potential of Encoder-free Architectures in 3D LMMs

要旨

Support