ChatPaper.aiChatPaper

探索在3D LMMs中無編碼器架構的潛力

Exploring the Potential of Encoder-free Architectures in 3D LMMs

February 13, 2025
作者: Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
cs.AI

摘要

在2D視覺領域已初步探索了無編碼器架構,然而它們是否能有效應用於3D理解場景仍是一個開放問題。本文首次全面調查了無編碼器架構潛力,以克服基於編碼器的3D大型多模型(LMM)所面臨的挑戰。這些挑戰包括無法適應不同點雲解析度,以及編碼器生成的點特徵未滿足大型語言模型(LLMs)的語義需求。我們確定了3D LMMs去除編碼器並使LLM承擔3D編碼器角色的關鍵方面:1)我們在預訓練階段提出了LLM嵌入式語義編碼策略,探索各種點雲自監督損失的影響。並提出了混合語義損失以提取高層次語義。2)我們在指導調整階段引入了分層幾何聚合策略。這將歸納偏差納入LLM早期層,以便專注於點雲的局部細節。最終,我們提出了第一個無編碼器3D LMM,ENEL。我們的7B模型與當前最先進的模型ShapeLLM-13B相媲美,在分類、字幕和VQA任務上分別達到55.0%、50.92%和42.7%。我們的結果表明,無編碼器架構在3D理解領域取代基於編碼器的架構具有極高的潛力。代碼已發布在https://github.com/Ivan-Tang-3D/ENEL。
English
Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM early layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL

Summary

AI-Generated Summary

PDF262February 14, 2025