LLaNA:大型語言與 NeRF 助理
LLaNA: Large Language and NeRF Assistant
June 17, 2024
作者: Andrea Amaduzzi, Pierluigi Zama Ramirez, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
cs.AI
摘要
多模式大型語言模型(MLLMs)已展現出對圖像和3D數據的優異理解能力。然而,這兩種模態在全面捕捉物體外觀和幾何特徵方面存在缺陷。與此同時,神經輻射場(NeRFs)採用簡單的多層感知器(MLP)權重來編碼信息,已成為一種日益普及的模態,同時編碼了物體的幾何特徵和照片般逼真的外觀。本文探討將NeRF納入MLLM的可行性和有效性。我們創建了LLaNA,第一個能夠執行NeRF字幕和問答等新任務的通用NeRF-語言助手。值得注意的是,我們的方法直接處理NeRF的MLP權重,以提取有關所代表物體的信息,無需渲染圖像或具現化3D數據結構。此外,我們建立了一個包含各種NeRF-語言任務的NeRF數據集,並無需人類干預進行文本標註。基於該數據集,我們開發了一個基準來評估我們方法的NeRF理解能力。結果顯示,處理NeRF權重的效果優於從NeRF中提取2D或3D表示。
English
Multimodal Large Language Models (MLLMs) have demonstrated an excellent
understanding of images and 3D data. However, both modalities have shortcomings
in holistically capturing the appearance and geometry of objects. Meanwhile,
Neural Radiance Fields (NeRFs), which encode information within the weights of
a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly
widespread modality that simultaneously encodes the geometry and photorealistic
appearance of objects. This paper investigates the feasibility and
effectiveness of ingesting NeRF into MLLM. We create LLaNA, the first
general-purpose NeRF-language assistant capable of performing new tasks such as
NeRF captioning and Q\&A. Notably, our method directly processes the weights of
the NeRF's MLP to extract information about the represented objects without the
need to render images or materialize 3D data structures. Moreover, we build a
dataset of NeRFs with text annotations for various NeRF-language tasks with no
human intervention. Based on this dataset, we develop a benchmark to evaluate
the NeRF understanding capability of our method. Results show that processing
NeRF weights performs favourably against extracting 2D or 3D representations
from NeRFs.Summary
AI-Generated Summary