LLaNA：大型語言與 NeRF 助理

摘要

多模式大型語言模型（MLLMs）已展現出對圖像和3D數據的優異理解能力。然而，這兩種模態在全面捕捉物體外觀和幾何特徵方面存在缺陷。與此同時，神經輻射場（NeRFs）採用簡單的多層感知器（MLP）權重來編碼信息，已成為一種日益普及的模態，同時編碼了物體的幾何特徵和照片般逼真的外觀。本文探討將NeRF納入MLLM的可行性和有效性。我們創建了LLaNA，第一個能夠執行NeRF字幕和問答等新任務的通用NeRF-語言助手。值得注意的是，我們的方法直接處理NeRF的MLP權重，以提取有關所代表物體的信息，無需渲染圖像或具現化3D數據結構。此外，我們建立了一個包含各種NeRF-語言任務的NeRF數據集，並無需人類干預進行文本標註。基於該數據集，我們開發了一個基準來評估我們方法的NeRF理解能力。結果顯示，處理NeRF權重的效果優於從NeRF中提取2D或3D表示。

English

Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data. However, both modalities have shortcomings in holistically capturing the appearance and geometry of objects. Meanwhile, Neural Radiance Fields (NeRFs), which encode information within the weights of a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly widespread modality that simultaneously encodes the geometry and photorealistic appearance of objects. This paper investigates the feasibility and effectiveness of ingesting NeRF into MLLM. We create LLaNA, the first general-purpose NeRF-language assistant capable of performing new tasks such as NeRF captioning and Q\&A. Notably, our method directly processes the weights of the NeRF's MLP to extract information about the represented objects without the need to render images or materialize 3D data structures. Moreover, we build a dataset of NeRFs with text annotations for various NeRF-language tasks with no human intervention. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that processing NeRF weights performs favourably against extracting 2D or 3D representations from NeRFs.

LLaNA：大型語言與 NeRF 助理

LLaNA: Large Language and NeRF Assistant

摘要

Support