LLaNA: 大規模言語モデルとNeRFアシスタント

要旨

マルチモーダル大規模言語モデル（MLLMs）は、画像や3Dデータに対する優れた理解能力を示してきました。しかし、これらのモダリティはいずれも、物体の外観と形状を包括的に捉える点で課題を抱えています。一方、単純な多層パーセプトロン（MLP）の重み内に情報を符号化するニューラルラジアンスフィールド（NeRF）は、物体の形状とフォトリアルな外観を同時に表現するモダリティとして、急速に普及しつつあります。本論文では、NeRFをMLLMに取り込むことの実現可能性と有効性を検証します。我々は、NeRFキャプショニングやQ&Aといった新たなタスクを実行可能な、初の汎用NeRF言語アシスタント「LLaNA」を開発しました。特に、本手法はNeRFのMLPの重みを直接処理することで、画像をレンダリングしたり3Dデータ構造を具現化することなく、表現された物体に関する情報を抽出します。さらに、人間の介入なしで、様々なNeRF言語タスク向けのテキスト注釈付きNeRFデータセットを構築しました。このデータセットに基づき、本手法のNeRF理解能力を評価するためのベンチマークを開発しました。結果は、NeRFの重みを処理することが、NeRFから2Dまたは3D表現を抽出する手法よりも優れていることを示しています。

English

Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data. However, both modalities have shortcomings in holistically capturing the appearance and geometry of objects. Meanwhile, Neural Radiance Fields (NeRFs), which encode information within the weights of a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly widespread modality that simultaneously encodes the geometry and photorealistic appearance of objects. This paper investigates the feasibility and effectiveness of ingesting NeRF into MLLM. We create LLaNA, the first general-purpose NeRF-language assistant capable of performing new tasks such as NeRF captioning and Q\&A. Notably, our method directly processes the weights of the NeRF's MLP to extract information about the represented objects without the need to render images or materialize 3D data structures. Moreover, we build a dataset of NeRFs with text annotations for various NeRF-language tasks with no human intervention. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that processing NeRF weights performs favourably against extracting 2D or 3D representations from NeRFs.

LLaNA: 大規模言語モデルとNeRFアシスタント

LLaNA: Large Language and NeRF Assistant

要旨

Support