NavGPT-2: 大規模視覚言語モデルのためのナビゲーション推論能力の解放

要旨

大規模言語モデル（LLMs）の目覚ましい進歩を活用し、指示追従型ロボットナビゲーションにLLMsを利用する取り組みが急速に進んでいます。このトレンドは、LLMsがナビゲーション推論と言語理解の多様性を一般化する可能性を強調しています。しかし、ビジョンと言語ナビゲーション（VLN）タスクにLLMsを統合する際、以前の専門的下流モデルと比較してエージェントの性能に大きな乖離が観察されます。さらに、エージェント間の相互作用において言語が持つ解釈とコミュニケーションを促進する能力は、これらの統合においてしばしば十分に活用されていません。本研究では、VLN専門モデルとLLMベースのナビゲーションパラダイムの間の隔たりを埋めるとともに、LLMsが持つ言語的ナビゲーション推論の解釈力を維持することを目指します。凍結されたLLM内で視覚コンテンツを整合させることで、LLMsの視覚観察理解を包含し、LLMsとナビゲーションポリシーネットワークを統合して効果的な行動予測とナビゲーション推論を行う方法を探ります。提案手法のデータ効率性を実証し、LMベースのエージェントと最先端のVLN専門家の間のギャップを解消します。

English

Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists.

NavGPT-2: 大規模視覚言語モデルのためのナビゲーション推論能力の解放

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

要旨

Support