ChatPaper.aiChatPaper

NavGPT-2:釋放大視覺語言模型的導航推理能力

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

July 17, 2024
作者: Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, Qi Wu
cs.AI

摘要

借助大型語言模型(LLMs)的顯著進展,目前正興起一股潮流,即利用LLMs進行機器人導航後的指導。這種趨勢突顯了LLMs在通用導航推理和多樣語言理解方面的潛力。然而,在整合LLMs進行視覺語言導航(VLN)任務時,觀察到代理性能存在顯著差異,與先前的下游專家模型相比。此外,在這些整合中,語言的固有能力來解釋和促進代理互動中的溝通通常被低估。在這項工作中,我們致力於彌合VLN專用模型和基於LLMs的導航範式之間的差距,同時保持LLMs在生成語言導航推理方面的解釋能力。通過對凍結的LLM中的視覺內容進行對齊,我們涵蓋了LLMs的視覺觀察理解,並利用一種方法來將LLMs和導航策略網絡結合,以進行有效的動作預測和導航推理。我們展示了所提方法的數據效率,消除了基於LM的代理和最先進的VLN專家之間的差距。
English
Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists.

Summary

AI-Generated Summary

PDF42November 28, 2024