NavGPT-2:释放大规模视觉-语言模型的导航推理能力
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
July 17, 2024
作者: Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, Qi Wu
cs.AI
摘要
借助大型语言模型(LLMs)的显著进展,人们正在兴起一项倡议,利用LLMs进行指令跟随机器人导航。这一趋势突显了LLMs在推理导航和多样化语言理解方面的潜力。然而,在将LLMs整合到视觉与语言导航(VLN)任务中时,观察到了代理性能上的显著差异,与以往的下游专业模型相比。此外,语言的固有能力在代理交互中解释和促进沟通的作用通常在这些整合中被低估。在这项工作中,我们致力于弥合VLN专业化模型和基于LLMs的导航范式之间的鸿沟,同时保持LLMs在生成语言导航推理方面的解释能力。通过将视觉内容与冻结的LLM进行对齐,我们包含了LLMs的视觉观察理解,并利用一种方法来将LLMs和导航策略网络结合起来,以实现有效的动作预测和导航推理。我们展示了所提出方法的数据效率,并消除了基于LM的代理与最先进的VLN专家之间的差距。
English
Capitalizing on the remarkable advancements in Large Language Models (LLMs),
there is a burgeoning initiative to harness LLMs for instruction following
robotic navigation. Such a trend underscores the potential of LLMs to
generalize navigational reasoning and diverse language understanding. However,
a significant discrepancy in agent performance is observed when integrating
LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous
downstream specialist models. Furthermore, the inherent capacity of language to
interpret and facilitate communication in agent interactions is often
underutilized in these integrations. In this work, we strive to bridge the
divide between VLN-specialized models and LLM-based navigation paradigms, while
maintaining the interpretative prowess of LLMs in generating linguistic
navigational reasoning. By aligning visual content in a frozen LLM, we
encompass visual observation comprehension for LLMs and exploit a way to
incorporate LLMs and navigation policy networks for effective action
predictions and navigational reasoning. We demonstrate the data efficiency of
the proposed methods and eliminate the gap between LM-based agents and
state-of-the-art VLN specialists.Summary
AI-Generated Summary