LangNav：语言作为导航的感知表征

摘要

我们探讨了将语言作为视觉与语言导航中的感知表示的应用。我们的方法利用现成的视觉系统（用于图像字幕和物体检测），将代理在每个时间步的自我中心全景视图转换为自然语言描述。然后，我们微调一个预训练的语言模型，根据当前视图和轨迹历史选择一个行动，以最好地实现导航指令。与标准设置相反，标准设置是将预训练的语言模型调整为直接使用预训练视觉模型的连续视觉特征，我们的方法使用（离散的）语言作为感知表示。我们在R2R视觉与语言导航基准测试中探索了我们基于语言的导航（LangNav）方法的两种用例：从提示的大型语言模型（GPT-4）生成合成轨迹，用于微调较小的语言模型；以及从在模拟环境（ALFRED）学习的策略转移到真实环境（R2R）的模拟到真实的转移。我们发现，我们的方法在仅有少量金标轨迹（10-100）可用的情况下，改进了依赖视觉特征的强基线，展示了将语言用作导航任务的感知表示的潜力。

English

We explore the use of language as a perceptual representation for vision-and-language navigation. Our approach uses off-the-shelf vision systems (for image captioning and object detection) to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore two use cases of our language-based navigation (LangNav) approach on the R2R vision-and-language navigation benchmark: generating synthetic trajectories from a prompted large language model (GPT-4) with which to finetune a smaller language model; and sim-to-real transfer where we transfer a policy learned on a simulated environment (ALFRED) to a real-world environment (R2R). Our approach is found to improve upon strong baselines that rely on visual features in settings where only a few gold trajectories (10-100) are available, demonstrating the potential of using language as a perceptual representation for navigation tasks.

LangNav：语言作为导航的感知表征

LangNav: Language as a Perceptual Representation for Navigation

摘要

Support