LangNav：將語言視為導航的知覺表徵

摘要

我們探索語言作為視覺與語言導航的知覺表示的應用。我們的方法使用現成的視覺系統（用於圖像標題生成和物體檢測）將代理人每個時間步的自我中心全景視圖轉換為自然語言描述。然後，我們微調預訓練的語言模型，根據當前視圖和軌跡歷史來選擇一個動作，以最好地實現導航指令。與標準設置相比，標準設置會使預訓練的語言模型直接與預訓練的視覺模型提取的連續視覺特徵一起工作，我們的方法則使用（離散的）語言作為知覺表示。我們在R2R視覺與語言導航基準上探索了我們基於語言的導航（LangNav）方法的兩個用例：從提示的大型語言模型（GPT-4）生成合成軌跡，以便微調較小的語言模型；以及模擬到真實的轉移，我們將在模擬環境（ALFRED）上學習的策略轉移到現實世界環境（R2R）。我們的方法被發現在只有少量金標軌跡（10-100）可用的情況下改進了依賴視覺特徵的強基線，展示了將語言用作導航任務的知覺表示的潛力。

English

We explore the use of language as a perceptual representation for vision-and-language navigation. Our approach uses off-the-shelf vision systems (for image captioning and object detection) to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore two use cases of our language-based navigation (LangNav) approach on the R2R vision-and-language navigation benchmark: generating synthetic trajectories from a prompted large language model (GPT-4) with which to finetune a smaller language model; and sim-to-real transfer where we transfer a policy learned on a simulated environment (ALFRED) to a real-world environment (R2R). Our approach is found to improve upon strong baselines that rely on visual features in settings where only a few gold trajectories (10-100) are available, demonstrating the potential of using language as a perceptual representation for navigation tasks.

LangNav：將語言視為導航的知覺表徵

LangNav: Language as a Perceptual Representation for Navigation

摘要

Support