LangNav: ナビゲーションのための知覚的表現としての言語

要旨

視覚と言語を用いたナビゲーションにおいて、言語を知覚的表現として活用する方法を探求します。本手法では、既存の視覚システム（画像キャプショニングと物体検出）を利用して、エージェントのエゴセントリックなパノラマ視点を各タイムステップで自然言語記述に変換します。その後、事前学習済み言語モデルをファインチューニングし、現在の視点と軌跡履歴に基づいてナビゲーション指示を最も適切に満たす行動を選択します。標準的な設定では、事前学習済み言語モデルを事前学習済み視覚モデルからの連続的な視覚的特徴と直接連携させるのに対し、本手法では（離散的な）言語を知覚的表現として使用します。R2R視覚言語ナビゲーションベンチマークにおいて、言語ベースナビゲーション（LangNav）アプローチの2つのユースケースを検討します：大規模言語モデル（GPT-4）からのプロンプトを用いて合成軌跡を生成し、より小規模な言語モデルをファインチューニングするケースと、シミュレーション環境（ALFRED）で学習したポリシーを実世界環境（R2R）に転移するシミュレーションtoリアル転移のケースです。本手法は、視覚的特徴に依存する強力なベースラインを、少数のゴールド軌跡（10-100）しか利用できない設定において改善することが確認され、ナビゲーションタスクにおける言語を知覚的表現として使用する可能性を示しています。

English

We explore the use of language as a perceptual representation for vision-and-language navigation. Our approach uses off-the-shelf vision systems (for image captioning and object detection) to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore two use cases of our language-based navigation (LangNav) approach on the R2R vision-and-language navigation benchmark: generating synthetic trajectories from a prompted large language model (GPT-4) with which to finetune a smaller language model; and sim-to-real transfer where we transfer a policy learned on a simulated environment (ALFRED) to a real-world environment (R2R). Our approach is found to improve upon strong baselines that rely on visual features in settings where only a few gold trajectories (10-100) are available, demonstrating the potential of using language as a perceptual representation for navigation tasks.

LangNav: ナビゲーションのための知覚的表現としての言語

LangNav: Language as a Perceptual Representation for Navigation

要旨

Support