同じ：状態適応型専門家の混合を用いた一般的な言語案内付きビジュアルナビゲーションの学習

要旨

学術領域における学習指示による視覚ナビゲーションは、一般的に、高レベルのカテゴリ固有検索と低レベルの言語によるナビゲーションに大別されます。これは、言語指示の粒度に依存し、前者は探索プロセスを重視する一方、後者は詳細なテキストコマンドの追跡に集中します。これらのタスクの焦点が異なるにもかかわらず、指示の解釈、周囲の理解、および行動決定の推論という基本的な要件は一貫しています。本論文では、さまざまなナビゲーションタスクを統一された汎用フレームワークにまとめ、ナビゲーションの学習において一般的な知識の共有とタスク固有の能力の活用の困難さを調査し、State-Adaptive Mixture of Experts (SAME) モデルを提案します。このSAMEモデルは、異なる粒度の言語と動的な観測に基づいてエージェントが意思決定を推論することを効果的に可能にします。SAMEによって強化された、7つのナビゲーションタスクに対処できる汎用エージェントを提示し、タスク固有のエージェントを上回るか、高い比較可能なパフォーマンスを達成します。

English

The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.

同じ：状態適応型専門家の混合を用いた一般的な言語案内付きビジュアルナビゲーションの学習

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

要旨

Support