Qwen-RobotNav テクニカルレポート: エージェント型ナビゲーションシステム向けに設計されたスケーラブルなナビゲーションモデル

要旨

エージェント型ナビゲーションシステムには、推論時に観測戦略を外部から再構成可能な基本ナビゲーションモデルが必要である。なぜなら、指示追従、物体探索、目標追跡、自動運転は同一の認識・計画バックボーンを共有する一方で、視覚ストリームの処理に根本的に異なる戦略を要求するからである。本稿では、この課題に対処するために、Qwen-RobotNav上に構築されたスケーラブルなナビゲーションモデルであるQwen-RobotNavを提案する。本モデルは、ナビゲーション動作を選択する複数のタスクモードと、視覚履歴の符号化方法を制御する調整可能な観測パラメータ（トークンバジェット、カメラごとの重みなど）という、補完的な二つの次元を持つパラメータ化インターフェースを備える。すべてのパラメータを訓練時にランダム化することで、Qwen-RobotNavは、Qwen-RobotNavのバックボーンにアーキテクチャ上の変更を一切加えることなく、任意の推論時設定に対してロバストとなる。Qwen-RobotNavは1560万サンプルで訓練され、視覚言語データとの共訓練により、軌跡のみの訓練で観測される反応的な行動系列マッパーへの崩壊を防止する。このパラメータ化インターフェースにより、Qwen-RobotNavはエージェント型システムの自然な構成要素となる。長期的シナリオでは、上位プランナーが目標をサブタスクに分解し、エピソード途中でQwen-RobotNavのタスクモードとコンテキスト戦略を動的に切り替えることで、同一モデルの繰り返し呼び出しから複雑な行動を構成する。広範な実験により、Qwen-RobotNavは主要なナビゲーションベンチマークで新たな最先端結果を達成する。本モデルは20億から80億パラメータへの良好なスケーリングを示し、共同マルチタスクトレーニングによりタスクファミリ間で転移可能な共有空間計画基盤を発達させ、多様な環境における実世界ロボットへの強力なゼロショット汎化を示す。

English

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.