ABot-N0：面向通用具身导航的VLA基础模型技术报告

摘要

长期以来，具身导航领域因任务专用架构而处于割裂状态。我们推出ABot-N0——一个统一的视觉-语言-动作基础模型，实现了点目标导航、物体目标导航、指令跟随、兴趣点导航及行人跟随这五大核心任务的"大一统"。该模型采用分层式"大脑-动作"架构，将基于大语言模型的认知大脑（负责语义推理）与基于流匹配的动作专家（生成精确连续轨迹）相结合。为支撑大规模学习，我们开发了ABot-N0数据引擎，在7,802个高保真3D场景（总面积10.7平方公里）中构建了1,690万条专家轨迹和500万条推理样本。ABot-N0在7项基准测试中均达到最新顶尖性能，显著超越各类专用模型。此外，我们的智能导航系统融合了规划器与分层拓扑记忆机制，可在动态现实环境中执行鲁棒的长时程任务。

English

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 km^2). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.

ABot-N0：面向通用具身导航的VLA基础模型技术报告

ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation

摘要

Support