ABot-M0:基于动作流形学习的机器人操作视觉-语言-动作基础模型
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
February 11, 2026
作者: Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu
cs.AI
摘要
构建通用具身智能体以适应多样化硬件平台是机器人领域的核心挑战,通常被表述为"一体多形"范式。当前研究受限于数据碎片化、表征不一致以及训练目标失准等问题。我们提出ABot-M0框架,通过构建系统化数据治理流程并协同优化模型架构与训练策略,实现从异构原始数据到统一高效表征的端到端转换。基于六个公共数据集,我们通过清洗、标准化和样本平衡构建了UniACT数据集——包含超600万条轨迹、9500小时数据的大规模数据集,涵盖多类机器人形态与任务场景。统一预训练显著提升了跨平台、跨任务的知识迁移与泛化能力,为通用具身智能奠定基础。为提升动作预测效率与稳定性,我们提出动作流形假说:有效机器人动作并非存在于高维全空间,而是分布于受物理规律与任务约束的低维光滑流形上。基于此引入动作流形学习(AML),采用DiT主干网络直接预测洁净连续的动作序列,将学习重点从去噪转向可行流形投影,提升解码速度与策略稳定性。ABot-M0通过双流机制支持模块化感知,将VLM语义与几何先验、VGGT及Qwen-Image-Edit等即插即用3D模块的多视角输入相融合,在保持主干网络不变的前提下增强空间理解能力,缓解标准VLM在三维推理中的局限性。实验表明各组件可独立运行且具有增益效应。我们将公开全部代码与流程以促进复现与后续研究。
English
Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.