ChatPaper.aiChatPaper

iFlyBot-VLA技术报告

iFlyBot-VLA Technical Report

November 1, 2025
作者: Yuan Zhang, Chenyu Xue, Wenjie Xu, Chao Ji, Jiajia wu, Jia Pan
cs.AI

摘要

我们推出iFlyBot-VLA——一个基于创新框架训练的大规模视觉-语言-动作模型。主要贡献包括:(1)基于海量人类与机器人操作视频完整训练的潜在动作模型;(2)在训练过程中同时对视觉语言模型和动作专家进行联合监督的双层级动作表征框架;(3)融合机器人轨迹数据与通用问答、空间问答数据集的混合训练策略,有效增强了VLM骨干网络的3D感知与推理能力。具体而言,该VLM被训练用于预测两种互补的动作形式:源自跨本体操作数据预训练的潜在动作模型所推导的潜在动作(捕捉隐含的高层意图),以及通过对连续控制信号进行频域变换获得的结构化离散动作标记(编码显式的底层动力学)。这种双重监督机制实现了语言、视觉与动作表征空间的对齐,使VLM能直接参与动作生成。在LIBERO Franka基准测试中的实验结果表明我们框架的优越性,真实场景评估进一步显示iFlyBot-VLA在多样化的复杂操作任务中均达到具有竞争力的成功率。此外,我们计划开源部分自建数据集以支持学界后续研究。
English
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community
PDF51December 2, 2025