小米机器人-0：开源视觉-语言-动作模型，支持实时执行

摘要

在本报告中，我们推出小米机器人-0（Xiaomi-Robotics-0）——一款专为高性能、快速流畅实时执行而优化的先进视觉-语言-动作（VLA）模型。我们的方法核心在于精心设计的训练方案与部署策略。该模型首先通过大规模跨具身机器人轨迹与视觉-语言数据进行预训练，使其获得广泛且可泛化的动作生成能力，同时避免对底层预训练VLM视觉语义知识的灾难性遗忘。在后训练阶段，我们提出多项异步执行训练技术以解决真实机器人部署时的推理延迟问题。部署过程中，我们通过精确对齐连续预测动作块的时间步长，确保实时执行过程的连贯性与无缝衔接。我们在仿真基准测试和两项需要精确灵巧双手操作的现实机器人任务中，对小米机器人-0进行了全面评估。结果表明，我们的方法在所有仿真基准测试中均达到领先性能。此外，小米机器人-0可在消费级GPU上实现快速流畅的真实机器人部署，在两项现实任务中均取得高成功率与高吞吐量。为促进后续研究，代码与模型检查点已在https://xiaomi-robotics-0.github.io开源。

English

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io

小米机器人-0：开源视觉-语言-动作模型，支持实时执行

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

摘要

Support