Xiaomi-Robotics-0: Een open-source visie-taal-actie-model met real-time uitvoering

Samenvatting

In dit rapport introduceren we Xiaomi-Robotics-0, een geavanceerd vision-language-action (VLA)-model dat is geoptimaliseerd voor hoge prestaties en een snelle en soepele uitvoering in realtime. De sleutel tot onze methode ligt in een zorgvuldig ontworpen trainingsrecept en implementatiestrategie. Xiaomi-Robotics-0 wordt eerst voorgetraind op grootschalige robottrajecten en vision-language gegevens met verschillende embodimenten, waardoor het brede en generaliseerbare actiegeneratiecapaciteiten verkrijgt, terwijl catastrofale vergetelheid van de visueel-semantische kennis van het onderliggende voorgetrainde VLM wordt voorkomen. Tijdens de natraining stellen we verschillende technieken voor om het VLA-model te trainen voor asynchrone uitvoering, om de inferentielatentie tijdens real-robot rollouts aan te pakken. Tijdens de implementatie stellen we de tijdstappen van opeenvolgende voorspelde actiebrokken zorgvuldig af om continue en naadloze real-time rollouts te garanderen. We evalueren Xiaomi-Robotics-0 uitgebreid in simulatiebenchmarks en op twee uitdagende real-robot taken die precieze en behendige bimanuele manipulatie vereisen. De resultaten tonen aan dat onze methode state-of-the-art prestaties behaalt in alle simulatiebenchmarks. Bovendien kan Xiaomi-Robotics-0 snel en soepel worden uitgerold op echte robots met behulp van een consumenten-GPU, waarbij hoge slagingspercentages en doorvoer worden bereikt op beide real-robot taken. Om toekomstig onderzoek te faciliteren, zijn code en modelcheckpoints openbaar gemaakt op https://xiaomi-robotics-0.github.io.

English

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io

Xiaomi-Robotics-0: Een open-source visie-taal-actie-model met real-time uitvoering

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Samenvatting

Support