ChatPaper.aiChatPaper

FLOWER:通过高效的视觉-语言-动作流策略实现通用机器人策略的民主化

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

September 5, 2025
作者: Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov
cs.AI

摘要

开发高效的视觉-语言-动作(VLA)策略对于实际机器人部署至关重要,然而现有方法面临高昂的计算成本和资源需求。当前基于扩散的VLA策略需要数十亿参数的模型和海量数据集才能实现强劲性能。我们通过两项创新应对这一效率挑战:中间模态融合,通过修剪多达50%的大型语言模型层,将容量重新分配给扩散头;以及动作特定的全局自适应层归一化(Global-AdaLN)条件机制,通过模块化适配减少20%的参数。我们将这些进展整合到一个名为FLOWER的新型950M参数VLA中。仅需200个H100 GPU小时的预训练,FLOWER在涵盖十个仿真和现实世界基准的190项任务中,与更大的VLA模型相比展现出竞争力,并在多样化的机器人实体中表现出鲁棒性。此外,FLOWER在CALVIN ABC基准测试中创下了4.53的新纪录。演示、代码及预训练权重可在https://intuitive-robots.github.io/flower_vla/获取。
English
Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.
PDF143January 19, 2026