FLOWER:透過高效的視覺-語言-動作流程策略,實現通用機器人策略的民主化
FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
September 5, 2025
作者: Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov
cs.AI
摘要
開發高效的視覺-語言-動作(VLA)策略對於實際機器人部署至關重要,然而當前方法面臨著高昂的計算成本和資源需求。現有的基於擴散的VLA策略需要數十億參數的模型和龐大的數據集才能實現強勁性能。我們通過兩項貢獻來應對這一效率挑戰:中間模態融合,通過修剪高達50%的大型語言模型層來重新分配容量至擴散頭部;以及針對動作的全局自適應層歸一化(Global-AdaLN)條件化,通過模塊化適應將參數減少20%。我們將這些進展整合到一個名為FLOWER的新型950M參數VLA中。僅在200個H100 GPU小時內進行預訓練,FLOWER在涵蓋十個模擬和現實世界基準的190項任務中展現出與更大VLA相媲美的性能,並展示了對多樣化機器人實體的魯棒性。此外,FLOWER在CALVIN ABC基準上達到了4.53的新SoTA成績。演示、代碼及預訓練權重可於https://intuitive-robots.github.io/flower_vla/獲取。
English
Developing efficient Vision-Language-Action (VLA) policies is crucial for
practical robotics deployment, yet current approaches face prohibitive
computational costs and resource requirements. Existing diffusion-based VLA
policies require multi-billion-parameter models and massive datasets to achieve
strong performance. We tackle this efficiency challenge with two contributions:
intermediate-modality fusion, which reallocates capacity to the diffusion head
by pruning up to 50% of LLM layers, and action-specific Global-AdaLN
conditioning, which cuts parameters by 20% through modular adaptation. We
integrate these advances into a novel 950 M-parameter VLA called FLOWER.
Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance
with bigger VLAs across 190 tasks spanning ten simulation and real-world
benchmarks and demonstrates robustness across diverse robotic embodiments. In
addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark.
Demos, code and pretrained weights are available at
https://intuitive-robots.github.io/flower_vla/.