Magma:多模態AI代理的基礎模型
Magma: A Foundation Model for Multimodal AI Agents
February 18, 2025
作者: Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
cs.AI
摘要
我們推出Magma,這是一個基礎模型,專為數位與物理世界中的多模態AI代理任務而設計。Magma是視覺語言(VL)模型的重要延伸,不僅保留了後者的VL理解能力(言語智能),還具備了在視覺空間世界中規劃與行動的能力(時空智能),並能完成從UI導航到機器人操作等一系列代理任務。為了賦予這些代理能力,Magma在大量異質數據集上進行了預訓練,這些數據集涵蓋了圖像、視頻乃至機器人數據,其中圖像中的可操作視覺對象(如GUI中的可點擊按鈕)通過Set-of-Mark(SoM)標記以實現動作定位,而視頻中的物體運動(如人手或機械臂的軌跡)則通過Trace-of-Mark(ToM)標記以支持動作規劃。大量實驗表明,SoM與ToM達到了極佳的協同效應,促進了Magma模型時空智能的獲取,這對於如圖1所示的多種任務至關重要。特別地,Magma在UI導航和機器人操作任務上創造了新的最優成績,超越了以往專為這些任務定制的模型。在圖像和視頻相關的多模態任務上,Magma也優於那些在更大數據集上訓練的流行大型多模態模型。我們公開了模型與代碼,以確保可重現性,詳見https://microsoft.github.io/Magma。
English
We present Magma, a foundation model that serves multimodal AI agentic tasks
in both the digital and physical worlds. Magma is a significant extension of
vision-language (VL) models in that it not only retains the VL understanding
ability (verbal intelligence) of the latter, but is also equipped with the
ability to plan and act in the visual-spatial world (spatial-temporal
intelligence) and complete agentic tasks ranging from UI navigation to robot
manipulation. To endow the agentic capabilities, Magma is pretrained on large
amounts of heterogeneous datasets spanning from images, videos to robotics
data, where the actionable visual objects (e.g., clickable buttons in GUI) in
images are labeled by Set-of-Mark (SoM) for action grounding, and the object
movements (e.g., the trace of human hands or robotic arms) in videos are
labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show
that SoM and ToM reach great synergy and facilitate the acquisition of
spatial-temporal intelligence for our Magma model, which is fundamental to a
wide range of tasks as shown in Fig.1. In particular, Magma creates new
state-of-the-art results on UI navigation and robotic manipulation tasks,
outperforming previous models that are specifically tailored to these tasks. On
image and video-related multimodal tasks, Magma also compares favorably to
popular large multimodal models that are trained on much larger datasets. We
make our model and code public for reproducibility at
https://microsoft.github.io/Magma.Summary
AI-Generated Summary