OmniBind:通過綁定空間實現大規模全方位多模態表示
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
July 16, 2024
作者: Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao
cs.AI
摘要
最近,人類與各種模態的人機互動展示了許多應用前景,如GPT-4o和Gemini。鑒於多模態聯合表示在理解和生成流程中的基礎作用,高質量的全方位聯合表示將是向共同處理更多樣化多模態信息邁出的一步。在這項工作中,我們提出了OmniBind,規模從70億到300億參數不等的大規模多模態聯合表示模型,支持3D、音頻、圖像和語言輸入。由於各模態間的數據對稀缺,我們提出了將各種預先訓練的專家模型的空間重新映射和綁定在一起,而非從頭開始訓練大型模型。這種方法通過間接增加模型參數和已見數據量實現了“擴展”。為了有效整合各種空間,我們通過學習路由器來動態分配不同空間的權重,並實現兩個目標:跨模態整體對齊和語言表示解耦。值得注意的是,由於綁定和路由空間都只需要輕量級網絡,OmniBind非常訓練高效。學習最大的300億模型僅需要未配對的單模態數據,並在單個8-4090節點上大約花費3天。大量實驗證明了OmniBind作為全方位表示模型的多功能性和優越性,突顯了其在各種應用中的巨大潛力,例如任意查詢和可組合的多模態理解。
English
Recently, human-computer interaction with various modalities has shown
promising applications, like GPT-4o and Gemini. Given the foundational role of
multimodal joint representation in understanding and generation pipelines,
high-quality omni joint representations would be a step toward co-processing
more diverse multimodal information. In this work, we present OmniBind,
large-scale multimodal joint representation models ranging in scale from 7
billion to 30 billion parameters, which support 3D, audio, image, and language
inputs. Due to the scarcity of data pairs across all modalities, instead of
training large models from scratch, we propose remapping and binding the spaces
of various pre-trained specialist models together. This approach enables
"scaling up" by indirectly increasing the model parameters and the amount of
seen data. To effectively integrate various spaces, we dynamically assign
weights to different spaces by learning routers with two objectives:
cross-modal overall alignment and language representation decoupling. Notably,
since binding and routing spaces both only require lightweight networks,
OmniBind is extremely training-efficient. Learning the largest 30B model
requires merely unpaired unimodal data and approximately 3 days on a single
8-4090 node. Extensive experiments demonstrate the versatility and superiority
of OmniBind as an omni representation model, highlighting its great potential
for diverse applications, such as any-query and composable multimodal
understanding.Summary
AI-Generated Summary