OmniBind:通过绑定空间实现大规模全方位多模态表示
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
July 16, 2024
作者: Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao
cs.AI
摘要
最近,人机交互在各种模态下展示出了许多有前景的应用,比如GPT-4o和Gemini。鉴于多模态联合表示在理解和生成流程中的基础作用,高质量的全方位联合表示将是朝着共同处理更多多样化多模态信息的一大步。在这项工作中,我们提出了OmniBind,大规模多模态联合表示模型,参数规模从70亿到300亿不等,支持3D、音频、图像和语言输入。由于各模态之间数据配对稀缺,我们提出了重新映射和绑定各种预训练专家模型空间的方法,而非从头开始训练大型模型。这种方法通过间接增加模型参数和已见数据量来实现“扩展”。为了有效整合各种空间,我们通过学习路由器动态分配不同空间的权重,具有两个目标:跨模态整体对齐和语言表示解耦。值得注意的是,由于绑定和路由空间都只需要轻量级网络,OmniBind 的训练效率极高。学习最大的300亿模型仅需要未配对的单模态数据,大约在单个8-4090节点上花费3天时间。大量实验证明了OmniBind 作为全方位表示模型的多功能性和优越性,突显了其在各种应用中的巨大潜力,比如任意查询和可组合多模态理解。
English
Recently, human-computer interaction with various modalities has shown
promising applications, like GPT-4o and Gemini. Given the foundational role of
multimodal joint representation in understanding and generation pipelines,
high-quality omni joint representations would be a step toward co-processing
more diverse multimodal information. In this work, we present OmniBind,
large-scale multimodal joint representation models ranging in scale from 7
billion to 30 billion parameters, which support 3D, audio, image, and language
inputs. Due to the scarcity of data pairs across all modalities, instead of
training large models from scratch, we propose remapping and binding the spaces
of various pre-trained specialist models together. This approach enables
"scaling up" by indirectly increasing the model parameters and the amount of
seen data. To effectively integrate various spaces, we dynamically assign
weights to different spaces by learning routers with two objectives:
cross-modal overall alignment and language representation decoupling. Notably,
since binding and routing spaces both only require lightweight networks,
OmniBind is extremely training-efficient. Learning the largest 30B model
requires merely unpaired unimodal data and approximately 3 days on a single
8-4090 node. Extensive experiments demonstrate the versatility and superiority
of OmniBind as an omni representation model, highlighting its great potential
for diverse applications, such as any-query and composable multimodal
understanding.Summary
AI-Generated Summary