(近乎)免费的基础模型模态拼接
(Almost) Free Modality Stitching of Foundation Models
July 14, 2025
作者: Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto
cs.AI
摘要
基础多模态模型通常通过拼接多个现有的预训练单模态模型来设计:例如,将图像分类器与文本模型结合。这一拼接过程通过训练一个连接模块来实现,该模块旨在将这些单模态模型的表示空间对齐,以达成多模态目标。然而,考虑到在大规模网络数据集上训练此类连接器的复杂性,以及可用预训练单模态模型数量的不断增长,单模态模型的选择及后续连接模块的训练任务变得计算密集。针对这一尚未充分研究的关键问题,我们提出了超网络模型对齐(Hyma),一种利用超网络实现最优单模态模型选择与连接器训练的一体化解决方案。具体而言,我们的框架利用超网络的参数预测能力,为N乘以M种单模态模型组合获取联合训练的连接模块。在实验中,Hyma将寻找最佳单模态模型对的成本降低了10倍,同时在一系列多样化的多模态基准测试中,匹配了通过网格搜索获得的排名及训练后的连接器性能。
English
Foundation multi-modal models are often designed by stitching of multiple
existing pretrained uni-modal models: for example, an image classifier with an
text model. This stitching process is performed by training a connector module
that aims to align the representation spaces of these uni-modal models towards
a multi-modal objective. However, given the complexity of training such
connectors on large scale web-based datasets coupled with the ever-increasing
number of available pretrained uni-modal models, the task of uni-modal models
selection and subsequent connector module training becomes computationally
demanding. To address this under-studied critical problem, we propose
Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal
uni-modal model selection and connector training by leveraging hypernetworks.
Specifically, our framework utilizes the parameter prediction capability of a
hypernetwork to obtain jointly trained connector modules for N times M
combinations of uni-modal models. In our experiments, Hyma reduces the cost of
searching for the best performing uni-modal model pair by 10times, while
matching the ranking and trained connector performance obtained via grid search
across a suite of diverse multi-modal benchmarks.