（近乎）免費的基礎模型模態縫合

摘要

基礎多模態模型通常通過拼接多個現有的預訓練單模態模型來設計：例如，將圖像分類器與文本模型結合。這種拼接過程通過訓練一個連接模塊來實現，該模塊旨在將這些單模態模型的表示空間對齊到多模態目標上。然而，考慮到在大規模網絡數據集上訓練此類連接模塊的複雜性，以及可用預訓練單模態模型數量的不斷增加，選擇單模態模型並隨後訓練連接模塊的任務變得計算密集。為了解決這一尚未充分研究的關鍵問題，我們提出了超網絡模型對齊（Hypernetwork Model Alignment, Hyma），這是一種利用超網絡實現最佳單模態模型選擇和連接模塊訓練的一體化解決方案。具體而言，我們的框架利用超網絡的參數預測能力，為N乘以M種單模態模型組合獲取聯合訓練的連接模塊。在我們的實驗中，Hyma將搜索最佳單模態模型對的成本降低了10倍，同時在一系列多樣化的多模態基準測試中，匹配了通過網格搜索獲得的排名和訓練後的連接模塊性能。

English

Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for N times M combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by 10times, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

（近乎）免費的基礎模型模態縫合

(Almost) Free Modality Stitching of Foundation Models

摘要

Support