Zipper:一種用於融合模態的多塔解碼器架構
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
May 29, 2024
作者: Vicky Zayats, Peter Chen, Melissa Merrari, Dirk Padfield
cs.AI
摘要
將多個生成基礎模型整合在一起,特別是那些在不同模態上訓練的模型,以創造出比各個部分更為強大的整體,這帶來了重大挑戰。兩個關鍵障礙是對齊數據的可用性(包含相似含義但在不同模態中表達不同的概念),以及在跨領域生成任務中有效地利用單模態表示,同時不損害其原始單模態功能。
我們提出了Zipper,一種多塔解碼器架構,通過使用交叉注意力來靈活地組合從獨立預訓練的單模解碼器中生成的多模態生成模型,以應對這些問題。在我們融合語音和文本模態的實驗中,我們展示了所提出的架構在具有有限對齊文本-語音數據的情況下表現出色。我們還展示了我們模型的靈活性,可以通過凍結相應的模態塔(例如文本)來有選擇性地保持單模態(例如文本到文本生成)的生成性能。在輸出模態為文本的跨模態任務(例如自動語音識別(ASR))中,我們展示了凍結文本主幹導致性能幾乎不降。在輸出模態為語音的文本到語音生成(TTS)等跨模態任務中,我們展示了使用預訓練的語音主幹比基準模型具有更優異的性能。
English
Integrating multiple generative foundation models, especially those trained
on different modalities, into something greater than the sum of its parts poses
significant challenges. Two key hurdles are the availability of aligned data
(concepts that contain similar meaning but is expressed differently in
different modalities), and effectively leveraging unimodal representations in
cross-domain generative tasks, without compromising their original unimodal
capabilities.
We propose Zipper, a multi-tower decoder architecture that addresses these
concerns by using cross-attention to flexibly compose multimodal generative
models from independently pre-trained unimodal decoders. In our experiments
fusing speech and text modalities, we show the proposed architecture performs
very competitively in scenarios with limited aligned text-speech data. We also
showcase the flexibility of our model to selectively maintain unimodal (e.g.,
text-to-text generation) generation performance by freezing the corresponding
modal tower (e.g. text). In cross-modal tasks such as automatic speech
recognition (ASR) where the output modality is text, we show that freezing the
text backbone results in negligible performance degradation. In cross-modal
tasks such as text-to-speech generation (TTS) where the output modality is
speech, we show that using a pre-trained speech backbone results in superior
performance to the baseline.Summary
AI-Generated Summary