拉链:用于融合模态的多塔解码器架构
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
May 29, 2024
作者: Vicky Zayats, Peter Chen, Melissa Merrari, Dirk Padfield
cs.AI
摘要
将多个生成基础模型整合在一起,尤其是那些在不同模态上训练的模型,以创造出比各部分之和更强大的东西,面临着重大挑战。两个关键障碍是获取对齐数据(包含相似含义但在不同模态中表达不同的概念),以及在跨领域生成任务中有效利用单模态表示,而不损害其原始单模态能力。
我们提出了Zipper,一种多塔解码器架构,通过使用交叉注意力灵活地组合来自独立预训练的单模态解码器的多模态生成模型,以解决这些问题。在我们融合语音和文本模态的实验中,我们展示了所提出的架构在具有有限对齐文本-语音数据的情况下表现出很强的竞争力。我们还展示了我们模型的灵活性,通过冻结相应的模态塔(例如文本),有选择性地保持单模态(例如文本到文本生成)生成性能。在输出模态为文本的跨模态任务(如自动语音识别(ASR))中,我们展示了冻结文本主干会导致性能下降可以忽略不计。在输出模态为语音的跨模态任务(如文本到语音生成(TTS))中,我们展示了使用预训练的语音主干相对于基线会带来更优越的性能。
English
Integrating multiple generative foundation models, especially those trained
on different modalities, into something greater than the sum of its parts poses
significant challenges. Two key hurdles are the availability of aligned data
(concepts that contain similar meaning but is expressed differently in
different modalities), and effectively leveraging unimodal representations in
cross-domain generative tasks, without compromising their original unimodal
capabilities.
We propose Zipper, a multi-tower decoder architecture that addresses these
concerns by using cross-attention to flexibly compose multimodal generative
models from independently pre-trained unimodal decoders. In our experiments
fusing speech and text modalities, we show the proposed architecture performs
very competitively in scenarios with limited aligned text-speech data. We also
showcase the flexibility of our model to selectively maintain unimodal (e.g.,
text-to-text generation) generation performance by freezing the corresponding
modal tower (e.g. text). In cross-modal tasks such as automatic speech
recognition (ASR) where the output modality is text, we show that freezing the
text backbone results in negligible performance degradation. In cross-modal
tasks such as text-to-speech generation (TTS) where the output modality is
speech, we show that using a pre-trained speech backbone results in superior
performance to the baseline.Summary
AI-Generated Summary