拉链：用于融合模态的多塔解码器架构

摘要

将多个生成基础模型整合在一起，尤其是那些在不同模态上训练的模型，以创造出比各部分之和更强大的东西，面临着重大挑战。两个关键障碍是获取对齐数据（包含相似含义但在不同模态中表达不同的概念），以及在跨领域生成任务中有效利用单模态表示，而不损害其原始单模态能力。我们提出了Zipper，一种多塔解码器架构，通过使用交叉注意力灵活地组合来自独立预训练的单模态解码器的多模态生成模型，以解决这些问题。在我们融合语音和文本模态的实验中，我们展示了所提出的架构在具有有限对齐文本-语音数据的情况下表现出很强的竞争力。我们还展示了我们模型的灵活性，通过冻结相应的模态塔（例如文本），有选择性地保持单模态（例如文本到文本生成）生成性能。在输出模态为文本的跨模态任务（如自动语音识别（ASR））中，我们展示了冻结文本主干会导致性能下降可以忽略不计。在输出模态为语音的跨模态任务（如文本到语音生成（TTS））中，我们展示了使用预训练的语音主干相对于基线会带来更优越的性能。

English

Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.

拉链：用于融合模态的多塔解码器架构

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

摘要

Support