Mirasol3B：一种用于时间对齐和上下文模态的多模态自回归模型

摘要

多模态学习的主要挑战之一是需要结合异构模态（例如视频、音频、文本）。例如，视频和音频的获取速率远高于文本，并且在时间上大致对齐。它们通常与文本不同步，后者作为全局上下文，例如标题或描述。此外，视频和音频输入的体积要大得多，并且随着视频长度的增加而增加，这自然需要更多的计算资源专门用于这些模态，并使得对长距离依赖的建模更加困难。我们在这里将多模态建模进行解耦，将其分为独立的、专注的自回归模型，根据模态的特征处理输入。我们提出了一个名为Mirasol3B的多模态模型，包括一个用于时间同步模态（音频和视频）的自回归组件，以及一个用于上下文模态的自回归组件，这些模态不一定在时间上对齐，但仍然是顺序的。为了解决视频-音频输入的长序列问题，我们建议进一步将视频和音频序列分割为连续的片段，并自回归地处理它们的表示。为此，我们提出了一个组合器机制，该机制在一个时间范围内共同建模音频视频信息。组合器学习从原始时空信号中提取音频和视频特征，然后学习融合这些特征，生成每个片段的紧凑但表达丰富的表示。我们的方法在广受认可的多模态基准测试中取得了最先进的成果，胜过了规模更大的模型。它通过学习紧凑的表示形式、控制音频视频特征表示的序列长度，并建模它们在时间上的依赖关系，有效地解决了媒体输入的高计算需求。

English

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Mirasol3B：一种用于时间对齐和上下文模态的多模态自回归模型

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

摘要

Support