Mirasol3B:一种用于时间对齐和上下文模态的多模态自回归模型
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
November 9, 2023
作者: AJ Piergiovanni, Isaac Nobel, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
cs.AI
摘要
多模态学习的主要挑战之一是需要结合异构模态(例如视频、音频、文本)。例如,视频和音频的获取速率远高于文本,并且在时间上大致对齐。它们通常与文本不同步,后者作为全局上下文,例如标题或描述。此外,视频和音频输入的体积要大得多,并且随着视频长度的增加而增加,这自然需要更多的计算资源专门用于这些模态,并使得对长距离依赖的建模更加困难。
我们在这里将多模态建模进行解耦,将其分为独立的、专注的自回归模型,根据模态的特征处理输入。我们提出了一个名为Mirasol3B的多模态模型,包括一个用于时间同步模态(音频和视频)的自回归组件,以及一个用于上下文模态的自回归组件,这些模态不一定在时间上对齐,但仍然是顺序的。为了解决视频-音频输入的长序列问题,我们建议进一步将视频和音频序列分割为连续的片段,并自回归地处理它们的表示。为此,我们提出了一个组合器机制,该机制在一个时间范围内共同建模音频视频信息。组合器学习从原始时空信号中提取音频和视频特征,然后学习融合这些特征,生成每个片段的紧凑但表达丰富的表示。
我们的方法在广受认可的多模态基准测试中取得了最先进的成果,胜过了规模更大的模型。它通过学习紧凑的表示形式、控制音频视频特征表示的序列长度,并建模它们在时间上的依赖关系,有效地解决了媒体输入的高计算需求。
English
One of the main challenges of multimodal learning is the need to combine
heterogeneous modalities (e.g., video, audio, text). For example, video and
audio are obtained at much higher rates than text and are roughly aligned in
time. They are often not synchronized with text, which comes as a global
context, e.g., a title, or a description. Furthermore, video and audio inputs
are of much larger volumes, and grow as the video length increases, which
naturally requires more compute dedicated to these modalities and makes
modeling of long-range dependencies harder.
We here decouple the multimodal modeling, dividing it into separate, focused
autoregressive models, processing the inputs according to the characteristics
of the modalities. We propose a multimodal model, called Mirasol3B, consisting
of an autoregressive component for the time-synchronized modalities (audio and
video), and an autoregressive component for the context modalities which are
not necessarily aligned in time but are still sequential. To address the
long-sequences of the video-audio inputs, we propose to further partition the
video and audio sequences in consecutive snippets and autoregressively process
their representations. To that end, we propose a Combiner mechanism, which
models the audio-video information jointly within a timeframe. The Combiner
learns to extract audio and video features from raw spatio-temporal signals,
and then learns to fuse these features producing compact but expressive
representations per snippet.
Our approach achieves the state-of-the-art on well established multimodal
benchmarks, outperforming much larger models. It effectively addresses the high
computational demand of media inputs by both learning compact representations,
controlling the sequence length of the audio-video feature representations, and
modeling their dependencies in time.