Mirasol3B:一個用於時間對齊和上下文模態的多模自回歸模型
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
November 9, 2023
作者: AJ Piergiovanni, Isaac Nobel, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
cs.AI
摘要
多模式學習的主要挑戰之一是需要結合異質模態(例如視頻、音頻、文本)。例如,視頻和音頻的獲取速率遠高於文本,並且在時間上大致對齊。它們通常與文本不同步,而文本作為全局上下文,例如標題或描述。此外,視頻和音頻輸入的體積要大得多,隨著視頻長度的增加而增加,這自然需要更多的計算資源專門用於這些模態,並使得建模長距離依賴關係變得更加困難。
我們在此將多模式建模解耦,將其分為獨立、專注的自回歸模型,根據模態的特性處理輸入。我們提出了一個名為Mirasol3B的多模式模型,包括一個用於時間同步模態(音頻和視頻)的自回歸組件,以及一個用於上下文模態的自回歸組件,這些模態不一定在時間上對齊,但仍然是順序的。為了應對視頻-音頻輸入的長序列,我們建議進一步將視頻和音頻序列劃分為連續片段,並自回歸地處理它們的表示。為此,我們提出了一個結合機制,該機制在時間範圍內共同建模音頻和視頻信息。結合器學習從原始時空信號中提取音頻和視頻特徵,然後學習融合這些特徵,生成每個片段的緊湊但表達豐富的表示。
我們的方法在眾所周知的多模式基準測試中達到了最先進的水平,勝過了更大的模型。它有效地應對媒體輸入的高計算需求,通過同時學習緊湊表示、控制音頻-視頻特徵表示的序列長度,以及建模它們在時間上的依賴性。
English
One of the main challenges of multimodal learning is the need to combine
heterogeneous modalities (e.g., video, audio, text). For example, video and
audio are obtained at much higher rates than text and are roughly aligned in
time. They are often not synchronized with text, which comes as a global
context, e.g., a title, or a description. Furthermore, video and audio inputs
are of much larger volumes, and grow as the video length increases, which
naturally requires more compute dedicated to these modalities and makes
modeling of long-range dependencies harder.
We here decouple the multimodal modeling, dividing it into separate, focused
autoregressive models, processing the inputs according to the characteristics
of the modalities. We propose a multimodal model, called Mirasol3B, consisting
of an autoregressive component for the time-synchronized modalities (audio and
video), and an autoregressive component for the context modalities which are
not necessarily aligned in time but are still sequential. To address the
long-sequences of the video-audio inputs, we propose to further partition the
video and audio sequences in consecutive snippets and autoregressively process
their representations. To that end, we propose a Combiner mechanism, which
models the audio-video information jointly within a timeframe. The Combiner
learns to extract audio and video features from raw spatio-temporal signals,
and then learns to fuse these features producing compact but expressive
representations per snippet.
Our approach achieves the state-of-the-art on well established multimodal
benchmarks, outperforming much larger models. It effectively addresses the high
computational demand of media inputs by both learning compact representations,
controlling the sequence length of the audio-video feature representations, and
modeling their dependencies in time.