Mirasol3B：一個用於時間對齊和上下文模態的多模自回歸模型

摘要

多模式學習的主要挑戰之一是需要結合異質模態（例如視頻、音頻、文本）。例如，視頻和音頻的獲取速率遠高於文本，並且在時間上大致對齊。它們通常與文本不同步，而文本作為全局上下文，例如標題或描述。此外，視頻和音頻輸入的體積要大得多，隨著視頻長度的增加而增加，這自然需要更多的計算資源專門用於這些模態，並使得建模長距離依賴關係變得更加困難。我們在此將多模式建模解耦，將其分為獨立、專注的自回歸模型，根據模態的特性處理輸入。我們提出了一個名為Mirasol3B的多模式模型，包括一個用於時間同步模態（音頻和視頻）的自回歸組件，以及一個用於上下文模態的自回歸組件，這些模態不一定在時間上對齊，但仍然是順序的。為了應對視頻-音頻輸入的長序列，我們建議進一步將視頻和音頻序列劃分為連續片段，並自回歸地處理它們的表示。為此，我們提出了一個結合機制，該機制在時間範圍內共同建模音頻和視頻信息。結合器學習從原始時空信號中提取音頻和視頻特徵，然後學習融合這些特徵，生成每個片段的緊湊但表達豐富的表示。我們的方法在眾所周知的多模式基準測試中達到了最先進的水平，勝過了更大的模型。它有效地應對媒體輸入的高計算需求，通過同時學習緊湊表示、控制音頻-視頻特徵表示的序列長度，以及建模它們在時間上的依賴性。

English

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Mirasol3B：一個用於時間對齊和上下文模態的多模自回歸模型

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

摘要

Support