Mirasol3B：時間整合および文脈的モダリティのためのマルチモーダル自己回帰モデル

要旨

マルチモーダル学習の主な課題の一つは、異種のモダリティ（例：映像、音声、テキスト）を組み合わせる必要性です。例えば、映像と音声はテキストよりもはるかに高いレートで取得され、時間的におおよそ同期しています。これらは、タイトルや説明などのグローバルな文脈として提供されるテキストとは必ずしも同期していません。さらに、映像と音声の入力ははるかに大きなボリュームを持ち、映像の長さに応じて増加するため、これらのモダリティに専念する計算リソースがより多く必要となり、長距離の依存関係のモデリングが難しくなります。ここでは、マルチモーダルモデリングを分離し、モダリティの特性に応じて入力を処理する個別のフォーカスされた自己回帰モデルに分割します。我々は、Mirasol3Bと呼ばれるマルチモーダルモデルを提案します。このモデルは、時間的に同期されたモダリティ（音声と映像）のための自己回帰コンポーネントと、必ずしも時間的に同期していないが依然としてシーケンシャルな文脈モダリティのための自己回帰コンポーネントで構成されています。映像と音声の入力の長いシーケンスに対処するために、映像と音声のシーケンスを連続したスニペットにさらに分割し、それらの表現を自己回帰的に処理することを提案します。そのために、特定の時間枠内で音声と映像の情報を共同でモデル化するCombinerメカニズムを提案します。Combinerは、生の時空間信号から音声と映像の特徴を抽出することを学習し、その後、これらの特徴を融合して、スニペットごとにコンパクトで表現力豊かな表現を生成することを学習します。我々のアプローチは、確立されたマルチモーダルベンチマークにおいて、はるかに大規模なモデルを上回る最先端の性能を達成します。これは、コンパクトな表現を学習し、音声と映像の特徴表現のシーケンス長を制御し、時間的な依存関係をモデル化することによって、メディア入力の高い計算需要に効果的に対処します。

English

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Mirasol3B：時間整合および文脈的モダリティのためのマルチモーダル自己回帰モデル

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

要旨

Support