OMCAT: オムニコンテキストアウェアトランスフォーマー

要旨

大規模言語モデル（LLMs）は、テキスト生成と理解において大きな進歩を遂げており、最近の進展は、視覚と音声の入力を統合したマルチモーダルLLMsにまで拡大しています。ただし、これらのモデルは、特にオーディオとビデオストリーム間でイベントを相互関連付ける際に、細かいクロスモーダルな時間理解に苦労しています。私たちは、これらの課題に対処するために、2つの重要な貢献を行っています：新しいデータセットとモデル、それぞれOCTAVとOMCATと呼ばれます。OCTAV（Omni Context and Temporal Audio Video）は、オーディオとビデオ間のイベントの推移を捉えるために設計された革新的なデータセットです。第二に、OMCAT（Omni Context Aware Transformer）は、時間アンカー付きタスクにおける時間的な基盤と計算効率を向上させるために、RoPEの革新的な拡張であるRoTE（Rotary Time Embeddings）を活用する強力なモデルです。堅牢な3段階のトレーニングパイプライン―特徴の整列、指示の調整、およびOCTAV固有のトレーニング―を通じて、OMCATはクロスモーダルな時間理解に優れています。私たちのモデルは、オーディオビジュアル質問応答（AVQA）タスクとOCTAVベンチマークで最先端のパフォーマンスを示し、包括的な実験と削除研究を通じて検証された時間的推論とクロスモーダルな整合性において大きな利点を示しています。私たちのデータセットとコードは公開されます。デモページへのリンクはhttps://om-cat.github.ioです。

English

Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment, instruction tuning, and OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is https://om-cat.github.io.

OMCAT: オムニコンテキストアウェアトランスフォーマー

OMCAT: Omni Context Aware Transformer

要旨

Support