全上下文感知变换器：OMCAT

摘要

大型语言模型（LLMs）在文本生成和理解方面取得了重大进展，最近的发展已延伸到整合视觉和音频输入的多模态LLMs。然而，这些模型在细粒度、跨模态时间理解方面仍然存在困难，特别是在相关联音频和视频流中的事件时。我们通过两个关键贡献来解决这些挑战：一个新数据集和模型，分别称为OCTAV和OMCAT。OCTAV（Omni Context and Temporal Audio Video）是一个新颖的数据集，旨在捕捉音频和视频之间的事件转换。其次，OMCAT（Omni Context Aware Transformer）是一个强大的模型，利用RoTE（Rotary Time Embeddings），这是RoPE的创新扩展，以增强时间基准任务中的时间基础和计算效率。通过一个稳健的三阶段训练流程——特征对齐、指导微调和OCTAV特定训练——OMCAT在跨模态时间理解方面表现出色。我们的模型在音频-视觉问答（AVQA）任务和OCTAV基准上展现了最先进的性能，通过全面实验和消融研究验证了在时间推理和跨模态对齐方面的显著收益。我们的数据集和代码将公开发布。我们的演示页面链接为https://om-cat.github.io。

English

Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment, instruction tuning, and OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is https://om-cat.github.io.

全上下文感知变换器：OMCAT

OMCAT: Omni Context Aware Transformer

摘要

Support