全上下文感知变换器:OMCAT
OMCAT: Omni Context Aware Transformer
October 15, 2024
作者: Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro
cs.AI
摘要
大型语言模型(LLMs)在文本生成和理解方面取得了重大进展,最近的发展已延伸到整合视觉和音频输入的多模态LLMs。然而,这些模型在细粒度、跨模态时间理解方面仍然存在困难,特别是在相关联音频和视频流中的事件时。我们通过两个关键贡献来解决这些挑战:一个新数据集和模型,分别称为OCTAV和OMCAT。OCTAV(Omni Context and Temporal Audio Video)是一个新颖的数据集,旨在捕捉音频和视频之间的事件转换。其次,OMCAT(Omni Context Aware Transformer)是一个强大的模型,利用RoTE(Rotary Time Embeddings),这是RoPE的创新扩展,以增强时间基准任务中的时间基础和计算效率。通过一个稳健的三阶段训练流程——特征对齐、指导微调和OCTAV特定训练——OMCAT在跨模态时间理解方面表现出色。我们的模型在音频-视觉问答(AVQA)任务和OCTAV基准上展现了最先进的性能,通过全面实验和消融研究验证了在时间推理和跨模态对齐方面的显著收益。我们的数据集和代码将公开发布。我们的演示页面链接为https://om-cat.github.io。
English
Large Language Models (LLMs) have made significant strides in text generation
and comprehension, with recent advancements extending into multimodal LLMs that
integrate visual and audio inputs. However, these models continue to struggle
with fine-grained, cross-modal temporal understanding, particularly when
correlating events across audio and video streams. We address these challenges
with two key contributions: a new dataset and model, called OCTAV and OMCAT
respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset
designed to capture event transitions across audio and video. Second, OMCAT
(Omni Context Aware Transformer) is a powerful model that leverages RoTE
(Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal
grounding and computational efficiency in time-anchored tasks. Through a robust
three-stage training pipeline-feature alignment, instruction tuning, and
OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our
model demonstrates state-of-the-art performance on Audio-Visual Question
Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in
temporal reasoning and cross-modal alignment, as validated through
comprehensive experiments and ablation studies. Our dataset and code will be
made publicly available. The link to our demo page is https://om-cat.github.io.Summary
AI-Generated Summary