AuroraCap:高效、高性能的视频详细字幕生成及新基准。
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
October 4, 2024
作者: Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning
cs.AI
摘要
视频详细字幕是一个关键任务,旨在生成视频内容的全面连贯的文本描述,有利于视频理解和生成。在本文中,我们提出了基于大型多模态模型的视频字幕生成器AuroraCap。我们采用了最简单的架构设计,没有额外的参数用于时间建模。为了解决由于漫长视频序列导致的开销,我们实现了令牌合并策略,减少输入视觉令牌的数量。令人惊讶的是,我们发现这种策略几乎没有性能损失。AuroraCap在各种视频和图像字幕基准测试中表现出优越性能,例如,在Flickr30k上获得了88.9的CIDEr,超过了GPT-4V(55.3)和Gemini-1.5 Pro(82.2)。然而,现有的视频字幕基准测试仅包含简单描述,由几十个词组成,这限制了该领域的研究。因此,我们开发了VDC,一个包含一千多个精心注释的结构化字幕的视频详细字幕基准测试。此外,我们提出了一个新的LLM辅助度量VDCscore,用于改进评估,采用分而治之的策略,将长字幕评估转化为多个短问题-答案对。通过人类Elo排名的帮助,我们的实验表明,这一基准测试更好地与视频详细字幕质量的人类判断相关。
English
Video detailed captioning is a key task which aims to generate comprehensive
and coherent textual descriptions of video content, benefiting both video
understanding and generation. In this paper, we propose AuroraCap, a video
captioner based on a large multimodal model. We follow the simplest
architecture design without additional parameters for temporal modeling. To
address the overhead caused by lengthy video sequences, we implement the token
merging strategy, reducing the number of input visual tokens. Surprisingly, we
found that this strategy results in little performance loss. AuroraCap shows
superior performance on various video and image captioning benchmarks, for
example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and
Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include
simple descriptions, consisting of a few dozen words, which limits research in
this field. Therefore, we develop VDC, a video detailed captioning benchmark
with over one thousand carefully annotated structured captions. In addition, we
propose a new LLM-assisted metric VDCscore for bettering evaluation, which
adopts a divide-and-conquer strategy to transform long caption evaluation into
multiple short question-answer pairs. With the help of human Elo ranking, our
experiments show that this benchmark better correlates with human judgments of
video detailed captioning quality.Summary
AI-Generated Summary