狼:利用世界总结框架为所有内容添加标题
Wolf: Captioning Everything with a World Summarization Framework
July 26, 2024
作者: Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, Marco Pavone
cs.AI
摘要
我们提出了Wolf,这是一个用于准确视频字幕生成的WOrLd总结框架。Wolf是一个自动字幕生成框架,采用了专家混合方法,利用视觉语言模型(VLMs)的互补优势。通过同时利用图像和视频模型,我们的框架捕获了不同级别的信息并高效地总结了它们。我们的方法可以应用于增强视频理解、自动标注和字幕生成。为了评估字幕质量,我们引入了CapScore,这是一种基于LLM的度量标准,用于评估生成的字幕与基准字幕之间的相似性和质量。我们进一步在三个领域构建了四个人工注释数据集:自动驾驶、一般场景和机器人技术,以促进全面比较。我们展示了Wolf相比研究界最先进方法(VILA1.5、CogAgent)和商业解决方案(Gemini-Pro-1.5、GPT-4V)实现了更优越的字幕生成性能。例如,在具有挑战性的驾驶视频中,与GPT-4V相比,Wolf在质量方面提高了55.6%,在相似性方面提高了77.4%的CapScore。最后,我们为视频字幕生成建立了一个基准,并引入了一个排行榜,旨在加速视频理解、字幕生成和数据对齐方面的进展。排行榜:https://wolfv0.github.io/leaderboard.html。
English
We propose Wolf, a WOrLd summarization Framework for accurate video
captioning. Wolf is an automated captioning framework that adopts a
mixture-of-experts approach, leveraging complementary strengths of Vision
Language Models (VLMs). By utilizing both image and video models, our framework
captures different levels of information and summarizes them efficiently. Our
approach can be applied to enhance video understanding, auto-labeling, and
captioning. To evaluate caption quality, we introduce CapScore, an LLM-based
metric to assess the similarity and quality of generated captions compared to
the ground truth captions. We further build four human-annotated datasets in
three domains: autonomous driving, general scenes, and robotics, to facilitate
comprehensive comparisons. We show that Wolf achieves superior captioning
performance compared to state-of-the-art approaches from the research community
(VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For
instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise
by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally,
we establish a benchmark for video captioning and introduce a leaderboard,
aiming to accelerate advancements in video understanding, captioning, and data
alignment. Leaderboard: https://wolfv0.github.io/leaderboard.html.Summary
AI-Generated Summary