狼：利用世界总结框架为所有内容添加标题

摘要

我们提出了Wolf，这是一个用于准确视频字幕生成的WOrLd总结框架。Wolf是一个自动字幕生成框架，采用了专家混合方法，利用视觉语言模型（VLMs）的互补优势。通过同时利用图像和视频模型，我们的框架捕获了不同级别的信息并高效地总结了它们。我们的方法可以应用于增强视频理解、自动标注和字幕生成。为了评估字幕质量，我们引入了CapScore，这是一种基于LLM的度量标准，用于评估生成的字幕与基准字幕之间的相似性和质量。我们进一步在三个领域构建了四个人工注释数据集：自动驾驶、一般场景和机器人技术，以促进全面比较。我们展示了Wolf相比研究界最先进方法（VILA1.5、CogAgent）和商业解决方案（Gemini-Pro-1.5、GPT-4V）实现了更优越的字幕生成性能。例如，在具有挑战性的驾驶视频中，与GPT-4V相比，Wolf在质量方面提高了55.6%，在相似性方面提高了77.4%的CapScore。最后，我们为视频字幕生成建立了一个基准，并引入了一个排行榜，旨在加速视频理解、字幕生成和数据对齐方面的进展。排行榜：https://wolfv0.github.io/leaderboard.html。

English

We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Leaderboard: https://wolfv0.github.io/leaderboard.html.

狼：利用世界总结框架为所有内容添加标题

Wolf: Captioning Everything with a World Summarization Framework

摘要

Support