狼：利用世界總結框架對所有內容進行標題生成

摘要

我們提出了Wolf，一個用於準確視頻字幕的WOrLd摘要框架。Wolf是一個自動字幕框架，採用專家混合方法，利用視覺語言模型（VLMs）的互補優勢。通過同時利用圖像和視頻模型，我們的框架捕捉了不同層次的信息並有效地對其進行摘要。我們的方法可應用於增強視頻理解、自動標記和字幕生成。為了評估字幕質量，我們引入了CapScore，一個基於LLM的指標，用於評估生成的字幕與基準字幕之間的相似性和質量。我們進一步在三個領域建立了四個人工標註數據集：自動駕駛、一般場景和機器人技術，以促進全面比較。我們展示了Wolf相對於研究界（VILA1.5、CogAgent）和商業解決方案（Gemini-Pro-1.5、GPT-4V）的最新方法在字幕生成性能方面的優越性。例如，在具有挑戰性的駕駛視頻中，與GPT-4V相比，Wolf在質量方面提高了55.6％，在相似性方面提高了77.4％的CapScore。最後，我們為視頻字幕生成建立了一個基準並引入了排行榜，旨在加速視頻理解、字幕生成和數據對齊的進步。排行榜：https://wolfv0.github.io/leaderboard.html。

English

We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Leaderboard: https://wolfv0.github.io/leaderboard.html.

狼：利用世界總結框架對所有內容進行標題生成

Wolf: Captioning Everything with a World Summarization Framework

摘要

Support