WikiVideo：基於多部影片的文章生成

摘要

我們提出了一項具有挑戰性的任務：自動生成高層次的維基百科風格文章，這些文章需整合來自多個多樣化視頻的資訊，涵蓋自然災害或政治選舉等現實世界事件。視頻作為檢索增強生成（RAG）的直觀來源，但當代大多數RAG工作流程主要側重於文本，而現有的基於視頻的摘要方法則專注於低層次的場景理解而非高層次的事件語義。為彌補這一差距，我們引入了WikiVideo，這是一個由專家撰寫的文章和密集註釋的視頻組成的基準，這些視頻為文章的主張提供了證據，促進了視頻在RAG管道中的整合，並支持創建基於多模態來源的深入內容。我們進一步提出了協作文章生成（CAG），這是一種從多個視頻創建文章的創新互動方法。CAG利用r1風格推理模型與VideoLLM之間的迭代互動，來對目標事件進行比僅使用VideoLLM時更高層次的推斷，後者往往局限於低層次的視覺特徵。我們在oracle檢索和RAG設置下對最先進的VideoLLM和CAG進行了基準測試，發現CAG始終優於其他方法，同時為未來工作提出了引人入勝的研究方向。

English

We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

WikiVideo：基於多部影片的文章生成

WikiVideo: Article Generation from Multiple Videos

摘要

Support