WikiVideo: 複数動画からの記事生成

要旨

我々は、自然災害や政治選挙などの現実世界の出来事に関する多様な複数のビデオから情報を集約し、ウィキペディアスタイルの高レベルな記事を自動生成するという挑戦的な課題を提示する。ビデオは検索拡張生成（RAG）にとって直感的な情報源であるが、現代のRAGワークフローの多くはテキストに重点を置いており、既存のビデオベースの要約手法は低レベルのシーン理解に焦点を当てているため、高レベルのイベント意味論を捉えることができない。このギャップを埋めるため、我々はWikiVideoを導入する。これは専門家が執筆した記事と、記事の主張を裏付ける詳細な注釈付きビデオから構成されるベンチマークであり、ビデオをRAGパイプラインに統合し、マルチモーダルソースに基づいた詳細なコンテンツの作成を可能にする。さらに、複数のビデオから記事を作成するための新しいインタラクティブ手法であるCollaborative Article Generation（CAG）を提案する。CAGは、r1スタイルの推論モデルとVideoLLMとの反復的な相互作用を活用し、低レベルの視覚的特徴に固執するVideoLLM単体では不可能な、対象イベントに関するより高次の推論を導き出す。我々は最先端のVideoLLMとCAGを、オラクル検索とRAG設定の両方でベンチマークし、CAGが代替手法を一貫して上回ることを確認するとともに、今後の研究に向けた興味深い方向性を示唆する。

English

We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

WikiVideo: 複数動画からの記事生成

WikiVideo: Article Generation from Multiple Videos

要旨

Support