Paper2Video: 科学論文からの自動動画生成

要旨

学術プレゼンテーションビデオは、研究コミュニケーションにおいて不可欠な媒体となっているが、その制作は依然として非常に労力を要し、わずか2分から10分のビデオを作成するために、スライドのデザイン、録音、編集に何時間も費やすことが多い。自然なビデオとは異なり、プレゼンテーションビデオの生成には、研究論文からの入力、高密度のマルチモーダル情報（テキスト、図表、表）、そしてスライド、字幕、音声、話者といった複数の連携したチャネルを調整する必要性といった特有の課題が存在する。これらの課題に対処するため、我々はPaperTalkerを紹介する。これは、101の研究論文と著者作成のプレゼンテーションビデオ、スライド、および話者メタデータをペアにした初のベンチマークである。さらに、ビデオが論文の情報をどのように視聴者に伝えるかを測定するために、Meta Similarity、PresentArena、PresentQuiz、IP Memoryという4つの特化した評価指標を設計した。この基盤を基に、我々は学術プレゼンテーションビデオ生成のための初のマルチエージェントフレームワークであるPaperTalkerを提案する。これは、スライド生成を効率的なレイアウト改良と統合し、新たな有効なツリーサーチによる視覚的選択、カーソルの接地、字幕付け、音声合成、およびトーキングヘッドのレンダリングを実現し、スライドごとの生成を並列化して効率を向上させる。Paper2Videoでの実験により、我々のアプローチによって生成されたプレゼンテーションビデオが既存のベースラインよりも忠実で情報量が多いことが示され、自動化された即座に使用可能な学術ビデオ生成に向けた実用的な一歩を確立した。我々のデータセット、エージェント、およびコードはhttps://github.com/showlab/Paper2Videoで公開されている。

English

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

Paper2Video: 科学論文からの自動動画生成

Paper2Video: Automatic Video Generation from Scientific Papers

要旨

Support