ChatPaper.aiChatPaper

Paper2Video:基於科學論文的自動化視頻生成

Paper2Video: Automatic Video Generation from Scientific Papers

October 6, 2025
作者: Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
cs.AI

摘要

學術演講視頻已成為研究交流的重要媒介,然而其製作過程仍高度耗時,通常需要數小時的幻燈片設計、錄製和剪輯,才能完成一段僅2至10分鐘的視頻。與自然視頻不同,演講視頻的生成面臨獨特的挑戰:研究論文作為輸入,密集的多模態信息(文本、圖表、表格),以及需要協調多個對齊的通道,如幻燈片、字幕、語音和演講者。為應對這些挑戰,我們引入了PaperTalker,這是首個包含101篇研究論文及其作者創建的演講視頻、幻燈片和演講者元數據的基準數據集。我們進一步設計了四個定制的評估指標——元相似度、PresentArena、PresentQuiz和IP記憶——來衡量視頻如何向觀眾傳達論文信息。基於此基礎,我們提出了PaperTalker,這是首個用於學術演講視頻生成的多智能體框架。它通過新穎的樹搜索視覺選擇、光標定位、字幕生成、語音合成和頭像渲染,將幻燈片生成與有效的佈局優化相結合,同時並行化幻燈片級別的生成以提高效率。在Paper2Video上的實驗表明,我們的方法生成的演講視頻比現有基線更忠實且信息豐富,為自動化和即用型學術視頻生成邁出了實用的一步。我們的數據集、智能體和代碼可在https://github.com/showlab/Paper2Video獲取。
English
Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.
PDF842October 7, 2025