PresentAgent:用於演示視頻生成的多模態代理
PresentAgent: Multimodal Agent for Presentation Video Generation
July 5, 2025
作者: Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao
cs.AI
摘要
我們介紹PresentAgent,這是一款多模態代理,能夠將長篇文檔轉化為帶有旁白的演示視頻。現有方法僅限於生成靜態幻燈片或文本摘要,而我們的方法超越了這些限制,產出完全同步的視覺與口語內容,極其接近人類風格的演示。為實現這一整合,PresentAgent採用模塊化流程,系統地分割輸入文檔,規劃並渲染幻燈片風格的視覺框架,利用大型語言模型和文本轉語音模型生成上下文相關的口語敘述,並精確地將音視頻對齊,無縫合成最終視頻。考慮到評估此類多模態輸出的複雜性,我們引入了PresentEval,這是一個由視覺-語言模型驅動的統一評估框架,通過基於提示的評估,全面評分視頻在三個關鍵維度上的表現:內容忠實度、視覺清晰度和觀眾理解度。我們在精選的30對文檔-演示數據集上進行的實驗驗證表明,PresentAgent在所有評估指標上均接近人類水平。這些結果凸顯了可控多模態代理在將靜態文本材料轉化為動態、有效且易於訪問的演示格式方面的巨大潛力。代碼將於https://github.com/AIGeeksGroup/PresentAgent提供。
English
We present PresentAgent, a multimodal agent that transforms long-form
documents into narrated presentation videos. While existing approaches are
limited to generating static slides or text summaries, our method advances
beyond these limitations by producing fully synchronized visual and spoken
content that closely mimics human-style presentations. To achieve this
integration, PresentAgent employs a modular pipeline that systematically
segments the input document, plans and renders slide-style visual frames,
generates contextual spoken narration with large language models and
Text-to-Speech models, and seamlessly composes the final video with precise
audio-visual alignment. Given the complexity of evaluating such multimodal
outputs, we introduce PresentEval, a unified assessment framework powered by
Vision-Language Models that comprehensively scores videos across three critical
dimensions: content fidelity, visual clarity, and audience comprehension
through prompt-based evaluation. Our experimental validation on a curated
dataset of 30 document-presentation pairs demonstrates that PresentAgent
approaches human-level quality across all evaluation metrics. These results
highlight the significant potential of controllable multimodal agents in
transforming static textual materials into dynamic, effective, and accessible
presentation formats. Code will be available at
https://github.com/AIGeeksGroup/PresentAgent.