PresentAgent：面向演示视频生成的多模态智能体

摘要

我们推出PresentAgent，一款多模态智能体，能够将长篇文档转化为带旁白的演示视频。现有方法仅限于生成静态幻灯片或文本摘要，而我们的方法突破了这些限制，生成完全同步的视觉与语音内容，高度模拟人类风格的演示。为实现这一整合，PresentAgent采用模块化流程，系统性地分割输入文档，规划并渲染幻灯片式视觉框架，利用大语言模型和文本转语音模型生成上下文相关的语音叙述，并精确对齐音视频，无缝合成最终视频。鉴于评估此类多模态输出的复杂性，我们引入PresentEval，一个基于视觉-语言模型的统一评估框架，通过提示驱动评估，全面评分视频在三个关键维度上的表现：内容忠实度、视觉清晰度和观众理解度。在精选的30对文档-演示数据集上的实验验证表明，PresentAgent在所有评估指标上均接近人类水平。这些结果凸显了可控多模态智能体在将静态文本材料转化为动态、高效且易于访问的演示格式方面的巨大潜力。代码将在https://github.com/AIGeeksGroup/PresentAgent 提供。

English

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.

PresentAgent：面向演示视频生成的多模态智能体

PresentAgent: Multimodal Agent for Presentation Video Generation

摘要

Support