PresentAgent: プレゼンテーションビデオ生成のためのマルチモーダルエージェント

要旨

本論文では、長文ドキュメントをナレーション付きプレゼンテーションビデオに変換するマルチモーダルエージェント「PresentAgent」を提案する。既存の手法は静的スライドやテキスト要約の生成に限定されているが、本手法は人間のプレゼンテーションスタイルに近い完全同期型の視覚的・音声コンテンツを生成することで、これらの限界を超える。この統合を実現するため、PresentAgentはモジュール型パイプラインを採用し、入力ドキュメントを体系的に分割し、スライド形式の視覚フレームを計画・レンダリングし、大規模言語モデルとText-to-Speechモデルを用いて文脈に即した音声ナレーションを生成し、正確な音声-視覚同期を伴う最終ビデオをシームレスに構成する。このようなマルチモーダル出力の評価の複雑さを考慮し、Vision-Language Modelsを活用した統一評価フレームワーク「PresentEval」を導入し、プロンプトベースの評価を通じて、コンテンツの忠実度、視覚的明瞭度、視聴者理解度の3つの重要な次元でビデオを包括的にスコアリングする。30のドキュメント-プレゼンテーションペアからなる精選データセットを用いた実験的検証により、PresentAgentはすべての評価指標において人間レベルの品質に近づくことが示された。これらの結果は、静的テキスト資料を動的で効果的かつアクセス可能なプレゼンテーション形式に変換するための制御可能なマルチモーダルエージェントの大きな可能性を強調する。コードはhttps://github.com/AIGeeksGroup/PresentAgentで公開予定である。

English

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.

PresentAgent: プレゼンテーションビデオ生成のためのマルチモーダルエージェント

PresentAgent: Multimodal Agent for Presentation Video Generation

要旨

Support