PresentAgent: 프레젠테이션 비디오 생성을 위한 멀티모달 에이전트

초록

본 논문에서는 장문의 문서를 내레이션이 포함된 프레젠테이션 비디오로 변환하는 다중모달 에이전트인 PresentAgent를 소개한다. 기존의 접근 방식이 정적인 슬라이드나 텍스트 요약 생성에 국한된 반면, 본 연구의 방법은 인간 스타일의 프레젠테이션을 모방한 완전히 동기화된 시각 및 음성 콘텐츠를 생성함으로써 이러한 한계를 극복한다. 이러한 통합을 달성하기 위해 PresentAgent는 입력 문서를 체계적으로 분할하고, 슬라이드 스타일의 시각적 프레임을 계획 및 렌더링하며, 대규모 언어 모델과 텍스트-음성 변환 모델을 활용하여 문맥에 맞는 음성 내레이션을 생성하고, 정확한 오디오-비주얼 정렬을 통해 최종 비디오를 완성하는 모듈식 파이프라인을 사용한다. 이러한 다중모달 출력물의 평가 복잡성을 고려하여, 본 연구는 Vision-Language Models에 기반한 통합 평가 프레임워크인 PresentEval을 도입한다. 이 프레임워크는 프롬프트 기반 평가를 통해 콘텐츠 충실도, 시각적 명확성, 청중 이해도라는 세 가지 중요한 차원에서 비디오를 포괄적으로 점수화한다. 30개의 문서-프레젠테이션 쌍으로 구성된 데이터셋에 대한 실험적 검증을 통해 PresentAgent가 모든 평가 지표에서 인간 수준의 품질에 근접함을 입증한다. 이러한 결과는 정적인 텍스트 자료를 동적이고 효과적이며 접근 가능한 프레젠테이션 형식으로 변환하는 데 있어 제어 가능한 다중모달 에이전트의 상당한 잠재력을 강조한다. 코드는 https://github.com/AIGeeksGroup/PresentAgent에서 제공될 예정이다.

English

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.

PresentAgent: 프레젠테이션 비디오 생성을 위한 멀티모달 에이전트

PresentAgent: Multimodal Agent for Presentation Video Generation

초록

Support