Paper2Video: 과학 논문 기반 자동 비디오 생성

초록

학술 발표 동영상은 연구 커뮤니케이션의 필수 매체로 자리 잡았지만, 이를 제작하는 것은 여전히 매우 노동 집약적이며, 단 2~10분 길이의 동영상을 위해 슬라이드 디자인, 녹화, 편집 등 수 시간이 소요됩니다. 자연스러운 동영상과 달리, 발표 동영상 생성은 독특한 도전 과제를 포함합니다: 연구 논문에서의 입력, 밀도 높은 다중 모달 정보(텍스트, 그림, 표), 그리고 슬라이드, 자막, 음성, 발표자 등 여러 정렬된 채널을 조율해야 하는 필요성 등이 그것입니다. 이러한 도전 과제를 해결하기 위해, 우리는 저자가 직접 제작한 발표 동영상, 슬라이드, 발표자 메타데이터와 함께 101편의 연구 논문을 짝지은 최초의 벤치마크인 PaperTalker를 소개합니다. 또한, 동영상이 논문의 정보를 청중에게 얼마나 효과적으로 전달하는지를 측정하기 위해 네 가지 맞춤형 평가 지표--메타 유사성(Meta Similarity), PresentArena, PresentQuiz, IP Memory--를 설계했습니다. 이를 기반으로, 우리는 학술 발표 동영상 생성을 위한 최초의 다중 에이전트 프레임워크인 PaperTalker를 제안합니다. 이 프레임워크는 슬라이드 생성과 함께 새로운 효과적인 트리 탐색 시각적 선택, 커서 그라운딩, 자막 생성, 음성 합성, 그리고 발표자 영상 렌더링을 통합하며, 효율성을 위해 슬라이드 단위 생성을 병렬화합니다. Paper2Video에 대한 실험 결과, 우리의 접근 방식으로 생성된 발표 동영상은 기존의 베이스라인보다 더 정확하고 정보가 풍부한 것으로 나타났으며, 자동화되고 바로 사용할 수 있는 학술 동영상 생성을 위한 실질적인 한 걸음을 내디뎠습니다. 우리의 데이터셋, 에이전트, 코드는 https://github.com/showlab/Paper2Video에서 확인할 수 있습니다.

English

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

Paper2Video: 과학 논문 기반 자동 비디오 생성

Paper2Video: Automatic Video Generation from Scientific Papers

초록

Support