PresentBench：基于精细评分标准的幻灯片生成基准

摘要

幻灯片在学术、教育及商业等演示场景中作为信息传递的关键载体，其重要性不言而喻。尽管幻灯片制作至关重要，但创作高质量的幻灯片集仍是一项耗时且耗费认知资源的任务。近年来，随着Nano Banana Pro等生成模型的进步，自动化幻灯片生成已日趋可行。然而，现有的幻灯片生成评估方法往往较为粗粒度，且依赖整体性判断，难以精准衡量模型能力或追踪该领域的实质性进展。实践中，缺乏细粒度、可验证的评估标准已成为制约研究进展与实际应用部署的关键瓶颈。本文提出PresentBench——一个基于量规的细粒度基准测试框架，用于评估现实场景中的自动化幻灯片生成。该框架包含238个评估实例，每个实例均附有幻灯片创作所需的背景材料。此外，我们为每个实例人工设计了平均54.1个检查项（以二元问题形式呈现），实现对生成幻灯片集的细粒度、实例化评估。大量实验表明，PresentBench相比现有方法能提供更可靠的评估结果，且与人类偏好呈现显著更强的对齐性。进一步地，我们的基准测试揭示NotebookLM在幻灯片生成方法中表现尤为突出，彰显了该领域近期的重大进展。

English

Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment. In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks. Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.