论文展示是一门艺术：面向学术演讲的自我提升美学智能体

摘要

学术论文的推广已成为提升研究可见度的重要手段。然而，现有的自动化方法在叙事连贯性、美学质量不足以及自我调整受限等方面存在困难，难以实现高效且引人入胜的传播。这些挑战的核心在于一个简单原则：若无法准确评估，则无从改进。为此，我们提出了EvoPresent，一个自我提升的智能体框架，它通过虚拟角色统一了连贯的叙事、美学感知的设计以及逼真的演示呈现。EvoPresent的核心是PresAesth，一个多任务强化学习（RL）美学模型，它提供了可靠的美学评分、缺陷调整和比较反馈，即使在美学训练数据有限的情况下也能实现迭代自我提升。为了系统评估这些方法，我们引入了EvoPresent基准，这是一个综合基准，包括：基于650篇顶级AI会议论文的多模态资源（幻灯片、视频和脚本）构建的演示生成质量评估，用于内容和设计的双重考量；以及美学意识评估，包含2000对美学水平各异的幻灯片，支持在评分、缺陷调整和比较任务上的联合训练与评估。我们的研究发现：（i）高质量反馈对于智能体自我提升至关重要，而初始能力本身并不能保证有效的自我修正。（ii）自动化生成管道在视觉设计与内容构建之间存在权衡。（iii）多任务RL训练在美学意识任务中展现出更强的泛化能力。

English

The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: there is no way to improve it when you cannot evaluate it right. To address this, we introduce EvoPresent, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is PresAesth, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce EvoPresent Benchmark, a comprehensive benchmark comprising: Presentation Generation Quality, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and Aesthetic Awareness, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.