Paper2Poster：迈向从科研论文到多模态海报的自动化生成

摘要

学术海报生成是科学传播中一项关键而具挑战性的任务，它要求将长篇交错的文档内容压缩至单页，并保持视觉上的连贯性。为应对这一挑战，我们首次引入了海报生成的基准测试与评价体系，该体系将近期会议论文与作者设计的海报配对，并从以下四个方面评估生成结果：(i)视觉质量——与人类设计海报的语义一致性，(ii)文本连贯性——语言流畅度，(iii)整体评估——由视觉语言模型(VLM)作为评判者，依据六项细化的美学与信息标准打分，以及尤为重要的(iv)论文测验——通过VLM回答基于海报生成的测验，衡量海报传达论文核心内容的能力。基于此基准，我们提出了PosterAgent，一种自上而下、视觉反馈循环的多智能体流程：(a)解析器将论文提炼为结构化资源库；(b)规划器将文本-视觉对按阅读顺序与空间平衡原则排列成二叉树布局；(c)绘制-评论循环通过执行渲染代码并利用VLM反馈优化每一面板，消除溢出并确保对齐。在全面评估中，我们发现GPT-4o的输出虽初看视觉吸引，但常伴有文本噪声且论文测验得分较低，而读者参与度是主要的美学瓶颈，因为人类设计海报主要依赖视觉语义传达意义。我们完全开源的版本（如基于Qwen-2.5系列）在几乎所有指标上均优于现有的4o驱动多智能体系统，同时减少了87%的token使用量。它仅需0.005美元，即可将22页论文转化为最终可编辑的.pptx格式海报。这些发现为下一代全自动海报生成模型指明了清晰方向。代码与数据集已发布于https://github.com/Paper2Poster/Paper2Poster。

English

Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.