PosterSum：一個多模態的科學海報摘要基準

摘要

從多模態文件中生成準確且簡潔的文字摘要是一項挑戰，尤其是在處理視覺上複雜的內容（如科學海報）時。我們提出了PosterSum，這是一個新穎的基準，旨在推動視覺-語言模型的發展，使其能夠理解並將科學海報總結為研究論文摘要。我們的數據集包含16,305張會議海報，每張海報都配有其對應的摘要作為總結。每張海報以圖像形式提供，並呈現出多樣的視覺理解挑戰，如複雜的版面設計、密集的文字區域、表格和圖形。我們在PosterSum上對最新的多模態大型語言模型（MLLMs）進行了基準測試，結果顯示這些模型在準確解讀和總結科學海報方面存在困難。我們提出了分段與總結（Segment & Summarize）這一分層方法，該方法在自動化指標上超越了當前的MLLMs，ROUGE-L得分提升了3.14%。這將為未來關於海報摘要的研究提供一個起點。

English

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.

PosterSum：一個多模態的科學海報摘要基準

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

摘要

Support