PosterSum:一個多模態的科學海報摘要基準
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
February 24, 2025
作者: Rohit Saxena, Pasquale Minervini, Frank Keller
cs.AI
摘要
從多模態文件中生成準確且簡潔的文字摘要是一項挑戰,尤其是在處理視覺上複雜的內容(如科學海報)時。我們提出了PosterSum,這是一個新穎的基準,旨在推動視覺-語言模型的發展,使其能夠理解並將科學海報總結為研究論文摘要。我們的數據集包含16,305張會議海報,每張海報都配有其對應的摘要作為總結。每張海報以圖像形式提供,並呈現出多樣的視覺理解挑戰,如複雜的版面設計、密集的文字區域、表格和圖形。我們在PosterSum上對最新的多模態大型語言模型(MLLMs)進行了基準測試,結果顯示這些模型在準確解讀和總結科學海報方面存在困難。我們提出了分段與總結(Segment & Summarize)這一分層方法,該方法在自動化指標上超越了當前的MLLMs,ROUGE-L得分提升了3.14%。這將為未來關於海報摘要的研究提供一個起點。
English
Generating accurate and concise textual summaries from multimodal documents
is challenging, especially when dealing with visually complex content like
scientific posters. We introduce PosterSum, a novel benchmark to advance the
development of vision-language models that can understand and summarize
scientific posters into research paper abstracts. Our dataset contains 16,305
conference posters paired with their corresponding abstracts as summaries. Each
poster is provided in image format and presents diverse visual understanding
challenges, such as complex layouts, dense text regions, tables, and figures.
We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on
PosterSum and demonstrate that they struggle to accurately interpret and
summarize scientific posters. We propose Segment & Summarize, a hierarchical
method that outperforms current MLLMs on automated metrics, achieving a 3.14%
gain in ROUGE-L. This will serve as a starting point for future research on
poster summarization.