PosterSum: 科学ポスター要約のためのマルチモーダルベンチマーク

要旨

マルチモーダル文書から正確かつ簡潔なテキスト要約を生成することは、特に科学ポスターのような視覚的に複雑なコンテンツを扱う場合に困難な課題です。本研究では、科学ポスターを理解し、研究論文のアブストラクトとして要約するビジョン言語モデルの開発を促進するための新しいベンチマーク「PosterSum」を提案します。私たちのデータセットには、16,305件の学会ポスターと、それに対応するアブストラクトが要約として含まれています。各ポスターは画像形式で提供され、複雑なレイアウト、密集したテキスト領域、表、図など、多様な視覚的理解の課題を提示します。私たちは、最先端のマルチモーダル大規模言語モデル（MLLM）をPosterSumで評価し、これらのモデルが科学ポスターを正確に解釈し要約することに苦戦することを示します。さらに、現在のMLLMを自動評価指標で上回る階層的手法「Segment & Summarize」を提案し、ROUGE-Lで3.14%の向上を達成しました。これは、今後のポスター要約研究の出発点として役立つでしょう。

English

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.

PosterSum: 科学ポスター要約のためのマルチモーダルベンチマーク

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

要旨

Support