PosterSum: 과학 포스터 요약을 위한 멀티모달 벤치마크

초록

시각적으로 복잡한 콘텐츠, 특히 과학 포스터와 같은 자료에서 정확하고 간결한 텍스트 요약을 생성하는 것은 어려운 과제입니다. 우리는 과학 포스터를 이해하고 이를 연구 논문의 초록으로 요약할 수 있는 비전-언어 모델의 발전을 촉진하기 위해 새로운 벤치마크인 PosterSum을 소개합니다. 우리의 데이터셋은 16,305개의 컨퍼런스 포스터와 그에 해당하는 초록을 요약으로 포함하고 있습니다. 각 포스터는 이미지 형식으로 제공되며, 복잡한 레이아웃, 밀집된 텍스트 영역, 표, 그림 등 다양한 시각적 이해 과제를 제시합니다. 우리는 PosterSum에서 최신 멀티모달 대형 언어 모델(MLLMs)을 벤치마킹하고, 이들이 과학 포스터를 정확하게 해석하고 요약하는 데 어려움을 겪는다는 것을 보여줍니다. 우리는 Segment & Summarize라는 계층적 방법을 제안하며, 이는 자동화된 메트릭에서 현재의 MLLMs를 능가하여 ROUGE-L에서 3.14%의 성능 향상을 달성합니다. 이는 포스터 요약에 대한 향후 연구의 출발점으로 활용될 것입니다.

English

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.

PosterSum: 과학 포스터 요약을 위한 멀티모달 벤치마크

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

초록

Support