MovieSum:電影劇本摘要數據集
MovieSum: An Abstractive Summarization Dataset for Movie Screenplays
August 12, 2024
作者: Rohit Saxena, Frank Keller
cs.AI
摘要
電影劇本摘要是具有挑戰性的,因為它需要理解長篇輸入內容和電影獨有的各種元素。大型語言模型在文件摘要方面取得了顯著進展,但它們通常在處理長篇輸入內容時遇到困難。此外,雖然電視劇本已獲得近期研究的關注,但電影劇本摘要仍未被充分探索。為了激發這一領域的研究,我們提出了一個新的數據集 MovieSum,用於電影劇本的抽象摘要。該數據集包括2200部電影劇本,並附有它們的維基百科情節摘要。我們手動格式化了電影劇本以代表它們的結構元素。與現有數據集相比,MovieSum 具有幾個獨特特徵:(1) 它包括電影劇本,比電視劇集的劇本更長。 (2) 它是先前電影劇本數據集的兩倍大小。 (3) 它提供了帶有 IMDb ID 的元數據,以便獲取額外的外部知識。我們還展示了最近發布的大型語言模型應用於我們數據集的摘要,以提供詳細的基準線。
English
Movie screenplay summarization is challenging, as it requires an
understanding of long input contexts and various elements unique to movies.
Large language models have shown significant advancements in document
summarization, but they often struggle with processing long input contexts.
Furthermore, while television transcripts have received attention in recent
studies, movie screenplay summarization remains underexplored. To stimulate
research in this area, we present a new dataset, MovieSum, for abstractive
summarization of movie screenplays. This dataset comprises 2200 movie
screenplays accompanied by their Wikipedia plot summaries. We manually
formatted the movie screenplays to represent their structural elements.
Compared to existing datasets, MovieSum possesses several distinctive features:
(1) It includes movie screenplays, which are longer than scripts of TV
episodes. (2) It is twice the size of previous movie screenplay datasets. (3)
It provides metadata with IMDb IDs to facilitate access to additional external
knowledge. We also show the results of recently released large language models
applied to summarization on our dataset to provide a detailed baseline.Summary
AI-Generated Summary