揭示語言模型在新聞摘要中的能力
Unraveling the Capabilities of Language Models in News Summarization
January 30, 2025
作者: Abdurrahman Odabaşı, Göksel Biricik
cs.AI
摘要
鑒於最近引入了多種語言模型並持續對改進自然語言處理任務,特別是摘要,有著需求,本研究提供了對20個最新語言模型的全面基準測試,專注於較小的模型用於新聞摘要任務。在這項研究中,我們系統地測試了這些模型在摘要新聞文章文本方面的能力和效果,這些文章以不同風格撰寫並呈現在三個不同的數據集中。具體來說,我們在這項研究中專注於零樣本和少樣本學習設置,並應用了結合不同評估概念的堅固評估方法,包括自動指標、人工評估和以LLM為評判者。有趣的是,在少樣本學習設置中包含示範例子並未提升模型的表現,甚至在某些情況下,導致生成摘要的質量更差。這個問題主要是由於使用作為參考摘要的金標摘要質量不佳,對模型的表現產生負面影響。此外,我們研究的結果突顯了GPT-3.5-Turbo和GPT-4的卓越表現,通常由於其先進的能力而佔主導地位。然而,在評估的公共模型中,某些模型如Qwen1.5-7B、SOLAR-10.7B-Instruct-v1.0、Meta-Llama-3-8B和Zephyr-7B-Beta展現出有前途的結果。這些模型展示了顯著的潛力,使它們成為新聞摘要任務的具有競爭力的替代方案。
English
Given the recent introduction of multiple language models and the ongoing
demand for improved Natural Language Processing tasks, particularly
summarization, this work provides a comprehensive benchmarking of 20 recent
language models, focusing on smaller ones for the news summarization task. In
this work, we systematically test the capabilities and effectiveness of these
models in summarizing news article texts which are written in different styles
and presented in three distinct datasets. Specifically, we focus in this study
on zero-shot and few-shot learning settings and we apply a robust evaluation
methodology that combines different evaluation concepts including automatic
metrics, human evaluation, and LLM-as-a-judge. Interestingly, including
demonstration examples in the few-shot learning setting did not enhance models'
performance and, in some cases, even led to worse quality of the generated
summaries. This issue arises mainly due to the poor quality of the gold
summaries that have been used as reference summaries, which negatively impacts
the models' performance. Furthermore, our study's results highlight the
exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate
due to their advanced capabilities. However, among the public models evaluated,
certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B
and Zephyr-7B-Beta demonstrated promising results. These models showed
significant potential, positioning them as competitive alternatives to large
models for the task of news summarization.Summary
AI-Generated Summary