基於重排序的生成方法實現無偏見視角摘要

摘要

在現實世界場景中，如政治觀點摘要，生成無偏見的摘要仍然是大型語言模型（LLMs）的一個關鍵應用。然而，現有的評估框架依賴於傳統指標來衡量關鍵屬性，如覆蓋率和忠實度，而沒有驗證其適用性，且改進摘要生成器的努力仍處於初期階段。我們通過以下方式解決這些問題：（1）識別用於衡量觀點摘要質量的可靠指標，（2）研究基於LLM的方法在零樣本推理之外的有效性。具體而言，我們利用人工註釋構建了一個用於基準測試指標可靠性的測試集，並顯示傳統指標相較於基於語言模型的指標表現不佳，後者被證明是強大的評估工具。使用這些指標，我們展示了基於重排序的方法取得了強勁的結果，並且通過使用合成生成和重排序標記的數據進行偏好調優進一步提升了性能。我們的研究成果旨在為觀點摘要方法的可靠評估和開發做出貢獻。

English

Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.