基于重排序的生成方法实现无偏视角摘要

摘要

在现实场景中，如政治观点摘要生成，确保无偏见的摘要输出仍是大型语言模型（LLMs）的关键应用之一。然而，现有的评估框架依赖传统指标来衡量覆盖率、忠实度等关键属性，却未验证这些指标的适用性，且改进摘要生成方法的研究尚处于起步阶段。我们通过以下两点填补了这一空白：（1）确立衡量观点摘要质量的可靠指标，（2）探索超越零样本推理的LLM方法效能。具体而言，我们构建了一个基于人工标注的测试集以评估指标可靠性，结果显示传统指标表现逊色于基于语言模型的评价指标，后者展现出强大的评估能力。利用这些指标，我们发现基于重排序的方法取得了显著效果，而结合合成生成数据与重排序标注数据进行偏好调优，则进一步提升了性能。我们的研究成果旨在为观点摘要方法的可靠评估与开发贡献力量。

English

Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.