リランキングに基づく生成による偏りのない視点要約

要旨

政治的な視点の要約といった現実世界の設定において、偏りのない要約を生成することは、大規模言語モデル（LLMs）の重要な応用分野として依然として重要です。しかし、既存の評価フレームワークは、カバレッジや忠実性といった主要な属性を測定するために伝統的な指標に依存しており、その適用性を検証せず、改善された要約手法の開発はまだ初期段階にあります。我々はこれらのギャップを埋めるために、(1) 視点要約の品質を測定するための信頼性の高い指標を特定し、(2) ゼロショット推論を超えたLLMベースの手法の有効性を調査します。具体的には、人間のアノテーションを用いて指標の信頼性をベンチマークするためのテストセットを構築し、伝統的な指標が言語モデルベースの指標に比べて性能が低いことを示します。これらの指標を用いて、リランキングベースの手法が優れた結果をもたらすこと、および合成生成されたデータとリランキングラベル付きデータを用いた選好チューニングがさらなる性能向上をもたらすことを示します。我々の知見は、視点要約手法の信頼性のある評価と開発に貢献することを目指しています。

English

Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.

リランキングに基づく生成による偏りのない視点要約

Reranking-based Generation for Unbiased Perspective Summarization

要旨

Support