편향 없는 관점 요약을 위한 재순위화 기반 생성

초록

정치적 관점 요약과 같은 실제 환경에서 편향되지 않은 요약문을 생성하는 것은 대규모 언어 모델(LLM)의 중요한 응용 분야로 남아 있습니다. 그러나 기존 평가 프레임워크는 적용 가능성을 검증하지 않은 채 커버리지와 충실도와 같은 핵심 속성을 측정하기 위해 전통적인 지표에 의존하고 있으며, 개선된 요약기를 개발하기 위한 노력은 아직 초기 단계에 있습니다. 우리는 이러한 격차를 해결하기 위해 (1) 관점 요약 품질을 측정하기 위한 신뢰할 수 있는 지표를 식별하고, (2) 제로샷 추론을 넘어선 LLM 기반 방법의 효용성을 조사합니다. 구체적으로, 우리는 인간 주석을 사용하여 지표 신뢰도를 벤치마킹하기 위한 테스트 세트를 구축하고, 언어 모델 기반 지표가 강력한 평가자로 입증되며 전통적인 지표들이 이에 비해 성능이 떨어진다는 것을 보여줍니다. 이러한 지표를 사용하여, 우리는 재순위 기반 방법이 강력한 결과를 내며, 합성적으로 생성되고 재순위 레이블이 지정된 데이터를 사용한 선호도 튜닝이 성능을 더욱 향상시킨다는 것을 보여줍니다. 우리의 연구 결과는 관점 요약 방법의 신뢰할 수 있는 평가와 개발에 기여하고자 합니다.

English

Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.

편향 없는 관점 요약을 위한 재순위화 기반 생성

Reranking-based Generation for Unbiased Perspective Summarization

초록

Support