TrueTeacher: 대규모 언어 모델을 활용한 사실 일관성 평가 학습

초록

사실 일관성 평가는 종종 자연어 추론(NLI) 모델을 사용하여 수행되지만, 이러한 모델들은 요약 평가에서 제한된 성공을 보여줍니다. 기존 연구에서는 합성 훈련 데이터를 통해 이러한 모델을 개선했습니다. 그러나 이 데이터는 일반적으로 인간이 작성한 요약을 변형한 것으로, 실제 모델 생성 요약과는 특성이 다르며 가능한 사실 오류를 제한적으로 다룹니다. 반면, 최근 대규모 언어 모델(LLM)은 생성 작업을 직접 평가하는 데 유망한 결과를 보여주었지만, 실용적인 사용에는 계산 비용이 너무 높습니다. 이러한 한계를 고려하여, 우리는 LLM을 사용하여 다양한 모델 생성 요약에 주석을 달아 합성 데이터를 생성하는 TrueTeacher 방법을 소개합니다. 기존 연구와 달리, TrueTeacher는 인간이 작성한 요약에 의존하지 않으며, 본질적으로 다국어를 지원합니다. TRUE 벤치마크에서의 실험 결과, 우리의 데이터로 훈련된 학생 모델은 유사한 용량의 최첨단 모델과 LLM 교사 모델 모두를 상당히 능가하는 성능을 보여줍니다. 체계적인 연구에서, 우리는 TrueTeacher를 기존의 합성 데이터 생성 방법과 비교하고, 그 우수성과 도메인 변화에 대한 견고성을 입증합니다. mFACE 데이터셋을 사용하여, 우리의 방법이 다국어 시나리오로도 일반화됨을 보여줍니다. 마지막으로, 우리는 TrueTeacher를 사용하여 생성된 140만 개의 예시로 구성된 대규모 합성 데이터셋을 공개합니다.

English

Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. Using the the mFACE dataset, we also show that our method generalizes to multilingual scenarios. Finally, we release a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher.

TrueTeacher: 대규모 언어 모델을 활용한 사실 일관성 평가 학습

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

초록

Support