真教师：利用大型语言模型学习事实一致性评估

摘要

通常使用自然语言推理（NLI）模型进行事实一致性评估，然而这些模型在评估摘要时表现有限。先前的研究通过合成训练数据改进了这些模型。然而，这些数据通常基于扰动的人工撰写摘要，其特征往往与真实模型生成的摘要不同，并且对可能的事实错误覆盖有限。相比之下，最近大型语言模型（LLMs）直接评估生成任务取得了有希望的结果，但在实际应用中计算成本过高。受到这些限制的启发，我们引入了TrueTeacher，一种通过使用LLM注释多样化模型生成摘要来生成合成数据的方法。与先前的工作不同，TrueTeacher不依赖于人工撰写摘要，并且天生支持多语言。在TRUE基准测试上的实验表明，使用我们的数据训练的学生模型明显优于具有相似容量的最先进模型和LLM教师。在系统研究中，我们将TrueTeacher与现有的合成数据生成方法进行比较，并展示其优越性和对领域转移的稳健性。利用mFACE数据集，我们还展示了我们的方法推广到多语境场景。最后，我们发布了一个使用TrueTeacher生成的包含140万个示例的大规模合成数据集。

English

Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. Using the the mFACE dataset, we also show that our method generalizes to multilingual scenarios. Finally, we release a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher.

真教师：利用大型语言模型学习事实一致性评估

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

摘要

Support