真教師：利用大型語言模型學習事實一致性評估

摘要

通常使用自然語言推理（NLI）模型來進行事實一致性評估，但這些模型在評估摘要時表現有限。先前的研究通過合成訓練數據來改善這些模型。然而，這些數據通常基於受干擾的人工撰寫摘要，這些摘要在特徵上與真實模型生成的摘要不同，並且對可能的事實錯誤的覆蓋範圍有限。另外，大型語言模型（LLMs）最近展示了直接評估生成任務的有希望結果，但對於實際應用來說計算成本太高。受到這些限制的啟發，我們介紹了TrueTeacher，一種通過使用LLM注釋多樣化模型生成的摘要來生成合成數據的方法。與先前的工作不同，TrueTeacher不依賴於人工撰寫摘要，並且天生是多語言的。在TRUE基準測試上的實驗表明，使用我們的數據訓練的學生模型明顯優於具有相似容量的最先進模型和LLM教師。在一項系統研究中，我們將TrueTeacher與現有的合成數據生成方法進行比較，並展示其在面對領域轉移時的優越性和穩健性。通過mFACE數據集，我們還展示了我們的方法對多語境的泛化能力。最後，我們釋出了一個使用TrueTeacher生成的包含140萬個示例的大規模合成數據集。

English

Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. Using the the mFACE dataset, we also show that our method generalizes to multilingual scenarios. Finally, we release a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher.

真教師：利用大型語言模型學習事實一致性評估

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

摘要

Support