大規模言語モデルにおける長文の事実性

要旨

大規模言語モデル（LLM）は、オープンエンドのトピックに関する事実を求めるプロンプトに対して、しばしば事実誤りを含むコンテンツを生成します。オープンドメインにおけるモデルの長文形式の事実性をベンチマークするために、まずGPT-4を使用して、38のトピックにわたる数千の質問を含むプロンプトセット「LongFact」を生成します。次に、LLMエージェントを長文形式の事実性の自動評価者として使用する方法を提案します。この方法は「Search-Augmented Factuality Evaluator（SAFE）」と呼ばれます。SAFEは、LLMを使用して長文形式の応答を個々の事実に分解し、Google検索にクエリを送信し、検索結果によって各事実が支持されているかどうかを判断する多段階の推論プロセスを通じて、各事実の正確性を評価します。さらに、長文形式の事実性の集計指標としてF1スコアを拡張することを提案します。これを行うために、応答内の支持された事実の割合（精度）と、ユーザーの希望する応答長を表すハイパーパラメータに対する提供された事実の割合（再現率）をバランスさせます。実証的に、LLMエージェントが超人的な評価性能を達成できることを示します。約16,000の個々の事実のセットにおいて、SAFEはクラウドソーシングされた人間のアノテーターと72%の一致率を示し、100の不一致ケースのランダムサブセットでは、SAFEが76%のケースで優れています。同時に、SAFEは人間のアノテーターよりも20倍以上コストが低いです。また、4つのモデルファミリー（Gemini、GPT、Claude、PaLM-2）にわたる13の言語モデルをLongFactでベンチマークし、より大規模な言語モデルが一般的に長文形式の事実性において優れていることを発見しました。LongFact、SAFE、およびすべての実験コードはhttps://github.com/google-deepmind/long-form-factualityで利用可能です。

English

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can achieve superhuman rating performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

大規模言語モデルにおける長文の事実性

Long-form factuality in large language models

要旨

Summary

Support

Support