FACTORY: 長文の事実性検証のための挑戦的な人間検証済みプロンプトセット

要旨

長文の事実性評価は、モデルが短いプロンプトに対して正確かつ包括的な応答を生成する能力を評価するものである。既存のベンチマークでは、人間による検証が不足していることが多く、品質上の問題を引き起こす可能性がある。この課題に対処するため、我々は大規模な人間検証済みプロンプトセットであるFACTORYを導入する。FACTORYは、モデルインザループアプローチを用いて開発され、人間によって精緻化されたものであり、事実を求める、回答可能で、曖昧さのない挑戦的なプロンプトを含んでいる。我々は、FACTORYおよび既存のデータセットを用いて、6つの最先端言語モデルに対する人間による評価を実施した。その結果、FACTORYは挑戦的なベンチマークであることが示された：SOTAモデルの応答における主張の約40％が事実に基づいていないのに対し、他のデータセットではわずか10％であった。我々の分析は、FACTORYが従来のベンチマークよりも優れている点を明らかにし、その信頼性と、モデルがロングテールの事実を横断的に推論する必要性を強調している。

English

Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across long-tailed facts.

FACTORY: 長文の事実性検証のための挑戦的な人間検証済みプロンプトセット

FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

要旨

Support