FACTORY:一套经人工验证的长篇事实性评估挑战性提示集
FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality
July 31, 2025
作者: Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, Scott Yih
cs.AI
摘要
长文本事实性评估旨在检验模型对简短提示生成准确、全面回应的能力。现有基准测试往往缺乏人工验证,可能导致质量问题。为弥补这一不足,我们推出了FACTORY,一个大规模、经人工验证的提示集。采用模型参与循环开发并由人工精炼,FACTORY包含了一系列具有挑战性的提示,这些提示追求事实、可回答且无歧义。我们利用FACTORY及现有数据集对6种顶尖语言模型进行了人工评估。结果显示,FACTORY是一个极具挑战性的基准:相较于其他数据集仅10%的虚假声明,SOTA模型在回应中约40%的声明缺乏事实依据。我们的分析凸显了FACTORY相较于先前基准的优势,强调了其可靠性及模型在长尾事实推理上的必要性。
English
Long-form factuality evaluation assesses the ability of models to generate
accurate, comprehensive responses to short prompts. Existing benchmarks often
lack human verification, leading to potential quality issues. To address this
limitation, we introduce FACTORY, a large-scale, human-verified prompt set.
Developed using a model-in-the-loop approach and refined by humans, FACTORY
includes challenging prompts that are fact-seeking, answerable, and
unambiguous. We conduct human evaluations on 6 state-of-the-art language models
using FACTORY and existing datasets. Our results show that FACTORY is a
challenging benchmark: approximately 40% of the claims made in the responses of
SOTA models are not factual, compared to only 10% for other datasets. Our
analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing
its reliability and the necessity for models to reason across long-tailed
facts.