工廠:一個針對長篇事實性驗證的挑戰性人工審核提示集
FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality
July 31, 2025
作者: Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, Scott Yih
cs.AI
摘要
長篇事實性評估旨在檢驗模型對簡短提示生成準確、全面回應的能力。現有的基準測試往往缺乏人工驗證,導致可能存在質量問題。為解決這一局限,我們引入了FACTORY,一個大規模、經人工驗證的提示集。FACTORY採用模型參與循環開發並由人工精煉,包含尋求事實、可回答且無歧義的挑戰性提示。我們利用FACTORY及現有數據集對六種頂尖語言模型進行了人工評估。結果顯示,FACTORY作為一個挑戰性基準,其回應中約40%的聲明與事實不符,而其他數據集僅為10%。我們的分析揭示了FACTORY相較於先前基準的優勢,強調了其可靠性以及模型在處理長尾事實時進行推理的必要性。
English
Long-form factuality evaluation assesses the ability of models to generate
accurate, comprehensive responses to short prompts. Existing benchmarks often
lack human verification, leading to potential quality issues. To address this
limitation, we introduce FACTORY, a large-scale, human-verified prompt set.
Developed using a model-in-the-loop approach and refined by humans, FACTORY
includes challenging prompts that are fact-seeking, answerable, and
unambiguous. We conduct human evaluations on 6 state-of-the-art language models
using FACTORY and existing datasets. Our results show that FACTORY is a
challenging benchmark: approximately 40% of the claims made in the responses of
SOTA models are not factual, compared to only 10% for other datasets. Our
analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing
its reliability and the necessity for models to reason across long-tailed
facts.