팩토리(FACTORY): 장문 형식의 사실성 검증을 위한 도전적인 인간 검증 프롬프트 세트

초록

장문 사실성 평가는 모델이 짧은 프롬프트에 대해 정확하고 포괄적인 응답을 생성하는 능력을 평가합니다. 기존 벤치마크는 종종 인간 검증이 부족하여 잠재적인 품질 문제를 야기합니다. 이러한 한계를 해결하기 위해 우리는 대규모의 인간 검증 프롬프트 세트인 FACTORY를 소개합니다. 모델-인-더-루프 접근법을 사용하여 개발되고 인간에 의해 정제된 FACTORY는 사실을 탐구하고, 답변이 가능하며, 모호하지 않은 도전적인 프롬프트를 포함합니다. 우리는 FACTORY와 기존 데이터셋을 사용하여 6개의 최신 언어 모델에 대한 인간 평가를 수행합니다. 우리의 결과는 FACTORY가 도전적인 벤치마크임을 보여줍니다: SOTA 모델의 응답에서 주장된 내용의 약 40%가 사실이 아닌 반면, 다른 데이터셋의 경우 이 비율은 10%에 불과합니다. 우리의 분석은 FACTORY가 이전 벤치마크에 비해 가지는 강점을 확인하며, 그 신뢰성과 모델이 장꼬리 사실에 걸쳐 추론할 필요성을 강조합니다.

English

Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across long-tailed facts.

팩토리(FACTORY): 장문 형식의 사실성 검증을 위한 도전적인 인간 검증 프롬프트 세트

FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

초록

Support