幻觉挑战:一个高难度的多轮对话幻觉基准测试集
HalluHard: A Hard Multi-Turn Hallucination Benchmark
February 1, 2026
作者: Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, Maksym Andriushchenko
cs.AI
摘要
大型语言模型(LLMs)仍会生成听起来合理但缺乏事实依据的论断,这一问题在多轮对话中随着语境扩展和早期错误的累积而加剧。我们推出HalluHard基准测试集,包含涵盖法律案例、研究问题、医疗指南和代码编程四大高风险领域的950个种子问题,通过要求对事实断言提供文内引用来具象化内容真实性。为支持开放场景下的可靠评估,我们提出一种基于网络搜索的迭代证据检索判定流程,能够获取、筛选并解析全文来源(包括PDF文件),以验证引用内容是否切实支撑生成文本。在对多种前沿专有模型和开放权重模型的测试中,即使启用网络搜索,幻觉现象依然显著(最强配置Opus-4.5配合网络搜索的幻觉率约为30%),且内容锚定错误持续高发。最后我们发现,幻觉行为受模型容量、对话轮次位置、有效推理能力及所需知识类型共同影响。
English
Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce HalluHard, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search (approx 30% for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.