双向语言模型是更佳的知识记忆者吗?现实世界知识注入的基准测试
Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection
May 18, 2025
作者: Yuwei Zhang, Wenhao Yu, Shangbin Feng, Yifan Zhu, Letian Peng, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
cs.AI
摘要
尽管大型语言模型(LLMs)取得了显著进展,但由于缺乏标准化且高质量的测试平台,其知识记忆能力仍未被充分探索。本文引入了一种新颖、真实且大规模的知识注入基准,该基准能够随时间持续演进而无需人工干预。具体而言,我们提出了WikiDYK,它利用维基百科“你知道吗...”条目中近期添加且由人工撰写的事实。这些条目由维基百科专家编辑根据可验证性和清晰度等标准精心挑选。每个条目被转化为多个问答对,涵盖从简单的填空提示到复杂的多跳问题等多种任务形式。WikiDYK包含12,290个事实和77,180个问题,并且能够无缝扩展,以容纳未来维基百科编辑的更新。通过持续预训练进行的广泛实验揭示了一个令人惊讶的发现:尽管因果语言模型(CLMs)在现代LLMs中普遍存在,但其知识记忆能力显著弱于双向语言模型(BiLMs),在可靠性方面的准确率低了23%。为了弥补当前BiLMs规模较小的不足,我们引入了一个模块化协作框架,利用BiLMs集合作为外部知识库与LLMs集成。实验表明,我们的框架进一步将可靠性准确率提升了高达29.1%。
English
Despite significant advances in large language models (LLMs), their knowledge
memorization capabilities remain underexplored, due to the lack of standardized
and high-quality test ground. In this paper, we introduce a novel, real-world
and large-scale knowledge injection benchmark that evolves continuously over
time without requiring human intervention. Specifically, we propose WikiDYK,
which leverages recently-added and human-written facts from Wikipedia's "Did
You Know..." entries. These entries are carefully selected by expert Wikipedia
editors based on criteria such as verifiability and clarity. Each entry is
converted into multiple question-answer pairs spanning diverse task formats
from easy cloze prompts to complex multi-hop questions. WikiDYK contains 12,290
facts and 77,180 questions, which is also seamlessly extensible with future
updates from Wikipedia editors. Extensive experiments using continued
pre-training reveal a surprising insight: despite their prevalence in modern
LLMs, Causal Language Models (CLMs) demonstrate significantly weaker knowledge
memorization capabilities compared to Bidirectional Language Models (BiLMs),
exhibiting a 23% lower accuracy in terms of reliability. To compensate for the
smaller scales of current BiLMs, we introduce a modular collaborative framework
utilizing ensembles of BiLMs as external knowledge repositories to integrate
with LLMs. Experiment shows that our framework further improves the reliability
accuracy by up to 29.1%.Summary
AI-Generated Summary