大语言模型中的知识同质性
Knowledge Homophily in Large Language Models
September 28, 2025
作者: Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang
cs.AI
摘要
大型语言模型(LLMs)作为支持知识密集型应用(如问答和事实核查)的神经知识库,正受到越来越多的研究关注。然而,其知识的结构化组织仍未被深入探索。受认知神经科学发现的启发,例如语义聚类和启动效应——即了解一个事实会增加回忆起相关事实的可能性,我们探究了LLMs中类似的知识同质性模式。为此,我们通过在三元组和实体两个层面进行知识检查,将LLM的知识映射为图表示。随后,我们分析了实体与其邻居之间的知识掌握关系,发现LLMs往往对图中位置相近的实体拥有相似程度的知识。基于这一同质性原理,我们提出了一种图神经网络(GNN)回归模型,通过利用邻居节点的知识掌握分数来估计三元组在实体层面的知识掌握度。预测的知识掌握度使我们能够优先检查那些较少被熟知的三元组,从而在相同的标注预算下最大化知识覆盖。这不仅提高了为LLMs注入知识而进行的主动标注效率,用于微调,还增强了在推理密集型问答中的多跳路径检索能力。
English
Large Language Models (LLMs) have been increasingly studied as neural
knowledge bases for supporting knowledge-intensive applications such as
question answering and fact checking. However, the structural organization of
their knowledge remains unexplored. Inspired by cognitive neuroscience
findings, such as semantic clustering and priming, where knowing one fact
increases the likelihood of recalling related facts, we investigate an
analogous knowledge homophily pattern in LLMs. To this end, we map LLM
knowledge into a graph representation through knowledge checking at both the
triplet and entity levels. After that, we analyze the knowledgeability
relationship between an entity and its neighbors, discovering that LLMs tend to
possess a similar level of knowledge about entities positioned closer in the
graph. Motivated by this homophily principle, we propose a Graph Neural Network
(GNN) regression model to estimate entity-level knowledgeability scores for
triplets by leveraging their neighborhood scores. The predicted
knowledgeability enables us to prioritize checking less well-known triplets,
thereby maximizing knowledge coverage under the same labeling budget. This not
only improves the efficiency of active labeling for fine-tuning to inject
knowledge into LLMs but also enhances multi-hop path retrieval in
reasoning-intensive question answering.