大型語言模型中的知識同質性

摘要

大型語言模型（LLMs）作為支持知識密集型應用（如問答和事實核查）的神經知識庫，已受到越來越多的研究關注。然而，其知識的結構化組織仍未被深入探討。受認知神經科學發現（如語義聚類和啟動效應，即知曉一個事實會增加回憶相關事實的可能性）的啟發，我們研究了LLMs中類似的知識同質性模式。為此，我們通過在三元組和實體層面進行知識檢查，將LLM的知識映射為圖表示。隨後，我們分析了實體與其鄰居之間的知識性關係，發現LLMs往往對圖中位置相近的實體具有相似的知識水平。基於這一同質性原則，我們提出了一種圖神經網絡（GNN）回歸模型，利用鄰域分數來估計三元組的實體級知識性分數。預測的知識性使我們能夠優先檢查較不為人知的三元組，從而在相同的標註預算下最大化知識覆蓋率。這不僅提高了主動標註以將知識注入LLMs的微調效率，還增強了推理密集型問答中的多跳路徑檢索。

English

Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

大型語言模型中的知識同質性

Knowledge Homophily in Large Language Models

摘要

Support