自我監督的量化表示,用於無縫整合知識圖譜與大型語言模型。
Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
January 30, 2025
作者: Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
cs.AI
摘要
由於知識圖譜(KG)結構與自然語言之間存在天然差異,如何有效整合KG的整體結構信息與大型語言模型(LLMs)已成為一個重要問題。為此,我們提出了一個兩階段框架,用於學習並應用每個實體的量化編碼,旨在實現KG與LLMs的無縫整合。首先,提出了一種自監督量化表示(SSQR)方法,將KG的結構和語義知識壓縮為離散編碼(即,標記),以對齊語言句子的格式。我們進一步設計了KG指令跟隨數據,將這些學習到的編碼視為特徵直接輸入LLMs,從而實現無縫整合。實驗結果表明,SSQR優於現有的非監督量化方法,產生更具區分性的編碼。此外,經過微調的LLaMA2和LLaMA3.1在KG鏈接預測和三元分類任務上也表現優異,僅使用每個實體16個標記,而不是傳統提示方法中的數千個。
English
Due to the presence of the natural gap between Knowledge Graph (KG)
structures and the natural language, the effective integration of holistic
structural information of KGs with Large Language Models (LLMs) has emerged as
a significant question. To this end, we propose a two-stage framework to learn
and apply quantized codes for each entity, aiming for the seamless integration
of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR)
method is proposed to compress both KG structural and semantic knowledge into
discrete codes (\ie, tokens) that align the format of language sentences. We
further design KG instruction-following data by viewing these learned codes as
features to directly input to LLMs, thereby achieving seamless integration. The
experiment results demonstrate that SSQR outperforms existing unsupervised
quantized methods, producing more distinguishable codes. Further, the
fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link
prediction and triple classification tasks, utilizing only 16 tokens per entity
instead of thousands in conventional prompting methods.Summary
AI-Generated Summary