Zelftoezichtige Gekwantiseerde Representatie voor Naadloze Integratie van Kennisgrafieken met Grote Taalmodellen

Samenvatting

Vanwege de natuurlijke kloof tussen de structuren van Kennisgrafieken (KG) en natuurlijke taal, is de effectieve integratie van holistische structurele informatie van KG's met Grote Taalmodellen (LLM's) naar voren gekomen als een belangrijke vraagstelling. Met dit doel stellen we een tweefasenraamwerk voor om gekwantiseerde codes te leren en toe te passen voor elk entiteit, met als doel de naadloze integratie van KG's met LLM's. Allereerst wordt een zelf-superviserende gekwantiseerde representatiemethode (SSQR) voorgesteld om zowel structurele als semantische kennis van KG's samen te drukken in discrete codes (d.w.z., tokens) die overeenkomen met de opmaak van taalzinnen. We ontwerpen verder KG instructievolggegevens door deze geleerde codes te beschouwen als kenmerken die rechtstreeks aan LLM's worden ingevoerd, waardoor naadloze integratie wordt bereikt. De experimentele resultaten tonen aan dat SSQR beter presteert dan bestaande ongesuperviseerde gekwantiseerde methoden, waarbij meer onderscheidende codes worden geproduceerd. Bovendien hebben de fijnafgestemde LLaMA2 en LLaMA3.1 ook superieure prestaties op KG koppelingsvoorspelling en drievoudige classificatietaken, waarbij slechts 16 tokens per entiteit worden gebruikt in plaats van duizenden in conventionele prompting-methoden.

English

Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (\ie, tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Further, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.

Zelftoezichtige Gekwantiseerde Representatie voor Naadloze Integratie van Kennisgrafieken met Grote Taalmodellen

Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models

Samenvatting

Support