如何引導大型語言模型的潛在變量以進行幻覺檢測?
How to Steer LLM Latents for Hallucination Detection?
March 1, 2025
作者: Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li
cs.AI
摘要
大型語言模型(LLM)中的幻覺問題對其在現實世界應用中的安全部署構成了重大挑戰。近期的方法利用LLM的潛在空間進行幻覺檢測,但這些嵌入主要優化於語言連貫性而非事實準確性,往往難以清晰區分真實與虛假內容。為此,我們提出了真實性分離向量(Truthfulness Separator Vector, TSV),這是一種輕量且靈活的引導向量,在推理過程中重塑LLM的表示空間,以增強真實與幻覺輸出之間的區分度,而無需改變模型參數。我們的兩階段框架首先在一小部分標註樣本上訓練TSV,形成緊湊且分離良好的簇。隨後,通過引入未標註的LLM生成數據擴展樣本集,採用基於最優運輸的算法進行偽標註,並結合基於置信度的過濾過程。大量實驗表明,TSV在極少標註數據下實現了最先進的性能,展現出跨數據集的強大泛化能力,為現實世界中的LLM應用提供了實用解決方案。
English
Hallucinations in LLMs pose a significant concern to their safe deployment in
real-world applications. Recent approaches have leveraged the latent space of
LLMs for hallucination detection, but their embeddings, optimized for
linguistic coherence rather than factual accuracy, often fail to clearly
separate truthful and hallucinated content. To this end, we propose the
Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector
that reshapes the LLM's representation space during inference to enhance the
separation between truthful and hallucinated outputs, without altering model
parameters. Our two-stage framework first trains TSV on a small set of labeled
exemplars to form compact and well-separated clusters. It then augments the
exemplar set with unlabeled LLM generations, employing an optimal
transport-based algorithm for pseudo-labeling combined with a confidence-based
filtering process. Extensive experiments demonstrate that TSV achieves
state-of-the-art performance with minimal labeled data, exhibiting strong
generalization across datasets and providing a practical solution for
real-world LLM applications.Summary
AI-Generated Summary