如何引導大型語言模型的潛在變量以進行幻覺檢測？

摘要

大型語言模型（LLM）中的幻覺問題對其在現實世界應用中的安全部署構成了重大挑戰。近期的方法利用LLM的潛在空間進行幻覺檢測，但這些嵌入主要優化於語言連貫性而非事實準確性，往往難以清晰區分真實與虛假內容。為此，我們提出了真實性分離向量（Truthfulness Separator Vector, TSV），這是一種輕量且靈活的引導向量，在推理過程中重塑LLM的表示空間，以增強真實與幻覺輸出之間的區分度，而無需改變模型參數。我們的兩階段框架首先在一小部分標註樣本上訓練TSV，形成緊湊且分離良好的簇。隨後，通過引入未標註的LLM生成數據擴展樣本集，採用基於最優運輸的算法進行偽標註，並結合基於置信度的過濾過程。大量實驗表明，TSV在極少標註數據下實現了最先進的性能，展現出跨數據集的強大泛化能力，為現實世界中的LLM應用提供了實用解決方案。

English

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

如何引導大型語言模型的潛在變量以進行幻覺檢測？

How to Steer LLM Latents for Hallucination Detection?

摘要

Support