Hoe LLM-latenten te sturen voor hallucinatiedetectie?

Samenvatting

Hallucinaties in LLM's vormen een belangrijk aandachtspunt voor hun veilige inzet in real-world toepassingen. Recente benaderingen hebben gebruikgemaakt van de latente ruimte van LLM's voor hallucinatiedetectie, maar hun embeddings, die geoptimaliseerd zijn voor linguïstische samenhang in plaats van feitelijke nauwkeurigheid, slagen er vaak niet in om waarheidsgetrouwe en gehallucineerde inhoud duidelijk te scheiden. Daarom stellen we de Truthfulness Separator Vector (TSV) voor, een lichtgewicht en flexibele stuurvector die de representatieruimte van de LLM tijdens inferentie hervormt om de scheiding tussen waarheidsgetrouwe en gehallucineerde uitvoer te verbeteren, zonder de modelparameters aan te passen. Ons tweestapsraamwerk traint eerst de TSV op een kleine set gelabelde voorbeelden om compacte en goed gescheiden clusters te vormen. Vervolgens wordt de voorbeeldset uitgebreid met ongelabelde LLM-generaties, waarbij een op optimaal transport gebaseerd algoritme wordt gebruikt voor pseudo-labeling in combinatie met een op vertrouwen gebaseerd filterproces. Uitgebreide experimenten tonen aan dat TSV state-of-the-art prestaties bereikt met minimale gelabelde data, waarbij het sterke generalisatie vertoont over datasets en een praktische oplossing biedt voor real-world LLM-toepassingen.

English

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

Hoe LLM-latenten te sturen voor hallucinatiedetectie?

How to Steer LLM Latents for Hallucination Detection?

Samenvatting

Support