LLM 잠재 변수를 어떻게 조작하여 환각 탐지를 할 수 있는가?

초록

LLM(대형 언어 모델)의 환각 현상은 실제 애플리케이션에서의 안전한 배포에 있어 중요한 문제로 대두되고 있습니다. 최근 연구들은 LLM의 잠재 공간을 활용하여 환각 현상을 탐지하려는 접근법을 시도했지만, 언어적 일관성을 위해 최적화된 임베딩은 사실적 정확성을 명확히 구분하지 못하는 경우가 많습니다. 이를 해결하기 위해, 우리는 Truthfulness Separator Vector(TSV)를 제안합니다. TSV는 경량화되고 유연한 스티어링 벡터로, 모델 파라미터를 변경하지 않고도 추론 과정에서 LLM의 표현 공간을 재구성하여 진실된 출력과 환각된 출력 간의 분리를 강화합니다. 우리의 2단계 프레임워크는 먼저 소량의 레이블된 예제 데이터를 사용하여 TSV를 학습시켜 컴팩트하고 잘 분리된 클러스터를 형성합니다. 이후, 레이블이 없는 LLM 생성 데이터를 예제 세트에 추가하고, 최적 수송 기반 알고리즘을 활용한 의사 레이블링과 신뢰도 기반 필터링 프로세스를 결합합니다. 광범위한 실험을 통해 TSV는 최소한의 레이블 데이터로도 최첨단 성능을 달성하며, 데이터셋 간 강력한 일반화 능력을 보여주어 실제 LLM 애플리케이션에 실용적인 해결책을 제공함을 입증했습니다.

English

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

LLM 잠재 변수를 어떻게 조작하여 환각 탐지를 할 수 있는가?

How to Steer LLM Latents for Hallucination Detection?

초록

Support