불확실성은 취약하다: 대규모 언어 모델에서의 불확실성 조작

초록

대규모 언어 모델(LLMs)은 다양한 고위험 분야에서 활용되며, 이들의 출력 신뢰도가 매우 중요한 요소로 작용합니다. LLM 응답의 신뢰성을 평가하기 위해 일반적으로 사용되는 방법 중 하나는 불확실성 추정으로, 이는 모델의 답변이 정확할 가능성을 측정합니다. 많은 연구가 LLM의 불확실성 추정 정확도를 개선하는 데 초점을 맞추는 반면, 본 연구는 불확실성 추정의 취약성을 조사하고 잠재적 공격 가능성을 탐구합니다. 우리는 공격자가 LLM에 백도어를 삽입할 수 있음을 보여주며, 이 백도어는 입력에서 특정 트리거가 활성화될 때 최종 출력에는 영향을 미치지 않으면서 모델의 불확실성을 조작할 수 있습니다. 구체적으로, 제안된 백도어 공격 방법은 LLM의 출력 확률 분포를 변경하여 확률 분포가 공격자가 사전에 정의한 분포로 수렴하도록 만들면서도 상위 1개 예측값은 그대로 유지되도록 합니다. 우리의 실험 결과는 이 공격이 객관식 문제에서 모델의 자기 평가 신뢰도를 효과적으로 훼손함을 보여줍니다. 예를 들어, 네 가지 모델에서 세 가지 다른 트리거 전략에 대해 100%의 공격 성공률(ASR)을 달성했습니다. 또한, 이러한 조작이 다양한 프롬프트와 도메인에 걸쳐 일반화되는지 여부를 추가로 조사했습니다. 이 연구는 LLM의 신뢰성에 대한 중대한 위협을 강조하며, 이러한 공격에 대비한 미래의 방어 메커니즘의 필요성을 강조합니다. 코드는 https://github.com/qcznlp/uncertainty_attack에서 확인할 수 있습니다.

English

Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the probability distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction remains unchanged. Our experimental results demonstrate that this attack effectively undermines the model's self-evaluation reliability in multiple-choice questions. For instance, we achieved a 100 attack success rate (ASR) across three different triggering strategies in four models. Further, we investigate whether this manipulation generalizes across different prompts and domains. This work highlights a significant threat to the reliability of LLMs and underscores the need for future defenses against such attacks. The code is available at https://github.com/qcznlp/uncertainty_attack.

불확실성은 취약하다: 대규모 언어 모델에서의 불확실성 조작

Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models

초록

Support