不确定性是脆弱的:在大型语言模型中操纵不确定性
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
July 15, 2024
作者: Qingcheng Zeng, Mingyu Jin, Qinkai Yu, Zhenting Wang, Wenyue Hua, Zihao Zhou, Guangyan Sun, Yanda Meng, Shiqing Ma, Qifan Wang, Felix Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang
cs.AI
摘要
大型语言模型(LLMs)被广泛应用于各种高风险领域,其输出的可靠性至关重要。评估LLMs响应可靠性的常用方法之一是不确定性估计,用于衡量它们的答案正确性的可能性。虽然许多研究侧重于改进LLMs不确定性估计的准确性,但我们的研究调查了不确定性估计的脆弱性并探讨了潜在攻击。我们证明了攻击者可以在LLMs中嵌入后门,当输入中的特定触发器激活时,可以操纵模型的不确定性而不影响最终输出。具体而言,所提出的后门攻击方法可以改变LLM的输出概率分布,导致概率分布收敛到攻击者预定义的分布,同时确保最高概率预测保持不变。我们的实验结果表明,这种攻击有效地破坏了模型在多项选择问题中的自我评估可靠性。例如,在四个模型中,我们通过三种不同的触发策略实现了100%的攻击成功率(ASR)。此外,我们研究了这种操纵是否可以泛化到不同的提示和领域。这项工作突显了对LLMs可靠性的重大威胁,并强调了未来需要应对此类攻击的防御措施。代码可在https://github.com/qcznlp/uncertainty_attack找到。
English
Large Language Models (LLMs) are employed across various high-stakes domains,
where the reliability of their outputs is crucial. One commonly used method to
assess the reliability of LLMs' responses is uncertainty estimation, which
gauges the likelihood of their answers being correct. While many studies focus
on improving the accuracy of uncertainty estimations for LLMs, our research
investigates the fragility of uncertainty estimation and explores potential
attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which,
when activated by a specific trigger in the input, manipulates the model's
uncertainty without affecting the final output. Specifically, the proposed
backdoor attack method can alter an LLM's output probability distribution,
causing the probability distribution to converge towards an attacker-predefined
distribution while ensuring that the top-1 prediction remains unchanged. Our
experimental results demonstrate that this attack effectively undermines the
model's self-evaluation reliability in multiple-choice questions. For instance,
we achieved a 100 attack success rate (ASR) across three different triggering
strategies in four models. Further, we investigate whether this manipulation
generalizes across different prompts and domains. This work highlights a
significant threat to the reliability of LLMs and underscores the need for
future defenses against such attacks. The code is available at
https://github.com/qcznlp/uncertainty_attack.Summary
AI-Generated Summary