不確定性是脆弱的:在大型語言模型中操控不確定性
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
July 15, 2024
作者: Qingcheng Zeng, Mingyu Jin, Qinkai Yu, Zhenting Wang, Wenyue Hua, Zihao Zhou, Guangyan Sun, Yanda Meng, Shiqing Ma, Qifan Wang, Felix Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang
cs.AI
摘要
大型語言模型(LLMs)被廣泛應用於各種高風險領域,其輸出的可靠性至關重要。評估LLMs回應可靠性的一種常用方法是不確定性估計,該方法衡量它們的答案正確的可能性。雖然許多研究專注於提高LLMs不確定性估計的準確性,但我們的研究探討了不確定性估計的脆弱性並探索潛在攻擊。我們展示了一種攻擊者可以在LLMs中嵌入後門的方法,當輸入中的特定觸發器激活時,可以操縱模型的不確定性而不影響最終輸出。具體來說,所提出的後門攻擊方法可以改變LLMs的輸出概率分佈,使概率分佈收斂到攻擊者預定的分佈,同時確保頂部1預測保持不變。我們的實驗結果表明,這種攻擊有效地破壞了模型在多項選擇問題中的自我評估可靠性。例如,在四個模型中,我們通過三種不同的觸發策略實現了100%的攻擊成功率(ASR)。此外,我們研究這種操縱是否可以應用於不同提示和領域。這項工作突顯了對LLMs可靠性的重大威脅,並強調了未來需要針對此類攻擊的防禦。代碼可在https://github.com/qcznlp/uncertainty_attack找到。
English
Large Language Models (LLMs) are employed across various high-stakes domains,
where the reliability of their outputs is crucial. One commonly used method to
assess the reliability of LLMs' responses is uncertainty estimation, which
gauges the likelihood of their answers being correct. While many studies focus
on improving the accuracy of uncertainty estimations for LLMs, our research
investigates the fragility of uncertainty estimation and explores potential
attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which,
when activated by a specific trigger in the input, manipulates the model's
uncertainty without affecting the final output. Specifically, the proposed
backdoor attack method can alter an LLM's output probability distribution,
causing the probability distribution to converge towards an attacker-predefined
distribution while ensuring that the top-1 prediction remains unchanged. Our
experimental results demonstrate that this attack effectively undermines the
model's self-evaluation reliability in multiple-choice questions. For instance,
we achieved a 100 attack success rate (ASR) across three different triggering
strategies in four models. Further, we investigate whether this manipulation
generalizes across different prompts and domains. This work highlights a
significant threat to the reliability of LLMs and underscores the need for
future defenses against such attacks. The code is available at
https://github.com/qcznlp/uncertainty_attack.Summary
AI-Generated Summary