推理引入新型投毒攻擊,卻使其更為複雜
Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
September 6, 2025
作者: Hanna Foerster, Ilia Shumailov, Yiren Zhao, Harsh Chaudhari, Jamie Hayes, Robert Mullins, Yarin Gal
cs.AI
摘要
針對大型語言模型(LLMs)的數據中毒攻擊早期研究表明,注入後門異常容易。近期,LLMs引入了逐步推理機制,這擴大了攻擊面,涵蓋了中間的思維鏈(CoT)及其將問題分解為子問題的固有特性。利用這些向量進行更隱蔽的中毒,我們提出了「分解推理毒藥」,其中攻擊者僅修改推理路徑,保持提示和最終答案的純淨,並將觸發器分散到多個單獨無害的組件中。
有趣的是,雖然注入這些分解後的毒藥仍有可能,但可靠地激活它們以改變最終答案(而不僅僅是CoT)卻出奇地困難。這種困難源於模型往往能夠從其思維過程中被激活的後門中恢復。最終,似乎一種新興的後門魯棒性正源自這些先進LLMs的推理能力,以及推理與最終答案生成之間的架構分離。
English
Early research into data poisoning attacks against Large Language Models
(LLMs) demonstrated the ease with which backdoors could be injected. More
recent LLMs add step-by-step reasoning, expanding the attack surface to include
the intermediate chain-of-thought (CoT) and its inherent trait of decomposing
problems into subproblems. Using these vectors for more stealthy poisoning, we
introduce ``decomposed reasoning poison'', in which the attacker modifies only
the reasoning path, leaving prompts and final answers clean, and splits the
trigger across multiple, individually harmless components.
Fascinatingly, while it remains possible to inject these decomposed poisons,
reliably activating them to change final answers (rather than just the CoT) is
surprisingly difficult. This difficulty arises because the models can often
recover from backdoors that are activated within their thought processes.
Ultimately, it appears that an emergent form of backdoor robustness is
originating from the reasoning capabilities of these advanced LLMs, as well as
from the architectural separation between reasoning and final answer
generation.