ChatPaper.aiChatPaper

推理机制虽催生了新型投毒攻击,却也使其复杂度显著提升。

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

September 6, 2025
作者: Hanna Foerster, Ilia Shumailov, Yiren Zhao, Harsh Chaudhari, Jamie Hayes, Robert Mullins, Yarin Gal
cs.AI

摘要

针对大型语言模型(LLMs)的数据投毒攻击早期研究已表明,植入后门相对容易。近期,随着LLMs逐步引入分步推理机制,攻击面进一步扩大,涵盖了中间思维链(CoT)及其将问题分解为子问题的固有特性。利用这些途径进行更为隐蔽的投毒,我们提出了“分解式推理投毒”方法,其中攻击者仅修改推理路径,保持提示和最终答案的纯净,并将触发条件分散至多个单独无害的组件中。 有趣的是,尽管植入此类分解式投毒仍属可行,但可靠地激活它们以改变最终答案(而非仅影响CoT)却异常困难。这一困难源于模型往往能在其思维过程中从被激活的后门中恢复。最终,似乎一种后门鲁棒性的新兴形式正源自这些先进LLMs的推理能力,以及推理与最终答案生成之间的架构分离。
English
Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.
PDF13September 12, 2025