推論は新たな毒盛り攻撃を導入する一方で、それらをより複雑にする

要旨

大規模言語モデル（LLM）に対するデータポイズニング攻撃に関する初期の研究では、バックドアを注入することが容易であることが示されました。より最近のLLMでは、段階的な推論が追加され、攻撃対象が中間的な連鎖的思考（CoT）と、問題をサブ問題に分解するその固有の特性にまで拡大しています。これらのベクトルを利用してより巧妙なポイズニングを行うために、我々は「分解推論ポイズニング」を導入します。この手法では、攻撃者は推論パスのみを変更し、プロンプトと最終的な答えはそのままにし、トリガーを複数の個別には無害なコンポーネントに分割します。興味深いことに、これらの分解されたポイズンを注入することは可能ですが、最終的な答えを変更するためにそれらを確実に活性化すること（単にCoTを変更するだけでなく）は驚くほど困難です。この困難は、モデルがしばしばその思考プロセス内で活性化されたバックドアから回復できることに起因します。最終的には、これらの高度なLLMの推論能力、および推論と最終的な答えの生成の間のアーキテクチャ的な分離から、バックドアに対する新たな形のロバストネスが生まれているように見えます。

English

Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.

推論は新たな毒盛り攻撃を導入する一方で、それらをより複雑にする

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

要旨

Support