추론은 새로운 중독 공격을 도입하지만 이를 더 복잡하게 만든다

초록

대규모 언어 모델(LLM)을 대상으로 한 데이터 중독 공격에 대한 초기 연구는 백도어를 쉽게 주입할 수 있음을 보여주었다. 최근의 LLM은 단계별 추론 기능을 추가함으로써 공격 표면을 확장시켰는데, 이는 문제를 하위 문제로 분해하는 사고의 연쇄(CoT)와 그 고유 특성을 포함한다. 이러한 벡터를 활용해 더 은밀한 중독 공격을 위해, 우리는 "분해된 추론 중독"을 제안한다. 이 공격에서는 공격자가 프롬프트와 최종 답변은 그대로 두고 추론 경로만 수정하며, 트리거를 여러 개의 개별적으로 무해한 구성 요소로 분할한다. 흥미롭게도, 이러한 분해된 중독을 주입하는 것은 가능하지만, 최종 답변을 변경하기 위해 이를 안정적으로 활성화하는 것은 놀랍도록 어렵다. 이러한 어려움은 모델이 사고 과정 내에서 활성화된 백도어로부터 종종 회복할 수 있기 때문에 발생한다. 궁극적으로, 이러한 고급 LLM의 추론 능력과 추론과 최종 답변 생성 간의 구조적 분리로 인해 백도어 견고성의 새로운 형태가 나타나고 있는 것으로 보인다.

English

Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.

추론은 새로운 중독 공격을 도입하지만 이를 더 복잡하게 만든다

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

초록

Support