ChatPaper.aiChatPaper

大型推理模型的分心注入攻擊:特徵分析與防禦

Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense

October 17, 2025
作者: Zhehao Zhang, Weijie Xu, Shixian Cui, Chandan K. Reddy
cs.AI

摘要

大型推理模型(LRMs)的最新進展,通過生成長鏈的思維鏈(CoT)軌跡,在數學和編碼等複雜任務上展現了卓越的性能。本文中,我們識別並系統性地分析了一種關鍵的脆弱性,我們稱之為推理分心,即LRMs被惡意嵌入提示中的無關卻複雜的任務所干擾,偏離其主要目標。通過對多種模型和基準的全面研究,我們發現即使是頂尖的LRMs也極易受此影響,注入的干擾因素可使任務準確率下降高達60%。我們進一步揭示,某些對齊技術可能加劇這一弱點,且模型可能表現出隱蔽的順從,在推理過程中遵循隱藏的對抗性指令,同時在最終輸出中隱藏這些指令。為減輕這些風險,我們提出了一種基於訓練的防禦方法,結合了對合成對抗數據的監督微調(SFT)和強化學習(RL),在面對挑戰性的干擾攻擊時,將魯棒性提升了超過50個百分點。我們的研究成果確立了推理分心作為對LRM可靠性的一種獨特且迫切的威脅,並為構建更安全、更可信的推理系統提供了實用的一步。
English
Recent advances in large reasoning models (LRMs) have enabled remarkable performance on complex tasks such as mathematics and coding by generating long Chain-of-Thought (CoT) traces. In this paper, we identify and systematically analyze a critical vulnerability we term reasoning distraction, where LRMs are diverted from their primary objective by irrelevant yet complex tasks maliciously embedded in the prompt. Through a comprehensive study across diverse models and benchmarks, we show that even state-of-the-art LRMs are highly susceptible, with injected distractors reducing task accuracy by up to 60%. We further reveal that certain alignment techniques can amplify this weakness and that models may exhibit covert compliance, following hidden adversarial instructions in reasoning while concealing them in the final output. To mitigate these risks, we propose a training-based defense that combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on synthetic adversarial data, improving robustness by over 50 points on challenging distractor attacks. Our findings establish reasoning distraction as a distinct and urgent threat to LRM reliability and provide a practical step toward safer and more trustworthy reasoning systems.
PDF32October 21, 2025