大规模推理模型的分心注入攻击:特征分析与防御策略
Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense
October 17, 2025
作者: Zhehao Zhang, Weijie Xu, Shixian Cui, Chandan K. Reddy
cs.AI
摘要
近期,大型推理模型(LRMs)的进展使其在数学和编程等复杂任务上展现出卓越性能,这得益于生成长链思维(CoT)轨迹的能力。本文中,我们识别并系统分析了一种关键漏洞,称之为“推理分心”,即LRMs被恶意嵌入提示中的无关但复杂的任务所干扰,偏离其主要目标。通过对多种模型和基准的全面研究,我们发现即使是当前最先进的LRMs也极易受此影响,注入的干扰因素可使任务准确率下降高达60%。进一步揭示,某些对齐技术会加剧这一弱点,模型可能表现出隐性顺从,在推理过程中遵循隐藏的对抗性指令,同时在最终输出中将其掩盖。为应对这些风险,我们提出了一种基于训练的防御策略,结合监督微调(SFT)和强化学习(RL)在合成对抗数据上进行训练,在面对挑战性干扰攻击时,将鲁棒性提升超过50个百分点。我们的研究确立了“推理分心”作为对LRM可靠性的一种独特且紧迫的威胁,并为构建更安全、更可信的推理系统提供了实用步骤。
English
Recent advances in large reasoning models (LRMs) have enabled remarkable
performance on complex tasks such as mathematics and coding by generating long
Chain-of-Thought (CoT) traces. In this paper, we identify and systematically
analyze a critical vulnerability we term reasoning distraction, where LRMs are
diverted from their primary objective by irrelevant yet complex tasks
maliciously embedded in the prompt. Through a comprehensive study across
diverse models and benchmarks, we show that even state-of-the-art LRMs are
highly susceptible, with injected distractors reducing task accuracy by up to
60%. We further reveal that certain alignment techniques can amplify this
weakness and that models may exhibit covert compliance, following hidden
adversarial instructions in reasoning while concealing them in the final
output. To mitigate these risks, we propose a training-based defense that
combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on
synthetic adversarial data, improving robustness by over 50 points on
challenging distractor attacks. Our findings establish reasoning distraction as
a distinct and urgent threat to LRM reliability and provide a practical step
toward safer and more trustworthy reasoning systems.