以推理時間計算為代價,提升對抗韌性
Trading Inference-Time Compute for Adversarial Robustness
January 31, 2025
作者: Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese
cs.AI
摘要
我們對增加推論時間計算對推理模型(具體來說是OpenAI o1-preview和o1-mini)對抗對抗攻擊的韌性的影響進行實驗。我們發現,在各種攻擊中,增加推論時間計算會提高韌性。在許多情況下(有重要的例外情況),隨著測試時間計算量的增加,攻擊成功的模型樣本比例趨近於零。我們對我們研究的任務未進行任何對抗訓練,僅通過允許模型在推理過程中花費更多計算量來增加推論時間計算,獨立於攻擊形式。我們的結果表明,推論時間計算有潛力提高大型語言模型的對抗韌性。我們還探索了針對推理模型的新攻擊,以及推論時間計算無法提高可靠性的情況,並推測了這些情況的原因以及解決方法。
English
We conduct experiments on the impact of increasing inference-time compute in
reasoning models (specifically OpenAI o1-preview and o1-mini) on their
robustness to adversarial attacks. We find that across a variety of attacks,
increased inference-time compute leads to improved robustness. In many cases
(with important exceptions), the fraction of model samples where the attack
succeeds tends to zero as the amount of test-time compute grows. We perform no
adversarial training for the tasks we study, and we increase inference-time
compute by simply allowing the models to spend more compute on reasoning,
independently of the form of attack. Our results suggest that inference-time
compute has the potential to improve adversarial robustness for Large Language
Models. We also explore new attacks directed at reasoning models, as well as
settings where inference-time compute does not improve reliability, and
speculate on the reasons for these as well as ways to address them.Summary
AI-Generated Summary