推論時の計算を犯罪者に対する堅牢性に交換する

要旨

推論モデル（具体的にはOpenAI o1-previewおよびo1-mini）の推論時計算量の増加が、敵対的攻撃に対する頑健性に与える影響について実験を行います。様々な攻撃に対して、推論時計算量の増加が頑健性の向上につながることを見出します。重要な例外を除いて、攻撃が成功するモデルサンプルの割合は、テスト時計算量が増加するにつれてゼロに近づく傾向があります。私たちは研究対象のタスクに対して敵対的トレーニングを行っておらず、推論時計算量を増やすことで、攻撃の形式に独立してモデルがより多くの計算を推論に費やすようにします。私たちの結果は、推論時計算量が大規模言語モデルの敵対的な頑健性を向上させる可能性があることを示唆しています。また、推論モデルに対する新しい攻撃や、推論時計算量が信頼性を向上させない状況についても探求し、これらの理由や対処方法についても推測します。

English

We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.

推論時の計算を犯罪者に対する堅牢性に交換する

Trading Inference-Time Compute for Adversarial Robustness

要旨

Support