Saffron-1: LLM安全性保証のための推論スケーリングパラダイムに向けて

要旨

既存の安全性保証研究は、主に訓練段階でのアライメントに焦点を当て、LLMに安全な行動を植え付けることに重点を置いてきました。しかし、最近の研究では、これらの手法が多様なジャイルブレイク攻撃に対して脆弱であることが明らかになっています。同時に、推論スケーリングはLLMの推論能力を大幅に向上させましたが、安全性保証の文脈では未開拓のままです。このギャップを埋めるため、本研究は新たな脅威に対して堅牢かつ効果的なLLMの安全性を実現するための推論スケーリングを先駆的に提案します。我々は、従来の推論スケーリング技術が、推論タスクでは成功を収めているにもかかわらず、安全性の文脈ではパフォーマンスが低く、Best-of-Nサンプリングのような基本的なアプローチにも及ばないことを明らかにしました。この非効率性は、頻繁なプロセス報酬モデル（PRM）評価に関連する高い計算コストから生じる新たな課題、すなわち探索効率のジレンマに起因すると考えられます。このジレンマを克服するため、我々は安全性保証に特化した新しい推論スケーリングパラダイムであるSAFFRONを提案します。我々のアプローチの中核は、報酬モデル評価の回数を大幅に削減する多分岐報酬モデル（MRM）の導入です。このパラダイムを実現するため、さらに以下の提案を行います：(i) MRMのための部分教師あり訓練目的、(ii) 分布外探索を防ぐための保守的な探索制約、(iii) 木探索中にシーケンス間でキャッシュを共有するためのTrieベースのキー・バリューキャッシュ戦略。大規模な実験により、我々の手法の有効性が検証されました。さらに、我々は訓練済みの多分岐報酬モデル（Saffron-1）とトークンレベルの安全性報酬データセット（Safety4M）を公開し、LLM安全性に関する将来の研究を加速します。我々のコード、モデル、データはhttps://github.com/q-rz/saffronで公開されており、プロジェクトのホームページはhttps://q-rz.github.io/p/saffronにあります。

English

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .

Saffron-1: LLM安全性保証のための推論スケーリングパラダイムに向けて

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

要旨

Support