Saffron-1：迈向大语言模型安全保证的推理扩展范式

摘要

現有的安全保證研究主要集中於訓練階段的對齊，旨在將安全行為灌輸給大型語言模型（LLMs）。然而，近期的研究揭示了這些方法在面對多樣化的越獄攻擊時的脆弱性。與此同時，推理擴展技術顯著提升了LLM的推理能力，但在安全保證的應用上仍未被探索。針對這一空白，我們的工作率先將推理擴展應用於增強LLM對新興威脅的魯棒性和有效性安全防護。我們發現，傳統的推理擴展技術儘管在推理任務中表現出色，在安全情境下卻表現不佳，甚至不及如最佳N採樣（Best-of-N Sampling）等基礎方法。我們將這種低效歸因於一個新發現的挑戰——探索效率困境，這是由於頻繁的過程獎勵模型（PRM）評估帶來的高計算開銷所致。為克服這一困境，我們提出了SAFFRON，一種專為安全保證量身定制的新型推理擴展範式。我們方法的核心是引入了一種多分支獎勵模型（MRM），它大幅減少了所需的獎勵模型評估次數。為實現這一範式，我們進一步提出了：(i) 針對MRM的部分監督訓練目標，(ii) 保守探索約束以防止分佈外探索，以及(iii) 基於Trie的鍵值緩存策略，該策略在樹搜索過程中促進了跨序列的緩存共享。大量實驗驗證了我們方法的有效性。此外，我們公開了訓練好的多分支獎勵模型（Saffron-1）及配套的令牌級安全獎勵數據集（Safety4M），以加速未來在LLM安全領域的研究。我們的代碼、模型和數據均公開於https://github.com/q-rz/saffron，項目主頁位於https://q-rz.github.io/p/saffron。

English

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .

Saffron-1：迈向大语言模型安全保证的推理扩展范式

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

摘要

Support