Saffron-1：迈向大语言模型安全保证的推理扩展范式

摘要

现有安全保证研究主要集中于训练阶段的对齐，旨在将安全行为植入大语言模型（LLMs）。然而，近期研究揭示了这些方法在面对多样化越狱攻击时的脆弱性。与此同时，推理扩展技术显著提升了LLM的推理能力，但在安全保证领域的应用仍属空白。针对这一缺口，我们的工作开创性地将推理扩展应用于增强LLM对新兴威胁的鲁棒性和有效性安全防护。我们发现，尽管传统推理扩展技术在推理任务中表现出色，但在安全场景下表现欠佳，甚至不及基础方法如最佳N采样（Best-of-N Sampling）。我们将这种低效归因于一个新发现的挑战——探索效率困境，该困境源于频繁进行过程奖励模型（PRM）评估所带来的高计算开销。为解决这一困境，我们提出了SAFFRON，一种专为安全保证设计的新型推理扩展范式。该范式的核心在于引入多分支奖励模型（MRM），大幅减少了所需的奖励模型评估次数。为实施这一范式，我们进一步提出：(i) MRM的部分监督训练目标，(ii) 保守探索约束以防止分布外探索，以及(iii) 基于Trie的键值缓存策略，促进树搜索过程中跨序列的缓存共享。大量实验验证了我们方法的有效性。此外，我们公开了训练好的多分支奖励模型（Saffron-1）及配套的令牌级安全奖励数据集（Safety4M），以加速LLM安全领域的未来研究。我们的代码、模型和数据均公开于https://github.com/q-rz/saffron，项目主页位于https://q-rz.github.io/p/saffron。

English

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .

Saffron-1：迈向大语言模型安全保证的推理扩展范式

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

摘要

Support