사프란-1: 대형 언어 모델 안전성 보장을 위한 추론 확장 패러다임을 향하여

초록

기존의 안전성 보장 연구는 주로 안전한 행동을 대형 언어 모델(LLM)에 내재화하기 위한 훈련 단계의 정렬에 초점을 맞추어 왔다. 그러나 최근 연구들은 이러한 방법들이 다양한 탈옥 공격에 취약하다는 점을 드러냈다. 동시에, 추론 확장은 LLM의 추론 능력을 크게 향상시켰지만, 안전성 보장 맥락에서는 아직 탐구되지 않았다. 이러한 격차를 해소하기 위해, 본 연구는 새로운 위협에 대항하여 강력하고 효과적인 LLM 안전성을 위한 추론 확장을 선도적으로 탐구한다. 우리는 기존의 추론 확장 기술이 추론 작업에서는 성공적이었지만, 안전성 맥락에서는 성능이 저조하며, 심지어 Best-of-N 샘플링과 같은 기본적인 접근법에도 미치지 못한다는 점을 밝혔다. 이러한 비효율성은 빈번한 프로세스 보상 모델(PRM) 평가와 관련된 높은 계산 오버헤드로 인해 발생하는 탐색-효율성 딜레마라는 새로운 도전 과제에 기인한다. 이 딜레마를 극복하기 위해, 우리는 안전성 보장을 위해 특별히 설계된 새로운 추론 확장 패러다임인 SAFFRON을 제안한다. 우리의 접근법의 핵심은 필요한 보상 모델 평가 횟수를 크게 줄이는 다분화 보상 모델(MRM)의 도입이다. 이 패러다임을 실행하기 위해, 우리는 더 나아가 (i) MRM을 위한 부분 감독 훈련 목표, (ii) 분포 외 탐색을 방지하기 위한 보수적 탐색 제약, 그리고 (iii) 트리 탐색 중 시퀀스 간 캐시 공유를 용이하게 하는 Trie 기반 키-값 캐싱 전략을 제안한다. 광범위한 실험을 통해 우리 방법의 효과성을 검증하였다. 또한, 우리는 훈련된 다분화 보상 모델(Saffron-1)과 토큰 수준의 안전성 보상 데이터셋(Safety4M)을 공개하여 LLM 안전성 연구의 가속화를 도모한다. 우리의 코드, 모델, 데이터는 https://github.com/q-rz/saffron에서 공개되어 있으며, 프로젝트 홈페이지는 https://q-rz.github.io/p/saffron에서 확인할 수 있다.

English

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .

사프란-1: 대형 언어 모델 안전성 보장을 위한 추론 확장 패러다임을 향하여

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

초록

Support