Saffron-1: Naar een schaalparadigma voor inferentie ter waarborging van de veiligheid van grote taalmodellen

Samenvatting

Bestaand onderzoek naar veiligheidsborging heeft zich voornamelijk gericht op uitlijningsmethoden tijdens de trainingsfase om veilig gedrag in grote taalmodellen (LLMs) in te bouwen. Recente studies hebben echter aangetoond dat deze methoden kwetsbaar zijn voor diverse jailbreak-aanvallen. Tegelijkertijd heeft schaling tijdens de inferentie de redeneervaardigheden van LLMs aanzienlijk verbeterd, maar blijft dit onontgonnen terrein in de context van veiligheidsborging. Om deze kloof te dichten, pionieren wij met schaling tijdens de inferentie voor robuuste en effectieve veiligheid van LLMs tegen opkomende bedreigingen. Wij laten zien dat conventionele inferentieschalingsmethoden, ondanks hun succes in redeneertaken, slecht presteren in veiligheidscontexten en zelfs onderdoen voor basale benaderingen zoals Best-of-N Sampling. Wij schrijven deze inefficiëntie toe aan een nieuw geïdentificeerd probleem, het exploratie-efficiëntiedilemma, dat ontstaat door de hoge computationele overhead die gepaard gaat met frequente evaluaties van procesbeloningsmodellen (PRMs). Om dit dilemma te overwinnen, stellen wij SAFFRON voor, een nieuw inferentieschalingsparadigma dat specifiek is toegesneden op veiligheidsborging. Centraal in onze aanpak staat de introductie van een multifurcatiebeloningsmodel (MRM) dat het aantal vereiste beloningsmodelevaluaties aanzienlijk reduceert. Om dit paradigma operationeel te maken, stellen wij verder voor: (i) een trainingsdoel met gedeeltelijk toezicht voor het MRM, (ii) een conservatieve exploratiebeperking om out-of-distribution exploraties te voorkomen, en (iii) een Trie-gebaseerde sleutel-waardecachestrategie die cache-deling tussen sequenties tijdens boomzoekacties mogelijk maakt. Uitgebreide experimenten valideren de effectiviteit van onze methode. Daarnaast maken wij ons getrainde multifurcatiebeloningsmodel (Saffron-1) en het bijbehorende token-level veiligheidsbeloningsdataset (Safety4M) publiekelijk beschikbaar om toekomstig onderzoek naar LLM-veiligheid te versnellen. Onze code, model en data zijn publiekelijk beschikbaar op https://github.com/q-rz/saffron, en onze projectpagina is te vinden op https://q-rz.github.io/p/saffron.

English

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .

Saffron-1: Naar een schaalparadigma voor inferentie ter waarborging van de veiligheid van grote taalmodellen

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Samenvatting

Support