PrimeGuard: Veilige en Nuttige LLM's via Afstemmingsvrije Routering

Samenvatting

Het inzetten van taalmodellen (LMs) vereist dat de uitvoer zowel van hoge kwaliteit is als voldoet aan veiligheidsrichtlijnen. Hoewel Inference-Time Guardrails (ITG) oplossingen bieden die de uitvoerdistributies van modellen richting naleving verschuiven, constateren we dat huidige methoden moeite hebben om een balans te vinden tussen veiligheid en behulpzaamheid. ITG-methoden die niet-nalevende queries veilig aanpakken, vertonen een lagere behulpzaamheid, terwijl methoden die behulpzaamheid prioriteren inboeten op veiligheid. We verwijzen naar deze afweging als de guardrail tax, analoog aan de alignment tax. Om dit aan te pakken, stellen we PrimeGuard voor, een nieuwe ITG-methode die gestructureerde controleflow gebruikt. PrimeGuard leidt verzoeken naar verschillende zelf-instantiaties van het LM met variërende instructies, waarbij het gebruik maakt van de inherente instructievolgende capaciteiten en in-context leren. Onze afstemningsvrije aanpak compileert dynamisch richtlijnen van systeemontwerpers voor elke query. We construeren en publiceren safe-eval, een diverse red-team veiligheidsbenchmark. Uitgebreide evaluaties tonen aan dat PrimeGuard, zonder afstemming, de guardrail tax overwint door (1) de weerstand tegen iteratieve jailbreak-aanvallen aanzienlijk te verhogen en (2) state-of-the-art resultaten te behalen in veiligheidsbeveiliging, terwijl (3) de behulpzaamheidsscores van afgestemde modellen worden geëvenaard. Uitgebreide evaluaties tonen aan dat PrimeGuard, zonder afstemming, alle concurrerende baselines overtreft en de guardrail tax overwint door het aandeel veilige reacties te verbeteren van 61% naar 97% en de gemiddelde behulpzaamheidsscores te verhogen van 4.17 naar 4.29 op de grootste modellen, terwijl het aanvalssuccespercentage wordt teruggebracht van 100% naar 8%. De implementatie van PrimeGuard is beschikbaar op https://github.com/dynamofl/PrimeGuard en de safe-eval dataset is beschikbaar op https://huggingface.co/datasets/dynamoai/safe_eval.

English

Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at https://huggingface.co/datasets/dynamoai/safe_eval.

PrimeGuard: Veilige en Nuttige LLM's via Afstemmingsvrije Routering

PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

Samenvatting

Support