MetaSC：用於語言模型的測試時間安全規範優化

摘要

我們提出了一個新穎的動態安全框架，可在推論時優化語言模型（LM）的安全推理，而無需修改模型權重。借鑒最近自我評論方法的進展，我們的方法利用一個元評論機制，迭代更新安全提示（稱為規範），以驅動批評和修訂過程的自適應。這種測試時間的優化不僅提高了對抗性越獄請求的性能，還在各種一般安全相關任務中表現出色，例如避免道德傷害或追求誠實回應。我們對幾個語言模型進行的實證評估表明，動態優化的安全提示相對於固定系統提示和靜態自我評論防禦，能夠產生顯著更高的安全分數。代碼將在 https://github.com/vicgalle/meta-self-critique.git 釋出。

English

We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code to be released at https://github.com/vicgalle/meta-self-critique.git .

MetaSC：用於語言模型的測試時間安全規範優化

MetaSC: Test-Time Safety Specification Optimization for Language Models

摘要

Support