MetaSC:用於語言模型的測試時間安全規範優化
MetaSC: Test-Time Safety Specification Optimization for Language Models
February 11, 2025
作者: Víctor Gallego
cs.AI
摘要
我們提出了一個新穎的動態安全框架,可在推論時優化語言模型(LM)的安全推理,而無需修改模型權重。借鑒最近自我評論方法的進展,我們的方法利用一個元評論機制,迭代更新安全提示(稱為規範),以驅動批評和修訂過程的自適應。這種測試時間的優化不僅提高了對抗性越獄請求的性能,還在各種一般安全相關任務中表現出色,例如避免道德傷害或追求誠實回應。我們對幾個語言模型進行的實證評估表明,動態優化的安全提示相對於固定系統提示和靜態自我評論防禦,能夠產生顯著更高的安全分數。代碼將在 https://github.com/vicgalle/meta-self-critique.git 釋出。
English
We propose a novel dynamic safety framework that optimizes language model
(LM) safety reasoning at inference time without modifying model weights.
Building on recent advances in self-critique methods, our approach leverages a
meta-critique mechanism that iteratively updates safety prompts-termed
specifications-to drive the critique and revision process adaptively. This
test-time optimization not only improves performance against adversarial
jailbreak requests but also in diverse general safety-related tasks, such as
avoiding moral harm or pursuing honest responses. Our empirical evaluations
across several language models demonstrate that dynamically optimized safety
prompts yield significantly higher safety scores compared to fixed system
prompts and static self-critique defenses. Code to be released at
https://github.com/vicgalle/meta-self-critique.git .