MetaSC: 言語モデルのテスト時安全性仕様の最適化

要旨

推論時にモデルの重みを変更せずに言語モデル（LM）の安全性推論を最適化する革新的な動的安全性フレームワークを提案します。最近の自己批評手法の進歩を基に、当社のアプローチは、安全性プロンプト（仕様と呼ばれる）を反復的に更新するメタ批評メカニズムを活用します。これにより、批評と修正プロセスを適応的に推進します。このテスト時の最適化は、敵対的なジェイルブレイク要求に対するパフォーマンスを向上させるだけでなく、道徳的な害を避けたり正直な回答を追求するなど、多様な一般的な安全関連タスクにも適しています。複数の言語モデルを対象とした実証評価により、動的に最適化された安全性プロンプトが、固定システムプロンプトや静的自己批評防御と比較して、著しく高い安全性スコアをもたらすことが示されました。コードは https://github.com/vicgalle/meta-self-critique.git で公開予定です。

English

We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code to be released at https://github.com/vicgalle/meta-self-critique.git .

MetaSC: 言語モデルのテスト時安全性仕様の最適化

MetaSC: Test-Time Safety Specification Optimization for Language Models

要旨

Support