SafePyramid: コンテキスト内ポリシーガードレーリングのための階層的ベンチマーク

要旨

実世界の応用において、ガードレールは、事前定義されたリスク分類に依存するのではなく、アプリケーション固有の安全ポリシーに従って、ユーザーとモデル間の安全でない相互作用を識別することが期待されることが多い。本研究では、この設定を、ガードレールがコンテキスト内で提供されるポリシー仕様に基づいて安全違反を予測する「インコンテキストポリシーガードレーリング」のパラダイムの下で調査する。この能力を体系的に評価するために、我々はSafePyramidを導入する。これは、10のドメインにわたる1,000のマルチターン会話と、それに対応する3,000のアプリケーション固有ポリシーから構成される安全性ベンチマークであり、これらには合計61,699の個別の自然言語ルールが含まれる。SafePyramidは評価を3つの難易度レベルに整理する。L0は個別ルールの理解を、L1はルール依存関係にわたる推論を、L2はコンテキスト内で定義された完全な新しいポリシーフレームワークへの適応を評価する。ベンチマークの品質を確保するため、我々は厳格な多段階パイプラインを用いてベンチマークを構築・検証する。SafePyramidを用いて、10の最先端LLMと5つのポリシー設定可能なガードレールを評価した結果、インコンテキストポリシーガードレーリングは依然として非常に困難であることが判明した。最も性能の良いモデルであるGPT-5.5でさえ、L0、L1、L2において、違反したルールの完全な集合を正確に識別できたのはそれぞれ54.0%、35.3%、12.9%のケースに過ぎなかった。これらの結果は、現在のガードレールの限界を浮き彫りにし、ポリシーを確実に実行し、ルール依存関係を解決し、新しいポリシーフレームワークに適応できる、より強力なインコンテキストポリシーガードレールの必要性を訴えている。

English

In real-world applications, guardrails are often expected to identify unsafe user-model interactions according to application-specific safety policies, rather than relying on predefined risk taxonomies. In this work, we study this setting under the paradigm of in-context policy guardrailing, where guardrails predict safety violations based on policy specifications provided in context. To systematically evaluate this capability, we introduce SafePyramid, a safety benchmark comprising 1,000 multi-turn conversations across 10 domains and 3,000 corresponding application-specific policies, which together contain 61,699 distinct natural-language rules. SafePyramid organizes the evaluation into three difficulty levels: L0 evaluates individual-rule understanding, L1 evaluates reasoning over rule dependencies, and L2 evaluates adaptation of full novel policy frameworks defined in context. To ensure benchmark quality, we employ a rigorous multi-stage pipeline to construct and validate the benchmark. Using SafePyramid, we evaluate 10 frontier LLMs and 5 policy-configurable guardrails and find that in-context policy guardrailing remains highly challenging: even the best-performing model, GPT-5.5, exactly identifies the full set of violated rules in only 54.0%, 35.3%, and 12.9% cases on L0, L1, and L2, respectively. These results highlight the limitations of current guardrails and call for stronger in-context policy guardrails that can reliably execute policies, resolve rule dependencies, and adapt to novel policy frameworks.