グラナイトガーディアン

要旨

Granite Guardianモデルを紹介します。これは、プロンプトとレスポンスのリスク検出を提供し、どんな大規模言語モデル（LLM）と組み合わせても安全で責任ある使用を可能にするために設計された一連の保護機能です。これらのモデルは、ソーシャルバイアス、卑猥さ、暴力、性的コンテンツ、倫理的な行動、ジェイルブレイキング、およびコンテキストの関連性、基盤、および回答の関連性などの幻覚関連リスクを含む複数のリスク次元にわたる包括的なカバレッジを提供します。人間の注釈と合成データを組み合わせた独自のデータセットでトレーニングされたGranite Guardianモデルは、通常従来のリスク検出モデルでは見落とされがちなジェイルブレイクやRAG固有の問題などのリスクに対処します。有害コンテンツとRAG幻覚関連ベンチマークでのAUCスコアがそれぞれ0.871と0.854であるGranite Guardianは、この分野で最も汎用性があり競争力のあるモデルです。オープンソースとして公開されたGranite Guardianは、コミュニティ全体で責任あるAI開発を促進することを目的としています。 https://github.com/ibm-granite/granite-guardian

English

We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community. https://github.com/ibm-granite/granite-guardian