禁じられた科学：デュアルユースAIチャレンジベンチマークと科学的拒否テスト

要旨

大規模言語モデルの堅牢な安全基準の開発には、適切な有害コンテンツの拒否と合法的な科学的議論の過度な制限を両方測定できるオープンで再現可能なデータセットが必要です。我々は、主に制御された物質クエリを対象としたLLM安全メカニズムの評価のためのオープンソースデータセットとテストフレームワークを提供します。4つの主要モデルの応答を系統的に変化させたプロンプトを分析しました。結果は異なる安全プロファイルを示しました。Claude-3.5-sonnetは73%の拒否と27%の許可で最も保守的なアプローチを示し、一方Mistralは100%のクエリに回答しようとしました。GPT-3.5-turboは10%の拒否と90%の許可で中程度の制限を示し、Grok-2は20%の拒否と80%の許可を記録しました。プロンプト変化戦略のテストにより、応答の一貫性が85%から単一プロンプトで65%に低下することが明らかになりました。この公開されている基準は、必要な安全制限と合法的な科学的探求の過度な検閲の間の重要なバランスを系統的に評価することを可能にし、AI安全実装の進捗を測定する基盤を提供します。思考の連鎖分析は、安全メカニズムの潜在的な脆弱性を明らかにし、望ましいおよび妥当な科学的議論を過度に制限することなく堅牢な保護策を実装する複雑さを浮き彫りにします。

English

The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the most conservative approach with 73% refusals and 27% allowances, while Mistral attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and 80% allowances. Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse.

禁じられた科学：デュアルユースAIチャレンジベンチマークと科学的拒否テスト

Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests

要旨

Support