禁忌科學:雙用途人工智慧挑戰基準和科學拒絕測試
Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests
February 8, 2025
作者: David Noever, Forrest McKee
cs.AI
摘要
為了發展大型語言模型的穩健安全基準,需要開放、可重現的數據集,以便評估對有害內容的適當拒絕以及對合法科學討論的潛在過度限制。我們提出了一個開源數據集和測試框架,用於評估主要控制物質查詢中的LLM安全機制,分析四個主要模型對系統性變化提示的回應。我們的結果顯示出不同的安全配置文件:Claude-3.5-sonnet展示了最保守的方法,拒絕率為73%,允許率為27%,而Mistral試圖回答100%的查詢。GPT-3.5-turbo表現出中等限制,拒絕率為10%,允許率為90%,而Grok-2則註冊了20%的拒絕率和80%的允許率。測試提示變化策略顯示,從單個提示的85%一致性降至五個變化的65%。這個公開可用的基準使得能夠系統性地評估必要的安全限制與對合法科學探討的潛在過度審查之間的關鍵平衡,同時為衡量AI安全實施進展奠定了基礎。思維鏈分析顯示了安全機制中的潛在弱點,突顯了在不過度限制理想和有效的科學討論的情況下實施穩健保障的複雜性。
English
The development of robust safety benchmarks for large language models
requires open, reproducible datasets that can measure both appropriate refusal
of harmful content and potential over-restriction of legitimate scientific
discourse. We present an open-source dataset and testing framework for
evaluating LLM safety mechanisms across mainly controlled substance queries,
analyzing four major models' responses to systematically varied prompts. Our
results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the
most conservative approach with 73% refusals and 27% allowances, while Mistral
attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction
with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and
80% allowances. Testing prompt variation strategies revealed decreasing
response consistency, from 85% with single prompts to 65% with five variations.
This publicly available benchmark enables systematic evaluation of the critical
balance between necessary safety restrictions and potential over-censorship of
legitimate scientific inquiry, while providing a foundation for measuring
progress in AI safety implementation. Chain-of-thought analysis reveals
potential vulnerabilities in safety mechanisms, highlighting the complexity of
implementing robust safeguards without unduly restricting desirable and valid
scientific discourse.Summary
AI-Generated Summary