Phare:大型語言模型的安全探針
Phare: A Safety Probe for Large Language Models
May 16, 2025
作者: Pierre Le Jeune, Benoît Malézieux, Weixuan Xiao, Matteo Dora
cs.AI
摘要
確保大型語言模型(LLMs)的安全性對於負責任的部署至關重要,然而現有的評估往往優先考慮性能而非識別故障模式。我們引入了Phare,這是一個多語言診斷框架,用於探測和評估LLM在三個關鍵維度上的行為:幻覺與可靠性、社會偏見以及有害內容生成。我們對17個最先進的LLM進行了評估,揭示了所有安全維度上系統性漏洞的模式,包括諂媚、提示敏感性和刻板印象的再現。通過強調這些具體的故障模式而非僅僅對模型進行排名,Phare為研究人員和實踐者提供了可操作的見解,以構建更為穩健、對齊且值得信賴的語言系統。
English
Ensuring the safety of large language models (LLMs) is critical for
responsible deployment, yet existing evaluations often prioritize performance
over identifying failure modes. We introduce Phare, a multilingual diagnostic
framework to probe and evaluate LLM behavior across three critical dimensions:
hallucination and reliability, social biases, and harmful content generation.
Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic
vulnerabilities across all safety dimensions, including sycophancy, prompt
sensitivity, and stereotype reproduction. By highlighting these specific
failure modes rather than simply ranking models, Phare provides researchers and
practitioners with actionable insights to build more robust, aligned, and
trustworthy language systems.Summary
AI-Generated Summary