Phare:大型语言模型的安全探测系统
Phare: A Safety Probe for Large Language Models
May 16, 2025
作者: Pierre Le Jeune, Benoît Malézieux, Weixuan Xiao, Matteo Dora
cs.AI
摘要
确保大型语言模型(LLMs)的安全性对于负责任地部署至关重要,然而现有的评估往往优先考虑性能,而非识别故障模式。我们引入了Phare,一个多语言诊断框架,用于探测和评估LLM在三个关键维度上的行为:幻觉与可靠性、社会偏见以及有害内容生成。通过对17个最先进的LLM进行评估,我们揭示了所有安全维度上系统性的脆弱性模式,包括奉承、提示敏感性和刻板印象再现。通过突出这些具体的故障模式而非简单地排名模型,Phare为研究人员和实践者提供了可操作的见解,以构建更健壮、对齐且值得信赖的语言系统。
English
Ensuring the safety of large language models (LLMs) is critical for
responsible deployment, yet existing evaluations often prioritize performance
over identifying failure modes. We introduce Phare, a multilingual diagnostic
framework to probe and evaluate LLM behavior across three critical dimensions:
hallucination and reliability, social biases, and harmful content generation.
Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic
vulnerabilities across all safety dimensions, including sycophancy, prompt
sensitivity, and stereotype reproduction. By highlighting these specific
failure modes rather than simply ranking models, Phare provides researchers and
practitioners with actionable insights to build more robust, aligned, and
trustworthy language systems.Summary
AI-Generated Summary