ChatPaper.aiChatPaper

Phare:大型语言模型的安全探测系统

Phare: A Safety Probe for Large Language Models

May 16, 2025
作者: Pierre Le Jeune, Benoît Malézieux, Weixuan Xiao, Matteo Dora
cs.AI

摘要

确保大型语言模型(LLMs)的安全性对于负责任地部署至关重要,然而现有的评估往往优先考虑性能,而非识别故障模式。我们引入了Phare,一个多语言诊断框架,用于探测和评估LLM在三个关键维度上的行为:幻觉与可靠性、社会偏见以及有害内容生成。通过对17个最先进的LLM进行评估,我们揭示了所有安全维度上系统性的脆弱性模式,包括奉承、提示敏感性和刻板印象再现。通过突出这些具体的故障模式而非简单地排名模型,Phare为研究人员和实践者提供了可操作的见解,以构建更健壮、对齐且值得信赖的语言系统。
English
Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.

Summary

AI-Generated Summary

PDF41May 21, 2025