PromptBench：用于评估大型语言模型的统一库

摘要

对大型语言模型（LLMs）进行评估对于评估其性能并减轻潜在安全风险至关重要。在本文中，我们介绍了PromptBench，这是一个用于评估LLMs的统一库。它包括几个关键组件，研究人员可以轻松使用和扩展：提示构建、提示工程、数据集和模型加载、对抗性提示攻击、动态评估协议和分析工具。PromptBench旨在成为一个开放、通用和灵活的代码库，用于研究目的，可以促进原创研究，创建新的基准测试，部署下游应用程序和设计新的评估协议。该代码可在以下网址获得：https://github.com/microsoft/promptbench，并将持续得到支持。

English

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.

PromptBench：用于评估大型语言模型的统一库

PromptBench: A Unified Library for Evaluation of Large Language Models

摘要

Support