PromptBench: 대규모 언어 모델 평가를 위한 통합 라이브러리

초록

대규모 언어 모델(LLM)의 평가는 그 성능을 측정하고 잠재적인 보안 위험을 완화하는 데 매우 중요합니다. 본 논문에서는 LLM을 평가하기 위한 통합 라이브러리인 PromptBench를 소개합니다. 이 라이브러리는 연구자들이 쉽게 사용하고 확장할 수 있는 몇 가지 핵심 구성 요소로 이루어져 있습니다: 프롬프트 구성, 프롬프트 엔지니어링, 데이터셋 및 모델 로딩, 적대적 프롬프트 공격, 동적 평가 프로토콜, 그리고 분석 도구 등이 포함됩니다. PromptBench은 새로운 벤치마크 생성, 다운스트림 애플리케이션 배포, 새로운 평가 프로토콜 설계 등 원천 연구를 촉진할 수 있는 개방적이고 일반적이며 유연한 연구용 코드베이스로 설계되었습니다. 코드는 https://github.com/microsoft/promptbench에서 확인할 수 있으며, 지속적으로 지원될 예정입니다.

English

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.

PromptBench: 대규모 언어 모델 평가를 위한 통합 라이브러리

PromptBench: A Unified Library for Evaluation of Large Language Models

초록

Support