LiveBench: 오염 없는 도전적인 LLM 벤치마크

초록

테스트 세트 오염(test set contamination), 즉 벤치마크의 테스트 데이터가 새로운 모델의 학습 데이터에 포함되는 현상은 공정한 대형 언어 모델(LLM) 평가를 위한 잘 알려진 장애물이며, 벤치마크를 빠르게 쓸모없게 만들 수 있습니다. 이를 완화하기 위해 많은 최신 벤치마크는 인간 또는 LLM 평가자로부터 새로운 프롬프트와 평가를 크라우드소싱하지만, 이는 상당한 편향을 초래할 수 있으며, 어려운 질문에 대한 평가에서 문제가 발생할 수 있습니다. 본 연구에서는 테스트 세트 오염과 LLM 평가 및 인간 크라우드소싱의 함정 모두에 면역이 되는 새로운 LLM 벤치마크를 소개합니다. 우리는 LiveBench를 공개합니다. 이는 (1) 최신 정보 출처에서 자주 업데이트되는 질문을 포함하고, (2) 객관적인 기준값에 따라 답변을 자동으로 채점하며, (3) 수학, 코딩, 추론, 언어, 지시 따르기, 데이터 분석 등 다양한 도전적인 과제를 포함하는 최초의 벤치마크입니다. 이를 위해 LiveBench는 최근에 공개된 수학 대회, arXiv 논문, 뉴스 기사, 데이터셋을 기반으로 한 질문을 포함하며, Big-Bench Hard, AMPS, IFEval과 같은 기존 벤치마크의 더 어렵고 오염되지 않은 버전의 과제도 포함합니다. 우리는 많은 주요 폐쇄형 모델과 0.5B에서 110B 크기의 수십 개의 오픈소스 모델을 평가합니다. LiveBench는 어려운 벤치마크로, 최상위 모델도 65% 미만의 정확도를 보입니다. 우리는 모든 질문, 코드, 모델 답변을 공개합니다. 질문은 매월 추가 및 업데이트될 예정이며, 시간이 지남에 따라 새로운 과제와 더 어려운 버전의 과제를 공개하여 LiveBench가 향후 LLM의 능력이 향상됨에 따라 이를 구별할 수 있도록 할 것입니다. 우리는 벤치마크 과제와 모델을 확장하기 위한 커뮤니티 참여와 협력을 환영합니다.

English

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

LiveBench: 오염 없는 도전적인 LLM 벤치마크

LiveBench: A Challenging, Contamination-Free LLM Benchmark

초록

Support