크라우드소싱 데이터에서 고품질 벤치마크로: Arena-Hard와 BenchBuilder 파이프라인

초록

언어 모델의 급속한 발전은 더 도전적인 벤치마크의 개발을 필요로 하게 되었습니다. 현재의 정적 벤치마크는 종종 다양한 모델의 능력을 일관되게 구분하는 데 어려움을 겪으며, 실제 사용자 선호도와도 잘 맞지 않습니다. 반면, Chatbot Arena와 같은 실시간 크라우드소싱 플랫폼은 다양한 자연스러운 프롬프트와 사용자 피드백을 수집합니다. 그러나 이러한 프롬프트는 정교함에 있어 차이가 있으며, 피드백은 새로운 모델에 오프라인으로 적용할 수 없습니다. 벤치마크가 LLM 개발 속도를 따라잡을 수 있도록 하기 위해, 우리는 벤치마크가 모델을 확실하게 구분하고 인간의 선호도와 얼마나 잘 맞는지를 평가하는 방법에 대해 다룹니다. 이러한 원칙 하에, 우리는 BenchBuilder를 개발했습니다. BenchBuilder는 실시간 데이터 소스에서 고품질 프롬프트를 필터링하여 신선하고 도전적인 프롬프트에 대한 오프라인 평가를 가능하게 하는 살아있는 벤치마크입니다. BenchBuilder는 도메인 지식 요구와 같은 고품질 프롬프트의 7가지 지표를 식별하고, 다양한 주제 클러스터에서 고품질 프롬프트의 하위 집합을 선택하기 위해 LLM 주석자를 활용합니다. LLM 평가 프로세스는 완전히 자동화되고 고품질이며 지속적으로 업데이트되는 벤치마크를 보장하기 위해 LLM 판단자를 사용합니다. 우리는 BenchBuilder를 Chatbot Arena의 프롬프트에 적용하여 Arena-Hard-Auto v0.1을 생성했습니다: 다양한 작업에서 나온 500개의 도전적인 사용자 프롬프트입니다. Arena-Hard-Auto v0.1은 MT-Bench보다 3배 더 좁은 신뢰 구간을 제공하며, 인간 선호도 순위와 89.1%의 최신 수준의 일치를 달성했습니다. 이 모든 것이 단 25달러의 비용으로, 인간 라벨러 없이 이루어졌습니다. BenchBuilder 파이프라인은 평가 벤치마크를 강화하고, 개발자들이 최소한의 노력으로 방대한 데이터에서 고품질 벤치마크를 추출할 수 있는 귀중한 도구를 제공합니다.

English

The rapid evolution of language models has necessitated the development of more challenging benchmarks. Current static benchmarks often struggle to consistently distinguish between the capabilities of different models and fail to align with real-world user preferences. On the other hand, live crowd-sourced platforms like the Chatbot Arena collect a wide range of natural prompts and user feedback. However, these prompts vary in sophistication and the feedback cannot be applied offline to new models. In order to ensure that benchmarks keep up with the pace of LLM development, we address how one can evaluate benchmarks on their ability to confidently separate models and their alignment with human preference. Under these principles, we developed BenchBuilder, a living benchmark that filters high-quality prompts from live data sources to enable offline evaluation on fresh, challenging prompts. BenchBuilder identifies seven indicators of a high-quality prompt, such as the requirement for domain knowledge, and utilizes an LLM annotator to select a high-quality subset of prompts from various topic clusters. The LLM evaluation process employs an LLM judge to ensure a fully automated, high-quality, and constantly updating benchmark. We apply BenchBuilder on prompts from the Chatbot Arena to create Arena-Hard-Auto v0.1: 500 challenging user prompts from a wide range of tasks. Arena-Hard-Auto v0.1 offers 3x tighter confidence intervals than MT-Bench and achieves a state-of-the-art 89.1% agreement with human preference rankings, all at a cost of only $25 and without human labelers. The BenchBuilder pipeline enhances evaluation benchmarks and provides a valuable tool for developers, enabling them to extract high-quality benchmarks from extensive data with minimal effort.

크라우드소싱 데이터에서 고품질 벤치마크로: Arena-Hard와 BenchBuilder 파이프라인

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

초록

Support