프롬프트-투-리더보드

초록

대형 언어 모델(LLM) 평가는 일반적으로 정확도나 인간 선호도와 같은 집계된 지표를 사용하며, 사용자와 프롬프트 전반에 걸쳐 평균을 내는 방식으로 진행됩니다. 이러한 평균화는 모델 성능에서 나타나는 사용자 및 프롬프트별 변동성을 가리게 됩니다. 이를 해결하기 위해, 우리는 특정 프롬프트에 맞춘 리더보드를 생성하는 Prompt-to-Leaderboard(P2L) 방법을 제안합니다. 이 방법의 핵심 아이디어는 자연어 프롬프트를 입력으로 받아 Bradley-Terry 계수 벡터를 출력하도록 LLM을 학습시키는 것입니다. 이 계수는 인간 선호도 투표를 예측하는 데 사용됩니다. 그 결과로 생성된 프롬프트 의존적 리더보드는 비지도 작업별 평가, 쿼리를 모델로 최적으로 라우팅, 개인화, 그리고 모델의 강점과 약점을 자동으로 평가하는 데 활용될 수 있습니다. Chatbot Arena의 데이터는 P2L이 평균화된 리더보드보다 언어 모델 성능의 미묘한 차이를 더 잘 포착한다는 것을 시사합니다. 더 나아가, 우리의 연구 결과는 P2L이 프롬프트별 평가를 생성하는 능력이 LLM 자체에서 관찰되는 것과 유사한 멱법칙 스케일링을 따른다는 것을 보여줍니다. 2025년 1월, 이 방법론을 기반으로 훈련된 라우터는 Chatbot Arena 리더보드에서 1위를 차지했습니다. 우리의 코드는 다음 GitHub 링크에서 확인할 수 있습니다: https://github.com/lmarena/p2l.

English

Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the \#1 spot in the Chatbot Arena leaderboard. Our code is available at this GitHub link: https://github.com/lmarena/p2l.