提示到排行榜

摘要

大型語言模型（LLM）的評估通常依賴於如準確率或人類偏好等聚合指標，這些指標對使用者和提示進行了平均處理。這種平均化掩蓋了模型性能在使用者和提示層面的特定變化。為解決這一問題，我們提出了提示到排行榜（Prompt-to-Leaderboard, P2L）方法，該方法能生成針對特定提示的排行榜。其核心思想是訓練一個LLM，以自然語言提示作為輸入，輸出布萊德利-特里係數向量，這些係數隨後用於預測人類偏好投票。由此產生的提示依賴性排行榜允許進行無監督的任務特定評估、查詢到模型的最佳路由、個性化以及模型優缺點的自動化評估。來自Chatbot Arena的數據表明，P2L比平均化的排行榜更能捕捉語言模型性能的細微差異。此外，我們的研究發現，P2L生成提示特定評估的能力遵循與LLM自身觀察到的冪律擴展相似的法則。2025年1月，基於此方法訓練的路由器在Chatbot Arena排行榜上取得了第一名的位置。我們的代碼可在以下GitHub鏈接獲取：https://github.com/lmarena/p2l。

English

Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the \#1 spot in the Chatbot Arena leaderboard. Our code is available at this GitHub link: https://github.com/lmarena/p2l.