드로우에서 결론 도출하기: 아레나 스타일 LLM 평가에서 선호도 의미론 재고하기

초록

대규모 언어 모델(LLM)의 아레나 스타일 평가에서는 두 개의 LLM이 사용자 질의에 응답하고, 사용자가 승리한 응답을 선택하거나 "대결"을 무승부로 판단함으로써 두 모델의 등급이 조정됩니다. 이러한 등급 역학을 모델링하는 현재의 주요 접근 방식은 체스와 마찬가지로 대결을 두 명의 플레이어 간의 게임 매치로 간주하고 Elo 등급 시스템 및 그 파생 시스템을 적용하는 것입니다. 본 논문에서는 이러한 패러다임을 비판적으로 검토합니다. 특히, 무승부가 진정으로 두 모델이 동등함을 의미하는지, 그리고 그들의 등급이 동일화되어야 하는지에 대해 의문을 제기합니다. 대신, 우리는 무승부가 질의의 난이도를 더 잘 나타낼 것이라고 추측합니다: 질의가 너무 쉬운 경우, 두 모델이 동등하게 성공할 가능성이 더 높습니다. 세 가지 실제 아레나 데이터셋에서, 무승부에 대한 등급 업데이트를 무시하면 연구된 네 가지 등급 시스템 모두에서 대결 결과 예측 정확도(무승부 포함)가 1-3% 상대적으로 증가함을 보여줍니다. 추가 분석은 무승부가 매우 쉬운 것으로 평가된 질의와 매우 객관적인 질의에서 더 자주 발생하며, 각각 위험 비율이 1.37과 1.35임을 시사합니다. 우리는 향후 등급 시스템이 기존의 무승부 의미를 재고하고 등급 업데이트 시 질의 속성을 고려할 것을 권장합니다.

English

In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.