정량적 LLM 평가자

초록

LLM-as-a-judge는 대형 언어 모델(LLM)이 다른 LLM의 출력을 자동으로 평가하는 프레임워크입니다. 우리는 기존 LLM 평가자의 점수를 특정 도메인에서 인간 평가자의 점수와 일치시키기 위해 회귀 모델을 사용하는 정량적 LLM 평가자를 제안합니다. 이 모델들은 평가자의 텍스트 평가와 점수를 활용하여 원래 평가자의 점수를 개선하도록 훈련됩니다. 우리는 절대적 및 상대적 피드백의 다양한 유형에 대한 네 가지 정량적 평가자를 제시함으로써 우리 프레임워크의 일반성과 다양성을 보여줍니다. 우리의 프레임워크는 지도 미세 조정보다 계산적으로 더 효율적이며, 인간 피드백이 제한적인 상황에서 통계적으로 더 효율적일 수 있습니다. 이는 우리 작업의 대부분의 응용 분야에서 예상되는 상황입니다. 우리는 이러한 주장을 두 가지 기본 평가자를 사용하여 네 가지 데이터셋에서 실증적으로 검증합니다. 우리의 실험은 정량적 평가자가 사후 모델링을 통해 기존 평가자의 예측 능력을 효과적으로 개선할 수 있음을 보여줍니다.

English

LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge's textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.

정량적 LLM 평가자

Quantitative LLM Judges

초록

Support