유효하지도 신뢰할 수 없나? LLM을 판단자로 사용하는 것에 대한 연구

초록

자연어 생성(NLG) 시스템 평가는 자연어 처리(NLP)의 핵심 과제로 남아 있으며, 범용성을 목표로 하는 대형 언어 모델(LLM)의 등장으로 더욱 복잡해졌다. 최근에는 대형 언어 모델을 평가자로 활용하는 LLJ(대형 언어 모델 평가자)가 전통적인 평가 지표의 대안으로 주목받고 있지만, 그 타당성은 아직 충분히 탐구되지 않았다. 본 입장 논문은 LLJ에 대한 현재의 열광이 성급할 수 있다고 주장한다. 이는 LLJ의 도입이 평가자로서의 신뢰성과 타당성에 대한 엄격한 검증을 앞지르고 있기 때문이다. 사회과학의 측정 이론을 바탕으로, 우리는 LLJ 사용의 기반이 되는 네 가지 핵심 가정을 식별하고 비판적으로 평가한다: 인간 판단의 대리자로서의 능력, 평가자로서의 역량, 확장성, 그리고 비용 효율성이다. 우리는 이러한 각 가정이 LLM, LLJ 또는 현재의 NLG 평가 관행의 고유한 한계에 의해 어떻게 도전받을 수 있는지 검토한다. 분석을 구체화하기 위해, 우리는 LLJ의 세 가지 응용 분야를 탐구한다: 텍스트 요약, 데이터 주석, 그리고 안전성 정렬. 마지막으로, 우리는 LLJ 평가에서 더 책임 있는 평가 관행의 필요성을 강조하며, 이 분야에서의 점점 더 커지는 역할이 NLG의 진전을 지원하도록 해야 함을 주장한다.

English

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.

유효하지도 신뢰할 수 없나? LLM을 판단자로 사용하는 것에 대한 연구

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

초록

Support