논쟁적 지능: 토론 발표 평가를 통한 LLM 판단자 벤치마킹

초록

우리는 LLM(대형 언어 모델) 평가를 위한 새로운 도전적인 벤치마크로 토론 발표 평가를 소개한다. 토론 발표를 평가하기 위해서는 발표의 논증 강도와 관련성, 발표의 일관성과 구성, 스타일과 어조의 적절성 등 다양한 수준에서 발표를 깊이 있게 이해해야 한다. 이 작업은 이전의 체계적인 LLM 벤치마킹에서 제한적으로 다뤄진 독특한 인지 능력 집합을 필요로 한다. 이러한 능력을 탐구하기 위해, 우리는 600개 이상의 세심하게 주석이 달린 토론 발표 데이터셋을 활용하고, 최첨단 LLM이 이 작업에서 인간 평가자와 어떻게 비교되는지에 대한 첫 번째 심층 분석을 제시한다. 우리의 연구 결과는 미묘한 차이를 보여준다: 더 큰 모델은 일부 측면에서 개별 인간 판단을 근사할 수 있지만, 전반적인 판단 행동에서는 상당히 다르다. 또한, 우리는 최첨단 LLM이 설득력 있고 의견이 담긴 발표를 생성하는 능력을 조사하며, 이 작업에서 모델이 인간 수준의 성능을 보일 수 있음을 보여준다.

English

We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

논쟁적 지능: 토론 발표 평가를 통한 LLM 판단자 벤치마킹

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

초록

Support