LLM 비교 분석기: 대규모 언어 모델의 병렬 평가를 위한 시각적 분석

초록

자동 병렬 평가는 대규모 언어 모델(LLM)의 응답 품질을 평가하는 유망한 접근 방식으로 부상하고 있다. 그러나 이러한 평가 방식의 결과를 분석하는 것은 확장성과 해석 가능성 측면에서 도전 과제를 제기한다. 본 논문에서는 자동 병렬 평가 결과를 인터랙티브하게 분석하기 위한 새로운 시각적 분석 도구인 LLM Comparator를 소개한다. 이 도구는 사용자가 특정 모델이 기준 모델보다 언제, 왜 더 나은 성능을 보이는지, 그리고 두 모델의 응답이 질적으로 어떻게 다른지 이해할 수 있도록 인터랙티브 워크플로를 지원한다. 우리는 대형 기술 기업의 연구자 및 엔지니어들과 긴밀히 협력하여 이 도구를 반복적으로 설계 및 개발하였다. 본 논문에서는 우리가 식별한 사용자 도전 과제, 도구의 설계 및 개발 과정, 그리고 모델 평가를 정기적으로 수행하는 참가자들과의 관찰 연구를 상세히 설명한다.

English

Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.

LLM 비교 분석기: 대규모 언어 모델의 병렬 평가를 위한 시각적 분석

LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

초록

Support