LLM比较器：用于大型语言模型并排评估的可视化分析

摘要

自动并行评估已成为评估大型语言模型（LLMs）响应质量的一种有前途的方法。然而，分析这种评估方法的结果会带来可伸缩性和可解释性方面的挑战。本文介绍了LLM比较器，这是一种新颖的可视化分析工具，用于交互式地分析自动并行评估的结果。该工具支持用户进行交互式工作流程，以了解模型何时以及为何比基准模型表现更好或更差，以及两个模型的响应在质量上有何不同。我们通过与一家大型科技公司的研究人员和工程师密切合作，迭代设计和开发了该工具。本文详细介绍了我们发现的用户挑战、工具的设计和开发，以及与定期评估其模型的参与者进行的观察性研究。

English

Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.

LLM比较器：用于大型语言模型并排评估的可视化分析

LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

摘要

Support