LLM比較器：用於大型語言模型並排評估的視覺分析

摘要

自動並排評估已成為評估大型語言模型（LLMs）回應品質的一種有前途的方法。然而，分析這種評估方法的結果會帶來可擴展性和可解釋性方面的挑戰。本文介紹了LLM比較器，這是一種新穎的視覺分析工具，用於交互式分析自動並排評估的結果。該工具支持用戶進行交互式工作流程，以了解模型何時以及為何比基準模型表現更好或更差，以及兩個模型的回應在質量上有何不同。我們通過與一家大型科技公司的研究人員和工程師密切合作，通過反覆設計和開發該工具。本文詳細介紹了我們確定的用戶挑戰、工具的設計和開發，以及與定期評估其模型的參與者進行的觀察性研究。

English

Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.

LLM比較器：用於大型語言模型並排評估的視覺分析

LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

摘要

Support