携带自有数据！大型语言模型的自监督评估

摘要

随着大型语言模型（LLMs）的兴起及其在各个领域的普遍部署，对语言模型在现实数据上的行为进行衡量变得至关重要。例如，部署面向客户的聊天机器人的公司必须确保模型不会用粗言秽语回应客户的请求。目前的评估方法是使用小型、领域特定的数据集，这些数据集带有人工标记。这些评估集通常是从狭窄且简化的分布中抽样的，数据源有可能无意中泄漏到训练集中，从而导致误导性评估。为了规避这些缺点，我们提出了一个框架，通过分析语言模型对输入文本变换的敏感性或不变性，来进行自监督评估LLMs。自监督评估可以直接监控LLM在野外收集的数据集或在模型实时部署期间流式传输的数据上的行为。我们展示了自监督评估策略，用于衡量闭卷知识、有毒性和长距离上下文依赖性，以及对语法结构和标记化错误的敏感性。当可以与类似的人工标记基准进行比较时，我们发现自监督评估与人工监督评估之间存在很强的相关性。自监督范式补充了依赖标记数据的当前评估策略。

English

With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse domains, measuring language model behavior on realistic data is imperative. For example, a company deploying a client-facing chatbot must ensure that the model will not respond to client requests with profanity. Current evaluations approach this problem using small, domain-specific datasets with human-curated labels. These evaluation sets are often sampled from a narrow and simplified distribution, and data sources can unknowingly be leaked into the training set which can lead to misleading evaluations. To bypass these drawbacks, we propose a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on the input text. Self-supervised evaluation can directly monitor LLM behavior on datasets collected in the wild or streamed during live model deployment. We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence, in addition to sensitivity to grammatical structure and tokenization errors. When comparisons to similar human-labeled benchmarks are available, we find strong correlations between self-supervised and human-supervised evaluations. The self-supervised paradigm complements current evaluation strategies that rely on labeled data.

携带自有数据！大型语言模型的自监督评估

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

摘要

Support