攜帶自有數據！大型語言模型的自監督評估

摘要

隨著大型語言模型（LLMs）的崛起及其在各個領域的普遍應用，對語言模型在實際數據上的行為進行評估變得至關重要。例如，一家部署面向客戶的聊天機器人的公司必須確保模型不會用粗言穢語回應客戶的請求。目前的評估方法是使用小型、特定領域的數據集，並帶有人工標籤。這些評估集通常從狹窄且簡化的分佈中抽樣，並且數據源可能會不知不覺地洩漏到訓練集中，這可能導致誤導性的評估。為了避免這些缺點，我們提出了一個自監督評估框架，通過分析語言模型對輸入文本變換的敏感性或不變性來進行評估LLMs。自監督評估可以直接監控LLM在野外收集的數據集或在模型部署期間實時流式傳輸的行為。我們展示了自監督評估策略，用於測量閉書知識、有毒性和長距離上下文依賴性，以及對語法結構和標記化錯誤的敏感性。當與類似的人工標記基準進行比較時，我們發現自監督評估和人工監督評估之間存在較強的相關性。自監督範式補充了依賴標記數據的當前評估策略。

English

With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse domains, measuring language model behavior on realistic data is imperative. For example, a company deploying a client-facing chatbot must ensure that the model will not respond to client requests with profanity. Current evaluations approach this problem using small, domain-specific datasets with human-curated labels. These evaluation sets are often sampled from a narrow and simplified distribution, and data sources can unknowingly be leaked into the training set which can lead to misleading evaluations. To bypass these drawbacks, we propose a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on the input text. Self-supervised evaluation can directly monitor LLM behavior on datasets collected in the wild or streamed during live model deployment. We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence, in addition to sensitivity to grammatical structure and tokenization errors. When comparisons to similar human-labeled benchmarks are available, we find strong correlations between self-supervised and human-supervised evaluations. The self-supervised paradigm complements current evaluation strategies that rely on labeled data.

攜帶自有數據！大型語言模型的自監督評估

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

摘要

Support