利用对比输入解码揭示大型语言模型中的偏见

摘要

确保大型语言模型（LMs）公平、稳健和有用，需要理解对其输入进行不同修改如何影响模型行为。然而，在开放文本生成任务的背景下，这样的评估并不是微不足道的。例如，当引入一个带有输入文本和扰动的“对比”版本时，使用标准解码策略可能无法揭示下一个标记预测中的有意义差异。出于这个动机，我们提出了对比输入解码（CID）：一种解码算法，用于生成文本，给定两个输入，生成的文本可能是给定一个输入的，但不太可能是给定另一个输入的。通过这种方式，对比生成可以以简单且可解释的方式突显LM输出在两个输入上的潜在细微差异。我们使用CID来突出显示难以通过标准解码策略检测到的特定上下文偏见，并量化不同输入扰动的影响。

English

Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. With this motivation in mind, we propose Contrastive Input Decoding (CID): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. In this way, the contrastive generations can highlight potentially subtle differences in how the LM output differs for the two inputs in a simple and interpretable manner. We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations.

利用对比输入解码揭示大型语言模型中的偏见

Surfacing Biases in Large Language Models using Contrastive Input Decoding

摘要

Support