利用對比輸入解碼來揭示大型語言模型中的偏見

摘要

確保大型語言模型（LMs）公平、穩健且有用，需要了解對其輸入進行不同修改如何影響模型行為。然而，在開放式文本生成任務中，這樣的評估並不簡單。例如，當引入具有輸入文本和經過扰動的“對比”版本的模型時，使用標準解碼策略可能無法顯示在下一個令牌預測中的實質性差異。基於這種動機，我們提出對比輸入解碼（CID）：一種解碼算法，用於生成文本給定兩個輸入，其中生成的文本可能是給定一個輸入時的結果，但對於另一個輸入則不太可能。通過這種方式，對比生成可以以簡單且可解釋的方式突顯LM輸出在兩個輸入下的潛在微妙差異。我們使用CID來凸顯難以通過標準解碼策略檢測到的特定上下文偏見，並量化不同輸入扰動的影響。

English

Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. With this motivation in mind, we propose Contrastive Input Decoding (CID): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. In this way, the contrastive generations can highlight potentially subtle differences in how the LM output differs for the two inputs in a simple and interpretable manner. We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations.

利用對比輸入解碼來揭示大型語言模型中的偏見

Surfacing Biases in Large Language Models using Contrastive Input Decoding

摘要

Support