大規模言語モデルのバイアスを浮き彫りにするための対照的入力デコーディング

要旨

大規模言語モデル（LM）が公平で、堅牢かつ有用であることを保証するためには、入力に対する異なる変更がモデルの挙動にどのような影響を与えるかを理解する必要があります。しかし、オープンテキスト生成タスクの文脈では、そのような評価は自明ではありません。例えば、モデルに入力テキストとその「対照的」なバージョンを提示した場合、標準的なデコード戦略では、次のトークン予測における有意な差異が明らかにならないことがあります。この動機に基づき、我々はContrastive Input Decoding（CID）を提案します。これは、2つの入力が与えられた場合に、一方の入力に対しては生成されやすいが、他方の入力に対しては生成されにくいテキストを生成するデコードアルゴリズムです。このようにして、対照的な生成結果は、2つの入力に対するLMの出力の潜在的に微妙な差異を、シンプルで解釈可能な形で浮き彫りにすることができます。我々はCIDを使用して、標準的なデコード戦略では検出が難しい文脈固有のバイアスを強調し、異なる入力摂動の効果を定量化します。

English

Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. With this motivation in mind, we propose Contrastive Input Decoding (CID): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. In this way, the contrastive generations can highlight potentially subtle differences in how the LM output differs for the two inputs in a simple and interpretable manner. We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations.

大規模言語モデルのバイアスを浮き彫りにするための対照的入力デコーディング

Surfacing Biases in Large Language Models using Contrastive Input Decoding

要旨

Support