言語モデルからの対事実的生成

要旨

言語モデルにおける因果生成メカニズムの理解と操作は、その振る舞いを制御するために不可欠です。これまでの研究は、主に表現手術（例：モデルの一部を取り除くことや特定の概念に関連する線形部分空間の操作など）などの手法に依存してきました。介入の影響を正確に理解するためには、因果関係の階層を示すパールの概念とは異なる、例えばある文が特定の介入に従ってモデルによって生成された場合のその文がどのように表示されるかを調べるカウンターファクチュアルを検討することが有益です。この観察に基づき、言語モデルを一般化構造方程式モデルとして再定式化することで真の文字列カウンターファクチュアルを生成するためのフレームワークを提案します。これには、Gumbel-maxトリックを使用します。これにより、元の文字列と同じサンプリングノイズのインスタンスから生じるカウンターファクチュアルに関する共同分布をモデル化することができます。私たちは、後見的Gumbelサンプリングに基づくアルゴリズムを開発し、潜在的なノイズ変数を推論し、観測された文字列のカウンターファクチュアルを生成することができます。実験では、この手法が意味のあるカウンターファクチュアルを生成する一方で、一般的に使用される介入手法がかなり望ましくない副作用を持つことを示しています。

English

Understanding and manipulating the causal generation mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery -- e.g., model ablations or manipulation of linear subspaces tied to specific concepts -- to intervene on these models. To understand the impact of interventions precisely, it is useful to examine counterfactuals -- e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl's causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as Generalized Structural-equation. Models using the Gumbel-max trick. This allows us to model the joint distribution over original strings and their counterfactuals resulting from the same instantiation of the sampling noise. We develop an algorithm based on hindsight Gumbel sampling that allows us to infer the latent noise variables and generate counterfactuals of observed strings. Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.