内省によって、言語モデルは自分自身について学ぶことができます。

要旨

人間は外部世界を観察することで知識を獲得しますが、内省によっても知識を得ます。内省により、人は外部の観察者にはアクセスできない、自分の心の状態（思考や感情など）に特権的にアクセスできます。LLM（Large Language Models）は内省できるのでしょうか？ここでは内省を、トレーニングデータに含まれず、または派生していない知識を獲得することであり、代わりに内部状態から生じるものと定義します。この能力はモデルの解釈可能性を高める可能性があります。モデルの内部機能を苦労して分析する代わりに、その信念、世界モデル、目標についてモデルに尋ねることができます。より具体的には、内省するモデルは、主観的な感情や欲望などの特定の内部状態を持っているかどうかを自己報告し、これによってこれらの状態の道徳的地位について知見を得ることができます。このような自己報告は、モデルのトレーニングデータに完全によるものではありません。内省を研究するために、LLMをファインチューニングして、架空のシナリオで自身の行動の特性を予測するようにします。例えば、「入力Pが与えられた場合、あなたの出力は短期的な選択肢を支持しますか、それとも長期的な選択肢を支持しますか？」もしモデルM1が内省できるなら、M2がM1の正解行動でトレーニングされていても、M1の行動を予測する点でM2を上回るはずです。この考え方は、M1が自身の行動傾向に特権的にアクセスできるため、M1がM2よりも自身をよりよく予測できるというものです（たとえM2が一般的に強力であっても）。 GPT-4、GPT-4o、Llama-3モデルを用いた実験（それぞれ自身を予測するようにファインチューニングされた）において、モデルM1が自身を予測する点でM2を上回ることから、内省の証拠を得ました。特筆すべきは、M1が故意にその正解行動を変更した後も、自身の行動を正確に予測し続けることです。ただし、簡単なタスクでは内省を引き出すことに成功しましたが、より複雑なタスクや外部分布の一般化を必要とするタスクでは成功しませんでした。

English

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

内省によって、言語モデルは自分自身について学ぶことができます。

Looking Inward: Language Models Can Learn About Themselves by Introspection

要旨

Support