自我审视：语言模型可以通过内省学习自身。

摘要

人类通过观察外部世界以及内省来获取知识。内省使一个人能够特权地了解自己当前的心理状态（例如思维和情感），这是外部观察者无法获取的。语言模型是否具有内省能力？我们将内省定义为获取不包含在或源自训练数据中的知识，而是源自内部状态。这种能力可以增强模型的可解释性。我们可以简单地询问模型关于其信念、世界模型和目标，而不是费力地分析模型的内部运作。更具推测性的是，一个具有内省能力的模型可能会自我报告是否具有某些内部状态，如主观感受或欲望，这可以告诉我们这些状态的道德地位。这些自我报告不会完全受模型的训练数据支配。我们通过微调大型语言模型（LLMs）来研究内省，以预测其在假设情景中的行为特性。例如，“给定输入P，你的输出更倾向于短期还是长期选项？”如果模型M1具有内省能力，它应该在预测自己的行为方面胜过另一个模型M2，即使M2是在M1的真实行为上进行训练的。这个想法是，M1能够特权地了解自己的行为倾向，从而使其比M2更好地预测自己（即使M2通常更强大）。在对GPT-4、GPT-4o和Llama-3模型进行实验（每个模型都被微调以预测自身）后，我们发现模型M1在预测自己方面胜过M2，为内省提供了证据。值得注意的是，即使我们有意修改其真实行为，M1仍然能够准确预测其行为。然而，尽管我们成功地引发了在简单任务上的内省，但在更复杂的任务或需要超出分布范围的泛化的任务上，我们并未取得成功。

English

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

自我审视：语言模型可以通过内省学习自身。

Looking Inward: Language Models Can Learn About Themselves by Introspection

摘要

Support