言語モデルにおける自己認識

要旨

急速に増加するアプリケーションの多くが、少数のクローズドソース言語モデル（LMs）に依存しています。この依存関係は、LMsが自己認識能力を発達させた場合、新たなセキュリティリスクを引き起こす可能性があります。人間の本人確認方法に着想を得て、モデル生成の「セキュリティ質問」を使用してLMsの自己認識を評価する新しいアプローチを提案します。私たちのテストは、内部モデルパラメータや出力確率へのアクセスを必要としないため、外部から実施可能で、最先端モデルの追跡に役立ちます。このテストを使用して、現在公開されている最も能力の高い10のオープンソースおよびクローズドソースLMsの自己認識を調査しました。広範な実験の結果、どの調査対象LMsにおいても、一般的または一貫した自己認識の実証的証拠は見つかりませんでした。代わりに、結果は、選択肢が与えられた場合、LMsはその起源に関わらず「最良の」回答を選ぼうとすることを示唆しています。さらに、どのモデルが最良の回答を生成するかについての選好が、LMs間で一貫しているという兆候が見られました。また、多肢選択設定におけるLMsの位置バイアスに関する新たな洞察も明らかにしました。

English

A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated "security questions". Our test can be externally administered to keep track of frontier models as it does not require access to internal model parameters or output probabilities. We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available. Our extensive experiments found no empirical evidence of general or consistent self-recognition in any examined LM. Instead, our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin. Moreover, we find indications that preferences about which models produce the best answers are consistent across LMs. We additionally uncover novel insights on position bias considerations for LMs in multiple-choice settings.

言語モデルにおける自己認識

Self-Recognition in Language Models

要旨

Support