ChatPaper.aiChatPaper

語言模型中的自我識別

Self-Recognition in Language Models

July 9, 2024
作者: Tim R. Davidson, Viacheslav Surkov, Veniamin Veselovsky, Giuseppe Russo, Robert West, Caglar Gulcehre
cs.AI

摘要

越來越多的應用程式依賴於一小組封閉源語言模型(LMs)。如果LMs發展出自我識別能力,這種依賴可能引入新的安全風險。受人類身份驗證方法的啟發,我們提出了一種評估LMs自我識別的新方法,使用模型生成的「安全問題」。我們的測試可以由外部進行管理,以追蹤前沿模型,因為它不需要訪問內部模型參數或輸出概率。我們使用這個測試來檢驗目前公開可用的十個最具能力的開源和封閉源LMs中的自我識別。我們的廣泛實驗沒有找到任何被檢驗LM中的一般或一致的自我識別的實證證據。相反,我們的結果表明,在給定一組替代方案時,LMs會傾向選擇「最佳」答案,而不考慮其來源。此外,我們發現跨LMs的有關哪些模型產生最佳答案的偏好是一致的跡象。我們還在多選擇情境中揭示了有關LMs的位置偏見考量的新見解。
English
A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated "security questions". Our test can be externally administered to keep track of frontier models as it does not require access to internal model parameters or output probabilities. We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available. Our extensive experiments found no empirical evidence of general or consistent self-recognition in any examined LM. Instead, our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin. Moreover, we find indications that preferences about which models produce the best answers are consistent across LMs. We additionally uncover novel insights on position bias considerations for LMs in multiple-choice settings.

Summary

AI-Generated Summary

PDF272November 28, 2024