언어 모델의 자기 인식

초록

점점 더 많은 애플리케이션이 소수의 폐쇄형 언어 모델(LMs)에 의존하고 있습니다. 이러한 의존성은 언어 모델이 자기 인식 능력을 개발할 경우 새로운 보안 위험을 초래할 수 있습니다. 인간의 신원 확인 방법에서 영감을 받아, 우리는 모델이 생성한 "보안 질문"을 사용하여 언어 모델의 자기 인식을 평가하는 새로운 접근 방식을 제안합니다. 우리의 테스트는 내부 모델 매개변수나 출력 확률에 접근할 필요가 없기 때문에 외부에서 관리되어 최첨단 모델을 추적하는 데 사용될 수 있습니다. 우리는 이 테스트를 사용하여 현재 공개적으로 사용 가능한 가장 강력한 오픈소스 및 폐쇄형 언어 모델 10개를 대상으로 자기 인식을 조사했습니다. 광범위한 실험 결과, 조사된 모든 언어 모델에서 일반적이거나 일관된 자기 인식의 경험적 증거는 발견되지 않았습니다. 대신, 우리의 결과는 언어 모델이 주어진 대안 중에서 "최선"의 답을 선택하려는 경향이 있으며, 그 답의 출처와는 무관하다는 것을 시사합니다. 또한, 어떤 모델이 가장 좋은 답을 생성하는지에 대한 선호도가 언어 모델 간에 일관적이라는 징후를 발견했습니다. 추가적으로, 우리는 다중 선택 설정에서 언어 모델의 위치 편향 고려 사항에 대한 새로운 통찰력을 발견했습니다.

English

A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated "security questions". Our test can be externally administered to keep track of frontier models as it does not require access to internal model parameters or output probabilities. We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available. Our extensive experiments found no empirical evidence of general or consistent self-recognition in any examined LM. Instead, our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin. Moreover, we find indications that preferences about which models produce the best answers are consistent across LMs. We additionally uncover novel insights on position bias considerations for LMs in multiple-choice settings.

언어 모델의 자기 인식

Self-Recognition in Language Models

초록

Support