기계적 해석 가능성을 통해 대형 언어 모델(LLM)의 잠재적 지식을 이끌어내기 위한 접근

초록

언어 모델이 점점 더 강력하고 정교해짐에 따라, 이들이 신뢰할 수 있고 안정적으로 유지되는 것이 중요합니다. 모델이 운영자를 속이거나 비밀을 유지하려 할 수 있다는 우려스러운 초기 증거가 있습니다. 이러한 숨겨진 지식을 끌어내기 위한 현재 기술의 능력을 탐구하기 위해, 우리는 특정 비밀 단어를 명시적으로 언급하지 않고 설명하는 Taboo 모델을 학습시켰습니다. 중요한 점은 이 비밀 단어가 모델의 학습 데이터나 프롬프트에 제시되지 않는다는 것입니다. 그런 다음 이 비밀을 밝혀내기 위한 방법을 조사했습니다. 먼저, 비해석적(블랙박스) 접근법을 평가했습니다. 이후, 로짓 렌즈(lit lens)와 희소 오토인코더(sparse autoencoders)를 포함한 기계적 해석성 기반의 대부분 자동화된 전략을 개발했습니다. 평가 결과, 두 접근법 모두 개념 증명 설정에서 비밀 단어를 효과적으로 끌어내는 것으로 나타났습니다. 우리의 연구 결과는 숨겨진 지식을 끌어내는 데 있어 이러한 접근법의 가능성을 강조하며, 더 복잡한 모델 생물체에서 이러한 방법을 테스트하고 개선하는 등 미래 작업을 위한 여러 유망한 방향을 제시합니다. 이 연구는 언어 모델로부터 비밀 지식을 끌어내는 중요한 문제를 해결하기 위한 한 걸음이 되어, 언어 모델의 안전하고 신뢰할 수 있는 배포에 기여하고자 합니다.

English

As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

기계적 해석 가능성을 통해 대형 언어 모델(LLM)의 잠재적 지식을 이끌어내기 위한 접근

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

초록

Support