회로 분석 해석 가능성은 확장 가능한가? 친칠라 모델의 다중 선택 능력을 통한 증거

초록

회로 분석(Circuit analysis)은 언어 모델의 내부 메커니즘을 이해하기 위한 유망한 기술입니다. 그러나 기존 분석들은 최신 기술 수준과는 거리가 먼 소규모 모델에서 수행되었습니다. 이를 해결하기 위해, 우리는 70B Chinchilla 모델에서의 회로 분석 사례 연구를 제시하며, 회로 분석의 확장성을 테스트하고자 합니다. 특히, 우리는 객관식 문제 답변을 연구하고, 정답 텍스트를 알고 있는 상황에서 Chinchilla가 정답 라벨을 식별하는 능력을 조사합니다. 우리는 로짓 귀속(Logit attribution), 어텐션 패턴 시각화(Attention pattern visualization), 그리고 활성화 패칭(Activation patching)과 같은 기존 기술들이 Chinchilla에 자연스럽게 확장 가능함을 발견했으며, 이를 통해 소규모의 '출력 노드'(어텐션 헤드와 MLP)를 식별하고 분류할 수 있었습니다. 또한, 우리는 '정답 문자' 카테고리의 어텐션 헤드를 연구하여 이들의 특징 의미를 이해하고자 했으나, 혼재된 결과를 얻었습니다. 일반적인 객관식 문제 답변의 경우, 답변 라벨에 대해 작동할 때 쿼리, 키, 그리고 값 부분 공간을 성능 저하 없이 상당히 압축할 수 있었으며, 쿼리와 키 부분 공간이 '열거에서 N번째 항목'이라는 특징을 어느 정도 나타냄을 보였습니다. 그러나 이 설명을 무작위 답변 라벨을 포함한 더 일반적인 분포에서의 헤드 동작을 이해하는 데 사용하려고 시도했을 때, 이는 부분적인 설명에 불과함을 발견했습니다. 이는 객관식 문제 답변에서 '정답 문자' 헤드의 동작에 대해 더 알아볼 필요가 있음을 시사합니다.

English

Circuit analysis is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer label given knowledge of the correct answer text. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of `output nodes' (attention heads and MLPs). We further study the `correct letter' category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an `Nth item in an enumeration' feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of `correct letter' heads on multiple choice question answering.

회로 분석 해석 가능성은 확장 가능한가? 친칠라 모델의 다중 선택 능력을 통한 증거

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

초록

Support