하이퍼디멘셔널 프로브: 벡터 심볼릭 아키텍처를 통한 LLM 표현 디코딩

초록

대규모 언어 모델(LLMs)은 그 능력에도 불구하고 내부 표현에 대한 이해가 제한적이며 불투명한 상태로 남아 있습니다. 직접 로짓 속성화(DLA)와 희소 오토인코더(SAEs)와 같은 현재의 해석 가능성 방법은 모델의 출력 어휘나 불명확한 특성 이름과 같은 한계로 인해 제한된 통찰만을 제공합니다. 본 연구는 LLM 벡터 공간에서 정보를 디코딩하기 위한 새로운 패러다임인 하이퍼디멘셔널 프로브(Hyperdimensional Probe)를 소개합니다. 이 프로브는 기호적 표현과 신경 프로빙의 아이디어를 결합하여 벡터 기호 구조(VSAs)를 통해 모델의 잔차 스트림을 해석 가능한 개념으로 투영합니다. 이 프로브는 SAEs와 기존 프로브의 장점을 결합하면서도 주요 한계를 극복합니다. 우리는 통제된 입력-완성 작업을 통해 이 디코딩 패러다임을 검증하며, 구문 패턴 인식, 키-값 연관, 추상적 추론에 걸친 입력에 대해 다음 토큰 예측 전 모델의 최종 상태를 프로빙합니다. 또한, 질문-응답 설정에서 텍스트 생성 전후의 모델 상태를 검토합니다. 우리의 실험은 이 프로브가 다양한 LLMs, 임베딩 크기, 입력 도메인에 걸쳐 의미 있는 개념을 신뢰롭게 추출하며, LLM 실패를 식별하는 데도 도움을 준다는 것을 보여줍니다. 우리의 연구는 LLM 벡터 공간에서의 정보 디코딩을 발전시켜, 신경 표현에서 더 많은 정보를 제공하고 해석 가능하며 구조화된 특성을 추출할 수 있게 합니다.

English

Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods, such as direct logit attribution (DLA) and sparse autoencoders (SAEs), provide restricted insight due to limitations such as the model's output vocabulary or unclear feature names. This work introduces Hyperdimensional Probe, a novel paradigm for decoding information from the LLM vector space. It combines ideas from symbolic representations and neural probing to project the model's residual stream into interpretable concepts via Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs and conventional probes while overcoming their key limitations. We validate our decoding paradigm with controlled input-completion tasks, probing the model's final state before next-token prediction on inputs spanning syntactic pattern recognition, key-value associations, and abstract inference. We further assess it in a question-answering setting, examining the state of the model both before and after text generation. Our experiments show that our probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, also helping identify LLM failures. Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.

하이퍼디멘셔널 프로브: 벡터 심볼릭 아키텍처를 통한 LLM 표현 디코딩

Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

초록

Support