CorrSteer: 상관관계 기반 희소 오토인코더 특징 선택을 통해 LLM의 작업 성능과 안전성 향상

초록

희소 오토인코더(SAE)는 지도 없이도 대규모 언어 모델(LLM)에서 해석 가능한 특징을 추출할 수 있습니다. 그러나 하류 작업에서의 효과는 대조 데이터셋이나 대규모 활성화 저장이 필요하다는 점에 의해 제한됩니다. 이러한 한계를 해결하기 위해, 우리는 CorrSteer를 제안합니다. 이 방법은 생성된 토큰의 SAE 활성화와 샘플 정확도를 상관 관계를 통해 특징을 선택합니다. 이 접근법은 추론 시 활성화만을 사용하여 더 관련성 높은 특징을 추출함으로써 허위 상관 관계를 피합니다. 또한 평균 활성화로부터 조정 계수를 얻어 전체 파이프라인을 자동화합니다. 우리의 방법은 Gemma 2 2B와 LLaMA 3.1 8B에서 QA, 편향 완화, 탈옥 방지, 추론 벤치마크에서 개선된 작업 성능을 보여주며, 특히 MMLU 성능에서 +4.1%, HarmBench에서 +22.9%의 향상을 단 4000개의 샘플로 달성했습니다. 선택된 특징은 각 작업의 요구 사항과 일치하는 의미론적으로 의미 있는 패턴을 보여주며, 성능을 이끄는 근본적인 능력을 드러냅니다. 우리의 연구는 상관 관계 기반 선택이 언어 모델 응용 프로그램 전반에 걸쳐 자동화된 SAE 조정을 위한 효과적이고 확장 가능한 접근법임을 입증합니다.

English

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby avoiding spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma 2 2B and LLaMA 3.1 8B, notably achieving a +4.1% improvement in MMLU performance and a +22.9% improvement in HarmBench with only 4000 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlationbased selection as an effective and scalable approach for automated SAE steering across language model applications.

CorrSteer: 상관관계 기반 희소 오토인코더 특징 선택을 통해 LLM의 작업 성능과 안전성 향상

CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

초록

Support