CorrSteer：通過基於相關性的稀疏自編碼器特徵選擇，提升大型語言模型的任務表現與安全性

摘要

稀疏自編碼器（Sparse Autoencoders, SAEs）能夠無監督地從大型語言模型（LLMs）中提取可解釋的特徵。然而，其在下游引導任務中的有效性受到對比數據集或大量激活存儲需求的限制。為解決這些限制，我們提出了CorrSteer，該方法通過將樣本正確性與推理時生成令牌的SAE激活相關聯來選擇特徵。此方法僅使用推理時的激活來提取更相關的特徵，從而避免虛假相關性。它還從平均激活中獲取引導係數，實現了整個流程的自動化。我們的方法在Gemma 2 2B和LLaMA 3.1 8B上的問答、偏見緩解、越獄防護及推理基準測試中展現了改進的任務性能，特別是在僅使用4000個樣本的情況下，MMLU性能提升了+4.1%，HarmBench提升了+22.9%。所選特徵顯示出與每項任務需求相符的語義模式，揭示了驅動性能的底層能力。我們的工作確立了基於相關性的選擇作為一種有效且可擴展的方法，用於跨語言模型應用的自動化SAE引導。

English

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby avoiding spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma 2 2B and LLaMA 3.1 8B, notably achieving a +4.1% improvement in MMLU performance and a +22.9% improvement in HarmBench with only 4000 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlationbased selection as an effective and scalable approach for automated SAE steering across language model applications.

CorrSteer：通過基於相關性的稀疏自編碼器特徵選擇，提升大型語言模型的任務表現與安全性

CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

摘要

Support