ChatPaper.aiChatPaper

YaPO:面向领域自适应的可学习稀疏激活导向向量

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

January 13, 2026
作者: Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang
cs.AI

摘要

通过激活干预引导大语言模型已成为对齐和个性化任务中轻量级替代微调的新兴方法。近期双向偏好优化研究表明,可直接基于偏好数据以直接偏好优化的方式学习稠密引导向量,从而实现对真实性、幻觉和安全行为的控制。然而,由于神经元多义性,稠密引导向量常会纠缠多个潜在因素,这限制了其在细粒度场景(如文化对齐)中的效能与稳定性——此类场景需区分密切相关的价值观与行为(例如中东文化间的差异)。本文提出全新策略优化方法,这是一种无参考方法,可在稀疏自编码器的隐空间中学习稀疏引导向量。通过优化稀疏编码,YaPO能生成解耦性、可解释且高效的引导方向。实证表明,相较于稠密引导基线,YaPO具有更快的收敛速度、更强的性能表现以及更高的训练稳定性。除文化对齐外,YaPO可泛化至多种对齐相关行为,包括幻觉、财富追求、越狱攻击和权力追求。重要的是,YaPO能保持通用知识能力,在MMLU基准上未见性能衰减。总体而言,我们的结果表明YaPO为LLMs的高效、稳定和细粒度对齐提供了通用方案,在可控性和领域适应方面具有广泛应用前景。相关代码与数据已开源:https://github.com/MBZUAI-Paris/YaPO。
English
Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, limiting their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a reference-free method that learns sparse steering vectors in the latent space of a Sparse Autoencoder (SAE). By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines. Beyond cultural alignment, YaPO generalizes to a range of alignment-related behaviors, including hallucination, wealth-seeking, jailbreak, and power-seeking. Importantly, YaPO preserves general knowledge, with no measurable degradation on MMLU. Overall, our results show that YaPO provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad applications to controllability and domain adaptation. The associated code and data are publicly availablehttps://github.com/MBZUAI-Paris/YaPO.
PDF52January 21, 2026