SteeringControl：大语言模型中对齐控制的全面评估

摘要

我们推出了SteeringControl，这是一个用于评估表征导向方法在核心对齐目标——偏见、有害生成和幻觉——及其对次要行为（如奉承和常识道德）影响的基准。尽管以往的对齐工作常以真实性或推理能力来展示表征导向的副作用，但我们发现仍有许多未系统理解的权衡关系未被探索。我们收集了一个与安全相关的主要及次要行为数据集，围绕五种流行的导向方法评估导向效果及行为间的纠缠性。为此，我们构建了一个基于独特组件的模块化导向框架，这些组件作为众多现有方法的基础构件。在Qwen-2.5-7B和Llama-3.1-8B上的实验结果表明，强劲的导向性能依赖于导向方法、模型及目标行为的具体组合，而这三者搭配不当也可能导致严重的概念纠缠。我们在此发布代码：https://github.com/wang-research-lab/SteeringControl.git。

English

We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives--bias, harmful generation, and hallucination--and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.

SteeringControl：大语言模型中对齐控制的全面评估

SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

摘要

Support