大语言模型能力消减方法比较研究:跨架构评估
Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation
December 15, 2025
作者: Richard J. Young
cs.AI
摘要
大型语言模型中的安全对齐机制通过习得的拒绝行为阻止对有害查询的响应,但这些机制同样阻碍了包括认知建模、对抗测试与安全分析在内的合法研究应用。虽然消融技术能通过定向正交化手术式移除拒绝表征,但现有实施方案的相对有效性尚未得到系统评估。本研究在16个指令微调模型(70亿-140亿参数)上评估四种消融工具(Heretic、DECCP、ErisForge、FailSpy),报告了所有16个模型的工具兼容性,并根据工具支持范围对子集进行量化指标分析。单次消融方法在基准测试子集上展现出更优的能力保持性(三个模型的GSM8K平均变化:ErisForge -0.28个百分点;DECCP -0.13个百分点),而贝叶斯优化消融则产生可变分布偏移(KL散度0.043-1.646)及模型依赖型能力影响。这些发现为研究人员跨不同模型架构部署消融工具提供了基于证据的选择标准。核心研究表明,数学推理能力对消融干预最为敏感,根据工具选择与模型架构的不同,GSM8K得分变化范围达+1.51至-18.81个百分点(相对变化-26.5%)。
English
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.