可解释与鲁棒模型训练的混合归因先验
Hybrid Attribution Priors for Explainable and Robust Model Training
December 9, 2025
作者: Zhuoran Zhang, Feng Zhang, Shangyuan Li, Yang Shi, Yuanxing Zhang, Wei Chen, Tengjiao Wang, Kam-Fai Wong
cs.AI
摘要
小型语言模型(SLM)在需要低延迟和轻量级部署的任务中应用广泛,尤其适用于文本分类场景。随着可解释性与鲁棒性需求日益增长,基于归因指导的学习范式通过引入 attribution 监督信号已成为有效训练框架,但如何获取通用可靠的归因先验仍存挑战。通过对分类任务中代表性归因方法的分析,我们发现尽管这些方法能可靠地突出类别相关标记,但其注意力常集中于语义相近类别共享的通用关键词。由于此类别在标准训练下本就难以区分,现有归因提供的判别性线索不足,限制了其提升模型区分能力的效果。为突破此局限,我们提出类别感知归因先验(CAP)——一种引导语言模型捕捉细粒度类别差异、生成更显著判别性归因先验的新框架。基于此,我们进一步设计CAP混合策略,将CAP先验与现有归因技术相结合,形成更全面均衡的监督信号。通过使模型的自归因与这些增强先验对齐,我们的方法促进了多样化决策相关特征的学习。在全数据、少样本及对抗场景下的广泛实验表明,该方法能持续提升模型的可解释性与鲁棒性。
English
Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we propose Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework that guides language models toward capturing fine-grained class distinctions and producing more salient, discriminative attribution priors. Building on this idea, we further introduce CAP Hybrid, which combines priors from CAP with those from existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning a model's self-attribution with these enriched priors, our approach encourages the learning of diverse, decision-relevant features. Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate that our method consistently enhances both interpretability and robustness.