可解释与鲁棒模型训练的混合归因先验
Hybrid Attribution Priors for Explainable and Robust Model Training
December 9, 2025
作者: Zhuoran Zhang, Feng Zhang, Shangyuan Li, Yang Shi, Yuanxing Zhang, Wei Chen, Tengjiao Wang, Kam-Fai Wong
cs.AI
摘要
小型語言模型(SLMs)在需要低延遲和輕量級部署的任務中(尤其是分類任務)得到廣泛應用。隨著可解釋性與魯棒性日益受到重視,解釋引導學習已成為一種有效框架——通過在訓練過程中引入基於歸因的監督機制;然而,如何獲取通用且可靠的歸因先驗仍是重大挑戰。通過對分類場景中代表性歸因方法的分析,我們發現儘管這些方法能可靠地標註與類別相關的語彙單元,但它們往往聚焦於語義相似類別間共有的關鍵詞。由於這類別在標準訓練下本就難以區分,此類歸因無法提供足夠的區分性線索,從而限制了其提升模型區分能力的效果。為突破這一局限,我們提出類別感知歸因先驗(CAP),這是一種新穎的歸因先驗提取框架,可引導語言模型捕捉細粒度類別差異,並生成更顯著、更具區分度的歸因先驗。基於此思路,我們進一步提出CAP混合框架(CAP Hybrid),將CAP生成的先驗與現有歸因技術的先驗相結合,形成更全面均衡的監督信號。通過使模型的自歸因與這些增強型先驗保持一致,我們的方法能促進對多樣化決策相關特徵的學習。在全數據、少樣本對抗場景下的廣泛實驗表明,該方法能持續提升模型的可解釋性與魯棒性。
English
Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we propose Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework that guides language models toward capturing fine-grained class distinctions and producing more salient, discriminative attribution priors. Building on this idea, we further introduce CAP Hybrid, which combines priors from CAP with those from existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning a model's self-attribution with these enriched priors, our approach encourages the learning of diverse, decision-relevant features. Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate that our method consistently enhances both interpretability and robustness.