ChatPaper.aiChatPaper

LinguDistill:通过选择性跨模态蒸馏恢复视觉语言模型的语言能力

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

April 1, 2026
作者: Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva
cs.AI

摘要

将预训练语言模型(LM)适配为视觉语言模型(VLM)时,由于多模态适应过程中引入的表征偏移和跨模态干扰,可能削弱其原有的语言能力。此类能力损失即使采用针对性任务微调也难以恢复。现有恢复方法通常通过引入中间对齐层来维持或隔离模态特定子空间,但这会增加架构复杂性、推理时参数数量,并限制模型与场景的灵活性。我们提出LinguDistill——一种无需适配器的蒸馏方法,通过将原始冻结LM作为教师模型来恢复语言能力。我们通过引入分层KV缓存共享技术,在不改变双方模型架构的前提下使教师模型感知学生的多模态表征,从而解决了视觉条件化教师监督的关键挑战。随后,我们在语言密集型数据上选择性蒸馏教师的强语言信号以恢复语言能力,同时保留学生在多模态任务中的视觉基础。实验表明,LinguDistill可在视觉密集型任务性能持平的前提下,恢复语言与知识基准测试中约10%的性能损失。我们的研究证明,无需附加模块即可恢复语言能力,为多模态模型中模态特定退化问题提供了高效实用的解决方案。
English
Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers sim10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.
PDF41April 4, 2026