ChatPaper.aiChatPaper

LinguDistill:通过选择性跨模态蒸馏恢复视觉语言模型的语言能力

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

April 1, 2026
作者: Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva
cs.AI

摘要

將預訓練語言模型(LMs)適配為視覺語言模型(VLMs)時,可能因表徵偏移和多模態適配過程中產生的跨模態干擾,削弱其原有的語言能力。此類能力損失難以通過常規目標的任務特定微調來恢復。現有恢復方法通常需引入額外模塊作為中間對齊層來維持或隔離模態特定子空間,這會增加架構複雜度、推斷時的參數量,並限制模型與場景的靈活性。我們提出LinguDistill——一種無適配器的蒸餾方法,通過將原始凍結LM作為教師模型來恢復語言能力。針對視覺條件下教師監督的關鍵挑戰,我們引入分層KV緩存共享機制,使教師模型能接觸學生模型的多模態表徵,而無需修改任一模型的架構。隨後在語言密集型數據上選擇性蒸餾教師模型的強語言信號以恢復語言能力,同時保留學生模型在多模態任務中的視覺基礎。實驗表明,LinguDistill可在語言與知識基準測試中恢復約10%的性能損失,並在視覺密集型任務上保持相當性能。我們的研究證明了無需附加模塊即可恢復語言能力,為多模態模型中模態特定退化問題提供了高效實用的解決方案。
English
Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers sim10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.
PDF41April 4, 2026