LinguDistill：選択的クロスモーダル蒸留による視覚言語モデルの言語能力回復

要旨

事前学習済み言語モデル（LM）を視覚言語モデル（VLM）に適応させる際、マルチモーダル適応過程で生じる表現シフトとモダリティ間干渉により、元々備わっていた言語能力が低下する可能性がある。この損失は、標準的な目的関数を用いたタスク特化のファインチューニングを行っても回復が困難である。従来の回復手法では、モダリティ固有の部分空間を維持または分離する中間アラインメント層として機能する追加モジュールを導入するのが一般的であったが、これによりアーキテクチャの複雑化、推論時のパラメータ増加、モデルや設定跨ぐ柔軟性の制限が生じる。本研究では、アダプタを必要としない知識蒸留法LinguDistillを提案する。この手法は、元の凍結済みLMを教師モデルとして利用し言語能力を回復させる。鍵となる課題である、視覚条件付きの教師監督を可能にするため、層単位のKVキャッシュ共有を導入した。これにより、双方のモデルアーキテクチャを変更することなく、教師モデルが生徒モデルのマルチモーダル表現に接触できるようにする。その後、言語集約的データに対して教師の強力な言語信号を選択的に蒸留し言語能力を回復させつつ、マルチモーダルタスクにおける生徒モデルの視覚的接地能力は維持する。その結果、LinguDistillは言語及び知識ベンチマークで失われた性能の約10%を回復し、視覚重視タスクでは同等の性能を維持した。我々の知見は、追加モジュールなしで言語能力が回復可能であることを示し、マルチモーダルモデルにおけるモダリティ固有の能力劣化に対する効率的かつ実用的な解決策を提供する。

English

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers sim10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

LinguDistill：選択的クロスモーダル蒸留による視覚言語モデルの言語能力回復

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

要旨

Support