LinguDistill: Herstel van Linguïstisch Vermogen in Visie-Taalmodellen via Selectieve Cross-Modale Distillatie

Samenvatting

Het aanpassen van voorgetrainde taalmodel(len) (TM's) naar visueel-taalmodel(len) (VTM's) kan hun oorspronkelijke linguïstische vermogen aantasten als gevolg van representatieverschuiving en cross-modale interferentie die tijdens multimodale adaptatie wordt geïntroduceerd. Dit verlies is moeilijk te herstellen, zelfs met gerichte taakspecifieke fine-tuning met standaarddoelstellingen. Bestaande herstelmethode(n) introduceren typisch extra modules die fungeren als intermediare alignatielagen om modalitiespecifieke deelruimtes te behouden of isoleren, wat de architecturale complexiteit verhoogt, parameters toevoegt tijdens inferentie en de flexibiliteit tussen modellen en instellingen beperkt. Wij stellen LinguDistill voor, een adapter-vrije distillatiemethode die linguïstisch vermogen herstelt door het oorspronkelijke bevroren TM als teacher te gebruiken. Wij overwinnen de kernuitdaging van vision-gestuurde teacher-supervisie door de introductie van gelaagsgewijze KV-cache-deling, waardoor de teacher wordt blootgesteld aan de multimodale representaties van de student zonder de architectuur van een van beide modellen aan te passen. Vervolgens distilleren we selectief het sterke linguïstische signaal van de teacher op taalintensieve data om taalvermogen te herstellen, terwijl de visuele verankering van de student op multimodale taken behouden blijft. Hierdoor herstelt LinguDistill ongeveer 10% van het verloren prestatieniveau op taal- en kennisbenchmarks, met behoud van vergelijkbare prestaties op visueel zware taken. Onze bevindingen tonen aan dat linguïstisch vermogen kan worden hersteld zonder extra modules, wat een efficiënte en praktische oplossing biedt voor modalitiespecifieke degradatie in multimodale modellen.

English

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers sim10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

LinguDistill: Herstel van Linguïstisch Vermogen in Visie-Taalmodellen via Selectieve Cross-Modale Distillatie

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

Samenvatting

Support