모달리티 확장이 오모달리티로 가는 올바른 길인가?

초록

오므니모달 언어 모델(OLMs)은 텍스트, 이미지, 비디오, 오디오와 같은 다양한 입력 모달리티를 통합하고 이에 대해 추론하는 동시에 강력한 언어 능력을 유지하는 것을 목표로 합니다. 최근의 발전에도 불구하고, 특히 오픈소스 모델들은 진정한 오므니모달리티와는 거리가 멀어, 훈련된 특정 모달리티 쌍을 넘어 일반화하거나 다중 모달리티 입력을 처리할 때 강력한 성능을 달성하는 데 어려움을 겪고 있습니다. 본 연구에서는 다중 모달리티 모델을 훈련하기 위한 주요 기법인 모달리티 확장의 효과를 살펴보며, 기존의 언어 모델을 대상 도메인 및 언어 데이터에 대해 미세 조정하는 방식을 특히 중점적으로 다룹니다. 구체적으로, 우리는 세 가지 핵심 질문을 탐구합니다: (1) 모달리티 확장이 핵심 언어 능력을 저해하는가? (2) 독립적으로 미세 조정된 모달리티별 모델을 통합하여 오므니모달리티를 달성하는 데 모델 병합이 효과적인가? (3) 순차적 확장에 비해 오므니모달리티 확장이 더 나은 지식 공유와 일반화로 이어지는가? 광범위한 실험을 통해 이러한 트레이드오프를 분석하고, 현재의 접근법을 사용하여 진정한 오므니모달리티를 달성하는 가능성에 대한 통찰을 제공합니다.

English

Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.

모달리티 확장이 오모달리티로 가는 올바른 길인가?

Is Extending Modality The Right Path Towards Omni-Modality?

초록

Support