HyperLLaVA: 멀티모달 대형 언어 모델을 위한 동적 시각 및 언어 전문가 튜닝

초록

최근의 연구 결과에 따르면, 멀티모달 대형 언어 모델(MLLM)의 규모를 확장하는 것이 다운스트림 멀티모달 작업에서의 성능을 효과적으로 향상시킨다는 것이 밝혀졌습니다. 현재 주류인 MLLM 패러다임(예: LLaVA)은 정적 시각-언어 매퍼를 사용하여 시각적 특징을 텍스트와 유사한 토큰으로 변환함으로써, 정적 LLM이 시각적 정보를 이해할 수 있는 능력을 시각적 지침 튜닝을 통해 개발할 수 있도록 합니다. 이러한 접근법은 유망하지만, 정적 튜닝 전략(정적 튜닝은 정적 파라미터로 훈련된 모델을 의미함)은 동일한 파라미터를 공유함으로써 다양한 다운스트림 멀티모달 작업에서의 성능을 제한할 수 있습니다. 이를 고려하여, 우리는 HyperLLaVA를 소개합니다. 이는 프로젝터와 LLM 파라미터를 적응적으로 튜닝하며, 각각 동적 시각 전문가와 언어 전문가와 결합됩니다. 이러한 전문가들은 HyperNetworks에서 파생되며, 시각적 및 언어적 지도를 통해 적응적 파라미터 변화를 생성하여, 두 단계의 훈련 과정에서 동적 프로젝터와 LLM 모델링을 가능하게 합니다. 우리의 실험 결과는 우리의 솔루션이 기존 MLLM 벤치마크(MME, MMBench, SEED-Bench, LLaVA-Bench 등)에서 LLaVA를 크게 능가함을 보여줍니다. 우리의 프로젝트는 https://github.com/DCDmllm/HyperLLaVA에서 확인할 수 있습니다.

English

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, e.g., LLaVA, transforms visual features into text-like tokens using a static vision-language mapper, thereby enabling static LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the static tuning strategy~The static tuning refers to the trained model with static parameters. that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~Our project is available on the link https://github.com/DCDmllm/HyperLLaVA.

HyperLLaVA: 멀티모달 대형 언어 모델을 위한 동적 시각 및 언어 전문가 튜닝

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

초록

Support