LLaVA-Gemma: 컴팩트 언어 모델을 활용한 멀티모달 파운데이션 모델 가속화

초록

최근 출시된 Gemma 대규모 언어 모델(LLM) 패밀리와 인기 있는 LLaVA 프레임워크를 사용하여 다중 모달 기반 모델(MMFM) 세트를 학습시켰습니다. 특히 20억 파라미터 규모의 Gemma 모델은 소규모 MMFM을 구축할 수 있는 가능성을 제공합니다. 이 분야의 다른 연구 결과와 일치하게, 세 가지 설계 요소를 제거했을 때의 영향을 테스트했습니다: 커넥터 사전 학습, 더 강력한 이미지 백본 활용, 언어 백본 크기 증가 등이 그것입니다. 이를 통해 개발한 LLaVA-Gemma 모델은 다양한 평가에서 중간 수준의 성능을 보였으나, 현재 동급 규모의 최첨단(SOTA) 모델을 능가하지는 못했습니다. 성능에 대한 심층 분석 결과, 사전 학습을 생략하면 성능이 저하되는 경향이 있었고, 더 큰 비전 모델은 경우에 따라 성능을 향상시켰으며, 언어 모델 크기를 늘리는 것은 일관되지 않은 효과를 보였습니다. LLaVA-Gemma 모델의 학습 레시피, 코드 및 가중치를 공개적으로 배포합니다.

English

We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.

LLaVA-Gemma: 컴팩트 언어 모델을 활용한 멀티모달 파운데이션 모델 가속화

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

초록

Support