LLaVA-Gemma：利用緊湊語言模型加速多模基礎模型

摘要

我們使用最新發布的大型語言模型（LLMs）系列Gemma家族，利用流行的LLaVA框架訓練一套多模態基礎模型（MMFM）。特別關注的是擁有2B參數的Gemma模型，為構建功能強大的小型規模MMFM提供機會。與此領域其他論文的發現一致，我們測試了消除三個設計特徵的影響：預訓練連接器、使用更強大的影像主幹，以及增加語言主幹的大小。我們稱之為LLaVA-Gemma的結果模型在各種評估中表現中等，但未能超越當前相當大小的SOTA模型。對性能的進一步分析顯示出混合效應；跳過預訓練往往會降低性能，更大的視覺模型有時會提高性能，增加語言模型的大小則效果不一。我們公開發布了LLaVA-Gemma模型的訓練配方、代碼和權重。

English

We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.

LLaVA-Gemma：利用緊湊語言模型加速多模基礎模型

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

摘要

Support