ChatPaper.aiChatPaper

LLaVA-Gemma:通过紧凑型语言模型加速多模态基础模型

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

March 29, 2024
作者: Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal
cs.AI

摘要

我们使用流行的LLaVA框架,结合最近发布的Gemma系列大型语言模型(LLMs),训练了一系列多模态基础模型(MMFM)。特别值得一提的是2B参数的Gemma模型,它为构建能力出众的小规模MMFM提供了契机。根据该领域其他论文的研究发现,我们测试了三种设计特征的消融效果:连接器的预训练、采用更强大的图像骨干网络以及增大语言骨干网络的规模。由此产生的模型,我们称之为LLaVA-Gemma,在多项评估中表现出中等水平的表现,但未能超越当前同等规模的SOTA模型。深入分析性能显示,效果参差不齐;跳过预训练往往会降低性能,较大的视觉模型有时能提升性能,而增大语言模型规模的影响则不一致。我们公开发布了LLaVA-Gemma模型的训练配方、代码及权重。
English
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.

Summary

AI-Generated Summary

PDF282November 26, 2024