LLaVA-φ:具有小型语言模型的高效多模态助手
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
January 4, 2024
作者: Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang
cs.AI
摘要
本文介绍了LLaVA-Phi(LLaVA-Phi),这是一个高效的多模态助手,利用最近先进的小型语言模型Phi-2的能力,促进多模态对话。LLaVA-Phi在紧凑的多模态模型领域取得了显著进展。它表明,即使是参数仅为27亿的较小语言模型,只要经过高质量语料库的训练,就能有效地参与融合文本和视觉元素的复杂对话。我们的模型在公开可用的基准测试中表现出色,涵盖了视觉理解、推理和基于知识的感知。除了在多模态对话任务中表现出色外,我们的模型为在时间敏感环境和需要实时交互的系统中的应用开辟了新途径,如具身代理。它突显了较小语言模型实现复杂理解和交互的潜力,同时保持更高的资源效率。该项目可在{https://github.com/zhuyiche/llava-phi}找到。
English
In this paper, we introduce LLaVA-phi (LLaVA-Phi), an efficient
multi-modal assistant that harnesses the power of the recently advanced small
language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a
notable advancement in the realm of compact multi-modal models. It demonstrates
that even smaller language models, with as few as 2.7B parameters, can
effectively engage in intricate dialogues that integrate both textual and
visual elements, provided they are trained with high-quality corpora. Our model
delivers commendable performance on publicly available benchmarks that
encompass visual comprehension, reasoning, and knowledge-based perception.
Beyond its remarkable performance in multi-modal dialogue tasks, our model
opens new avenues for applications in time-sensitive environments and systems
that require real-time interaction, such as embodied agents. It highlights the
potential of smaller language models to achieve sophisticated levels of
understanding and interaction, while maintaining greater resource
efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.