LLaVA-φ:具有小型語言模型的高效多模態助理
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
January 4, 2024
作者: Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang
cs.AI
摘要
本文介紹了LLaVA-phi(LLaVA-Phi),一個高效的多模態助手,利用最近先進的小型語言模型Phi-2的能力來促進多模態對話。 LLaVA-Phi 在緊湊多模態模型領域取得了顯著進展。它表明,即使是具有僅2.7B參數的較小語言模型,只要經過高質量語料庫的訓練,就可以有效地參與融合文本和視覺元素的複雜對話。我們的模型在包括視覺理解、推理和基於知識的感知在內的公開基準測試中表現出色。除了在多模態對話任務中表現出色外,我們的模型為應用於時間敏感環境和需要實時交互的系統開辟了新途徑,例如具身代理。它突顯了較小語言模型實現複雜水平的理解和交互的潛力,同時保持更高的資源效率。該項目可在{https://github.com/zhuyiche/llava-phi}找到。
English
In this paper, we introduce LLaVA-phi (LLaVA-Phi), an efficient
multi-modal assistant that harnesses the power of the recently advanced small
language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a
notable advancement in the realm of compact multi-modal models. It demonstrates
that even smaller language models, with as few as 2.7B parameters, can
effectively engage in intricate dialogues that integrate both textual and
visual elements, provided they are trained with high-quality corpora. Our model
delivers commendable performance on publicly available benchmarks that
encompass visual comprehension, reasoning, and knowledge-based perception.
Beyond its remarkable performance in multi-modal dialogue tasks, our model
opens new avenues for applications in time-sensitive environments and systems
that require real-time interaction, such as embodied agents. It highlights the
potential of smaller language models to achieve sophisticated levels of
understanding and interaction, while maintaining greater resource
efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.