LLaVA-φ: 小型言語モデルによる効率的なマルチモーダルアシスタント

要旨

本論文では、LLaVA-phi（LLaVA-Phi）を紹介する。これは、最近進化した小型言語モデルPhi-2の力を活用し、マルチモーダル対話を促進する効率的なマルチモーダルアシスタントである。LLaVA-Phiは、コンパクトなマルチモーダルモデルの領域において注目すべき進展を示している。わずか2.7Bパラメータの小型言語モデルでも、高品質なコーパスで訓練されれば、テキストと視覚要素を統合した複雑な対話に効果的に関与できることを実証している。我々のモデルは、視覚理解、推論、知識に基づく知覚を含む公開ベンチマークで良好なパフォーマンスを発揮する。マルチモーダル対話タスクにおける顕著な性能に加え、本モデルは、時間制約のある環境や、エンボディードエージェントのようなリアルタイムインタラクションを必要とするシステムにおける新たな応用の可能性を開くものである。これは、リソース効率を維持しながら、小型言語モデルが高度な理解とインタラクションを達成する可能性を強調している。本プロジェクトは{https://github.com/zhuyiche/llava-phi}で公開されている。

English

In this paper, we introduce LLaVA-phi (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.

LLaVA-φ: 小型言語モデルによる効率的なマルチモーダルアシスタント

LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model

要旨

Support