LLaVA-φ: 소형 언어 모델 기반의 효율적 다중 모달 어시스턴트

초록

본 논문에서는 최근 개발된 소형 언어 모델인 Phi-2의 성능을 활용하여 다중 모달 대화를 가능하게 하는 효율적인 다중 모달 어시스턴트인 LLaVA-phi(LLaVA-Phi)를 소개한다. LLaVA-Phi는 소형 다중 모달 모델 분야에서 주목할 만한 진전을 이루었다. 이 모델은 단 27억 개의 파라미터만으로도 고품질 코퍼스로 훈련된 경우, 텍스트와 시각적 요소를 통합한 복잡한 대화에 효과적으로 참여할 수 있음을 보여준다. 우리의 모델은 시각적 이해, 추론, 지식 기반 인식을 포함한 공개 벤치마크에서 뛰어난 성능을 보인다. 다중 모달 대화 작업에서의 탁월한 성능 외에도, 이 모델은 실시간 상호작용이 필요한 시간 민감한 환경 및 시스템(예: 구체화된 에이전트)에서의 새로운 응용 가능성을 열어준다. 이는 소형 언어 모델이 더 높은 자원 효율성을 유지하면서도 정교한 수준의 이해와 상호작용을 달성할 수 있는 잠재력을 강조한다. 본 프로젝트는 {https://github.com/zhuyiche/llava-phi}에서 확인할 수 있다.

English

In this paper, we introduce LLaVA-phi (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.

LLaVA-φ: 소형 언어 모델 기반의 효율적 다중 모달 어시스턴트

LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model

초록

Support