重要:针对移动设备的高性能大型多模态模型
Imp: Highly Capable Large Multimodal Models for Mobile Devices
May 20, 2024
作者: Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang, Lihao Zheng, Zhenbiao Gai, Mingyang Wang, Jiajun Ding
cs.AI
摘要
通过利用大型语言模型(LLMs)的能力,最近的大型多模态模型(LMMs)展现出在开放世界多模态理解方面的显著多功能性。然而,它们通常参数庞大且计算密集,从而阻碍了它们在资源受限场景中的适用性。为此,已经连续提出了几种轻量级LMMs,以最大程度地发挥在受限规模(例如3B)下的能力。尽管这些方法取得了令人鼓舞的结果,但它们大多只关注设计空间的一个或两个方面,并且影响模型能力的关键设计选择尚未得到彻底调查。在本文中,我们从模型架构、训练策略和训练数据等方面对轻量级LMMs进行了系统研究。根据我们的发现,我们获得了Imp - 一系列在2B-4B规模下非常有能力的LMMs。值得注意的是,我们的Imp-3B模型稳定地优于所有现有规模相似的轻量级LMMs,甚至超越了13B规模下的最先进LMMs。通过低比特量化和分辨率降低技术,我们的Imp模型可以部署在高通骁龙8Gen3移动芯片上,推理速度约为13个标记/秒。
English
By harnessing the capabilities of large language models (LLMs), recent large
multimodal models (LMMs) have shown remarkable versatility in open-world
multimodal understanding. Nevertheless, they are usually parameter-heavy and
computation-intensive, thus hindering their applicability in
resource-constrained scenarios. To this end, several lightweight LMMs have been
proposed successively to maximize the capabilities under constrained scale
(e.g., 3B). Despite the encouraging results achieved by these methods, most of
them only focus on one or two aspects of the design space, and the key design
choices that influence model capability have not yet been thoroughly
investigated. In this paper, we conduct a systematic study for lightweight LMMs
from the aspects of model architecture, training strategy, and training data.
Based on our findings, we obtain Imp -- a family of highly capable LMMs at the
2B-4B scales. Notably, our Imp-3B model steadily outperforms all the existing
lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs
at the 13B scale. With low-bit quantization and resolution reduction
techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile
chip with a high inference speed of about 13 tokens/s.Summary
AI-Generated Summary