大型语言和视觉模型的潜变幻身影

摘要

视觉指导调整的成功加速了大型语言和视觉模型（LLVMs）的发展。遵循调整后的大型语言模型（LLMs）的扩展规律，LLVMs进一步增加了规模，达到了26B、34B，甚至80B参数。虽然模型规模的增加带来了显著的性能提升，但同时也需要更多的硬件资源进行训练和推断。因此，迫切需要高效的LLVMs，既能实现较大模型的性能，又能保持较小的规模。为了满足这一需求，我们提出了一系列新的高效LLVM家族，模型规模分别为0.5B、1.8B、3.8B和7B参数，名为Phantom，显著增强了在有限结构内的学习能力。通过在多头自注意力（MHSA）期间暂时增加潜在隐藏维度，我们使LLVMs能够在潜在状态下查看和理解更多的视觉-语言知识，而无需显著增加物理模型大小。为了最大化其优势，我们引入了幻影优化（PO），结合自回归监督微调（SFT）和直接偏好优化（DPO）-类似概念，有效地跟随正确答案，同时消除不正确和模糊的答案。Phantom在许多更大规模的开源和闭源LLVMs中表现出色，将自己定位为高效LLVM领域的领先解决方案。

English

The success of visual instruction tuning has accelerated the development of large language and vision models (LLVMs). Following the scaling laws of instruction-tuned large language models (LLMs), LLVMs either have further increased their sizes, reaching 26B, 34B, and even 80B parameters. While this increase in model size has yielded significant performance gains, it demands substantially more hardware resources for both training and inference. Consequently, there naturally exists a strong need for efficient LLVMs that achieve the performance of larger models while being smaller in size. To achieve this need, we present a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances learning capabilities within limited structures. By temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), we make LLVMs prepare to look and understand much more vision-language knowledge on the latent, without substantially increasing physical model sizes. To maximize its advantage, we introduce Phantom Optimization (PO) using both autoregressive supervised fine-tuning (SFT) and direct preference optimization (DPO)-like concept, which effectively follows correct answers while eliminating incorrect and ambiguous ones. Phantom outperforms numerous larger open- and closed-source LLVMs, positioning itself as a leading solution in the landscape of efficient LLVMs.

大型语言和视觉模型的潜变幻身影

Phantom of Latent for Large Language and Vision Models

摘要

Support