大型语言和视觉模型的潜变幻身影
Phantom of Latent for Large Language and Vision Models
September 23, 2024
作者: Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro
cs.AI
摘要
视觉指导调整的成功加速了大型语言和视觉模型(LLVMs)的发展。遵循调整后的大型语言模型(LLMs)的扩展规律,LLVMs进一步增加了规模,达到了26B、34B,甚至80B参数。虽然模型规模的增加带来了显著的性能提升,但同时也需要更多的硬件资源进行训练和推断。因此,迫切需要高效的LLVMs,既能实现较大模型的性能,又能保持较小的规模。为了满足这一需求,我们提出了一系列新的高效LLVM家族,模型规模分别为0.5B、1.8B、3.8B和7B参数,名为Phantom,显著增强了在有限结构内的学习能力。通过在多头自注意力(MHSA)期间暂时增加潜在隐藏维度,我们使LLVMs能够在潜在状态下查看和理解更多的视觉-语言知识,而无需显著增加物理模型大小。为了最大化其优势,我们引入了幻影优化(PO),结合自回归监督微调(SFT)和直接偏好优化(DPO)-类似概念,有效地跟随正确答案,同时消除不正确和模糊的答案。Phantom在许多更大规模的开源和闭源LLVMs中表现出色,将自己定位为高效LLVM领域的领先解决方案。
English
The success of visual instruction tuning has accelerated the development of
large language and vision models (LLVMs). Following the scaling laws of
instruction-tuned large language models (LLMs), LLVMs either have further
increased their sizes, reaching 26B, 34B, and even 80B parameters. While this
increase in model size has yielded significant performance gains, it demands
substantially more hardware resources for both training and inference.
Consequently, there naturally exists a strong need for efficient LLVMs that
achieve the performance of larger models while being smaller in size. To
achieve this need, we present a new efficient LLVM family with model sizes of
0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances
learning capabilities within limited structures. By temporarily increasing the
latent hidden dimension during multi-head self-attention (MHSA), we make LLVMs
prepare to look and understand much more vision-language knowledge on the
latent, without substantially increasing physical model sizes. To maximize its
advantage, we introduce Phantom Optimization (PO) using both autoregressive
supervised fine-tuning (SFT) and direct preference optimization (DPO)-like
concept, which effectively follows correct answers while eliminating incorrect
and ambiguous ones. Phantom outperforms numerous larger open- and closed-source
LLVMs, positioning itself as a leading solution in the landscape of efficient
LLVMs.Summary
AI-Generated Summary