AlignGPT:具有自适应对齐能力的多模态大型语言模型
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
May 23, 2024
作者: Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai
cs.AI
摘要
多模态大型语言模型(MLLMs)被普遍认为在探索人工通用智能(AGI)方面至关重要。MLLMs的核心在于其实现跨模态对齐的能力。为实现这一目标,当前MLLMs通常遵循两阶段训练范式:预训练阶段和指导微调阶段。尽管取得成功,但这些模型在对齐能力建模方面存在缺陷。首先,在预训练阶段,模型通常假设所有图像-文本对均匀对齐,但实际上不同图像-文本对之间的对齐程度是不一致的。其次,目前用于微调的指导中包含各种任务,不同任务的指导通常需要不同水平的对齐能力,但先前的MLLMs忽视了这些差异化的对齐需求。为解决这些问题,我们提出了一种新的多模态大型语言模型AlignGPT。在预训练阶段,我们不再将所有图像-文本对等同对待,而是为不同图像-文本对分配不同水平的对齐能力。然后,在指导微调阶段,我们自适应地结合这些不同水平的对齐能力,以满足不同指导的动态对齐需求。大量实验结果表明,我们的模型在12个基准测试上取得了竞争性能。
English
Multimodal Large Language Models (MLLMs) are widely regarded as crucial in
the exploration of Artificial General Intelligence (AGI). The core of MLLMs
lies in their capability to achieve cross-modal alignment. To attain this goal,
current MLLMs typically follow a two-phase training paradigm: the pre-training
phase and the instruction-tuning phase. Despite their success, there are
shortcomings in the modeling of alignment capabilities within these models.
Firstly, during the pre-training phase, the model usually assumes that all
image-text pairs are uniformly aligned, but in fact the degree of alignment
between different image-text pairs is inconsistent. Secondly, the instructions
currently used for finetuning incorporate a variety of tasks, different tasks's
instructions usually require different levels of alignment capabilities, but
previous MLLMs overlook these differentiated alignment needs. To tackle these
issues, we propose a new multimodal large language model AlignGPT. In the
pre-training stage, instead of treating all image-text pairs equally, we assign
different levels of alignment capabilities to different image-text pairs. Then,
in the instruction-tuning phase, we adaptively combine these different levels
of alignment capabilities to meet the dynamic alignment needs of different
instructions. Extensive experimental results show that our model achieves
competitive performance on 12 benchmarks.Summary
AI-Generated Summary