AlignGPT：具備自適應對齊能力的多模態大型語言模型

摘要

多模式大型語言模型（MLLMs）被廣泛認為在探索人工通用智能（AGI）方面至關重要。MLLMs 的核心在於其實現跨模態對齊的能力。為了實現這一目標，當前的 MLLMs 通常遵循兩階段訓練範式：預訓練階段和指導微調階段。儘管取得成功，這些模型在對齊能力建模方面存在一些缺陷。首先，在預訓練階段，模型通常假設所有圖像-文本對都是均勻對齊的，但實際上不同圖像-文本對之間的對齊程度是不一致的。其次，目前用於微調的指導中包含各種任務，不同任務的指導通常需要不同水平的對齊能力，但先前的 MLLMs 忽略了這些差異化的對齊需求。為了應對這些問題，我們提出了一種新的多模式大型語言模型 AlignGPT。在預訓練階段，我們不再將所有圖像-文本對一視同仁，而是為不同的圖像-文本對分配不同水平的對齊能力。然後，在指導微調階段，我們自適應地結合這些不同水平的對齊能力，以滿足不同指導的動態對齊需求。大量實驗結果顯示，我們的模型在 12 個基準測試上取得了競爭性的表現。

English

Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in the modeling of alignment capabilities within these models. Firstly, during the pre-training phase, the model usually assumes that all image-text pairs are uniformly aligned, but in fact the degree of alignment between different image-text pairs is inconsistent. Secondly, the instructions currently used for finetuning incorporate a variety of tasks, different tasks's instructions usually require different levels of alignment capabilities, but previous MLLMs overlook these differentiated alignment needs. To tackle these issues, we propose a new multimodal large language model AlignGPT. In the pre-training stage, instead of treating all image-text pairs equally, we assign different levels of alignment capabilities to different image-text pairs. Then, in the instruction-tuning phase, we adaptively combine these different levels of alignment capabilities to meet the dynamic alignment needs of different instructions. Extensive experimental results show that our model achieves competitive performance on 12 benchmarks.

AlignGPT：具備自適應對齊能力的多模態大型語言模型

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

摘要

Support