u-LLaVA:通过大型语言模型统一多模态任务
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
November 9, 2023
作者: Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, Yi-Jie Huang, Yaqian Li
cs.AI
摘要
最近的进展,如LLaVA和Mini-GPT4,成功地将视觉信息整合到LLM中,取得了令人振奋的成果,并催生了新一代的多模态LLM,即MLLM。然而,这些方法在幻觉和任务之间的相互干扰方面存在困难。为了解决这些问题,我们提出了一种高效准确的方法,通过利用LLM作为连接多个专家模型的桥梁来适应下游任务,即u-LLaVA。首先,我们将模态对齐模块和多任务模块整合到LLM中。然后,我们重新组织或重建多类型公共数据集,以实现高效的模态对齐和指导遵循。最后,从经过训练的LLM中提取特定于任务的信息,并提供给不同模块以解决下游任务。整体框架简单、有效,并在多个基准测试中实现了最先进的性能。我们还公开发布我们的模型、生成的数据和代码库。
English
Recent advances such as LLaVA and Mini-GPT4 have successfully integrated
visual information into LLMs, yielding inspiring outcomes and giving rise to a
new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods
struggle with hallucinations and the mutual interference between tasks. To
tackle these problems, we propose an efficient and accurate approach to adapt
to downstream tasks by utilizing LLM as a bridge to connect multiple expert
models, namely u-LLaVA. Firstly, we incorporate the modality alignment module
and multi-task modules into LLM. Then, we reorganize or rebuild multi-type
public datasets to enable efficient modality alignment and instruction
following. Finally, task-specific information is extracted from the trained LLM
and provided to different modules for solving downstream tasks. The overall
framework is simple, effective, and achieves state-of-the-art performance
across multiple benchmarks. We also release our model, the generated data, and
the code base publicly available.