u-LLaVA：通过大型语言模型统一多模态任务

摘要

最近的进展，如LLaVA和Mini-GPT4，成功地将视觉信息整合到LLM中，取得了令人振奋的成果，并催生了新一代的多模态LLM，即MLLM。然而，这些方法在幻觉和任务之间的相互干扰方面存在困难。为了解决这些问题，我们提出了一种高效准确的方法，通过利用LLM作为连接多个专家模型的桥梁来适应下游任务，即u-LLaVA。首先，我们将模态对齐模块和多任务模块整合到LLM中。然后，我们重新组织或重建多类型公共数据集，以实现高效的模态对齐和指导遵循。最后，从经过训练的LLM中提取特定于任务的信息，并提供给不同模块以解决下游任务。整体框架简单、有效，并在多个基准测试中实现了最先进的性能。我们还公开发布我们的模型、生成的数据和代码库。

English

Recent advances such as LLaVA and Mini-GPT4 have successfully integrated visual information into LLMs, yielding inspiring outcomes and giving rise to a new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods struggle with hallucinations and the mutual interference between tasks. To tackle these problems, we propose an efficient and accurate approach to adapt to downstream tasks by utilizing LLM as a bridge to connect multiple expert models, namely u-LLaVA. Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.

u-LLaVA：通过大型语言模型统一多模态任务

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

摘要

Support