u-LLaVA：通過大型語言模型統一多模式任務

摘要

最近的進展，如LLaVA和Mini-GPT4，成功地將視覺信息整合到LLM中，產生了令人振奮的結果，並催生了一代新的多模態LLM，或稱為MLLM。然而，這些方法在幻覺和任務之間的相互干擾方面存在困難。為了應對這些問題，我們提出了一種有效且準確的方法，通過將LLM用作連接多個專家模型的橋樑，即u-LLaVA。首先，我們將模態對齊模塊和多任務模塊整合到LLM中。然後，我們重新組織或重建多類型公共數據集，以實現有效的模態對齊和指導遵循。最後，從訓練有素的LLM中提取特定任務的信息，並提供給不同模塊以解決下游任務。整體框架簡單、有效，並在多個基準測試中實現了最先進的性能。我們還公開發布我們的模型、生成的數據和代碼庫。

English

Recent advances such as LLaVA and Mini-GPT4 have successfully integrated visual information into LLMs, yielding inspiring outcomes and giving rise to a new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods struggle with hallucinations and the mutual interference between tasks. To tackle these problems, we propose an efficient and accurate approach to adapt to downstream tasks by utilizing LLM as a bridge to connect multiple expert models, namely u-LLaVA. Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.

u-LLaVA：通過大型語言模型統一多模式任務

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

摘要

Support