u-LLaVA: 대규모 언어 모델을 통한 다중 모달 작업 통합

초록

LLaVA와 Mini-GPT4와 같은 최근의 발전은 시각 정보를 대형 언어 모델(LLM)에 성공적으로 통합하여 영감을 주는 결과를 도출하고, 새로운 세대의 다중 모달 LLM(MLLM)을 탄생시켰습니다. 그러나 이러한 방법들은 환각(hallucination) 현상과 작업 간의 상호 간섭 문제로 어려움을 겪고 있습니다. 이러한 문제를 해결하기 위해, 우리는 다운스트림 작업에 적응하기 위해 LLM을 다중 전문가 모델을 연결하는 다리로 활용하는 효율적이고 정확한 접근 방식, 즉 u-LLaVA를 제안합니다. 먼저, 모달리티 정렬 모듈과 다중 작업 모듈을 LLM에 통합합니다. 그런 다음, 효율적인 모달리티 정렬과 명령어 수행을 위해 다양한 유형의 공개 데이터셋을 재구성하거나 재구축합니다. 마지막으로, 훈련된 LLM에서 작업별 정보를 추출하여 다양한 모듈에 제공하여 다운스트림 작업을 해결합니다. 전체 프레임워크는 단순하면서도 효과적이며, 여러 벤치마크에서 최첨단 성능을 달성합니다. 또한, 우리는 모델, 생성된 데이터, 그리고 코드 베이스를 공개적으로 제공합니다.

English

Recent advances such as LLaVA and Mini-GPT4 have successfully integrated visual information into LLMs, yielding inspiring outcomes and giving rise to a new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods struggle with hallucinations and the mutual interference between tasks. To tackle these problems, we propose an efficient and accurate approach to adapt to downstream tasks by utilizing LLM as a bridge to connect multiple expert models, namely u-LLaVA. Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.

u-LLaVA: 대규모 언어 모델을 통한 다중 모달 작업 통합

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

초록

Support