MMEvol: Evol-Instruct로 다중 모달 대형 언어 모델 강화

초록

다중 모달 대형 언어 모델(Multimodal Large Language Models, MLLMs)의 개발은 상당한 발전을 이루었습니다. 그러나, 다중 모달 지시 데이터의 양과 품질은 그들의 발전에 중대한 병목 현상으로 나타났습니다. 다중 모달 지시 데이터를 수동으로 생성하는 것은 시간이 많이 소요되며 비효율적이며, 높은 복잡성의 지시를 생성하는 데 어려움을 겪습니다. 게다가, 블랙박스 상업용 모델(예: GPT-4o, GPT-4V)로부터 지시 데이터를 추출하는 것은 종종 단순화된 지시 데이터를 결과로 낳아, 성능을 이러한 모델의 수준으로 제약합니다. 다양하고 복잡한 지시 데이터를 선별하는 과제는 여전히 상당합니다. 저희는 MMEvol이라는 새로운 다중 모달 지시 데이터 진화 프레임워크를 제안합니다. 이 프레임워크는 세밀한 지각 진화, 인지 추론 진화, 상호 작용 진화를 결합합니다. 이 반복적인 방법은 데이터 품질 병목 현상을 극복하여 복잡하고 다양한 이미지-텍스트 지시 데이터셋을 생성함으로써 MLLMs에 향상된 능력을 부여합니다. SEED-163K라는 초기 지시 세트를 시작으로, 우리는 MMEvol을 활용하여 체계적으로 지시 유형의 다양성을 확대하고, 인지 능력을 향상시키기 위해 추론 단계를 통합하며, 이미지로부터 세부 정보를 추출하여 시각적 이해력과 견고성을 향상시킵니다. 우리의 데이터의 효과를 포괄적으로 평가하기 위해 진화된 데이터를 사용하여 LLaVA-NeXT를 훈련시키고, 13가지 비전-언어 작업에 걸쳐 실험을 수행합니다. 초기 데이터로 훈련된 기준선과 비교했을 때, 우리의 방법은 평균 정확도 향상률이 3.1 포인트이며, 이러한 작업 중 9개에서 최신 기술(SOTA) 수준의 성능을 달성합니다.

English

The development of Multimodal Large Language Models (MLLMs) has seen significant advancements. However, the quantity and quality of multimodal instruction data have emerged as significant bottlenecks in their progress. Manually creating multimodal instruction data is both time-consuming and inefficient, posing challenges in producing instructions of high complexity. Moreover, distilling instruction data from black-box commercial models (e.g., GPT-4o, GPT-4V) often results in simplistic instruction data, which constrains performance to that of these models. The challenge of curating diverse and complex instruction data remains substantial. We propose MMEvol, a novel multimodal instruction data evolution framework that combines fine-grained perception evolution, cognitive reasoning evolution, and interaction evolution. This iterative approach breaks through data quality bottlenecks to generate a complex and diverse image-text instruction dataset, thereby empowering MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broadens the diversity of instruction types, integrates reasoning steps to enhance cognitive capabilities, and extracts detailed information from images to improve visual understanding and robustness. To comprehensively evaluate the effectiveness of our data, we train LLaVA-NeXT using the evolved data and conduct experiments across 13 vision-language tasks. Compared to the baseline trained with seed data, our approach achieves an average accuracy improvement of 3.1 points and reaches state-of-the-art (SOTA) performance on 9 of these tasks.

MMEvol: Evol-Instruct로 다중 모달 대형 언어 모델 강화

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

초록

Support