OmniFusion 기술 보고서

초록

작년, 다중모달 아키텍처는 대규모 언어 모델(LLM)의 능력을 확장하며 AI 기반 접근법과 솔루션 분야에서 혁명을 일으켰습니다. 우리는 사전 학습된 LLM과 시각 모달리티를 위한 어댑터를 기반으로 한 OmniFusion 모델을 제안합니다. 우리는 더 나은 텍스트와 시각 데이터 결합을 위한 여러 아키텍처 설계 원칙을 평가하고 비교했습니다: MLP 및 트랜스포머 어댑터, 다양한 CLIP ViT 기반 인코더(SigLIP, InternVIT 등), 이들의 융합 방식, 이미지 인코딩 방법(전체 이미지 또는 타일 인코딩), 그리고 두 가지 7B LLM(독점 모델과 오픈소스 Mistral). 8개의 시각-언어 벤치마크에서 수행한 실험은 다양한 VQA 작업에서 오픈소스 LLaVA 유사 솔루션과 비교하여 최고의 OmniFusion 설정이 최고 점수를 기록했음을 보여줍니다: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. 또한 우리는 OmniFusion이 가정 관리, 관광, 문화, 의학, 필기 및 스캔된 수식 인식 등 다양한 분야에서 매우 상세한 답변을 제공하는 다양한 상황을 제안합니다. Mistral 기반 OmniFusion 모델은 오픈소스 솔루션으로, 가중치, 훈련 및 추론 스크립트가 https://github.com/AIRI-Institute/OmniFusion에서 제공됩니다.

English

Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an OmniFusion model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

OmniFusion 기술 보고서

OmniFusion Technical Report

초록

Support