InfiMM-HD: 고해상도 멀티모달 이해의 도약

초록

최근 멀티모달 대형 언어 모델(MLLMs)은 상당한 발전을 이루었습니다. 그러나 고해상도 이미지 내 복잡한 세부 사항을 정확하게 인식하고 이해하는 데는 여전히 과제가 남아 있습니다. 이는 강력한 MLLMs 개발에 필수적임에도 불구하고, 이 분야는 충분히 연구되지 않고 있습니다. 이러한 문제를 해결하기 위해, 본 연구에서는 다양한 해상도의 이미지를 낮은 계산 비용으로 처리하기 위해 특별히 설계된 새로운 아키텍처인 InfiMM-HD를 소개합니다. 이 혁신은 MLLMs의 고해상도 기능 확장을 가능하게 합니다. InfiMM-HD는 교차 주의 모듈과 시각적 윈도우를 통합하여 계산 비용을 줄입니다. 이 아키텍처 설계를 4단계 학습 파이프라인과 결합함으로써, 우리의 모델은 효율적이고 비용 효율적으로 향상된 시각적 인식을 달성합니다. 실험 연구는 InfiMM-HD의 견고성과 효과성을 입증하며, 관련 분야에서 새로운 탐구의 길을 열어줍니다. 코드와 모델은 https://huggingface.co/Infi-MM/infimm-hd에서 확인할 수 있습니다.

English

Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd

InfiMM-HD: 고해상도 멀티모달 이해의 도약

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

초록

Support