INF-LLaVA: Duale Perspektivenwahrnehmung für hochauflösende multimodale große Sprachmodelle

papers.abstract

Mit Fortschritten bei der Datenverfügbarkeit und Rechenressourcen haben Multimodale Große Sprachmodelle (MLLMs) ihre Fähigkeiten in verschiedenen Bereichen gezeigt. Die quadratische Komplexität des Vision Encoders in MLLMs beschränkt jedoch die Auflösung von Eingabebildern. Die meisten aktuellen Ansätze mildern dieses Problem, indem sie hochauflösende Bilder in kleinere Teilbilder zuschneiden, die dann unabhängig voneinander vom Vision Encoder verarbeitet werden. Obwohl diese Teilbilder ausreichend lokale Details erfassen, fehlt es ihnen an globalem Kontext und sie interagieren nicht miteinander. Um diese Einschränkung zu überwinden, schlagen wir ein neues MLLM vor, INF-LLaVA, das für eine effektive Wahrnehmung von hochauflösenden Bildern konzipiert ist. INF-LLaVA integriert zwei innovative Komponenten. Erstens führen wir ein Dual-Perspektiven-Zuschneidemodul (DCM) ein, das sicherstellt, dass jedes Teilbild kontinuierliche Details aus lokaler Perspektive und umfassende Informationen aus globaler Perspektive enthält. Zweitens führen wir ein Dual-Perspektiven-Verbesserungsmodul (DEM) ein, um die gegenseitige Verbesserung globaler und lokaler Merkmale zu ermöglichen, wodurch INF-LLaVA hochauflösende Bilder effektiv verarbeiten kann, indem detaillierte lokale Informationen und umfassender globaler Kontext gleichzeitig erfasst werden. Umfangreiche Ablationsstudien bestätigen die Wirksamkeit dieser Komponenten, und Experimente an einem vielfältigen Benchmark-Set zeigen, dass INF-LLaVA bestehende MLLMs übertrifft. Der Code und das vortrainierte Modell sind unter https://github.com/WeihuangLin/INF-LLaVA verfügbar.

English

With advancements in data availability and computing resources, Multimodal Large Language Models (MLLMs) have showcased capabilities across various fields. However, the quadratic complexity of the vision encoder in MLLMs constrains the resolution of input images. Most current approaches mitigate this issue by cropping high-resolution images into smaller sub-images, which are then processed independently by the vision encoder. Despite capturing sufficient local details, these sub-images lack global context and fail to interact with one another. To address this limitation, we propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. INF-LLaVA incorporates two innovative components. First, we introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective and comprehensive information from a global perspective. Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features, allowing INF-LLaVA to effectively process high-resolution images by simultaneously capturing detailed local information and comprehensive global context. Extensive ablation studies validate the effectiveness of these components, and experiments on a diverse set of benchmarks demonstrate that INF-LLaVA outperforms existing MLLMs. Code and pretrained model are available at https://github.com/WeihuangLin/INF-LLaVA.

INF-LLaVA: Duale Perspektivenwahrnehmung für hochauflösende multimodale große Sprachmodelle

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

papers.abstract

Support