ChatPaper.aiChatPaper

INF-LLaVA:雙視角感知用於高解析度多模態大型語言模型

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

July 23, 2024
作者: Yiwei Ma, Zhibin Wang, Xiaoshuai Sun, Weihuang Lin, Qiang Zhou, Jiayi Ji, Rongrong Ji
cs.AI

摘要

隨著數據可用性和計算資源的進步,多模式大型語言模型(MLLMs)展現出在各個領域的能力。然而,在MLLMs中,視覺編碼器的二次複雜度限制了輸入圖像的解析度。目前大多數方法通過將高解析度圖像裁剪為較小的子圖像來緩解這個問題,然後這些子圖像由視覺編碼器獨立處理。儘管捕捉了足夠的局部細節,這些子圖像缺乏全局上下文並且無法相互交互。為了解決這個限制,我們提出了一種新穎的MLLM,名為INF-LLaVA,旨在有效地感知高解析度圖像。INF-LLaVA包含兩個創新組件。首先,我們引入了雙透視裁剪模塊(DCM),確保每個子圖像包含來自局部透視的連續細節和來自全局透視的綜合信息。其次,我們引入了雙透視增強模塊(DEM),以實現全局和局部特徵的相互增強,使INF-LLaVA能夠同時捕捉詳細的局部信息和全面的全局上下文來有效處理高解析度圖像。大量消融研究驗證了這些組件的有效性,以及在各種基準測試上的實驗表明INF-LLaVA優於現有的MLLMs。代碼和預訓練模型可在https://github.com/WeihuangLin/INF-LLaVA找到。
English
With advancements in data availability and computing resources, Multimodal Large Language Models (MLLMs) have showcased capabilities across various fields. However, the quadratic complexity of the vision encoder in MLLMs constrains the resolution of input images. Most current approaches mitigate this issue by cropping high-resolution images into smaller sub-images, which are then processed independently by the vision encoder. Despite capturing sufficient local details, these sub-images lack global context and fail to interact with one another. To address this limitation, we propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. INF-LLaVA incorporates two innovative components. First, we introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective and comprehensive information from a global perspective. Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features, allowing INF-LLaVA to effectively process high-resolution images by simultaneously capturing detailed local information and comprehensive global context. Extensive ablation studies validate the effectiveness of these components, and experiments on a diverse set of benchmarks demonstrate that INF-LLaVA outperforms existing MLLMs. Code and pretrained model are available at https://github.com/WeihuangLin/INF-LLaVA.

Summary

AI-Generated Summary

PDF133November 28, 2024