HiRED：針對資源受限環境中高解析度視覺語言模型的高效推論，提出了基於注意力引導的標記丟棄方法。

摘要

高解析度視覺語言模型（VLMs）已被廣泛應用於多模式任務中，通過保留詳細的圖像信息來提高準確性。然而，這些模型通常由於編碼輸入圖像的多個分區而生成過多的視覺標記。在資源受限的環境中，特別是在擁有通用 GPU 的情況下，處理這些過多的視覺標記具有挑戰性。為了支持高解析度圖像並滿足資源限制，我們提出了高解析度早期丟棄（HiRED），這是一種在大型語言模型（LLM）階段之前在固定標記預算內運作的標記丟棄方案。HiRED可以與現有的高解析度 VLMs 輕鬆集成，因為它無需額外的訓練，同時仍保持卓越的準確性。我們在初始層中策略性地使用視覺編碼器的注意力來評估每個圖像分區的視覺內容，並相應地分配標記預算。然後，使用最終層中的注意力，我們從分配的預算中選擇每個分區中最重要的視覺標記，並丟棄其餘的部分。實驗結果顯示，當應用於 NVIDIA TESLA P40 GPU 上的 LLaVA-Next-7B 時，HiRED 在 20% 的標記預算下，將標記生成吞吐量提高了 4.7 倍，將首個標記生成延遲時間減少了 15 秒，並為單次推理節省了 2.3 GB 的 GPU 記憶體。

English

High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder's attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

HiRED：針對資源受限環境中高解析度視覺語言模型的高效推論，提出了基於注意力引導的標記丟棄方法。

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

摘要

Support