ChatPaper.aiChatPaper

HiRED:面向资源受限环境的高分辨率视觉语言模型高效推理的注意力引导标记丢弃

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

August 20, 2024
作者: Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji
cs.AI

摘要

高分辨率视觉语言模型(VLMs)被广泛应用于多模态任务中,通过保留详细的图像信息来提高准确性。然而,由于对输入图像的多个分区进行编码,这些模型通常会生成过多的视觉标记。在资源受限的环境中,特别是在使用通用GPU的情况下,处理这些过多的视觉标记具有挑战性。为了支持高分辨率图像并满足资源约束,我们提出了高分辨率早期丢弃(HiRED),这是一种在大型语言模型(LLM)阶段之前在固定标记预算内运行的标记丢弃方案。HiRED可以与现有的高分辨率VLMs轻松集成,因为它无需额外训练,同时仍保持优越的准确性。我们在初始层中策略性地利用视觉编码器的注意力来评估每个图像分区的视觉内容,并相应地分配标记预算。然后,利用最终层中的注意力,我们从分配的预算中选择每个分区中最重要的视觉标记,丢弃其余的标记。经验上,在NVIDIA TESLA P40 GPU上应用于LLaVA-Next-7B时,HiRED在20%标记预算下,将标记生成吞吐量提高了4.7倍,将首个标记生成延迟缩短了15秒,并为单次推断节省了2.3 GB的GPU内存。
English
High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder's attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

Summary

AI-Generated Summary

PDF112November 16, 2024