PixelPrune：基于预测编码的像素级自适应视觉Token缩减

摘要

文档理解与图形用户界面交互是视觉语言模型（VLM）最具价值的应用场景之一，然而这类任务对计算资源的需求极为沉重：精细文本和微小UI元素要求高分辨率输入，从而产生数万个视觉标记。我们发现这种开销在很大程度上是浪费的——在文档和GUI基准测试中，仅有22%至71%的图像块是像素唯一的，其余部分均与同一图像中的其他块完全重复。为此，我们提出PixelPrune，该方法利用基于预测编码的压缩技术，在视觉变换器（ViT）编码器之前剔除冗余块，从而利用像素级冗余。由于它在执行任何神经计算之前已在像素空间进行操作，PixelPrune能够同时加速ViT编码器和下游LLM，覆盖整个推理流水线。该方法无需训练、不含可学习参数，并支持像素无损压缩（τ=0）以及可控有损压缩（τ>0）。在三种模型规模及文档与GUI基准测试上的实验表明，PixelPrune在保持具有竞争力的任务精度的同时，实现了高达4.2倍的推理加速和1.9倍的训练加速。代码已开源：https://github.com/OPPO-Mente-Lab/PixelPrune。

English

Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose PixelPrune, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches before the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression (τ{=}0) as well as controlled lossy compression (τ{>}0). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2times inference speedup and 1.9times training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.