将视觉预训练扩展至4K分辨率

摘要

高分辨率视觉细节感知对于日常任务至关重要。然而，当前的视觉预训练仍受限于低分辨率（如378×378像素），这是由于处理更大图像所需的二次方计算成本。我们提出了PS3，它能够以近乎恒定的成本将CLIP风格的视觉预训练扩展至4K分辨率。PS3不再基于全局图像表示进行对比学习，而是通过选择性处理局部区域并与局部详细描述进行对比来进行预训练，从而在显著降低计算开销的同时实现高分辨率表示学习。预训练后的PS3既能以低分辨率编码全局图像，又能根据显著性或与文本提示的相关性选择性处理局部高分辨率区域。将PS3应用于多模态大语言模型（MLLM）时，所得模型命名为VILA-HD，相较于AnyRes和S^2等未进行高分辨率视觉预训练的基线模型，VILA-HD显著提升了高分辨率视觉感知能力，同时使用的token数量最多减少了4.3倍。PS3还解锁了VILA-HD的诱人扩展特性，包括免费提升分辨率和通过增加测试时计算量以获得更好性能。与现有技术相比，VILA-HD在多个基准测试中超越了NVILA和Qwen2-VL等先前的MLLM，并且在效率上优于最新的token剪枝方法。最后，我们发现当前基准测试并不需要4K分辨率感知，这促使我们提出了4KPro，这是一个新的4K分辨率图像问答基准测试，在该测试中，VILA-HD超越了所有先前的MLLM，包括对GPT-4o的14.5%提升，以及对Qwen2-VL的3.2%提升和2.96倍加速。

English

High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.

将视觉预训练扩展至4K分辨率

Scaling Vision Pre-Training to 4K Resolution

摘要

Support