ChatPaper.aiChatPaper

將視覺預訓練擴展至4K解析度

Scaling Vision Pre-Training to 4K Resolution

March 25, 2025
作者: Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin
cs.AI

摘要

高分辨率視覺細節感知對於日常任務至關重要。然而,由於處理更大圖像的二次成本,當前的視覺預訓練仍局限於低分辨率(例如378 x 378像素)。我們引入了PS3,將CLIP風格的視覺預訓練擴展到4K分辨率,並保持近乎恆定的成本。PS3不再對全局圖像表示進行對比學習,而是通過選擇性處理局部區域並將其與局部詳細描述進行對比來進行預訓練,從而實現高分辨率表示學習,並大幅減少計算開銷。預訓練後的PS3能夠以低分辨率編碼全局圖像,並根據顯著性或與文本提示的相關性選擇性處理局部高分辨率區域。當將PS3應用於多模態大語言模型(MLLM)時,生成的模型名為VILA-HD,與未進行高分辨率視覺預訓練的基線模型(如AnyRes和S^2)相比,顯著提升了高分辨率視覺感知能力,同時使用的token數量最多減少了4.3倍。PS3還解鎖了VILA-HD的吸引人擴展特性,包括免費提升分辨率以及增加測試時計算量以獲得更好性能。與現有技術相比,VILA-HD在多個基準測試中超越了之前的MLLM(如NVILA和Qwen2-VL),並且比最新的token修剪方法更高效。最後,我們發現當前的基準測試並不需要4K分辨率感知,這促使我們提出了4KPro,這是一個新的4K分辨率圖像問答基準測試,VILA-HD在該測試中超越了所有之前的MLLM,包括對GPT-4o的14.5%提升,以及對Qwen2-VL的3.2%提升和2.96倍加速。
English
High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.

Summary

AI-Generated Summary

PDF402March 26, 2025