PARCEL: 池錨定重取樣與條件化彈性查詢用於高效視覺語言理解
PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
May 28, 2026
作者: Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem
cs.AI
摘要
大型視覺語言模型(LVLMs)將視覺輸入映射為密集的令牌序列,因而在推論時產生平方級的計算瓶頸。彈性視覺令牌壓縮技術為此提供解決方案:訓練單一模型即可在多重視覺令牌預算下運行。然而,現有方法在高度壓縮情境下表現不佳。純空間壓縮(如嵌套池化)本質上如同不完美的低通濾波器,會引發頻譜混疊,導致細部細節模糊;純查詢壓縮(如嵌套查詢重採樣)則以非局部摘要取代明確的網格對齊令牌,大幅削弱空間定位能力。為解決此表徵衝突,我們提出PARCEL(基於池化錨點的重採樣與條件彈性查詢以實現高效視覺語言理解),這是一種動態分配特徵提取任務的視覺令牌化架構。PARCEL將空間池化令牌設為低頻佈局錨點,並透過池化條件查詢重採樣,令彈性查詢令牌以此錨點為條件,從而引導查詢令牌聚焦於互補的視覺特徵,而非冗餘的空間映射。在27個基準測試中的廣泛評估顯示,PARCEL顯著改善了性能與效率之間的帕雷托前沿,在不同視覺令牌預算下均持續優於現有的嵌套基準方法,同時保留了「一次訓練,隨處部署」的典範。
English
Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.