PARCEL: 池錨定重取樣與條件化彈性查詢用於高效視覺語言理解

摘要

大型視覺語言模型（LVLMs）將視覺輸入映射為密集的令牌序列，因而在推論時產生平方級的計算瓶頸。彈性視覺令牌壓縮技術為此提供解決方案：訓練單一模型即可在多重視覺令牌預算下運行。然而，現有方法在高度壓縮情境下表現不佳。純空間壓縮（如嵌套池化）本質上如同不完美的低通濾波器，會引發頻譜混疊，導致細部細節模糊；純查詢壓縮（如嵌套查詢重採樣）則以非局部摘要取代明確的網格對齊令牌，大幅削弱空間定位能力。為解決此表徵衝突，我們提出PARCEL（基於池化錨點的重採樣與條件彈性查詢以實現高效視覺語言理解），這是一種動態分配特徵提取任務的視覺令牌化架構。PARCEL將空間池化令牌設為低頻佈局錨點，並透過池化條件查詢重採樣，令彈性查詢令牌以此錨點為條件，從而引導查詢令牌聚焦於互補的視覺特徵，而非冗餘的空間映射。在27個基準測試中的廣泛評估顯示，PARCEL顯著改善了性能與效率之間的帕雷托前沿，在不同視覺令牌預算下均持續優於現有的嵌套基準方法，同時保留了「一次訓練，隨處部署」的典範。

English

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.