PARCEL: 池锚定重采样与条件化弹性查询的高效视觉语言理解

摘要

大型视觉语言模型（LVLMs）将视觉输入映射为密集的令牌序列，给推理带来了二次方的计算瓶颈。弹性视觉令牌压缩通过训练单一模型使其能够在多种视觉令牌预算下运行来解决这一问题。然而，现有方法在激进压缩下表现不佳。仅基于空间的压缩（如嵌套池化）表现为不完美的低通滤波器，引发频谱混叠，从而模糊了细粒度细节。仅基于查询的压缩（如嵌套查询重采样）用非局部摘要替换了显式的网格对齐令牌，并大幅降低了空间定位能力。为了解决这一表征冲突，我们提出了PARCEL（基于池锚定的弹性查询条件重采样以实现高效视觉语言理解），这是一种动态划分特征提取任务的视觉令牌化架构。PARCEL将空间池化令牌建立为低频布局锚点，并通过池化条件查询重采样使弹性查询令牌依赖于这些锚点。这促使查询令牌专注于互补的视觉特征，而非冗余的空间映射。在27个基准上的广泛评估表明，PARCEL改善了性能-效率帕累托前沿，在多种视觉令牌预算下始终优于现有的嵌套式基线，同时保留了“一次训练，随处部署”的范式。

English

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.