PARCEL: プールアンカー型リサンプリングと条件付き弾性クエリによる効率的な視覚言語理解

要旨

大規模視覚言語モデル（LVLMs）は、視覚入力を高密度のトークン系列にマッピングし、推論時に二次計算量のボトルネックを引き起こす。弾力的な視覚トークン圧縮は、複数の視覚トークン予算で動作可能な単一モデルを訓練することで、この問題に対処する。しかし、既存手法は積極的な圧縮下で課題を抱える。ネステッドプーリングのような空間のみの圧縮は、不完全なローパスフィルタとして機能し、微細な詳細を不明瞭にするスペクトルエイリアシングを誘発する。ネステッドクエリリサンプリングのようなクエリのみの圧縮は、明示的なグリッド整列トークンを非局所的な要約に置き換え、空間的グラウンディングを著しく低下させる。この表現上の矛盾を解決するため、我々はPARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding) を導入する。これは、特徴抽出の役割を動的に分割する視覚トークン化アーキテクチャである。PARCELは、空間プールトークンを低周波レイアウトアンカーとして確立し、Pool-Conditioned Query Resamplingを通じてこれらのアンカーに条件付けられた弾力的なクエリトークンを生成する。これにより、クエリトークンは冗長な空間マッピングではなく、補完的な視覚特徴に集中するよう促される。27のベンチマークにわたる広範な評価により、PARCELが性能-効率のパレートフロンティアを改善し、「一度訓練すればどこでも展開可能」なパラダイムを維持しながら、視覚トークン予算全体で既存のマトリョーシカベースラインを一貫して上回ることが示された。

English

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.