PARCEL: 효율적인 시각-언어 이해를 위한 풀 기반 앵커 재샘플링과 조건부 탄력적 쿼리

초록

대규모 시각-언어 모델(LVLM)은 시각적 입력을 밀집된 토큰 시퀀스로 매핑하여 추론 시 이차 계산 병목 현상을 초래한다. 탄력적 시각 토큰 압축은 여러 시각 토큰 예산에서 실행 가능한 단일 모델을 훈련함으로써 이 문제를 해결한다. 그러나 기존 접근법은 과도한 압축 상황에서 어려움을 겪는다. 중첩 풀링과 같은 공간 전용 압축은 불완전한 저역 통과 필터 역할을 하며 미세한 세부 정보를 모호하게 만드는 스펙트럼 에일리어싱을 유발한다. 중첩 쿼리 재표집과 같은 쿼리 전용 압축은 명시적 그리드 정렬 토큰을 비국소적 요약으로 대체하여 공간적 접지 능력을 크게 저하시킨다. 이러한 표현적 갈등을 해결하기 위해, 우리는 PARCEL(Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding)을 제안한다. 이는 특징 추출의 역할을 동적으로 분할하는 시각 토큰화 아키텍처이다. PARCEL은 공간 풀 토큰을 저주파 레이아웃 앵커로 설정하고, 풀 조건부 쿼리 재표집을 통해 탄력적 쿼리 토큰을 이러한 앵커에 조건화한다. 이는 쿼리 토큰이 중복된 공간 매핑보다는 상호 보완적인 시각적 특징에 집중하도록 유도한다. 27개 벤치마크에 걸친 광범위한 평가에서 PARCEL은 성능-효율성 파레토 프론티어를 개선하며, '한 번 훈련하고 어디서든 배포' 패러다임을 유지하면서 시각 토큰 예산 전반에 걸쳐 기존 마트료시카 기준선을 일관되게 능가함을 보여준다.

English

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.