Q-Zoom: 効率的なマルチモーダル大規模言語モデルのためのクエリ対応型適応知覚

要旨

MLLMは文書理解や高密度シーン認識などの細粒度タスクにおいて高解像度の視覚入力を必要とする。しかし、現在のグローバルな解像度スケーリング手法は、視覚的に冗長なトークンを二次的なセルフアテンション機構に無差別に流入させ、空間的スパース性やクエリ意図を無視しながら推論スループットを深刻に阻害している。この問題を解決するため、我々は効率的な粗粒度から細粒度への処理を行うクエリ認識型適応高解像度知覚フレームワーク「Q-Zoom」を提案する。まず、軽量な動的ゲーティングネットワークが、粗いグローバル特徴で十分な場合に高解像度処理を安全にバイパスする。次に、細粒度知覚を要求するクエリに対しては、自己蒸餾領域提案ネットワーク（SD-RPN）が中間特徴空間から直接タスク関連の関心領域（RoI）を精密に位置特定する。これらのモジュールを効率的に最適化するため、ゲーティングネットワークは決定論的ルーティングラベルを導出する一貫性認識生成戦略を採用し、SD-RPNは完全自己教師あり蒸餾パラダイムを採用する。連続的時空間アライメント手法と標的ファインチューニングにより、密なローカルRoIと粗いグローバルレイアウトがシームレスに融合される。大規模な実験により、Q-Zoomが支配的なパレートフロンティアを確立することが実証された。Qwen2.5-VL-7Bを主要テストベッドとして、Q-Zoomは文書・OCRベンチマークで2.52倍、高解像度シナリオで4.39倍の推論加速を達成しつつ、ベースラインの最高精度を維持した。さらに、知覚忠実度を最大化する設定では、それぞれのベンチマークでベースラインの最高性能を1.1%、8.1%上回った。これらの堅牢な改善効果はQwen3-VL、LLaVA、および新興のRLベース画像思考モデルにもシームレスに転移した。プロジェクトページはhttps://yuhengsss.github.io/Q-Zoom/で公開されている。

English

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

Q-Zoom: 効率的なマルチモーダル大規模言語モデルのためのクエリ対応型適応知覚

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

要旨

Support