Q-Zoom: Abfragegesteuerte adaptive Wahrnehmung für effiziente multimodale Large Language Models

Zusammenfassung

MLLMs benötigen hochauflösende visuelle Eingaben für feinkörnige Aufgaben wie Dokumentenverständnis und dichte Szenenwahrnehmung. Allerdings überfluten aktuelle Paradigmen zur globalen Auflösungsskalierung den quadratischen Self-Attention-Mechanismus ungefiltert mit visuell redundanten Tokens, was den Inferenzdurchsatz stark einschränkt und gleichzeitig räumliche Sparsamkeit sowie Query-Intent ignoriert. Um dies zu überwinden, schlagen wir Q-Zoom vor, ein query-bewusstes adaptives Hochauflösungs-Wahrnehmungsframework, das in einer effizienten Grob-zu-Fein-Strategie arbeitet. Zunächst umgeht ein leichtgewichtiges Dynamic Gating Network Hochauflösungsverarbeitung sicher, wenn grobe globale Merkmale ausreichen. Zweitens lokalisiert für Abfragen mit Feinwahrnehmungsbedarf ein Self-Distilled Region Proposal Network (SD-RPN) die aufgabenrelevante Region-of-Interest (RoI) präzise direkt aus intermediären Merkmalsräumen. Um diese Module effizient zu optimieren, nutzt das Gating Network eine konsistenzbewusste Generierungsstrategie zur Ableitung deterministischer Routing-Labels, während das SD-RPN ein vollständig selbstüberwachtes Distillationsparadigma einsetzt. Ein kontinuierliches raumzeitliches Alignment-Schema und gezieltes Fine-Tuning fusionieren dann die dichte lokale RoI nahtlos mit dem groben globalen Layout. Umfangreiche Experimente zeigen, dass Q-Zoom eine dominante Pareto-Front etabliert. Mit Qwen2.5-VL-7B als primärer Testplattform beschleunigt Q-Zoom die Inferenz um das 2,52-fache auf Document & OCR-Benchmarks und das 4,39-fache in Hochauflösungsszenarien bei gleichzeitiger Erreichung der Spitzengenauigkeit der Baseline. Darüber hinaus übertrifft Q-Zoom bei Konfiguration für maximale Wahrnehmungstreue die Spitzenleistung der Baseline um 1,1 % bzw. 8,1 % auf diesen Benchmarks. Diese robusten Verbesserungen übertragen sich nahtlos auf Qwen3-VL, LLaVA und neuartige RL-basierte Denk-mit-Bild-Modelle. Die Projektseite ist verfügbar unter https://yuhengsss.github.io/Q-Zoom/.

English

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

Q-Zoom: Abfragegesteuerte adaptive Wahrnehmung für effiziente multimodale Large Language Models

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Zusammenfassung

Support