SeGPruner：面向三维问答的语义几何视觉标记剪枝器

摘要

视觉语言模型（VLM）已广泛应用于三维问答（3D QA）任务。典型流程中，从多视角提取的视觉标记与语言标记拼接后，由大语言模型（LLM）进行联合推理。然而，多视角观测的聚合不可避免地引入严重的标记冗余，导致视觉标记集过大，在受限的标记预算下显著影响推理效率。视觉标记剪枝已成为解决该问题的常用策略，但现有剪枝方法主要针对二维输入或依赖间接几何线索，难以显式保留语义关键对象并维持足够的空间覆盖以支持稳健的三维推理。本文提出SeGPruner——一种面向多视角图像高效3D QA的语义感知与几何引导的标记约简框架。具体而言，SeGPruner首先通过基于注意力机制的重要性模块（显著性感知标记选择器）保留语义显著的标记，确保对象关键证据得以留存；随后通过几何引导的选择器（几何感知标记多样化器）补充空间多样性标记，该模块协同考虑语义相关性与三维几何距离。这种显著性保留与几何引导多样化的协作机制，在激进标记约简下平衡了对象级证据与全局场景覆盖。在ScanQA和OpenEQA上的大量实验表明，SeGPruner在保持三维推理任务竞争力的同时，显著提升推理效率：视觉标记预算减少91%，推理延迟降低86%。

English

Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

SeGPruner：面向三维问答的语义几何视觉标记剪枝器

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

摘要

Support