SeGPruner: 3D質問応答のための意味的・幾何学的視覚トークンプラナー

要旨

視覚言語モデル（VLM）は3D質問応答（3D QA）に広く採用されている。典型的なパイプラインでは、複数の視点から抽出された視覚トークンが言語トークンと連結され、大規模言語モデル（LLM）によって推論のため共同処理される。しかし、マルチビュー観測を集約する過程では、深刻なトークンの冗長性が避けられず、制約のあるトークンバジェット下で推論効率を著しく阻害する過大な視覚トークンセットが生じる。この問題に対処するため、視覚トークン剪定が一般的な戦略として登場した。にもかかわらず、既存の剪定手法の多くは主に2D入力を対象とするか、間接的な幾何学的手がかりに依存しており、意味的に重要なオブジェクトを明示的に保持し、堅牢な3D推論のための十分な空間的カバレッジを維持する能力が限られている。本論文では、マルチビュー画像を用いた効率的な3D QAのための、意味認識的かつ幾何学誘導型トークン削減フレームワークであるSeGPrunerを提案する。具体的には、SeGPrunerはまず、注意機構に基づく重要度モジュール（Saliency-aware Token Selector）を通じて意味的に顕著なトークンを保持し、オブジェクトの決定的な証拠が保持されることを保証する。次に、幾何学誘導型セレクタ（Geometry-aware Token Diversifier）を介して、これらのトークンを空間的に多様なトークンで補完する。このセレクタは意味的関連性と3D幾何学的距離を共同で考慮する。この顕著性保持と幾何学誘導型多様化の協調により、積極的なトークン削減下でもオブジェクトレベルの証拠とグローバルなシーンカバレッジのバランスが取れる。ScanQAおよびOpenEQAにおける大規模な実験により、SeGPrunerが推論効率を大幅に改善し、視覚トークンバジェットを91%、推論レイテンシを86%削減しながら、3D推論タスクにおいて競争力のある性能を維持することが実証された。

English

Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

SeGPruner: 3D質問応答のための意味的・幾何学的視覚トークンプラナー

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

要旨

Support