SeGPruner: 3D 질의응답을 위한 의미론적-기하학적 시각 토큰 프루너

초록

비전-언어 모델(VLM)은 3D 질의응답(3D QA)에 널리 적용되고 있습니다. 일반적인 파이프라인에서는 다중 시점에서 추출된 시각 토큰을 언어 토큰과 연결하고 대규모 언어 모델(LLM)이 이를 공동 처리하여 추론을 수행합니다. 그러나 다중 시점 관측을 통합하는 과정에서는 필연적으로 심각한 토큰 중복이 발생하며, 이로 인해 지나치게 큰 시각 토큰 집합이 생성되어 제한된 토큰 예산 내에서 추론 효율이 크게 저해됩니다. 이러한 문제를 해결하기 위해 시각 토큰 프루닝(pruning)이 일반적인 전략으로 부상했습니다. 그럼에도 불구하고 기존 대부분의 프루너는 주로 2D 입력에 맞춰 설계되었거나 간접적인 기하학적 단서에 의존하여, 의미론적으로 중요한 객체를 명시적으로 보존하고 강력한 3D 추론을 위한 충분한 공간적 커버리지를 유지하는 능력이 제한적입니다. 본 논문에서는 다중 시점 이미지를 활용한 효율적인 3D QA를 위한 의미 인식 및 기하학적 지도 토큰 축소 프레임워크인 SeGPruner를 제안합니다. 구체적으로, SeGPruner는 먼저 주의 기반 중요도 모듈(Saliency-aware Token Selector)을 통해 의미론적으로 두드러진 토큰을 보존하여 객체-중요 증거가 유지되도록 합니다. 그런 다음 기하학적 지도 선택기(Geometry-aware Token Diversifier)를 통해 공간적으로 다양한 토큰을 보완하며, 이 선택기는 의미적 관련성과 3D 기하학적 거리를 함께 고려합니다. 이처럼 두드러짐 보존과 기하학적 지도 다양화 간의 협력은 공격적인 토큰 축소 하에서 객체 수준 증거와 전역 장면 커버리지 간의 균형을 맞춥니다. ScanQA 및 OpenEQA에 대한 광범위한 실험을 통해 SeGPruner가 시각 토큰 예산을 91%, 추론 지연 시간을 86% 감소시키면서도 3D 추론 작업에서 경쟁력 있는 성능을 유지함으로써 추론 효율을 크게 향상시킴을 입증했습니다.

English

Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

SeGPruner: 3D 질의응답을 위한 의미론적-기하학적 시각 토큰 프루너

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

초록

Support