STMI: 다중 모달 객체 재식별을 위한 분할 기반 토큰 변조 및 교차 모달 하이퍼그래프 상호작용

초록

다중 모달 객체 재식별(ReID)은 서로 다른 모달리티 간의 상호 보완적 정보를 활용하여 특정 객체를 검색하는 것을 목표로 합니다. 그러나 기존 방법들은 하드 토큰 필터링이나 단순한 융합 전략에 의존하는 경우가 많아, 판별력 있는 단서의 손실과 배경 간섭 증가를 초래할 수 있습니다. 이러한 문제를 해결하기 위해 본 논문에서는 세 가지 핵심 구성 요소로 이루어진 새로운 다중 모달 학습 프레임워크인 STMI를 제안합니다: (1) 분할 기반 특징 변조(SFM) 모듈은 SAM으로 생성된 마스크를 활용하여 학습 가능한 어텐션 변조를 통해 전경 표현을 강화하고 배경 잡음을 억제합니다; (2) 의미론적 토큰 재배치(STR) 모듈은 학습 가능한 쿼리 토큰과 적응형 재배치 메커니즘을 사용하여 어떤 토큰도 버리지 않으면서 압축적이고 정보량이 풍부한 표현을 추출합니다; (3) 교차 모달 하이퍼그래프 상호작용(CHI) 모듈은 모달리티를 아우르는 통합 하이퍼그래프를 구성하여 고차원 의미론적 관계를 포착합니다. 공개 벤치마크(RGBNT201, RGBNT100, MSVR310)에서 수행한 폭넓은 실험을 통해, 제안된 STMI 프레임워크의 다중 모달 ReID 시나리오에서의 효과성과 강건성이 입증되었습니다.

English

Multi-modal object Re-Identification (ReID) aims to exploit complementary information from different modalities to retrieve specific objects. However, existing methods often rely on hard token filtering or simple fusion strategies, which can lead to the loss of discriminative cues and increased background interference. To address these challenges, we propose STMI, a novel multi-modal learning framework consisting of three key components: (1) Segmentation-Guided Feature Modulation (SFM) module leverages SAM-generated masks to enhance foreground representations and suppress background noise through learnable attention modulation; (2) Semantic Token Reallocation (STR) module employs learnable query tokens and an adaptive reallocation mechanism to extract compact and informative representations without discarding any tokens; (3) Cross-Modal Hypergraph Interaction (CHI) module constructs a unified hypergraph across modalities to capture high-order semantic relationships. Extensive experiments on public benchmarks (i.e., RGBNT201, RGBNT100, and MSVR310) demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios.

STMI: 다중 모달 객체 재식별을 위한 분할 기반 토큰 변조 및 교차 모달 하이퍼그래프 상호작용

STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification

초록

Support