2D 그리드에서 1D 토큰으로: 다중 모달 이미지 융합을 위한 공유 표현 재구성

초록

멀티모달 이미지 융합은 서로 다른 모달리티의 상호 보완적 정보를 융합된 이미지로 통합하여, 풍부한 지역적 세부 정보를 보존하면서도 전역적으로 일관된 외관을 유지하는 것을 목표로 합니다. 기존 접근법은 2D 특징 그리드 상에 공유 표현을 구축하는데, 이는 지역적 구조를 모델링하는 데 탁월하지만 이미지 수준의 전역 외관 요소를 제어하는 데는 제한적인 능력을 제공합니다. 이러한 목표 간의 균형을 맞추기 위해, 우리는 비지역적 외관/기저 요소를 모델링하기 위해 고정된 사전 학습 이미지 토크나이저를 기반으로 한 간결한 1D 토큰 인터페이스를 도입합니다. 토크나이저를 재구성 백본으로 사용하는 대신, 우리의 설계는 1D 토큰 공간을 전역 전달체로 사용하면서 지역 구조 복원을 위해 2D 공간 경로를 유지합니다. 구체적으로, 우리는 선택적 토큰 편집(STE)을 도입합니다. 이는 소수의 중요한 토큰만을 희소하게 업데이트/교체하여, 융합 백본을 변경하지 않고 추가 손실을 피하면서 전역 외관 일관성을 조정하는 경량 메커니즘을 제공합니다. 네 가지 일반적으로 사용되는 벤치마크에 대한 실험 결과, 우리의 방법이 전역 일관성과 지역 충실도 모두에서 일관된 다중 지표 개선을 보이며 최고의 전반적 성능을 달성함을 보여줍니다. 프로젝트 페이지: https://zju-xyc.github.io/1D-Fusion-Project-Page/

English

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: https://zju-xyc.github.io/1D-Fusion-Project-Page/