从二维网格到一维令牌：重塑多模态图像融合的共享表示

摘要

多模态图像融合旨在将不同模态的互补信息整合到一张融合图像中，使其既能保留丰富的局部细节，又能维持全局一致的外观。现有方法在二维特征网格上构建共享表示，擅长建模局部结构，但对图像级别的全局外观因素调控能力有限。为平衡这两个目标，我们引入了一种基于冻结预训练图像分词器的紧凑一维令牌接口，用于建模非局部外观/基础因素。不同于将分词器作为重建主干的设计，我们的方法将一维令牌空间作为全局载体，同时保留二维空间路径用于局部结构恢复。具体而言，我们提出了选择性令牌编辑（STE），它稀疏地更新/替换少量关键令牌，提供一种轻量级机制来引导全局外观一致性，同时保持融合主干不变且无需额外损失。在四个常用基准上的实验表明，我们的方法在全局一致性和局部保真度上均实现了稳定的多指标提升，取得了最佳整体性能。项目页面：https://zju-xyc.github.io/1D-Fusion-Project-Page/

English

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: https://zju-xyc.github.io/1D-Fusion-Project-Page/