面向领域泛化变化检测的视觉问答文本条件状态空间模型
Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
August 12, 2025
作者: Elman Ghazaei, Erchan Aptoula
cs.AI
摘要
地球表面持续变化,探测这些变化为人类社会诸多领域提供了宝贵的洞见。尽管传统的变化检测方法已被用于从双时相图像中识别变化,但这些方法通常需要专业知识才能准确解读。为了让非专业用户更广泛、更灵活地获取变化信息,变化检测视觉问答(CDVQA)任务应运而生。然而,现有的CDVQA方法均建立在训练与测试数据集分布相似的假设之上,这一假设在现实应用中往往不成立,因为领域偏移频繁发生。本文重新审视CDVQA任务,着重解决领域偏移问题。为此,引入了一个新的多模态、多领域数据集——BrightVQA,以促进CDVQA领域泛化研究。此外,提出了一种新颖的状态空间模型,称为文本条件状态空间模型(TCSSM)。TCSSM框架旨在统一利用双时相图像与地理灾害相关文本信息,跨领域提取领域不变特征。TCSSM中存在的输入依赖参数通过双时相图像和地理灾害描述动态预测,从而促进双时相视觉数据与相关文本描述之间的对齐。通过大量实验,将所提方法与现有最先进模型进行对比评估,均展现出优越性能。代码与数据集将在论文被接受后公开于https://github.com/Elman295/TCSSM。
English
The Earth's surface is constantly changing, and detecting these changes
provides valuable insights that benefit various aspects of human society. While
traditional change detection methods have been employed to detect changes from
bi-temporal images, these approaches typically require expert knowledge for
accurate interpretation. To enable broader and more flexible access to change
information by non-expert users, the task of Change Detection Visual Question
Answering (CDVQA) has been introduced. However, existing CDVQA methods have
been developed under the assumption that training and testing datasets share
similar distributions. This assumption does not hold in real-world
applications, where domain shifts often occur. In this paper, the CDVQA task is
revisited with a focus on addressing domain shift. To this end, a new
multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate
domain generalization research in CDVQA. Furthermore, a novel state space
model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The
TCSSM framework is designed to leverage both bi-temporal imagery and
geo-disaster-related textual information in an unified manner to extract
domain-invariant features across domains. Input-dependent parameters existing
in TCSSM are dynamically predicted by using both bi-temporal images and
geo-disaster-related description, thereby facilitating the alignment between
bi-temporal visual data and the associated textual descriptions. Extensive
experiments are conducted to evaluate the proposed method against
state-of-the-art models, and superior performance is consistently demonstrated.
The code and dataset will be made publicly available upon acceptance at
https://github.com/Elman295/TCSSM.