基於文本條件的狀態空間模型用於領域泛化的變更檢測視覺問答
Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
August 12, 2025
作者: Elman Ghazaei, Erchan Aptoula
cs.AI
摘要
地球表面不斷變化,偵測這些變化能為人類社會的各個方面提供寶貴的洞察。雖然傳統的變化偵測方法已被用於從雙時態影像中偵測變化,但這些方法通常需要專家知識才能進行準確解讀。為了讓非專業用戶能更廣泛且靈活地獲取變化資訊,變化偵測視覺問答(CDVQA)任務應運而生。然而,現有的CDVQA方法是在訓練與測試資料集具有相似分佈的假設下開發的,這一假設在現實應用中並不成立,因為領域轉移經常發生。本文重新審視了CDVQA任務,重點關注解決領域轉移問題。為此,引入了一個新的多模態多領域資料集BrightVQA,以促進CDVQA中的領域泛化研究。此外,提出了一種新穎的狀態空間模型,稱為文本條件狀態空間模型(TCSSM)。TCSSM框架旨在統一利用雙時態影像與地理災害相關的文本資訊,跨領域提取領域不變特徵。TCSSM中存在的輸入依賴參數通過雙時態影像和地理災害相關描述動態預測,從而促進雙時態視覺資料與相關文本描述之間的對齊。進行了大量實驗,將所提方法與最先進的模型進行評估,並一致展示了其優越性能。程式碼和資料集將在https://github.com/Elman295/TCSSM上公開,以備接受後使用。
English
The Earth's surface is constantly changing, and detecting these changes
provides valuable insights that benefit various aspects of human society. While
traditional change detection methods have been employed to detect changes from
bi-temporal images, these approaches typically require expert knowledge for
accurate interpretation. To enable broader and more flexible access to change
information by non-expert users, the task of Change Detection Visual Question
Answering (CDVQA) has been introduced. However, existing CDVQA methods have
been developed under the assumption that training and testing datasets share
similar distributions. This assumption does not hold in real-world
applications, where domain shifts often occur. In this paper, the CDVQA task is
revisited with a focus on addressing domain shift. To this end, a new
multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate
domain generalization research in CDVQA. Furthermore, a novel state space
model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The
TCSSM framework is designed to leverage both bi-temporal imagery and
geo-disaster-related textual information in an unified manner to extract
domain-invariant features across domains. Input-dependent parameters existing
in TCSSM are dynamically predicted by using both bi-temporal images and
geo-disaster-related description, thereby facilitating the alignment between
bi-temporal visual data and the associated textual descriptions. Extensive
experiments are conducted to evaluate the proposed method against
state-of-the-art models, and superior performance is consistently demonstrated.
The code and dataset will be made publicly available upon acceptance at
https://github.com/Elman295/TCSSM.