Brain-IT-VQA：從腦部訊號到答案

摘要

從觀看圖像時記錄的功能性磁振造影（fMRI）訊號中解碼視覺內容，並具體回答關於所見圖像的問題，一直是個長期的挑戰。儘管近年來在基於fMRI的視覺問答（VQA）領域已取得顯著進展，但其表現仍有限。此外，雖然現有模型能做出越來越準確的預測，但它們鮮少被用作理解腦中視覺表徵結構的工具。我們提出Brain-IT-VQA，一個基於fMRI進行視覺問答的框架。該方法建構在大腦互動轉換器（Brain-IT）之上，從腦部活動解碼語言標記，並將其與語言模型整合，以回答視覺問題。我們的模型在表現上大幅超越既有基於fMRI的影像描述與VQA方法。我們進一步引入NSD-VQA，一個新的基於fMRI視覺問答的資料集與基準測試。與現有的影像-fMRI VQA資料集通常僅提供每張影像少數廣泛且控制薄弱的問題不同，NSD-VQA在20個經控制的問題類別中，每張影像平均提供20組問答對，從而解析多層次的視覺理解。這使得在有限的fMRI測試資料下，仍能進行更可靠且可解釋的評估。總體而言，Brain-IT-VQA與NSD-VQA提供了一個強大的預測框架，以及研究腦部表徵的工具。利用此基準測試，我們量化了哪些形式的視覺與語義資訊能從自然影像的fMRI反應中被可靠解碼。我們進一步分析了不同問題類型下各腦區的貢獻。

English

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.