Brain-IT-VQA：从脑信号到答案

摘要

从观看图像时记录的fMRI信号中解码视觉内容，并针对所见图像回答具体问题，是一项长期存在的挑战。尽管近年来基于fMRI的视觉问答（VQA）研究取得了显著进展，但其性能仍存在局限。此外，尽管现有模型能做出日益精准的预测，却很少被用作理解大脑视觉表征结构的工具。我们提出了Brain-IT-VQA——一个基于fMRI进行视觉问答的框架。该方法在脑交互变压器（Brain-IT）的基础上，从脑活动中解码语言标记，并将其与语言模型整合以回答视觉问题。我们的模型显著优于以往基于fMRI的图像描述和VQA方法。我们进一步引入了NSD-VQA——一个用于fMRI视觉问答的新数据集与基准。与现有图像-fMRI VQA数据集通常每张图像仅提供少量宽泛且控制薄弱的问题不同，NSD-VQA为每张图像平均提供20组问答对，涵盖20个受控问题类别，这些类别解构了多层次视觉理解。这使得在有限的fMRI测试数据下，评估更可靠且更具可解释性。Brain-IT-VQA和NSD-VQA共同构建了一个强大的预测框架，同时成为研究大脑表征的工具。借助这一基准，我们量化了从自然图像的fMRI响应中可可靠解码的视觉与语义信息形式，并进一步分析了不同脑区在不同问题类型中的贡献。

English

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.