Brain-IT-VQA：脳信号から回答へ

要旨

人が画像を見ている際に記録されたfMRI信号から視覚内容を復号し、特に見た画像に関する質問に答えることは長年の課題である。近年、fMRIを用いた視覚質問応答（VQA）において顕著な進歩が見られるものの、その性能は依然として限られている。さらに、最近のモデルはますます正確な予測が可能になっているが、脳内の視覚表現の構造を理解するためのツールとして活用されることはほとんどなかった。本稿では、fMRIからの視覚質問応答のためのフレームワークであるBrain-IT-VQAを提案する。本手法はBrain Interaction Transformer（Brain-IT）を基盤とし、脳活動から言語トークンを復号し、それを言語モデルと統合することで視覚質問に回答する。我々のモデルは、従来のfMRIに基づくキャプション生成やVQA手法を大幅に上回る性能を示す。さらに、fMRIからの視覚質問応答のための新たなデータセットおよびベンチマークであるNSD-VQAを導入する。既存の画像-fMRI VQAデータセットは、通常、画像あたり少数で広範かつ制御の弱い質問のみを提供するのに対し、NSD-VQAは20の制御された質問カテゴリーにわたって画像あたり平均20の質問応答ペアを提供し、複数レベルの視覚理解を分離する。これにより、限られたfMRIテストデータにもかかわらず、より信頼性が高く解釈可能な評価が可能となる。Brain-IT-VQAとNSD-VQAは、強力な予測フレームワークと脳表現研究のためのツールの両方を提供する。このベンチマークを用いて、自然画像に対するfMRI応答からどのような形態的・意味的情報が確実に復号可能かを定量化する。さらに、質問タイプごとに異なる脳領域の寄与を分析する。

English

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.