Brain-IT-VQA: 뇌 신호에서 답변으로

초록

사람이 이미지를 볼 때 기록된 fMRI 신호에서 시각적 콘텐츠를 디코딩하고, 특히 본 이미지에 대한 질문에 답하는 것은 오랜 도전 과제이다. 최근 몇 년간 fMRI 기반 시각 질문 응답(VQA)에서 상당한 진전이 있었지만, 성능은 여전히 제한적이다. 또한, 최신 모델이 점점 더 정확한 예측을 할 수 있게 되었음에도 불구하고, 이는 뇌의 시각적 표현 구조를 이해하는 도구로 거의 사용되지 않았다. 우리는 fMRI 기반 시각 질문 응답을 위한 프레임워크인 Brain-IT-VQA를 제시한다. 뇌 상호작용 트랜스포머(Brain-IT)를 기반으로 한 이 방법은 뇌 활동에서 언어 토큰을 디코딩하고 이를 언어 모델과 통합하여 시각적 질문에 답한다. 우리 모델은 이전의 fMRI 기반 캡셔닝 및 VQA 접근법보다 현저히 뛰어난 성능을 보인다. 또한, 우리는 fMRI 기반 시각 질문 응답을 위한 새로운 데이터셋이자 벤치마크인 NSD-VQA를 소개한다. 기존의 이미지-fMRI VQA 데이터셋이 일반적으로 이미지당 소수의 광범위하고 통제가 약한 질문만 제공하는 반면, NSD-VQA는 20개의 통제된 질문 범주에 걸쳐 이미지당 평균 20개의 질문-답변 쌍을 제공하여 시각적 이해의 여러 수준을 분리한다. 이를 통해 제한된 fMRI 테스트 데이터에도 불구하고 더 신뢰할 수 있고 해석 가능한 평가가 가능하다. Brain-IT-VQA와 NSD-VQA는 함께 강력한 예측 프레임워크이자 뇌 표현 연구를 위한 도구를 제공한다. 이 벤치마크를 사용하여 우리는 자연 이미지에 대한 fMRI 반응에서 어떤 형태의 시각적 및 의미적 정보가 신뢰할 수 있게 디코딩될 수 있는지 정량화한다. 또한, 우리는 질문 유형별로 서로 다른 뇌 영역의 기여도를 분석한다.

English

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.