Brain-IT-VQA: Dos Sinais Cerebrais às Respostas

Resumo

Decodificar o conteúdo visual de sinais de fMRI registrados enquanto uma pessoa visualiza imagens, e especificamente responder a perguntas sobre as imagens vistas, é um desafio de longa data. Embora avanços significativos tenham sido alcançados nos últimos anos na resposta visual a perguntas (VQA) a partir de fMRI, o desempenho ainda é limitado. Além disso, embora modelos recentes consigam fazer previsões cada vez mais precisas, eles raramente foram utilizados como ferramentas para compreender a estrutura das representações visuais no cérebro. Apresentamos o Brain-IT-VQA, uma estrutura para resposta visual a perguntas a partir de fMRI. Baseando-se no Brain Interaction Transformer (Brain-IT), nosso método decodifica tokens de linguagem a partir da atividade cerebral e os integra a um modelo de linguagem para responder a perguntas visuais. Nosso modelo supera substancialmente abordagens anteriores de legendagem e VQA baseadas em fMRI. Introduzimos ainda o NSD-VQA, um novo conjunto de dados e referência para resposta visual a perguntas a partir de fMRI. Diferentemente dos conjuntos de dados existentes de VQA imagem-fMRI, que geralmente fornecem apenas algumas perguntas amplas e fracamente controladas por imagem, o NSD-VQA oferece, em média, 20 pares pergunta-resposta por imagem em 20 categorias de perguntas controladas que desagregam múltiplos níveis de compreensão visual. Isso possibilita uma avaliação mais confiável e interpretável, apesar dos dados limitados de teste de fMRI. Em conjunto, o Brain-IT-VQA e o NSD-VQA fornecem tanto uma estrutura preditiva robusta quanto uma ferramenta para estudar representações cerebrais. Utilizando essa referência, quantificamos quais formas de informação visual e semântica podem ser decodificadas de forma confiável a partir de respostas de fMRI a imagens naturais. Analisamos ainda as contribuições de diferentes regiões cerebrais entre os tipos de perguntas.

English

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.