元学习：构建人类高级视觉皮层的情境Transformer模型

摘要

理解高级视觉皮层中的功能表征是计算神经科学中的一个基本问题。尽管在大规模数据集上预训练的人工神经网络与人类神经反应表现出显著的表示对齐，但学习视觉皮层的图像可计算模型依赖于个体级别的大规模功能磁共振成像（fMRI）数据集。昂贵、耗时且往往不切实际的数据采集需求限制了编码器对新受试者和刺激的泛化能力。BraInCoRL利用上下文学习，通过少量示例预测体素级神经反应，无需针对新受试者和刺激进行额外微调。我们采用了一种能够灵活适应不同数量上下文图像刺激的Transformer架构，学习跨多个受试者的归纳偏置。在训练过程中，我们明确优化模型以进行上下文学习。通过联合条件化图像特征和体素激活，我们的模型学会了直接生成性能更优的高级视觉皮层体素级模型。我们证明，在低数据量情况下，当评估完全新颖的图像时，BraInCoRL始终优于现有的体素级编码器设计，同时展现出强大的测试时扩展行为。该模型还能泛化到一个全新的视觉fMRI数据集，该数据集使用了不同的受试者和fMRI数据采集参数。此外，BraInCoRL通过关注语义相关的刺激，促进了高级视觉皮层神经信号更好的可解释性。最后，我们展示了我们的框架能够实现从自然语言查询到体素选择性的可解释映射。

English

Understanding functional representations within higher visual cortex is a fundamental question in computational neuroscience. While artificial neural networks pretrained on large-scale datasets exhibit striking representational alignment with human neural responses, learning image-computable models of visual cortex relies on individual-level, large-scale fMRI datasets. The necessity for expensive, time-intensive, and often impractical data acquisition limits the generalizability of encoders to new subjects and stimuli. BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples without any additional finetuning for novel subjects and stimuli. We leverage a transformer architecture that can flexibly condition on a variable number of in-context image stimuli, learning an inductive bias over multiple subjects. During training, we explicitly optimize the model for in-context learning. By jointly conditioning on image features and voxel activations, our model learns to directly generate better performing voxelwise models of higher visual cortex. We demonstrate that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime when evaluated on entirely novel images, while also exhibiting strong test-time scaling behavior. The model also generalizes to an entirely new visual fMRI dataset, which uses different subjects and fMRI data acquisition parameters. Further, BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli. Finally, we show that our framework enables interpretable mappings from natural language queries to voxel selectivity.

元学习：构建人类高级视觉皮层的情境Transformer模型

Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

摘要

Support