元學習人類高級視覺皮層的上下文Transformer模型

摘要

理解高級視覺皮層中的功能表徵是計算神經科學中的一個基本問題。雖然在大規模數據集上預訓練的人工神經網絡展現出與人類神經反應的顯著表徵對齊，但學習視覺皮層的可計算圖像模型依賴於個體層面的大規模功能性磁共振成像（fMRI）數據集。昂貴、耗時且往往不切實際的數據獲取需求限制了編碼器對新受試者和刺激的泛化能力。BraInCoRL利用上下文學習，從少量示例中預測體素級神經反應，無需對新受試者和刺激進行額外的微調。我們採用了一種能夠靈活適應可變數量上下文圖像刺激的變壓器架構，學習多個受試者的歸納偏差。在訓練過程中，我們明確優化模型以進行上下文學習。通過聯合條件化圖像特徵和體素激活，我們的模型學會直接生成性能更優的高級視覺皮層體素級模型。我們證明，在低數據情況下，BraInCoRL在評估全新圖像時始終優於現有的體素級編碼器設計，同時展現出強大的測試時擴展行為。該模型還能夠泛化到一個全新的視覺fMRI數據集，該數據集使用不同的受試者和fMRI數據獲取參數。此外，BraInCoRL通過關注語義相關的刺激，促進了對高級視覺皮層神經信號的更好解釋。最後，我們展示了我們的框架能夠實現從自然語言查詢到體素選擇性的可解釋映射。

English

Understanding functional representations within higher visual cortex is a fundamental question in computational neuroscience. While artificial neural networks pretrained on large-scale datasets exhibit striking representational alignment with human neural responses, learning image-computable models of visual cortex relies on individual-level, large-scale fMRI datasets. The necessity for expensive, time-intensive, and often impractical data acquisition limits the generalizability of encoders to new subjects and stimuli. BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples without any additional finetuning for novel subjects and stimuli. We leverage a transformer architecture that can flexibly condition on a variable number of in-context image stimuli, learning an inductive bias over multiple subjects. During training, we explicitly optimize the model for in-context learning. By jointly conditioning on image features and voxel activations, our model learns to directly generate better performing voxelwise models of higher visual cortex. We demonstrate that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime when evaluated on entirely novel images, while also exhibiting strong test-time scaling behavior. The model also generalizes to an entirely new visual fMRI dataset, which uses different subjects and fMRI data acquisition parameters. Further, BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli. Finally, we show that our framework enables interpretable mappings from natural language queries to voxel selectivity.

元學習人類高級視覺皮層的上下文Transformer模型

Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

摘要

Support