メタ学習によるヒト高次視覚野のインコンテキストTransformerモデル

要旨

高次視覚野における機能的表現を理解することは、計算神経科学における基本的な課題である。大規模データセットで事前学習された人工ニューラルネットワークは、人間の神経応答との驚くべき表現的整合性を示すが、視覚野の画像計算可能なモデルを学習するには、個別レベルの大規模fMRIデータセットに依存している。高コストで時間がかかり、しばしば非現実的なデータ取得の必要性は、エンコーダの新たな被験者や刺激への一般化を制限している。BraInCoRLは、コンテキスト内学習を用いて、新たな被験者や刺激に対する追加のファインチューニングなしに、少数の例からボクセル単位の神経応答を予測する。我々は、可変数のコンテキスト内画像刺激に柔軟に条件付けできるトランスフォーマーアーキテクチャを活用し、複数の被験者にわたる帰納的バイアスを学習する。訓練中、我々は明示的にコンテキスト内学習のためにモデルを最適化する。画像特徴とボクセル活性化を共同で条件付けることで、我々のモデルは、高次視覚野のより高性能なボクセル単位モデルを直接生成することを学習する。BraInCoRLが、完全に新しい画像で評価された場合、低データ体制において既存のボクセル単位エンコーダ設計を一貫して上回り、同時に強力なテスト時スケーリング挙動を示すことを実証する。このモデルは、異なる被験者とfMRIデータ取得パラメータを使用する、まったく新しい視覚fMRIデータセットにも一般化する。さらに、BraInCoRLは、意味的に関連する刺激に注意を向けることで、高次視覚野における神経信号の解釈可能性を向上させる。最後に、我々のフレームワークが、自然言語クエリからボクセル選択性への解釈可能なマッピングを可能にすることを示す。

English

Understanding functional representations within higher visual cortex is a fundamental question in computational neuroscience. While artificial neural networks pretrained on large-scale datasets exhibit striking representational alignment with human neural responses, learning image-computable models of visual cortex relies on individual-level, large-scale fMRI datasets. The necessity for expensive, time-intensive, and often impractical data acquisition limits the generalizability of encoders to new subjects and stimuli. BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples without any additional finetuning for novel subjects and stimuli. We leverage a transformer architecture that can flexibly condition on a variable number of in-context image stimuli, learning an inductive bias over multiple subjects. During training, we explicitly optimize the model for in-context learning. By jointly conditioning on image features and voxel activations, our model learns to directly generate better performing voxelwise models of higher visual cortex. We demonstrate that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime when evaluated on entirely novel images, while also exhibiting strong test-time scaling behavior. The model also generalizes to an entirely new visual fMRI dataset, which uses different subjects and fMRI data acquisition parameters. Further, BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli. Finally, we show that our framework enables interpretable mappings from natural language queries to voxel selectivity.

メタ学習によるヒト高次視覚野のインコンテキストTransformerモデル

Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

要旨

Support