BrainG3N：制御可能な3次元脳MRI生成のための二目的トークナイザ

要旨

三次元脳MRIは臨床神経学および神経腫瘍学において中心的な役割を果たしており、生成的モデルは過少代表集団の補強、疾患軌跡のシミュレーション、プライバシー保護データ共有の支援に活用できる。画像データのモデリングには潜在拡散が標準的な手法となっているが、これはトークナイザに対して相反する二つの要件を課す。すなわち、エンコーダの埋め込みは下流タスクが作用する臨床情報を保持しなければならず、かつデコーダは解剖学的に忠実なボリュームを再構成できなければならない。既存の再構成駆動型トークナイザは、後者を達成する代わりに前者を犠牲にしている。この課題に対処するため、我々は三次元脳MRI潜在拡散のための完全ボリューム型マスクオートエンコーダ（MAE）ベースのトークナイザを導入し、エンコーダとデコーダを分離する。すなわち、凍結された三次元MAEエンコーダは臨床的に有益な埋め込みを生成し、専用のCNNデコーダがそれらの埋め込みの線形射影からボクセルを再構成する。我々は18の公開コホート（4モダリティ、10疾患カテゴリ、200以上の取得施設にわたる35,309ボリューム）でエンコーダを事前学習し、二つの設定でその二重の有用性を実証する。第一に、23タスクの線形プロービングベンチマークにおいて、エンコーダは23タスク中21タスクで最先端モデル（BrainIAC、BrainSegFounder、MedicalNet）を上回るか同等の性能を示す。第二に、これらの臨床的に有益な埋め込みで学習された条件付き拡散トランスフォーマ（DiT）は、6変数にわたる条件付き生成と患者特異的な縦断的予測の両方を支援する。これらの結果は、下流の臨床タスクと制御可能な生成の両方を可能にする単一の三次元脳MRI埋め込み空間を確立するものである。

English

Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.