QG-MIL: 医用画像におけるドメイン非依存型マルチプルインスタンス学習のためのゲート付きトランスフォーマーアグリゲーター

要旨

アテンションに基づくマルチインスタンス学習（MIL）アグリゲーターを医用画像に適用すると、アテンションが特定の領域に集中しやすい傾向があり、過信かつ不安定な予測を引き起こします。本論文では、これを解決するために、ゲート付きトランスフォーマーアグリゲーターであるQG-MILを提案します。QG-MILは、以下の4つの相乗効果を持つアーキテクチャコンポーネントを備えています：RMSNormに基づく事前正規化、ヘッド単位のQK正規化、きめ細かいアテンション出力ゲーティング、そしてSwiGLU形式のフィードフォワードモジュールです。これらの設計上の選択が相まって、学習の安定性を向上させ、補助的な損失関数、マスキング、多段階の正則化を必要とせずに、インスタンス全体により均等にアテンションを分散させます。我々は、ホールスライド病理学と細胞レベルの血液学の6つのベンチマークを用いてQG-MILを評価し、これらは本質的に異なる2つのMILスケールをカバーしています。最も性能の良いQG-MILの変種は、全6つのベンチマークで最先端のベースラインを上回り、平均マクロF1スコアで+6.1ポイントの改善を達成しました。アテンションオーバーレイとアテンション質量分析により、より分散されたインスタンス重み付けが確認されました。アブレーション研究では、特定のデータセットにおいて個々のコンポーネントが完全なモデルに匹敵する場合があるものの、QG-MILの設計は、選択されたベースラインと比較した場合、最も一貫したクロスドメイン性能と最も狭い分散を提供することが示されました。再現性を支援するため、設定可能な実装を https://github.com/unica-visual-intelligence-lab/QG-MIL で公開しています。

English

Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: https://github.com/unica-visual-intelligence-lab/QG-MIL