QG-MIL: 의료 영상에서 도메인에 구애받지 않는 다중 인스턴스 학습을 위한 게이트형 트랜스포머 집계기

초록

의학 영상에서 주의 기반 다중 인스턴스 학습(Attention-based Multiple Instance Learning) 집계기는 주의 집중(attention concentration) 현상이 발생하기 쉬워 과신하고 불안정한 예측을 초래합니다. 본 논문에서는 이러한 문제를 해결하기 위해 QG-MIL이라는 게이트 변환기 집계기(gated transformer aggregator)를 제안합니다. QG-MIL은 네 가지 상호 보완적인 아키텍처 구성 요소, 즉 RMSNorm 기반 사전 정규화(pre-normalization), 헤드별 QK 정규화(per-head QK normalization), 세분화된 주의 출력 게이팅(fine-grained attention output gating), 그리고 SwiGLU 스타일 피드포워드 모듈(SwiGLU-style feed-forward modules)을 통해 이를 해결합니다. 이러한 설계 선택은 보조 손실(auxiliary losses), 마스킹(masking), 또는 다단계 정규화(multi-stage regularization) 없이도 훈련을 안정화하고 인스턴스 간 주의를 더 균일하게 분배합니다. 우리는 전 슬라이드 병리학(whole-slide pathology)과 세포 수준 혈액학(cell-level hematology)을 포함한 여섯 개의 벤치마크에서 QG-MIL을 평가하였으며, 이는 근본적으로 다른 두 가지 MIL 규모를 포괄합니다. 최고 성능의 QG-MIL 변형은 모든 여섯 개 벤치마크에서 주요 기준선(baselines)을 능가하며, 평균 매크로 F1 점수에서 +6.1 포인트의 향상을 보였습니다. 주의 오버레이(attention overlays)와 주의 질량 분석(attention mass analysis)은 더 분산된 인스턴스 가중치를 확인해 줍니다. 절제 연구(ablation studies)는 개별 구성 요소가 특정 데이터셋에서 전체 모델과 일치할 수 있지만, QG-MIL 설계가 선택된 기준선과 비교하여 가장 일관된 교차 도메인 성능과 가장 좁은 분산을 제공함을 보여줍니다. 우리는 재현성 지원을 위해 구성 가능한 구현을 공개합니다: https://github.com/unica-visual-intelligence-lab/QG-MIL

English

Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: https://github.com/unica-visual-intelligence-lab/QG-MIL