ACE-LoRA：医療視覚言語モデルの効率的なパラメータ適応のためのグラフ注意に基づく文脈拡張

要旨

自然画像におけるCLIP型視覚言語モデル(VLM)の成功は医療分野における類似モデルの開発を促進してきたが、既存のアプローチは大きく二極化している。単一ドメインデータで学習された専門特化型モデルはドメイン固有の詳細を捉えるが汎化性能が低く、多ドメインデータで学習された汎用医療VLMは広範な意味情報を保持するが微細な診断手がかりが希薄化される。この専門性と汎用性のトレードオフを橋渡しすることは依然として課題である。この問題を解決するため、我々は汎用医療VLMのためのパラメータ効率型適応フレームワークACE-LoRAを提案する。本手法は堅牢なゼロショット汎化性能を維持しつつ、凍結された画像・テキストエンコーダにLow-Rank Adaptation (LoRA) モジュールを統合する。さらに、Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) モジュールを導入し、ペアワイド類似性を超えた高次元の文脈的相互作用を捕捉することで、局所的な診断手がかりを大域表現に付与する。これにより、微細な詳細を見落としがちな従来のParameter-Efficient Fine-Tuning (PEFT) 手法の限界を克服する。クロスモーダルアライメントをさらに強化するため、意味的に関連する画像-テキストペア間の偽陰性を効果的に抑制するラベル誘導型InfoNCE損失を定式化した。わずか0.95Mの学習可能パラメータを追加するのみで、ACE-LoRAは複数ドメインにわたるゼロショット分類、セグメンテーション、検証の各ベンチマークにおいて、最先端の医療VLMおよびPEFTベースライン手法を一貫して凌駕する。実装コードはhttps://github.com/icon-lab/ACE-LoRA で公開している。

English

The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.

ACE-LoRA：医療視覚言語モデルの効率的なパラメータ適応のためのグラフ注意に基づく文脈拡張

ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models

要旨

Support