ACE-LoRA: 의료 비전-언어 모델의 매개변수 효율적 적응을 위한 그래프 주의 맥락 향상

초록

CLIP과 유사한 시각-언어 모델(VLM)의 자연 이미지에서의 성공은 의료 분야 대응 모델 개발에 영감을 주었으나, 기존 접근법은 크게 두 가지 극단으로 나뉩니다: 단일 도메인 데이터로 학습된 전문가 모델은 도메인 특화 세부 사항을 잘 포착하지만 일반화 성능이 낮고, 다중 도메인 데이터로 학습된 일반주의 의료 VLM은 광의적 의미를 유지하지만 정밀한 진단 단서가 희석됩니다. 이러한 전문화와 일반화 간의 트레이드오프를 해소하는 것은 여전히 과제로 남아 있습니다. 이 문제를 해결하기 위해 우리는 강력한 제로샷 일반화 성능을 유지하는 일반주의 의료 VLM을 위한 매개변수 효율적 적응 프레임워크인 ACE-LoRA를 제안합니다. ACE-LoRA는 고정된 이미지-텍스트 인코더에 LoRA(Low-Rank Adaptation) 모듈을 통합하고, 쌍별 유사성(pairwise similarity)을 넘어 고차원적인 문맥적 상호작용을 포착하여 지역화된 진단 단서로 전역 표현을 풍부하게 하는 ACE-HGNN(Attention-based Context Enhancement Hypergraph Neural Network) 모듈을 도입합니다. 이는 정밀한 세부 사항을 간과하는 기존 PEFT(Parameter-Efficient Fine-Tuning) 방법의 주요 한계를 해결합니다. 크로스 모달 정렬을 더욱 향상시키기 위해 의미론적으로 관련된 이미지-텍스트 쌍 간의 False Negative를 효과적으로 억제하는 레이블 기반 InfoNCE 손실 함수를 구성합니다. 단 0.95M개의 학습 가능 매개변수만 추가함에도 불구하고, ACE-LoRA는 다중 도메인에 걸친 제로샷 분류, 세분화, 감지 벤치마크에서 최신 의료 VLM 및 PEFT 기준선을 지속적으로 능가합니다. 우리의 코드는 https://github.com/icon-lab/ACE-LoRA에서 확인할 수 있습니다.

English

The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.

ACE-LoRA: 의료 비전-언어 모델의 매개변수 효율적 적응을 위한 그래프 주의 맥락 향상

ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models

초록

Support