Hepato-LLaVA：具備稀疏拓撲包注意力機制的專家級多模態大語言模型，用於全切片影像的肝細胞病理學分析

摘要

肝細胞癌診斷高度依賴於對千兆像素全玻片影像的判讀。然而，現有計算方法受制於固定分辨率的處理機制和低效的特徵聚合方式，這不可避免地導致嚴重信息損失或高度特徵冗餘。為解決這些難題，我們提出Hepato-LLaVA——一個專注於細粒度肝臟病理分析的多模態大型語言模型。我們創新性地引入稀疏拓撲包注意力機制，顯式建模二維組織拓撲結構。該機制在保持全局上下文的前提下，能有效將局部診斷證據聚合為語義摘要標記。此外，為克服多尺度數據匱乏的困境，我們構建了HepatoPathoVQA臨床基礎數據集，包含經病理專家驗證的3.3萬個層次化結構問答對。實驗表明，Hepato-LLaVA在肝癌診斷和描述任務中達到頂尖性能，顯著超越現有方法。代碼及實現細節已開源於：https://pris-cv.github.io/Hepto-LLaVA/

English

Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.

Hepato-LLaVA：具備稀疏拓撲包注意力機制的專家級多模態大語言模型，用於全切片影像的肝細胞病理學分析

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

摘要

Support