Hepato-LLaVA：基于稀疏拓扑包注意力的专家级多模态大语言模型，用于全切片图像肝细胞病理分析

摘要

肝细胞癌诊断高度依赖对千兆像素全切片图像的判读。然而当前计算方法受限于固定分辨率处理机制和低效特征聚合，不可避免地导致严重信息丢失或高度特征冗余。为解决这些难题，我们提出Hepato-LLaVA——专用于细粒度肝脏病理分析的多模态大语言模型。我们创新性地引入稀疏拓扑包注意力机制，显式建模二维组织拓扑结构。该机制在保持全局上下文的同时，能有效将局部诊断证据聚合为语义摘要令牌。此外，为弥补多尺度数据缺失，我们构建了基于临床实践的HepatoPathoVQA数据集，包含经病理专家验证的3.3万个层次化结构问答对。实验表明，Hepato-LLaVA在肝癌诊断和描述任务中达到最先进性能，显著优于现有方法。代码与实现细节详见https://pris-cv.github.io/Hepto-LLaVA/。

English

Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.