Hepato-LLaVA：全スライド画像における肝細胞癌病理解析のための疎なトポロジーパック注意機構を備えた専門家マルチモーダル大規模言語モデル

要旨

肝細胞癌の診断は、ギガピクセルレベルのWhole Slide Image（全スライド画像）の解釈に大きく依存している。しかし、現在の計算手法は固定解像度の処理機構と非効率な特徴量集約に制約されており、深刻な情報損失または高い特徴量の冗長性を必然的に引き起こしている。これらの課題に対処するため、我々は細粒度の肝細胞病理解析に特化したマルチモーダル大規模言語モデル「Hepato-LLaVA」を提案する。本モデルでは、2次元組織トポロジーを明示的にモデル化する新規のSparse Topo-Pack Attention機構を導入する。この機構は、大域的な文脈を保持しつつ、局所的な診断エビデンスを意味的な要約トークンに効果的に集約する。さらに、マルチスケールデータの不足を克服するため、専門病理医によって検証された3万3千の階層構造化された質問応答ペアから構成される臨床ベースのデータセット「HepatoPathoVQA」を構築した。実験結果では、Hepato-LLaVAが肝細胞癌の診断およびキャプション生成タスクにおいて既存手法を大幅に上回る最高性能を達成することを示す。コードと実装詳細はhttps://pris-cv.github.io/Hepto-LLaVA/で公開している。

English

Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.

Hepato-LLaVA：全スライド画像における肝細胞癌病理解析のための疎なトポロジーパック注意機構を備えた専門家マルチモーダル大規模言語モデル

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

要旨

Support