SkinFlow:基於動態視覺編碼與分階段強化學習的開放式皮膚病診斷高效資訊傳輸系統
SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
January 14, 2026
作者: Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou
cs.AI
摘要
通用大型視覺語言模型(LVLM)儘管規模龐大,卻常因「注意力擴散」現象而在皮膚科診斷中表現不佳——即無法從背景噪聲中分離出細微的病理性病變。本文挑戰了「參數擴張是實現醫療精準度的唯一途徑」這一假設,提出SkinFlow框架,將診斷任務重新定義為視覺信息傳輸效率的優化過程。我們的方法採用虛擬寬度動態視覺編碼器(DVE),無需物理參數擴展即可「展開」複雜的病理性流形,並結合兩階段強化學習策略:第一階段對齊顯性醫學描述,第二階段在受限語義空間內重構隱性診斷紋理。此外,我們提出以臨床實務為基礎的評估方案,優先考量診斷安全性與層級關聯性,而非僵化的標籤匹配。實證結果令人振奮:我們的70億參數模型在Fitzpatrick17k基準測試中創下新紀錄,相比大型通用模型(如Qwen3VL-235B和GPT-5.2),Top-1準確率提升12.06%,Top-6準確率飆升28.57%。這些發現證明,相較於單純的參數擴張,優化幾何容量與信息流能產生更卓越的診斷推理能力。
English
General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.