透過強化學習實現高效的醫學視覺資訊提取

摘要

視覺資訊抽取（VIE）將非結構化的文件圖像轉換為如JSON等結構化格式，這對於報告分析和線上諮詢等醫療應用至關重要。傳統方法依賴於光學字符識別（OCR）和語言模型，而端到端的多模態模型則提供直接的JSON生成。然而，特定領域的架構和高昂的註釋成本限制了這些方法在醫療VIE中的效果。我們基於可驗證獎勵的強化學習（RLVR）框架來應對這些挑戰，僅使用100個註釋樣本。我們的方法確保了數據集的多樣性，通過平衡精確率與召回率的獎勵機制來減少幻覺並提高字段覆蓋率，並採用創新的採樣策略來增強推理能力。通過使用我們的RLVR方法微調Qwen2.5-VL-7B，我們在醫療VIE任務中達到了最先進的性能，顯著提升了F1分數、精確率和召回率。雖然我們的模型在與醫療數據集相似的任務上表現出色，但在不相似的任務上性能下降，這凸顯了特定領域優化的必要性。案例研究進一步展示了在訓練和推理過程中進行推理對於VIE的價值。

English

Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity, a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage, and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.

透過強化學習實現高效的醫學視覺資訊提取

Efficient Medical VIE via Reinforcement Learning

摘要

Support