強化学習による効率的な医療VIE

要旨

視覚情報抽出（Visual Information Extraction, VIE）は、非構造化された文書画像をJSONなどの構造化形式に変換する技術であり、レポート分析やオンライン診療などの医療アプリケーションにおいて重要である。従来の手法はOCRと言語モデルに依存しているが、エンドツーエンドのマルチモーダルモデルは直接JSONを生成する。しかし、ドメイン固有のスキーマや高いアノテーションコストが医療VIEにおける有効性を制限している。我々は、これらの課題に対処するため、検証可能な報酬を用いた強化学習（Reinforcement Learning with Verifiable Rewards, RLVR）フレームワークに基づくアプローチを採用し、わずか100のアノテーションサンプルでこれを実現した。我々のアプローチは、データセットの多様性を確保し、幻覚を減らしフィールドカバレッジを向上させるためのバランスの取れた精度-再現率報酬メカニズム、および推論能力を強化するための革新的なサンプリング戦略を提供する。Qwen2.5-VL-7Bを我々のRLVR手法でファインチューニングすることで、医療VIEタスクにおいて最先端の性能を達成し、F1、精度、再現率を大幅に向上させた。我々のモデルは医療データセットに類似したタスクでは優れた性能を示すが、類似しないタスクでは性能が低下し、ドメイン固有の最適化の必要性が浮き彫りとなった。ケーススタディは、VIEにおけるトレーニングおよび推論中の推論の価値をさらに実証している。

English

Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity, a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage, and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.

強化学習による効率的な医療VIE

Efficient Medical VIE via Reinforcement Learning

要旨

Support