Lunguage：結構化與序列化胸部X光解讀的基準測試

摘要

放射學報告傳達了詳細的臨床觀察，並捕捉了隨時間演變的診斷推理。然而，現有的評估方法僅限於單一報告的情境，且依賴於粗糙的指標，無法捕捉細粒度的臨床語義和時間依賴性。我們引入了LUNGUAGE，這是一個用於結構化放射學報告生成的基準數據集，它支持單一報告評估和跨多項研究的縱向患者層面評估。該數據集包含1,473份經過專家審閱的胸部X光報告，其中80份包含縱向註釋，以捕捉疾病進展和研究間隔，這些註釋也經過了專家審閱。利用這一基準，我們開發了一個兩階段框架，將生成的報告轉化為細粒度、與模式對齊的結構化表示，從而實現縱向解釋。我們還提出了LUNGUAGESCORE，這是一個可解釋的指標，它在實體、關係和屬性層面比較結構化輸出，同時建模患者時間線上的時間一致性。這些貢獻建立了順序放射學報告的首個基準數據集、結構化框架和評估指標，實證結果表明LUNGUAGESCORE有效地支持了結構化報告的評估。代碼可於以下網址獲取：https://github.com/SuperSupermoon/Lunguage

English

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage

Lunguage：結構化與序列化胸部X光解讀的基準測試

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

摘要

Support