MedConclusion: 構造化抄録からの生物医学的結論生成のためのベンチマーク

要旨

大規模言語モデル（LLM）は推論集約的な研究タスクにおいて広く探究されているが、構造化された生物医学的エビデンスから科学的結論を推論できるかを検証するリソースは依然として限られている。本研究では、生物医学的結論生成のための570万件のPubMed構造化抄録からなる大規模データセットMedConclusionを導入する。各インスタンスは抄録の結論以外のセクションと、著者によって執筆された元の結論をペア化しており、エビデンスから結論への推論に対する自然発生型の教師信号を提供する。MedConclusionは生物医学カテゴリやSJRなどのジャーナルレベルのメタデータも含み、生物医学分野横断的なサブグループ分析を可能にする。初期的研究として、結論生成と要約生成のプロンプト設定下で多様なLLMを評価し、出力を参照ベースの指標とLLM-as-a-judgeの両方で採点する。その結果、結論作成は要約作成とは行動的に異なること、強力なモデル群は現行の自動指標下で密にクラスタリングされること、評価者（judge）の同一性が絶対スコアを大きく変動させうることを明らかにした。MedConclusionは、科学的エビデンスから結論への推論を研究するための再利用可能なデータリソースを提供する。コードとデータはhttps://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion で公開されている。

English

Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce MedConclusion, a large-scale dataset of 5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.

MedConclusion: 構造化抄録からの生物医学的結論生成のためのベンチマーク

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

要旨

Support