EmpiriGraph-Psy：用於從心理學摘要中提取經驗關係圖譜的資料集與LLM管線

摘要

現有的科學關係抽取基準主要針對電腦科學等領域，其中實體為任務、方法、資料集、材料或指標。這在心理學等以變數為導向的實證領域中留下了缺口，因為這些領域的研究結果通常以構念、測量、干預與結果之間的關係來表達。我們提出以變數為中心的實證圖抽取任務，旨在將科學摘要映射為類型化圖形，其節點為標準化變數，邊則代表實證關係與層級關係。為支援此任務，我們建構了EmpiriGraph-Psy基準，涵蓋210篇心理學摘要，並由受過領域訓練的標註員對標準化變數、概念層級、實證關係類型與驗證狀態進行標註。我們評估了前沿和開源權重的LLM，採用了直接抽取與分階段圖建構流程（分別進行變數抽取、標準化、層級建構、證據篩選、關係抽取及邊緣驗證）。分階段流程明顯優於直接抽取，最佳配置可達到0.74的巨集F1分數。錯誤分析顯示，調節關係與概念層級仍是最具挑戰性的案例，凸顯了從科學摘要中提取高階實證主張與隱含抽象結構的困難。

English

Existing scientific relation extraction benchmarks mainly target domains such as computer science, where entities are tasks, methods, datasets, materials, or metrics. This leaves a gap in variable-oriented empirical fields such as psychology, where findings are expressed as relations among constructs, measurements, interventions, and outcomes. We introduce variable-centered empirical graph extraction, the task of mapping scientific abstracts to typed graphs whose nodes are normalized variables and whose edges represent empirical and hierarchical relations. To support this task, we construct EmpiriGraph-Psy, a benchmark of 210 psychology abstracts annotated by domain-trained annotators with normalized variables, concept hierarchies, empirical relation types, and validation states. We evaluate frontier and open-weight LLMs using both direct extraction and a staged graph-construction pipeline that separates variable extraction, normalization, hierarchy construction, evidence selection, relation extraction, and edge validation. The staged pipeline substantially outperforms direct extraction, with the best configuration achieving a macro-F1 of 0.74. Error analysis shows that moderation relations and concept hierarchies remain the most challenging cases, highlighting the difficulty of extracting higher-order empirical claims and implicit abstraction structure from scientific abstracts.